Surf on Chart: Training RL model for coin trading
Agents were trained to perform coin trading using reinforcement learning
algorithms such as DQN, DDQN, A2C, PPO.
Tensorforce
library was used to implement the agents, and multiple tests were
conducted to find optimal tuning parameters. In our experiments, the PPO
agent was revealed to generate the highest profit. However, due to the
high frequency of learning results falling into local optima, they
cannot be fully trusted.
RL Agents
-
A2C: A2C is a kind of Actor-Critic method which utilizes
expected output as advantage. Advantage is a subtraction of state
value from state-action value. Critic learns from advantage. A2C is
good for learning how much it will gian "more" not only the absolute
gain the agent gets.
-
PPO: PPO is a learning technique that simplifies the complex
computations of TRPO (Trust Region Policy Optimization), a type of
A2C, while maintaining its performance. During the update process, it
is limited to update only in the trusted region through clipping,
enabling more stable learning.
-
PPO: PPO is a learning technique that simplifies the complex
computations of TRPO (Trust Region Policy Optimization), a type of
A2C, while maintaining its performance. During the update process, it
is limited to update only in the trusted region through clipping,
enabling more stable learning.
-
DQN: Deep Q-Network. This method use a deep neural network in
Q-Learning, an off-policy control method of TD. The Q function is
updated using the largest Q value at the next state s'. While
selecting the action with the highest Q value, sufficient exploration
is also performed.
-
DDQN: Double Deep Q-Network. DDQN updates DQN based on the
Q-value obtained from the Target Network, which updates the estimated
Q-value for the chosen action.
Parameter Tuning
Repetitive experiments were done for hyperparameter tuning. The default
value was set according to documentation of tensorforce library, which
we used to implement RL algorithms. We tested if bigger or smaller
hyperparameters would be better than the default value of them.
Additionally, hyperparameter values of reference papers were also
tested. The test range was narrowed in direction of some values which
showed better performance compared to default values.
Problem of model constantly choosing same action "Hold"
An endemic problem of AI investment had been reported from previous
stock and coin trading model researchers, which is that as the model get
trained more, they tend to keep selecting action “Hold”. When agent
performs action “Hold”, They do not buy or sell any assets. We observed
same problem in our own research.
The problem is that buying nor selling coin would not preserve the
initial balance in real world. The reason is that value of cash
continues to decline. We reflected the reality to our trader
environment, giving penalty to value of cash as timestep goes by. The
models trained in revised environment chose “Hold” less. Additionally,
we introduced idea of “Long” and “Short” positions to our environment so
that agents could be capable for much more aggressive investment. This
also reduced the problematic situation that trained models only tend to
“Hold”.
Tuning episode length for best performance
At the beginning of experiments, whole chart data (Bitcoin
open/close/high/low/volume data from 2017-12-09 23:00:00 to 2018-02-03
12:15:00 with 5-min timestep) was used, counting up to 15000 tick data.
However, in this case, loss did not decrease stably, thus model was not
trained appropriately. In order to resolve the problem, we conducted a
little research and referenced Github community of Tensorforce library.
A researcher who was struggling on same problem (which means model
seemed to be not learning) reported the problem to the Github community,
and Tensorforce developers explained that too long training episode
might be troublesome for training and recommended to lessen the training
data.
According to the idea we used shorter episode length and reconducted
experiments. 3 different piece of chart data with length 500 tick was
used as training data. Initially, we programmed the trainer code to get
random 3 slices. However, because we aimed to train agent with different
RL algorithms and compare them, we had to eliminate the randomness and
fix the training slices. Bull market, bear market, and box patterned
market were selected as training data because trained model should make
a profit at any environment after the training phase. Bull market
specifies a market which has continuously rising coin price, bear market
represents a market with continuously falling coin price, and box
patterned market means a market which has both price falling and rising
phase, but not showing outstanding falling nor rising.