Can train with reinforcements agent for trading in the stock market? Implementation in the language of R

 3r33737. 3r3-31. Let's create a prototype learning agent with reinforcements (RL) that will master the skill of trading.
 3r33737.
 3r33737. Given that the implementation of the prototype works in the R language, I urge R users and programmers to come closer to the ideas presented in this material.
 3r33737.
 3r33737. This is a translation of my English article: Can Reinforcement Learning Trade Stock? Implementation in R.
 3r33737.
 3r33737. I want to warn code hunters that in this note there is only a neural network code adapted for R.
 3r33737.
 3r33737. If I did not distinguish myself in good Russian, point out the mistakes (the text was prepared with the help of an automatic translator).
 3r33737.
 3r33737. Can train with reinforcements agent for trading in the stock market? Implementation in the language of R
 3r33737.
 3r33737.
 3r33737. 3r3690. Introduction to problem 3r3691.
 3r33737.
 3r33737.
 3r33737.
 3r33737. I advise you to start the dive into the topic with this article: DeepMind 3r3642.
 3r33737.
 3r33737. She will introduce you to the idea of ​​using the Deep Q-Network (DQN) to approximate the value function, which are crucial in Markov decision-making processes.
 3r33737.
 3r33737. I also recommend delving into mathematics using the preprint of this book by Richard S. Sutton and Andrew J. Barto: 3r3356. Reinforcement Learning
 3r33737.
 3r33737. Below I will present an extended version of the original DQN, which includes more ideas that help the algorithm to quickly and efficiently converge, namely: 3r3706.  3r33737.
 3r33737. 3r366. Deep Double Dueling Noisy NN Neutral Neural Nets [/b] with priority selection from the experience playback buffer.
 3r33737.
 3r33737. What makes this approach better than classic DQN?
 3r33737.
 3r33737. 3r3105.  3r33737. 3r3r166. Dual: there are two networks, one of which is trained, and the other estimates the following values ​​of Q
 3r33737. 3r3r166. Dueling: There are neurons that clearly appreciate the value and benefits of 3r3117.  3r33737. 3r3r166. Noisy: There are noise matrices applied to the weights of the intermediate layers, where the mean and standard deviations are the trained
weights.  3r33737. 3r3r166. Sampling priority: batches of observations from the playback buffer contain examples, due to which previous training of functions led to large residues that can be stored in an auxiliary array. 3r3117.  3r33737. 3r3119.
 3r33737.
 3r33737. 3r3128. Well, what about the trade done by the DQN agent? This is an interesting topic as such.
 3r33737.
 3r33737. There are reasons why this is interesting:
 3r33737.
 3r33737. 3r3105.  3r33737. 3r3r166. Absolute freedom of choice of representations of the state, actions, awards and architecture NN. You can enrich the entry space with anything you consider worth trying, from news to other stocks and indices. 3r3117.  3r33737. 3r3r166. Compliance of the trading logic with the learning logic of reinforcement is that: the agent performs discrete (or continuous) actions, is rarely rewarded (after closing the transaction or expiration of the period), the environment is partially observable and may contain information on the next steps, trading is an episodic game. 3r3117.  3r33737. 3r3r166. You can compare DQN results with several references, such as indices and technical trading systems. 3r3117.  3r33737. 3r3r166. The agent can continuously learn new information and, thus, adapt to the changing rules of the game. 3r3117.  3r33737. 3r3119.
 3r33737.
 3r33737. In order not to stretch the material, look at the code of this NN, which I want to share, since this is one of the mysterious parts of the whole project.
 3r33737.
 3r33737. 3r3128. R-code for a value neural network that uses Keras to build our agent RL.
 3r33737.
 3r33737. 3r3133.
Code [/b]
# configure critic NN - library ('keras')
 3r33737. library ('R6')
 3r33737. learning_rate < — 1e-3
 3r33737. state_names_length < — 12 # just for example
 3r33737. a_CustomLayer < — R6::R6Class(
 3r33737. "CustomLayer"
 3r33737. , inherit = KerasLayer
 3r33737. , public = list (
 3r33737.
 3r33737. call = function (x, mask = NULL) {
 3r33737. x - k_mean (x, axis = ? keepdims = T)
 3r33737.}
 3r33737.
 3r33737. )
 3r33737. )
 3r33737. a_normalize_layer < — function(object) {
 3r33737. create_layer (a_CustomLayer, object, list (name = 'a_normalize_layer'))
 3r33737.}
 3r33737. v_CustomLayer < — R6::R6Class(
 3r33737. "CustomLayer"
 3r33737. , inherit = KerasLayer
 3r33737. , public = list (
 3r33737.
 3r33737. call = function (x, mask = NULL) {
 3r33737. k_concatenate (list (x, x, x), axis = 2)
 3r33737.}
 3r33737.
 3r33737. , compute_output_shape = function (input_shape) {
 3r33737.
 3r33737. output_shape = input_shape
 3r33737. output_shape[[2]]<- input_shape[[2]]* 3L
 3r33737.
 3r33737. output_shape
 3r33737.}
 3r33737. )
 3r33737. )
 3r33737. v_normalize_layer < — function(object) {
 3r33737. create_layer (v_CustomLayer, object, list (name = 'v_normalize_layer'))
 3r33737.}
 3r33737. noise_CustomLayer < — R6::R6Class(
 3r33737. "CustomLayer"
 3r33737. , inherit = KerasLayer
 3r33737. , lock_objects = FALSE
 3r33737. , public = list (
 3r33737.
 3r33737. initialize = function (output_dim) {
 3r33737. self $ output_dim < — output_dim
 3r33737.}
 3r33737.
 3r33737. , build = function (input_shape) {
 3r33737.
 3r33737. self $ input_dim <- input_shape[[2]]
 3r33737.
 3r33737. sqr_inputs < — self$input_dim ** (1/2)
 3r33737.
 3r33737. self $ sigma_initializer < — initializer_constant(.5 /sqr_inputs)
 3r33737.
 3r33737. self $ mu_initializer < — initializer_random_uniform(minval = (-1 /sqr_inputs), maxval = (1 /sqr_inputs))
 3r33737.
 3r33737. self $ mu_weight < — self$add_weight(
 3r33737. name = 'mu_weight',
 3r33737. shape = list (self $ input_dim, self $ output_dim),
 3r33737. initializer = self $ mu_initializer,
 3r33737. trainable = TRUE
 3r33737. )
 3r33737.
 3r33737. self $ sigma_weight < — self$add_weight(
 3r33737. name = 'sigma_weight',
 3r33737. shape = list (self $ input_dim, self $ output_dim),
 3r33737. initializer = self $ sigma_initializer,
 3r33737. trainable = TRUE
 3r33737. )
 3r33737.
 3r33737. self $ mu_bias < — self$add_weight(
 3r33737. name = 'mu_bias',
 3r33737. shape = list (self $ output_dim),
 3r33737. initializer = self $ mu_initializer,
 3r33737. trainable = TRUE
 3r33737. )
 3r33737.
 3r33737. self $ sigma_bias < — self$add_weight(
 3r33737. name = 'sigma_bias',
 3r33737. shape = list (self $ output_dim),
 3r33737. initializer = self $ sigma_initializer,
 3r33737. trainable = TRUE
 3r33737. )
 3r33737.
 3r33737.}
 3r33737.
 3r33737. , call = function (x, mask = NULL) {
 3r33737.
 3r33737. #sample from noise distribution
 3r33737.
 3r33737. e_i = k_random_normal (shape = list (self $ input_dim, self $ output_dim))
 3r33737. e_j = k_random_normal (shape = list (self $ output_dim))
 3r33737.
 3r33737.
 3r33737. #We use the factorized case of Fortunato et al.
 3r33737.
 3r33737. eW = k_sign (e_i) * (k_sqrt (k_abs (e_i))) * k_sign (e_j) * (k_sqrt (k_abs (e_j))) 3r3706.  3r33737. eB = k_sign (e_j) * (k_abs (e_j) ** (1/2)) 3r3706.  3r33737.
 3r33737.
 3r33737. #See section 3 of Fortunato et al.
 3r33737.
 3r33737. noise_injected_weights = k_dot (x, self $ mu_weight + (self $ sigma_weight * eW))
 3r33737. noise_injected_bias = self $ mu_bias + (self $ sigma_bias * eB)
 3r33737. output = k_bias_add (noise_injected_weights, noise_injected_bias)
 3r33737.
 3r33737. output
 3r33737.
 3r33737.}
 3r33737.
 3r33737. , compute_output_shape = function (input_shape) {
 3r33737.
 3r33737. output_shape < — input_shape
 3r33737. output_shape[[2]]< — self$output_dim
 3r33737.
 3r33737. output_shape
 3r33737.
 3r33737.}
 3r33737. )
 3r33737. )
 3r33737. noise_add_layer < — function(object, output_dim) {
 3r33737. create_layer (
 3r33737. noise_CustomLayer
 3r33737. , object
 3r33737. , list (
 3r33737. name = 'noise_add_layer'
 3r33737. , output_dim = as.integer (output_dim)
 3r33737. , trainable = T
 3r33737. )
 3r33737. )
 3r33737.}
 3r33737. critic_input < — layer_input(
 3r33737. shape = c (as.integer (state_names_length))
 3r33737. , name = 'critic_input'
 3r33737. )
 3r33737. common_layer_dense_1 < — layer_dense(
 3r33737. units = 20
 3r33737. , activation = "tanh"
 3r33737. )
 3r33737. critic_layer_dense_v_1 < — layer_dense(
 3r33737. units = 10
 3r33737. , activation = "tanh"
 3r33737. )
 3r33737. critic_layer_dense_v_2 < — layer_dense(
 3r33737. units = 5
 3r33737. , activation = "tanh"
 3r33737. )
 3r33737. critic_layer_dense_v_3 < — layer_dense(
 3r33737. units = ?3r3706.  3r33737. , name = 'critic_layer_dense_v_3'
 3r33737. )
 3r33737. critic_layer_dense_a_1 < — layer_dense(
 3r33737. units = 10
 3r33737. , activation = "tanh"
 3r33737. )
 3r33737. # critic_layer_dense_a_2 < — layer_dense(
 3r33737. # units = 5
 3r33737. #, activation = "tanh"
 3r33737. #)
 3r33737. critic_layer_dense_a_3 < — layer_dense(
 3r33737. units = length (acts)
 3r33737. , name = 'critic_layer_dense_a_3'
 3r33737. )
 3r33737. critic_model_v < — critic_input %> %
 3r33737. common_layer_dense_1%>%
 3r33737. critic_layer_dense_v_1%>%
 3r33737. critic_layer_dense_v_2%>%
 3r33737. critic_layer_dense_v_3%>%
 3r33737. v_normalize_layer
 3r33737. critic_model_a < — critic_input %> %
 3r33737. common_layer_dense_1%>%
 3r33737. critic_layer_dense_a_1%>%
 3r33737. # critic_layer_dense_a_2%>%
 3r33737. noise_add_layer (output_dim = 5)%>%
 3r33737. critic_layer_dense_a_3%>%
 3r33737. a_normalize_layer
 3r33737. critic_output < — layer_add(
 3r33737. list (
 3r33737. critic_model_v
 3r33737. , critic_model_a
 3r33737. )
 3r33737. , name = 'critic_output'
 3r33737. )
 3r33737. critic_model_1 < — keras_model(
 3r33737. inputs = critic_input
 3r33737. , outputs = critic_output
 3r33737. )
 3r33737. critic_optimizer = optimizer_adam (lr = learning_rate)
 
keras :: compile (
 3r33737. critic_model_1
 3r33737. , optimizer = critic_optimizer
 3r33737. , loss = 'mse'
 3r33737. , metrics = 'mse'
 3r33737. )
 3r33737. train.x < — rnorm(state_names_length * 10)
 3r33737. train.x < — array(train.x, dim = c(10, state_names_length))
 3r33737. predict (critic_model_? train.x)
 3r33737. critic_model_2 < — critic_model_1
 3r33737.
 3r33737. 3r33737. 3r33737.
 3r33737.
 3r33737. I used this source to adapt the Python code for the noise part of the network: 3r33550. github repo
 3r33737.
 3r33737. This neural network looks like this:
 3r33737.
 3r33737. 3r33560.
 3r33737.
 3r33737. Recall that in the duel architecture, we use the equation (equation 1):
 3r33737.
 3r33737. Q = A '+ V, where
 3r33737.
 3r33737. A '= A - avg (A);
 3r33737.
 3r33737. Q = state-action value;
 3r33737.
 3r33737. V = state value;
 3r33737.
 3r33737. A = advantage.
 3r33737.
 3r33737. Other variables in the code speak for themselves. In addition, this architecture is only good for a specific task, so do not take it for granted.
 3r33737.
 3r33737. The rest of the code is likely to be quite template for publication, and it will be interesting for the programmer to write it yourself.
 3r33737.
 3r33737. And now - experiments. Testing the work of the agent was carried out in a sandbox, far from the realities of trading in a live market, with a real broker.
 3r33737.
 3r33737. 3r3690. Phase I
 3r33737.
 3r33737. We run our agent vs synthetic dataset. Our transaction value is 0.5:
 3r33737.
 3r33737. 3r3611.
 3r33737.
 3r33737. The result is excellent. The maximum average episodic reward in this experiment is
 3r33737. should be 1.5.
 3r33737.
 3r33737. 3r3622.
 3r33737.
 3r33737. We see: the loss of the critic (the so-called network of values ​​in the approach of the actor-critic), the average remuneration for the episode, the cumulative reward, a sample of the latest rewards.
 3r33737.
 3r33737. 3r3690. Phase II
 3r33737.
 3r33737. We train our agent to an arbitrarily chosen exchange symbol, which demonstrates interesting behavior: a smooth start, fast growth in the middle and a dreary end. In our training set about 4300 days. The transaction cost is set at 0.1 USD (purposefully low); the reward is USD Profit /loss after closing a deal to buy /sell 1.0 shares.
 3r33737.
 3r33737. Source: 3r3641. finance.yahoo.com/quote/algn?ltr=1
 3r33737.
 3r33737.
 3r33737.
 3r33737. NASDAQ: ALGN
 3r33737.
 3r33737. After setting up some parameters (leaving the NN architecture the same), we arrived at the following result:
 3r33737.
 3r33737. 3r3662.
 3r33737.
 3r33737. It turned out not bad, because in the end the agent learned to make a profit by pressing the three buttons on his console.
 3r33737.
 3r33737.
 3r33737.
 3r33737. [i] red marker = sell, green marker = buy, gray marker = do nothing. 3r3777.
 3r33737.
 3r33737. Note that at its peak, the average reward for an episode exceeded the realistic value of the transaction, which can be encountered in real trading.
 3r33737.
 3r33737. It is a pity that the shares are falling like crazy because of bad news
 3r33737.
 3r33737. 3r3690. Concluding remarks 3r3691.
 3r33737.
 3r33737. Trading with RL is not only difficult, but also useful. When your robot does it better than you, it's time to spend personal time to get an education and health.
 3r33737.
 3r33737. I hope it was an interesting journey for you. If you like this story, wave your hand. If there is a lot of interest, I can continue and show you how the policy gradient methods work using the R language and the Keras API.
 3r33737.
 3r33737. I also want to thank my friends who are passionate about neural networks for their advice.
 3r33737.
 3r33737. If you have any questions, I am always here. 3r33737. 3r33737. 3r33737. 3r33737. 3r337. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r33737. 3r33737. 3r33737. 3r33737. 3r33737. 3r33737. 3r33737.
+ 0 -

Add comment