7

I'm trying to gain an intuitive understanding of deep reinforcement learning. In deep Q-networks (DQN) we store all actions/environments/rewards in a memory array and at the end of the episode, "replay" them through our neural network. This makes sense because we are trying to build out our rewards matrix and see if our episode ended in reward, scale that back through our matrix.

I would think the sequence of actions that led to the reward state is what is important to capture - this sequence of actions (and not the actions independently) are what led us to our reward state.

In the Atari-DQN paper by Mnih and many tutorials since we see the practice of random sampling from the memory array and training. So if we have a memory of:

$(action\,a, state\,1) \rightarrow (action\,b, state\,2) \rightarrow (action\,c, state\,3) \rightarrow (action\,d, state\,4) \rightarrow reward!$

We may train a mini-batch of:

[(action c state 3), (action b, state 2), reward!]

The reason given is:

Second, learning directly from consecutive samples is inefficient, due to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates.

or from this pytorch tutorial:

By sampling from it randomly, the transitions that build up a batch are decorrelated. It has been shown that this greatly stabilizes and improves the DQN training procedure.

My intuition would tell me the sequence is what is most important in reinforcement learning. Most episodes have a delayed reward so most action/states do not have a reward (and are not "reinforced"). The only way to bring a portion of the reward to these previous states is to retroactively break the reward out across the sequence (through the future_reward in the Q algorithm of reward + reward * learning_rate(future_reward))

A random sampling of the memory bank breaks our sequence, how does that help when you are trying to back-fill a Q (reward) matrix?

Perhaps this is more similar to a Markov model where every state should be considered independent? Where is the error in my intuition?

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
ZAR
  • 203
  • 3
  • 7

3 Answers3

6

The de-correlation effect is more important than following sequence of trajectories in this case.

Single step Q-learning does not rely on trajectories to learn. It is slightly less efficient to do this in TD learning - a Q($\lambda$) algorithm which averages over multiple trajectory lengths would maybe work better if it were not for the instability of using function approximators.

Instead, the DQN-based learning bootstraps across single steps (State, Action, Reward, Next State). It doesn't need longer trajectories. And in fact due to bias caused by correlation, the neural network might suffer for it if you tried. Even with experience replay, the bootstrapping - using one set of estimates to refine another - can be unstable. So other stabilising influences are beneficial too, such as using a frozen copy of the network to estimate the TD target $R + \text{max}_{a'} Q(S', a')$ - sometimes written $R + \text{max}_{a'} \hat{q}(S', a', \theta^{\bar{ }})$ where $\theta$ are the learnable parameters for $\hat{q}$ function.

It might still be possible to use longer trajectories, sampled randomly, to get a TD target estimate based on more steps. This can be beneficial for reducing bias from bootstrapping, at the expense of adding variance due to sampling from larger space of possible trajectories (and "losing" parts of trajectories or altering predicted reward because of exploratory actions). However, the single-step method presented by DQN has shown success, and it is not clear which problems would benefit from longer trajectories. You might like to experiment with options though . . . it is not an open-and-shut case, and since the DQN paper, various other refinements have been published.

Neil Slater
  • 29,388
  • 5
  • 82
  • 101
1

My intuition would tell me the sequence is what is most important in reinforcement learning. Most episodes have a delayed reward so most action/states do not have a reward (and are not "reinforced"). The only way to bring a portion of the reward to these previous states is to retroactively break the reward out across the sequence (through the future_reward in the Q algorithm of reward + reward * learning_rate(future_reward))

While your intuition is correct w.r.t to RL in general (ie it is always better to deal with complete trajectories with all the dependencies), the problem arises when we use ML algorithms like gradient descent etc to train a model. These methods require IIID samples and hence there is a conflict now. By using a experience replay store, we can give IIID samples to the model. After a long long training it will eventually learn to build dependencies between states itself.

This should address the key doubt OP has.

Allohvk
  • 938
  • 7
  • 8
1

The DQN uses experience replay to break correlations between sequential experiences. It is viewed that for every state, the next state is going to be affected by the current action, therefore, taking experiences sequentially would result in instabilities due to internal correlations between experiences. An experience consists of a state, an action, a reward and the next state; all that is needed to learn, to the very least in temporal difference fashion. As such, experience replay enables us to combine Monte Carlo (which uses full episodes) and temporal difference (which uses single experiences) in one but more robust scheme. You realize that what is needed is not the size of a single reward but an indication as to whether the agent is on the right track, which is best given by an average of several experiences.

Also, using same network parameters to get both the prediction and the target Q-values is like updating a guess with another guess, which is like "a dog chasing its own tail." The use of a target network which is just a clone (stale copy) of the prediction network would help break this other kind of correlation between targets and prediction. While the prediction network is updated using the experiences, the target network parameters are periodically updated, say after every so many updates of the prediction network, its weights are copied to the target network.

EArwa
  • 75
  • 8