3

My team and I started digging into RL for the purpose of a specific application. We have plenty of data of an agent carrying out suboptimal policies (states and rewards...).

It is too costly for us to emulate the agent (executing the action and assessing the reward) meaning our only option is to learn the optimal policy on our dataset using an off-policy algorithm. However I still do not know how to do that, Q-learning, while off-policy, still require an emulator.

Could you please provide me some guideline on how to do that? Which type of algorithm should I use?

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
user26616
  • 41
  • 1
  • 3

1 Answers1

4

Q-learning does not in fact need to be online or need an emulator, it can learn exclusively from experience replay. If you put all your history into a table or state, action, reward, next state and then sample from it, it should be possible to train your agent that way.

To do this, you will need to skip the algorithm steps that take actions and store results. The algorithm will then learn from the data you have. It will just not be possible to collect more. Depending on the problem you are trying to solve, this could be OK, or it may inhibit learning.

RL algorithms learning optimal control in complex environments benefit from sampling near to their current policy, so it is possible in your case that your agent will reach a limit on what it can learn from historic data. It may end up still quite far from optimal behaviour, although it should stand a reasonable chance of improving on the best that the historic data shows.

If you need to use function approximation (e.g. a neural network) due to size of the state, action space, then take extra care, because it will be hard to detect whether the action values have converged correctly. This is because you are learning the optimal Q values, and you will have no test data that demonstrates what those should be (to collect that data, you need to follow the optimal policy and measure the total reward).

Here is roughly what an experience-replay-only Q-learning algorithm would look like:

Input: History $H$, consisting of rows of $S,A,R,S'$

Initialise the NN for calculating $\hat{q}(s,a)$

Repeat until NN converges:

$\qquad$ Sample $S,A,R,S'$ from $H$

$\qquad$ $tdtarget \leftarrow R + \gamma \text{max}_{a'}[\hat{q}(S',a')]$

$\qquad$ Train NN single step, $\hat{q}(S,A) \rightarrow tdtarget$

You can make use of mini-batch processing to generate multiple $tdtarget$ values at once and train on them. A worthwhile improvement for stability is to use a frozen copy of the neural network when calculating the $tdtarget$ value, and update it only every N steps to be a copy of most recent network, with N maybe 1000 steps.

Neil Slater
  • 29,388
  • 5
  • 82
  • 101