10

The value of a state $s$ under a certain policy $\pi$, $V^\pi(s)$, is defined as the "expected return" starting from state $s$. More precisely, it is defined as

$$ V^\pi(s) = \mathbb{E}\left(R_t \mid s_t = s \right) $$

where $R_t$ can be defined as

$$ \sum_{k=0}^\infty \gamma^k r_{t+k+1} $$

which is a sum of "discounted" rewards after time $t$, i.e. starting from time $t+1$.

$V^\pi(s)$ can also be interpreted, even more precisely, as the expected cumulative future discounted reward. This denotation contains all the words which refer to specific parts of the formula above, where

  1. "Expected" refers to the "expected value"
  2. "Cumulative" refers to the summation
  3. "Future" refers to the fact that it's an expected value of a future quantity with respect to the present quantity, i.e. $s_t = s$.
  4. "Discounted" refers to the "gamma" factor, which is a way to adjust the importance of how much we value rewards at future time steps, i.e. starting from $t + 1$.
  5. "Reward" refers to the main quantity of interested, i.e. the reward received from the environment.

Meanwhile, I've heard the term "expected reward", but I am not sure if it refers to the same concept or not, that is if "expected reward" or "expected return" are the same thing or not.

I know there's also the concept of "expected value of the next reward", often denoted as $\mathcal{R}^a_{ss'}$, and defined as

$$ \mathcal{R}^a_{ss'} = \mathbb{E}\left(r_{t+1} \mid s_t = s, a_t = a, s_{t+1} = s' \right) $$

which, again, is the value we expect for the reward at the next time step, that is at time step $t+1$, given that action $a$ from state $s$ brings us to state $s'$.

Is the "expected reward" actually $\mathcal{R}^a_{ss'}$ instead of $V^\pi(s)$?

1 Answers1

7

Is the "expected reward" actually $\mathcal{R}^a_{ss'}$ instead of $V^\pi(s)$?

In short, yes.

Although there is some context associated - $\mathcal{R}^a_{ss'}$ is in the context of specific action and state transition. You will also find $\mathcal{R}^a_{s}$ used for expected reward given only current state and action (which works fine, but moves around some terms in the Bellman equations).

"Return" may also be called "Utility".

RL suffers a bit from naming differences, however the meaning of reward is not one of them.

Notation differences also abound, and in Sutton & Barto Reinforcement Learning: An Introduction (2nd edition), you will find:

  • $R_t$ is a placeholder for reward received at time $t$, a random variable.

  • $G_t$ is a placeholder for return received after time $t$, and you can express the value equation as $v_{\pi}(s) = \mathbb{E}[G_t|S_t=s] = \mathbb{E}[\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}|S_t=s]$

  • $r$ is a specific reward value

  • You won't see "expected reward" used directly in an equation from the book, as the notation in the revised book relies on summing over distribution of reward values.

In some RL contexts, such as control in continuous problems with function approximation, it is more convenient to work with maximising average reward, than maximising expected return. But this is not quite the same as "expected reward", due to differences in context (average reward includes averaging over the expected state distribution when following the policy)

Neil Slater
  • 29,388
  • 5
  • 82
  • 101