In reinforcement learning, why learn Q rather than V?

Question

Why do we learn the action-value function $Q(s,a)$ rather than the (just state) value function $V(s)$? At least for deterministic environments?

$V$ is much smaller than $Q$, and they are trivially related—just scan over all possible actions and simulate to the next states to arrive at the $Q$ (which is what the Q learning algo must do anyway). I.e. $Q(s,a)=V(s')$ where $s'$ is the next state from $s$ after doing $a$.

So why not just learn the $V$s? Is the main reason because of stochastic environments?

Related: What is the Q function and what is the V function in reinforcement learning?

J_H · Accepted Answer · 2024-03-24T16:57:44.487

$V(s)$ is a total function, with a value defined for any thoroughly explored state $s$. It abstracts away the details of exploring the state space.

$Q(s, a)$ is a partial function, which reveals our current ignorance. When the MDP encounters state $s$, it is free to explore or exploit. Early on we may be better off systematically exploring the action space from that state. As we learn a greater portion of $Q$, we might favor "exploit", in order to navigate to some promising $s'$ state.

While not inherent in the theoretic problem formulation, it is often the case that real world problems are drawn from smoothly differentiable spaces and exhibit a cluster of adjacent "dead end" states. Given a limited exploration budget, telling the MDP to avoid states known to have low values for some actions may well be prudent, despite the possibility of a large reward for exploring some novel action from a given state.

Consider a blind robot tailor that must thread a needle. It has a thread gripper mounted on a linear actuator with a range of one meter, similar to what you might find in a 3D printer. A stationary needle is present somewhere, perhaps connected to a loom. We get unit reward for threading the needle.

Action space: move gripper a distance X to the left, then 1 cm up which will hopefully thread the needle. We treat those movements as a single atomic action.

After a while an external agent evaluates whether the attempt succeeded.

The state space is very small, as there is exactly one initial condition, with gripper starting at origin, from which an experienced tailor will always achieve unit reward.

Rather than continuous, we might choose to view the $Q$ space as discretized to millimeter increments of the X motion. Looking at $Q$ gives us a burn-down list of actions we should explore. Looking at $V$'s constant unit reward doesn't really assist with the learning goal.

Some real world situations are not smoothly differentiable, such as this one. Sometimes we find that much of the space is smooth, yet we must locate one or a sequence of nonlinear regions to obtain a reward. This is where we can draw the strongest contrast between $V$ and $Q$, when the state space has not yet been fully explored.

In reinforcement learning, why learn Q rather than V?

1 Answers1