You're right that it's not UC-secure, for exactly the reason you say. It allows offline dictionary attacks. Here's how that problem manifests in the UC model:
Consider this particular environment:
- Environment chooses honest party's password $pw$ uniformly from some known polynomial-size dictionary $\mathcal D$ (without loss of generality $\mathcal{D} = \{1, \ldots, m\}$.
- Environment initiates the protocol and waits for honest party to report an output $K^*$
- Environment requests from the adversary a sequence $k_1, \ldots, k_m$. The adversary wins the game if $k_{pw} = K^*$.
It is easy for an adversary to win this game in the real world interaction with probability one. The adversary sets $k_i = \mathcal{O}(i,r,r')$.
It is also not hard to see that no adversary can win this game in the ideal world interaction with probability better than $1/m + 1/2^\ell$ (where $\ell$ is the output length of $\mathcal{O}$). Observe the following:
- The only way to get the honest party to report output in the ideal world is for the simulator to initiate an online password guess.
- Since the honest $pw$ is uniform in $\mathcal{D}$, the probability of this guess being correct is $1/m$.
- Assuming the online guess is incorrect, the functionality delivers a totally uniform/independent value $K^*$ to the honest party. In particular, $K^*$ is independent of whatever $k_{pw}$ value that the adversary sends to the environment. So it can match only with probability $1/2^\ell$.
I think the essence of your question relates to the fact that extraction has to happen at the time of the protocol execution (that's what makes it an online password guess).
In this protocol, as soon as the parties have exchanged $r$ and $r'$, the real-world honest party will output a key, so the simulator will need to send a password guess to make the same thing happen in the ideal world.
But what if the adversary hasn't even queried $\mathcal{O}$ yet -- what will the simulator do?
This protocol fails to limit the adversary to one password guess (every time it queries $\mathcal{O}(pw',r,r')$ the simulator would have to know whether $pw'$ is correct, in order to produce a consistent response), and it fails to limit that password guess to the period of time that the protocol is running.