This is for classification, and I am not sure if it is possible to extend them to reinforcement learning.
As you figured out, accuracy should not be used as a metric for a dataset as imbalanced as the one you have. Instead, you should look at a metric such as Area Under Curve(AUC). If you would have infinite data, then you could just rebalance and remove some of the data from the class that has the most samples. However, in many cases data is sparse and you want to use as much of it as possible. Removing data can have a disastrous effect on many applications.
So what are good and convenient ways of handling this?
Add weights to the loss function. One weight for class A and one for B. By increasing the magnitude of the loss for the B class the model should not get stuck in a suboptimal solution that just predicts one class.
Use another objective(loss) function. F1-score can, for example, be implemented and used as an objective(loss) function. For a differentiable version see https://stackoverflow.com/a/65320239/12229416
What is great with these approaches is that it will allow you to use all the data.