I am a beginner in programming, but managed to get a little pong game done. For my studies I had to understand an AI that solved the Lunar-Lander-V2 environment of the Gymnasium API. Therefore it used deep learning and the cross-entropy method. It was written in pytorch.
I had the Idea to put this AI into my pong game to move one of the rackets. That worked well, as far as I know. The problem is that the program doesn't learn.
Small description of the pong game: It consists out of two rackets (rectangles) on both sides of the screen and one ball that is randomly moving in one direction. The ball bounces off the rackets and heads into the other direction. The goal is to get the ball behind the racket of the other side to score a goal.
In my game the ball speeds up every time it gets hit by a racket. Because of this the rackets will speed up if one action (up, down) is carried out several times in a row. This allows the player to still hit the ball if it speeds up.
I changed the right racket to a rectangle that covers the whole right side. That way the randomness of the ball movement does not result in random goals for the left side, which is controlled by the AI. My thought was that by doing so the AI "only" has to learn how to follow the ball with the racket, to hit it. To get it like this i put in some rewards. I played around a bit with the features and rewards but the results were always pretty bad.
Here my last try: At first only a hit gives 50 points and receiving a goal will take 30 points. There is also a dynamic reward that grows the further away the ball is from the center of the racket. It will be substracted from the over all reward. The movement of the racket up or down also gives a small reward. At the end I added another reward that adds one point to the reward for every frame, after the first 100 (to not count the ball spawning position), in which the y-position of the ball lays between the upper and the lower end of the racket.
Furthermore I gave the AI 5 features.
- The y-position of the racket
- The y-position of the ball
- The speed of the racked
- The information if the ball is on the same y-height as any position of the rackets surface as a boolean
- The information if the ball has hit the racket as a boolean
Why ever at the end the mean reward of an batch consisting of 1000 play throughs always was around -200 and the threshold (upper 98 %, for hopefully better results,...) of the system was between -30 and -100 Points. This were the results after 1000 episodes. (In my head it should be possible to get enough positive reward by one hit and the movement of the racket to outnumber the negative one of the dynamic distance and receiving a goal. But doesn't seem to work like this...)
I am using the adam-optimizer and a learning rate of 0.01. I divided all features by a fixed number. (normalized them around zero) This allowed normal probabilities the AI took to choose the next action. Actions are 0, 1, 2 with 0 = stay still, 1 = up, 2 = down.
My main questions are, in hope that this will solve all the problems =P :
Are the rewards and features which I give to the AI good enough to make it able to learn??
If not how to find good ones? Is there a way?
Do I miss something? (I am pretty sure about that (; )
Thankyou for your time. In hope for good news K