Our design of algorithms class requires all students to enroll in an online $AI$ competition, where each team has to come up with a bot. Before the final lockdown, each team is allowed to challenge any other team in order to test their strategies, including the random bot provided by the course assistants.
For the first testing round, each team had to play $10$ matches with a random bot provided by the course staff. By random I mean a bot which chooses a random move, out of the possible set of moves available to it at that game state. For a draw with the random bot you get $0$ points, for a win you get $+1$ and for a loss you get $-1$ points.
Unlike other teams, we chose to avoid hardcoding, going instead with an altered version of the minimax algorithm that fits this game. Needless to say, our strategy is far from flawless, but it's a lot better than what most others came up with.
Relevant facts :
$\bullet$ We lost $5$ times and won $5$ times in the testing round. So we got $0$ points.
$\bullet$ During or practice matches we got an $80 \text{%}$ win rate with the random bot.
$\bullet$ Also during the practice matches we played against a lot of other competing teams. One of the teams we played against had a very weak strategy, done through hardcoding. We got $4/4$ wins in the matches against their bot. Another team (their bot was also hardcoded) we've played against $4$ times, managed to beat us $1/4$ times, but we still beat them $3/4$ times. The former got a score of $10/10$ in the testing matches, while the latter got a $7/10$ ($3$ out of $10$ were draws).
$\bullet$ None of the $2$ teams I mentioned above updated their strategy between the time they played against us and the time of the testing.
$\bullet$ We had the worst score off all the teams tested in this round, though more than half of them were much weaker than us (as we've seen in the practice matches).
$\bullet$ The rules of the game in question can be found here (MSE link).
Not much can be done about our wasted time, but I would really love to see if there's a mathematical way to quantify that their grading is really flawed. I'm certain the randomness factor is quite relevant here, but I don't have any training in probability theory or chaos theory, so I can't model this situation.
How would you mathematically prove the grading system is wrong?