Welcome to Heliosphan

Kaggle Chess Ratings Competition

Chess ratings - Elo versus the Rest of the World - This competition aims to discover whether other approaches can predict the outcome of chess games more accurately than the workhorse Elo rating system.

Data Set

65,053 rows of data of the form:

Month # White Player # Black Player # Score
1 73 1246 0.5

Score is 1 if white wins, 0 if black wins, 0.5 indicates a draw.
There are 8631 players and the month value ranges from 1 to 105, hence there is some scope for modelling player's changing skill level.


Predict outcomes for games for which no score has been revealed, the games to be predicted occurred after the period represented by the training data. The primary goal of the organisers was to find a chess player rating system better than the existing ELO rating system.


In a nutshell, I define each player as having a rating value. The probability of player A beating player B is simply A's rating minus B's rating. From there I use a gradient descent update rule to find the optimal ratings. Details as follows...

1) Each player (P) has a rating initialised to zero (Rp).

2) Probability of player A winning against player B is defined as W(A) = Ra - Rb. Any resulting probabilities outside the range [0,1] are simply truncated to [0,1]

3) Iterate over all training games and

   3.1) Calculate the prediction and square it.

   3.2) Calculate the squared error's rate of change (gradient) with respect to A's rating and also to B's rating.

   3.3) Update each player's rating using the gradient from 3.2 and a learning rate (a wide range of values work well, but 0.01 works well)

   3.4) Keep iterating until some probe RMSE figure stops falling (where probe RMSE is the RMSE calculated on a set of games that aren't being trained with)

4) To create a submission I do all of the above but use *all* of the available training data (no probe set). The number of iterations is now fixed at the number of iterations recorded during a previous run that used a probe set (I recorded when the error on the probe set stopped falling)

Some degree of tweaking was necessary to make this competitive. Most notably:

5) W(A) = (Ra - Rb) * 0.04   (found experimentally)

6) Game score (0, 0.5 or 1.0) used for error calculation is modified such that:
    better_score = (score - trainMeanScore) * 0.43) + trainMeanScore   (found experimentally)

7) Regularization. I tried decaying ratings towards zero but this didn't work. Oddly regularizing away from zero did. One possible hypothesis is that the 'reverse' regularization is helping the player ratings distribution towards its true shape, and that that shape has a long tailed distribution. Regularization factor is 0.005 (although this gets the learning rate applied on top of it, thus it's really 0.0005).

8) Anomaly. If I run the model and plot prediction error (Y-axis) versus player rating (X-axis), then I get a straight diagonal line. Predictions for low rated players are typically too high and vice versa. You might expect this to be a result of the 'reverse' regularization, but it appears to be a deeper feature of the data/model. The obvious (but dumb/wrong) way to compensate for this is to directly modify player ratings, this doesn't work because a given player generally plays against players of a similar rating, thus two competing players both get adjusted in the same direction and the anomaly remains. Adjusting by prediction error squared, i.e. stretching out the distribution of player ratings didn't work either.

The method that worked was to create a 3D/surface plot of player A rating (X) versus player B rating (Y) versus prediction error (Z). I then directly adjust player ratings at each game based on this plot.

9) Time. Adding in a linear 'slope' over time to each player's rating resulted in no improvement in predictive ability whatsoever. However, training against the most recent half of the data did make an improvement. Thus it seems there may be some change in ratings over time that is non-linear. Whether it can be modelled without over-fitting (too many variables!) is debatable; Given the relatively small improvement from using the most recent 50% of training data I would guess not.


All of the above ultimately got me to 10th position on the final leaderboard with an RMSE score of 0.70018 (lower is better), very closely behind the winning score of 0.69477. Also a number of interesting developments were made public on the forums after I had carried out the above work, so I possibly could have improved this score further. Possibly the most beneficial change would have been to use a probe data set that more accurately represented the set being used to score competition submissions - I used a random scattering of games across the whole data set while others observed better results using a probe set made up from the most recent games.

Also the final reported position is somewhat better than my position on the leaderboard in the final weeks of the competition where I had slipped down to approx 23rd place after making my final submission sometime earlier. The final scores were based on a broader data set and this seems to have significantly altered the final standings, boosting my entry (entries?) up to 12th place. (My high water mark was a brief stint at 4th place). This noise in the scoring system is a problem inherent to small data sets.

(some time in 2010)

Copyright 2010, 2011 Colin Green.
This article is licensed under a Creative Commons Attribution 3.0 License