What we've learnt from trueskill #3: the computer never forgets

Published on: May 7, 2018

Who here remembers the results of the M2- C Final of Final GB Trials last year?

Unless you've already clicked on the link, your answer is probably going to be 'no' or even 'come again?'. And that would be understandable - this is hardly the gold medal race for the Olympics or the final of the Grand at Henley. But believe it or not, this obscure little race had a huge impact on some rowers.

Don't believe it? Take a look at this rowers' skill score over time:

skill score over time

Looking at this, you can see this rower's skill is relatively stable over time - until suddenly, their score collapses by 10 points in a single race. They go from slightly above average to significantly below average - all at once. His pair partners suffers a similar, if not quite so devastating, 6 point deduction from their loss.

To be clear, these sorts of collapses are extremely rare - this is the only one we've been able to find (although we haven't been looking very hard). And there is some context to this - the rowers in that pair won the C/D Semi-Final, and then promptly came last in the C Final. Whether through injury, exhaustion or just "CBA", they were being highly inconsistent, and this probably confused the computer.

But this isn't a pain-free novelty either. This person went on to row in the following year's boat race, so this odd result may have reduced the quality of predictions and will have affected the scores of other people they have raced with or competed against.

Lessons?

Are there any lessons to this? It's not clear. Sure, major up or down revisions like this on the basis of obscure races aren't ideal. And the hangover from this odd result did go on to have 'downstream' effects - as mentioned above, this person went on to row in the following year's boat race, so it may have destabilised future predictions. But we do need to update people's scores on the basis of their results.

Weighting?

I guess there are three ways we could approach this differently. Firstly, we could introduce some sort of weighting system that downrates what you might call the 'offcut' results. Sure, everyone will be smashing it in the A Final and the precursor races (e.g. the A/B Semi). But once you've lost that, should we really be placing too much weight on the result of the E Final? Similarly, pointless timetrials (e.g. race for lanes or first 2 to the Final, else to Reps) are highly susceptible to rowers 'going slow' as they know the result makes little difference to the end result, or they race tactically to conserve energy.

The counterargument to this, though, is that (pointless timetrials aside) these more minor results is where you really work out where the wheat from the chaff lie and allow good comparisons between peers - whether at Brit Champs or GB trials. The smaller GB trials results are a rare example of pairs racing between top-level club and university athletes, which is an invaluable source of information.

Ignore some races?

The second, more problematic, approach you could take would be to declare some races ignored, or their impact limited, on the basis of their freak outcome, as this goes against all other known information. But this approach is wracked with issues: what makes a freak outcome? How do you detect this? If you were to cap down-rating in an individual race, by how much? And would this not have negative knock-on effects in non-elite context - if retired athletes reappear, we don't want them retaining excessively high scores? (If we introduced a ratings floor like we discussed in our previous article, the impact of this would be lessened)

Just add more ... uncertainty

The third approach we could introduce would be to add more uncertainty into our calculations. At the moment, as you race, the level of uncertainty associated with your skill score reduces as the computer has cumulatively more information to work with. A residual amount of uncertainty is preserved, so it's unusual to have an uncertainty level lower than 3. We could increase this, however, based on the gap between previous races - e.g. a year's gap would restore your uncertainty level back to its original level of 10. This would prevent the computer from 'locking in' odd scores and allow greater movement. But given the fact that we only have one or two results per rower per year (mostly), this is likely to disrupt scores from converging at all.

A slightly different approach again would be to add to add a method of rating the 'volatility' - i.e. inconsistency - of the rower. This was implemented in the Glicko2 rating system. Thus, if the rower was found to have more volatile results, their overall skill score would be downrated less. This has merits as an approach, but can't be implemented in Trueskill and won't help with the situation with retired rowers making surprise reappearances.

As ever, we continue to reflect on the best way forward, but for the time being we're not planning to make any changes (although the third option above strikes us as the next best step forward). We want to see what happens when we add significantly more data to the system, whether these creases we've identified get ironed out.

But we're open to your views - if you have any thoughts on what we've outlined above, let us know via twitter or email (contact dot rowingstats at gmail dot com).