What we've learnt from trueskill #1: bubbles

Published on: March 3, 2018

In this series, we look at some of the subtler points that come up in our experimental ranking system.

Trueskill is based on a system of predicting the outcomes of rowing races, given information on who is in each crew and their previous scores. Previous scores are updated based on the results on that particular race and the strength of the opposition.

By design, the scores that the trueskill system attributes to rowers are only derived from race outcomes relative to other rowers. There isn't any way the computer can say that a particular performance is stronger than another - in other words, there isn't an absolute scale against which to compare results - it's all relative.

We can use this to produce rankings and by quoting individuals scores you can guess how good they are relative to the general population. For example, if someone has a score of 105, and the average point is 100, we could infer that they are an above average rower.

But what happens when two people who've never raced against each other come up head to head?

If they've been competing against a broad range of people at major events (eg fours head, marlow regatta) then we can feel pretty confident that their skill scores are robust, and can provide a reasonable predictive value.

But imagine someone only raced at minor, regional regattas, clocking up "easy" wins. They could obtain large number of points, and given that they would never race more elite athletes, there would be no data for the computer to "correct" their scores to.

The computer's ranking of their score would be only within the "bubble" of the small world of regional heads and regattas, and not comparable to athletes racing in more competitive events. Because these rowers don't mix with the rest, two distinct silos of scores are formed - neither is validly comparable with the other.

However, of course, you don't know that to begin with - you can compare their scores and make a numerical prediction. It's just that that prediction will be nonsense because their are derived from different bubbles.

You could for example, compare the score of a top GB women athlete and a reasonable club male rower and conclude that the woman would win easily (which could well be the case). But as should be obvious, the GB athlete's score here is derived from only racing other women - and so the strength of the comparison falls down when compared to male athletes.

It's for this reason that at the moment we have real trouble working out how to appraise lightweight rowers. Unlike men and women, who almost always race in Men (sometimes described as Open) or Women categories, lightweights sometime race as lightweights and sometimes race as normal rowers - their position will normally fluctuate during the season. So should we produce a separate ranking, but only based on races where we know everyone was racing as a lightweight? Can we include the results of lightweight GB trials results with the rest of the group, or are we making an incorrect comparison across bubbles?

A further example is with junior rowers. For a time they will compete only against other junior rowers, and their standard would be based on that. But when they grow up and start racing as adults, are their prior scores still valid? Should we start again and reset them? What about juniors who really are very good - how do you compare a 26 year old who is winning the Temple at Henley against a junior world champion?

And what happens when somebody competes against others who they have never raced before (e.g. international events)? Do we award them the default set of points - giving the losing olympic athletes mediocre scores, or do we have to add some sort of correcting value? Or would we somehow have to calculate the standard of the event and then award unknown competitors points based on that prior standard? Should U23 world champs be considered on the same basis as Marlow or Met Regattas?

We've introduced some workarounds already. Firstly we avoid small or minor events (not that we would likely get crew data for them anyway) and don't bother with junior events at all - at least for the time being. And for international events (world cup, world champs and olympics) we try to compensate for the high standard of the event by giving newcomers a starting point of 110 (10 points higher than the normal).

As you can probably tell, we don't have immediate answers to these questions. But we continue to think and reflect on them. We'll try and make continuous updates to our experimental system. If you have any thoughts, drop us a message at @rowingstats.

Next in this series (coming soon): the curse of the GB triallist