Evaluation of Sports Performance: Cognitive Biases, Vectors and Visualization Challenges

This article is the cover story of a project that we developed or supported :

Off_the_charts

Date : 2023-06-15

Evaluation of human performance is notoriously difficult, even for an expert eye. The difficulty does not necessarily lie in the perception of performance, although it may require to overcome cognitive biases, but in consistency of evaluations when different subjects are observed and/or when several observations take place at different points in time. Adding to the complexity, individuals may work in teams to achieve certain goals, so task completion or failure only tells one part of the story. For team-sports, the cynical approach just does not cut it.

Performance evaluation methodologies

To understand the intriguing process underpinning human evaluation of human performance, a mandatory starting point is to compare human assessments against a dataset that gives a complete picture of the observed performance, using data points collected according to a strict methodology. And you must be able to do this at scale, if you want your results to be meaningful.

This is precisely what Pappalardo et al, tried to achieve in 2017 by comparing statistics of football players during games with the ratings these players received from experts in sports newspapers the next day. As in many sports since sabermetrics became part of the popular culture - which the Moneyball book and movie largely contributed to -, football statistics are abundant and collection methodology is sufficiently consistent. The researchers built performance vectors with n features representing events or actions of players during Serie A games (e.g, a goal, a successful tackle or a failed pass), and subsequently used these vectors as benchmarks for ratings in three different Italian sports newspapers. Many insightful conclusions from this paper:

  • Even ignoring individual cognitive biases, a number of contextual considerations play an important role such as, at a macro level, a game’s anticipated result as estimated by bookmakers or, at a micro level, the quality of a pass that can’t really be captured in statistics.
  • Judges proceed through heuristics, focusing on a small number of noticeable events and discarding the vast majority.
  • In particular, the team found that only about 20 features (events) were considered during the evaluation.
  • The features judges focused on varied depending on the players: position on the pitch (technical features for goalkeepers and forwards but collective features for defenders and midfielders).
  • Human judges tend to agree with each other, albeit they disagree 20% of the time.

This last aspect may have driven Pappalardo and his team to focus on a data-driven assessments, even though they imply to drop contextuals aspects that may not be systematically mirrored in the available data.

Source : Human Perception of Performance, Pappalardo et al, 2017

In 2019, Pappalardo teamed up with Wyscout to present PlayeRank, a metric "measuring performance quality in all of its facets [...] as performance is an inherently multidimensional concept". In a nutshell, they gave each feature a weight to obtain a vector which they subsequently multiplied by a vector with the feature-count in a specific game. Through this process they obtained a single number ; a rating which they used to rank players. It may sound extremely complex (and it is quite complex when you read the paper), but you have probably come across this exact methodology if you have ever played fantasy sports: players’ scores are obtained by multiplying two vectors; a feature-count vector and a feature-weight vector.

performance rating of player u in match m

Building on the approach, a team led by Aydemir at the University of Ankara proposed in 2021 an even more advanced methodology taking into account game and competition difficulty as well as time decay rate to specify importance given to older matches. The paper describes very clever ways to incorporate contextual adjustments and it's definitely worth a deep-dive if you are into complex modelling.

Game difficulty assessed using ELO Scores based on roster values

The approach may be too complex for real-life applications unless you are a Scout for a professional football club. Scaling at various stages of the computation process would likely hinder mainstream users from finding the metric relevant. Accessibility is key to facilitate adoption, so users must be able to make a rough estimate of the result through a heuristic process. That’s why Fantasy Sports platforms use simplified matrices as a good compromise between understandability and fair reflection of performance quality given context. As the only absolute requirements are consistency across players (notably across different positions) and consistency over time, simple weights vectors are the perfect tool for the job. Simplification can result in frustrations for the users but that's not necessarily a bad thing for a gaming platform as frustrations are a necessarily evil to maximize the release of dopamine.

Feeding the information back to end-users

Using the vector approach, it is straightforward to compute a player's rating or score. However, a user who only receives the computed score would not be able to back-solve the exact nature of a player's performance. The result itself does not say anything about the strengths and weaknesses as many vectors can lead to the same score ; the results simply indicate that a player performed well or poorly relatively to others according to the scoring matrix. The mathematical explanation for this lies in one of the most fundamental principles of information theory: an integer or a number with few decimals (the output of the vectors multiplication) can only carry a limited amount of information, quantified in bits. Even if you add colors or other types of decoration, there is a limit to the information capacity of the rendered digits. And this amount of information is disproportionately small compared to the amount of information carried by the input. This raises the following question : is it possible to increase information capacity of the output, knowing that the output MUST be a numeric value (an integer) to allow comparisons?

In the rest of this article, we use the Sorare matrix for illustration purposes. This matrix presents a decent level of complexity (c.50 features) and appropriate weights based on our experience across several Fantasy Football platforms*. Sorare scores range from 0 (floor) to 100 (cap) with a median around 45 for starting players (players start games with a base score of 35). The matrix is available in our Github repository of the data visualization project (scoringMatrix here)

Despite its extreme reliance on data, the Fantasy Sports industry is surprisingly lacking solutions to visualize the output of complex scoring matrices. Most Fantasy Sports companies stick to unidimensional representations, i.e a single number that is the result of a multiplication of two vectors as explained above. This fails to capture the complexity of their scoring, and to represent what actually happened on the pitch.

Adding a second dimension was an important step implemented by SorareData*, the main provider of data dedicated to Sorare's Fantasy Sports platforms (covering football, baseball and basketball).

Based on user feedback, the split between Decisive Score and All Around Score is highly valuable.

Terms for the two sub-scores were coined by Sorare themselves to explain how they built the matrix in the first place. Decisive score includes a limited number of low-frequency-but-high-impact features such as goals or assists (positive impact) or penalty conceded or red cards (negative impact). All Around score encompasses a large number of less important features when taken individually like passes or tackles but which, taken together, can also reflect a truly outstanding performance.

Sorare scores as displayed on the Sorare Website
Sorare scores as displayed on the SorareData website

There are therefore two sub-scores, which allow to compare more technical skills for certain positions with more collective skills and that's consistent with Pappalardo’s findings in the 2017 paper.

The second dimension alone is not necessarily sufficient to compare athletes based on their respective strengths and weaknesses. Certainly not if you try to compare them across various positions (forwards, midfielders, defenders or goalkeepers).

To provide a detailed breakdown, the gaming and sports industries typically use the same tools : polar charts or radars charts. These charts are easy to understand in a glimpse of an eye and represent the percentile in which the athlete sits. In other words, they show the athlete’s strengths and weaknesses compared to a predetermined benchmark. For Sorare, the radar could be split to represent the multiple dimensions of each of the Decisive Score and the All Around Score.

Statbomb - Midfielder view
Sorare matrix - double-radar view (butterfly radar)
blue: Erling Haland (avg. c.70) / red: Edin Dzeko (avg. c.50)

However, many features of a football scoring matrix are relatively rare events, in particular on the decisive side. For example, with one goal, an athlete is in the 90th percentile, even with a logarithmic scale. So the shape of the resulting “butterfly-radar” does not differentiate a player with an average score of 50 from a player with an average score of 70 if their respective scores rely heavily on goal-scoring.

The fact that one would expect to identify differences in scoring optically based on the size of the area leads to an interesting observation though: the message carried by shape itself can be more complex than initially envisaged ; not only do the vertices carry information, the area does too. If you can make the area to grow proportionally to the athletes’ score, you can leverage information from the vector multiplication result AND provide at the same time the breakdown the athlete’s strengths and weaknesses through the vertices.

The first step is to identify the area corresponding to the base score, i.e 35 for Sorare football (it could be zero for matrices allowing negative scores), and arrange the features of the scoring matrix in a circle around the base. Then, the base can expand or shrink (proportionally) along each segment. Features can be combined by categories to reduce the number of segments, in particular on the All Around side. Note that negative points eat into the base along segments until the center is reached, at which point the "doughnut hole" forms and spreads outwards from the center. When it's the case, as all negative points associated with features get aggregated in the doughnut hole, it results again in a situation where the output (the hole at the center) carries significantly less information than input (negative points associated with each feature). For that reason, the solution is not entirely satisfactory. But you can judge by yourself based on this video comparing three very similar players (same average score, same position).

Arranging features around the base
off_the_charts app - demo
3 midfielders with similar scores (avg. between 72 and 74 over the period)

And it's not just a demo video : to let you explore further, we have built a web app implementing this visualization technique as well as other features. It was not an overly ambitious project from a development perspective but you should still be able to play with it here. It covers pretty much all football players in the world, and stats are available for all competitions covered by Sorare (which means basically all competitions covered by their data provider Opta). We do not store any information about the players and their stats, the app fetches the information directly from Sorare each time a user looks a player up. In the interest of performance, scores are only pulled from July 1st, 2022 onwards. The source code of the app is available on Github.

*Disclosure : At inception of his entrepreneurial journey in 2020, I supported the Founder of SorareData, a data provider focussing on Sorare's fantasy sports platforms. More information about this project is available here. I acted as Product Owner of SorareData for about a year, providing the team with support for feature conception and designing the paywall. Although I rarely catch-up with the team, I still own equity and I am an active user as I use their tools when I want to register Fantasy Sports lineups on Sorare. I have been a business angel of Sorare at its inception too and exited in 2021. I still own Sorare NFTs from that period which I use to play on a regular basis.

We care about your privacy so we do not store nor use any cookie unless it is stricly necessary to make the website to work
Got it
Learn more