In November 2017, many major news outlets reported that the Tokyo 2020 Olympics would see the debut of artificially intelligent judges, a monumental first for the sport of gymnastics and the field of artificial intelligence.
The technology, which is currently in development by Fujitsu and the Japan Gymnastics Association, will use sophisticated 3D laser sensors in order to track athletes’ movements in real time. The main idea is that it will measure the positions of an athlete’s hands and feet and match them against a dictionary of human movements so that it will be able to accurately identify the skill performed, how many twists were executed, and so on.
For the upcoming 2020 Olympics, the technology will only act as assistants to the judges, but eventually researchers hope to completely eliminate the need for human judges. There are many questions as to whether this vision will be fully realized with Fujitsu’s proposed model given the complex nature of gymnastics routines. On the other hand, there are also ethical and security-related concerns that will be difficult to overcome. Weighing the pros and cons of the technology can help us understand whether the need for artificially intelligence judges is justified.
Strange as it may seem that AI could one day replace human judges, some sports have already begun implementing technology to automate timing and scoring procedures. In the 2018 Winter Olympics, Sven Kramer won the men’s 5000 meter speed skating race over second place finisher Ted-Jan Bloemen by two hundredths of a second. During the race, to any spectator, let alone any judge watching, the two skaters appeared to finish at the same time. The naked eye could not determine who finished first. The referees needed to make use of slow motion replay videos to account for the micro difference between the skaters.
A difference like that can have profound effects on the outcome of the competition. In gymnastics, tenths, hundredths, or even thousandths of a point can drastically change an athlete’s placing in the standings. At the 2011 World Championships, Jordyn Wieber won the all-around gold over Viktoria Komova by 0.032, an amount that can’t even be described as a quantifiable deduction. Had they been judged by another set of judges, the outcome would not necessarily be the same. This is one of the many issues that Fujitsu’s new AI technology aims to solve. In practice, though, how effective will this kind of technology be?
Sophisticated movement tracking capabilities will result in more accurate results, particularly when determining the execution score. Scoring the gymnast’s execution is where the AI technology will be most helpful, since many gymnastics skills are performed at a high speed, so a judge may misperceive deductions or miss them entirely. For example, the Code of Points differentiates between a step on a landing that is shoulder width apart, resulting in a 0.3 deduction, or 0.1 if it is a narrower step. That 0.2 difference can change whether a gymnast finishes on the podium or qualifies for a major competition. This is part of the reason why judging panels have multiple judges. Cumulatively, all of the judges can try to account for all deductions, but Fujitsu’s technology may simultaneously eliminate the judging panel altogether and be able to provide a more accurate score.
Introducing AI into gymnastics judging may also alleviate the issue of bias. There are rules already in place that attempt to prevent obvious biases. Judging panels are composed of judges from multiple nations, for example, and if one judge’s score is drastically different from the others, the head judge can advise the deviating judge to change their score.
Subtle biases can still go undetected, however, particularly when a judge is scoring a routine of an athlete from their own nation. Taking results from the 2008 Beijing Olympics, Andrew Duong, a Berkeley university student, performed a statistical analysis of the scores and compared them against judges’ nationalities. He found that there were positive biases when a judge scored an athlete of the same nation, which was evident through higher scores. Conversely, some judges had negative biases towards out-group nations, resulting in lower scores.
Whether these biases were conscious or not, they have adverse effects on competitions. The difference in scores between one judge and the others might be too small to call out as a bias, so it will have an effect on the score, putting athletes at an unfair advantage or disadvantage.
This issue may also skew athletes’ idea of the quality of their routines. If an athlete competed a routine at two different competitions with similar mistakes, the scores may differ simply because of their geographical location. In theory, Fujitsu’s technology will solve this problem by guaranteeing that routines will be scored in the same manner at all competitions.
Furthermore, it may mitigate the issue of fatigue. The task of judging accurately can be demanding; they must track multiple features at once, usually at a high speed, which depletes cognitive resources. With multiple flights of a competition occurring per day, and each competition lasting two or three hours, judges often work from early in the morning to late in the evening. As the day goes on, it becomes increasingly difficult to maintain the same quality of scoring. One way to fix this issue would be to rotate the panels of judges throughout the day. However, differences in scores may still exist because the new set of judges, for example, may have a different style of judging. Having the same algorithm executed for each athlete should eliminate these disparities.
In a perfect world, this is exactly how an algorithm would perform: execute a sequence of actions in an unbiased manner. The problem is that the chosen sequence is developed by a technical expert and potentially other stakeholders, who will have their own biases. Those biases will be transferred to the algorithm.
How transparent will these algorithms be? Even if athletes and coaches have a general understanding of how scores are being calculated, it will be difficult to recognize and call out biases. Trust falls to the developers to ensure that the algorithm scores athletes fairly. Athletes might also want to know how they can improve – can they learn how if algorithms are not transparent?
Although introducing AI may solve the issues of inconsistent judging, in reality, gymnastics is an unpredictable sport. For example, a gymnast may make a mistake and cover it up by performing a different skill. At the 2017 U.S. Championships, Jordan Chiles intended to perform a double wolf turn, but she lost her balance after the first turn and ended up spinning into a standing triple spin. After her routine ended, the judges took extra time to decide how the turns should be credited in her difficulty score.
Fujitsu’s proposed technology would also have trouble determining the score since it has a fixed dictionary of movements, and it is unclear how it would handle unfamiliar skills. This situation can also occur when athletes submit new skills to the code of points. To do so, an athlete must complete it successfully (i.e. without falling). Will it be able to add the new skill to its knowledge base?
Another issue facing AI judging is how to deduct for artistry in floor and beam routines. Artistry is one of the things that make gymnastics such a unique sport. Beyond just performing skill after skill, floor routines in particular allow a gymnast to express their personality, creating a personal blend of dance, leaps, and tumbling lines in conjunction with music.
Judging the artistic aspect of a routine has been a contentious subject for human judges for years, as the definition of an “artistic” routine is vaguely worded in the Code of Points. In the 2017 Code of Points, one of the possible deductions for the E-score is described as “insufficient artistry of performance,” including lack of confidence, fluency, and personal style. Can Fujitsu’s current vision for AI judging be extended to include artistry deductions? So far, it is only being built to map the athlete’s perceived actions to a set library of movements. The aforementioned qualities — personal style in particular — cannot be reduced to a set number of movements. The idea that any machine, or any human for that matter, would attempt to do so would be absurd, as there is no limit to the number of dance movements that can be created.
Artistry is highly subjective; what one person believes is artistic may not be for another person. Some prefer the Russian gymnasts, who are known for their beautiful balletic style, while others are more drawn to the precise movements of the Chinese gymnasts, who also have their own iconic style. Compare these gymnasts to college gymnasts, whose routines tend to consist of more hip-hop themed movements in the foreground of upbeat dance music. Someone who is not used to this type of unconventional routine may not consider it artistic, so is it possible to design an algorithm that is a universal account of artistry?
On one hand, if the same algorithm used to detect artistry (or lack thereof) in a routine is used across all competitions, this guarantees consistent judging. The problem is that this doesn’t completely eliminate subjectivity. What the algorithm considers artistic will be at the discretion of its creators, the developers, and the experts involved. This is a small, centralized group of people who are speaking on behalf of the gymnastics community, and there is no guarantee that all viewpoints will be considered.
This approach could work if many different opinions were taken into account with a large committee overseeing the creation of the algorithm. Still, there’s a level of skepticism about finding a concrete way of defining “artistry,” since many are already doubtful of the current scoring system’s methods. Perhaps this speaks to a bigger issue in the sport as to how artistry should be accounted for in an athlete’s score, but in the end, because of this lack of consensus, artificially intelligent judging will be most effective and reliable if it evaluates solely objective deductions.
Assuming judging relied solely on AI technology, what would happen if a device suffered from a technical difficulty in the middle of a routine? What if there was a security breaching, causing a malfunction of the system? There is no room for failure because such an incident could have catastrophic effects on the competition. It could result in an athlete receiving an inaccurate score, or worse, no score at all. In this case, it seems the only possible solutions — having an athlete redo their routine or using video replay footage — are cumbersome and would interrupt the competition. Given the risks, such a situation lends itself to the question of whether competitions should depend entirely on technology.
Gary Marcus points out that neural networks, at their core, are a complicated combination of statistical techniques, which suffer from deviations. This means that the mechanisms involved in deep learning may not be reliable enough to use in practice since competitions cannot afford any mistakes. The consequences would be too great.
Though human biases may show through in the scores, it’s easier to monitor and mitigate these issues during a competition. Fujitsu’s technology may be able to judge autonomously for bar routines and vault skills, but for beam and floor, where choreography is involved, the judging panel should be divided as follows: provided the software is secure and thoroughly tested, it can handle D-score calculation and objective E-score deductions, while the human judges are responsible for subjective E-score deductions such as artistry.
Rather than replace human judges completely, perhaps artificially intelligent judges should only act as assistants because human judges are better equipped to handle novel situations. Ultimately, this will help ensure a future where athletes are scored as fairly as possible while competitions still reap the benefits of the accurate real-time scoring.
Article by Shannon Lee
This post was made possible thanks to our amazing patrons who help us fund things like travel and video production as we work to grow the site. This month’s patrons: April, Daniel Bertolina, Emily Bischoff, Dodi Blumstein, Wendy Bruce, Katie Burrows, Kelly Byrd, Melissa Carwin, Jillian Cohen, Brittany Cook, Kat Cornetta, Kristyn Cozier, Anita Gjerde Davidsen, Holly Glymour, Hydrick Harden, Lauren Haslett, Inaya, Lauren Jade, Alexis Johnston, Katrina, Sarah Keegan, Ishita Kent, Alyssa King, Jenny Kreiss, Maria Layton, Rae Lemke Sprung, Leigh Linden, Annabelle McCombe, Stephanie McNemar, Bridget McNulty, Cindy McWilliams, M. Melcher, Alison Melko, Emily Minehart, Eyleen Mund, Rachel Myers, Melanie Oechsner, Jessica Olaiya, David F. Pendrys, Lauren Pickens, Cordelia Price, Abbey Richards, Christine Robins, Kaitlyn Schaefer, Lisa Schmidt, Brian Schwegman, Sam Smart, Stephanie, Karen Steward, Lucia Tang, Tipse_ee, Rachel Walsh, Laura Williams, and Jenny Zaidi. THANK YOU!
Want to help out and qualify for super fun rewards for as little as $1/month? Check us out on Patreon!