The Pros and Cons of Artificially Intelligent Judges in Gymnastics

6b26936e706712f9d1ab84a83b0b7a78

In November 2017, many major news outlets reported that the Tokyo 2020 Olympics would see the debut of artificially intelligent judges, a monumental first for the sport of gymnastics and the field of artificial intelligence.

The technology, which is currently in development by Fujitsu and the Japan Gymnastics Association, will use sophisticated 3D laser sensors in order to track athletes’ movements in real time. The main idea is that it will measure the positions of an athlete’s hands and feet and match them against a dictionary of human movements so that it will be able to accurately identify the skill performed, how many twists were executed, and so on.

For the upcoming 2020 Olympics, the technology will only act as assistants to the judges, but eventually researchers hope to completely eliminate the need for human judges. There are many questions as to whether this vision will be fully realized with Fujitsu’s proposed model given the complex nature of gymnastics routines. On the other hand, there are also ethical and security-related concerns that will be difficult to overcome. Weighing the pros and cons of the technology can help us understand whether the need for artificially intelligence judges is justified.

Strange as it may seem that AI could one day replace human judges, some sports have already begun implementing technology to automate timing and scoring procedures. In the 2018 Winter Olympics, Sven Kramer won the men’s 5000 meter speed skating race over second place finisher Ted-Jan Bloemen by two hundredths of a second. During the race, to any spectator, let alone any judge watching, the two skaters appeared to finish at the same time. The naked eye could not determine who finished first. The referees needed to make use of slow motion replay videos to account for the micro difference between the skaters.

A difference like that can have profound effects on the outcome of the competition. In gymnastics, tenths, hundredths, or even thousandths of a point can drastically change an athlete’s placing in the standings. At the 2011 World Championships, Jordyn Wieber won the all-around gold over Viktoria Komova by 0.032, an amount that can’t even be described as a quantifiable deduction. Had they been judged by another set of judges, the outcome would not necessarily be the same. This is one of the many issues that Fujitsu’s new AI technology aims to solve. In practice, though, how effective will this kind of technology be?

Sophisticated movement tracking capabilities will result in more accurate results, particularly when determining the execution score. Scoring the gymnast’s execution is where the AI technology will be most helpful, since many gymnastics skills are performed at a high speed, so a judge may misperceive deductions or miss them entirely. For example, the Code of Points differentiates between a step on a landing that is shoulder width apart, resulting in a 0.3 deduction, or 0.1 if it is a narrower step. That 0.2 difference can change whether a gymnast finishes on the podium or qualifies for a major competition. This is part of the reason why judging panels have multiple judges. Cumulatively, all of the judges can try to account for all deductions, but Fujitsu’s technology may simultaneously eliminate the judging panel altogether and be able to provide a more accurate score.

Introducing AI into gymnastics judging may also alleviate the issue of bias. There are rules already in place that attempt to prevent obvious biases. Judging panels are composed of judges from multiple nations, for example, and if one judge’s score is drastically different from the others, the head judge can advise the deviating judge to change their score.

Subtle biases can still go undetected, however, particularly when a judge is scoring a routine of an athlete from their own nation. Taking results from the 2008 Beijing Olympics, Andrew Duong, a Berkeley university student, performed a statistical analysis of the scores and compared them against judges’ nationalities. He found that there were positive biases when a judge scored an athlete of the same nation, which was evident through higher scores. Conversely, some judges had negative biases towards out-group nations, resulting in lower scores.

Whether these biases were conscious or not, they have adverse effects on competitions. The difference in scores between one judge and the others might be too small to call out as a bias, so it will have an effect on the score, putting athletes at an unfair advantage or disadvantage.

This issue may also skew athletes’ idea of the quality of their routines. If an athlete competed a routine at two different competitions with similar mistakes, the scores may differ simply because of their geographical location. In theory, Fujitsu’s technology will solve this problem by guaranteeing that routines will be scored in the same manner at all competitions.

Furthermore, it may mitigate the issue of fatigue. The task of judging accurately can be demanding; they must track multiple features at once, usually at a high speed, which depletes cognitive resources. With multiple flights of a competition occurring per day, and each competition lasting two or three hours, judges often work from early in the morning to late in the evening. As the day goes on, it becomes increasingly difficult to maintain the same quality of scoring. One way to fix this issue would be to rotate the panels of judges throughout the day. However, differences in scores may still exist because the new set of judges, for example, may have a different style of judging. Having the same algorithm executed for each athlete should eliminate these disparities.

In a perfect world, this is exactly how an algorithm would perform: execute a sequence of actions in an unbiased manner. The problem is that the chosen sequence is developed by a technical expert and potentially other stakeholders, who will have their own biases. Those biases will be transferred to the algorithm.

How transparent will these algorithms be? Even if athletes and coaches have a general understanding of how scores are being calculated, it will be difficult to recognize and call out biases. Trust falls to the developers to ensure that the algorithm scores athletes fairly. Athletes might also want to know how they can improve – can they learn how if algorithms are not transparent?

Although introducing AI may solve the issues of inconsistent judging, in reality, gymnastics is an unpredictable sport. For example, a gymnast may make a mistake and cover it up by performing a different skill. At the 2017 U.S. Championships, Jordan Chiles intended to perform a double wolf turn, but she lost her balance after the first turn and ended up spinning into a standing triple spin. After her routine ended, the judges took extra time to decide how the turns should be credited in her difficulty score.

Fujitsu’s proposed technology would also have trouble determining the score since it has a fixed dictionary of movements, and it is unclear how it would handle unfamiliar skills. This situation can also occur when athletes submit new skills to the code of points. To do so, an athlete must complete it successfully (i.e. without falling). Will it be able to add the new skill to its knowledge base?

Another issue facing AI judging is how to deduct for artistry in floor and beam routines. Artistry is one of the things that make gymnastics such a unique sport. Beyond just performing skill after skill, floor routines in particular allow a gymnast to express their personality, creating a personal blend of dance, leaps, and tumbling lines in conjunction with music.

Judging the artistic aspect of a routine has been a contentious subject for human judges for years, as the definition of an “artistic” routine is vaguely worded in the Code of Points. In the 2017 Code of Points, one of the possible deductions for the E-score is described as “insufficient artistry of performance,” including lack of confidence, fluency, and personal style. Can Fujitsu’s current vision for AI judging be extended to include artistry deductions? So far, it is only being built to map the athlete’s perceived actions to a set library of movements. The aforementioned qualities — personal style in particular — cannot be reduced to a set number of movements. The idea that any machine, or any human for that matter, would attempt to do so would be absurd, as there is no limit to the number of dance movements that can be created.

Artistry is highly subjective; what one person believes is artistic may not be for another person. Some prefer the Russian gymnasts, who are known for their beautiful balletic style, while others are more drawn to the precise movements of the Chinese gymnasts, who also have their own iconic style. Compare these gymnasts to college gymnasts, whose routines tend to consist of more hip-hop themed movements in the foreground of upbeat dance music. Someone who is not used to this type of unconventional routine may not consider it artistic, so is it possible to design an algorithm that is a universal account of artistry?

On one hand, if the same algorithm used to detect artistry (or lack thereof) in a routine is used across all competitions, this guarantees consistent judging. The problem is that this doesn’t completely eliminate subjectivity. What the algorithm considers artistic will be at the discretion of its creators, the developers, and the experts involved. This is a small, centralized group of people who are speaking on behalf of the gymnastics community, and there is no guarantee that all viewpoints will be considered.

This approach could work if many different opinions were taken into account with a large committee overseeing the creation of the algorithm. Still, there’s a level of skepticism about finding a concrete way of defining “artistry,” since many are already doubtful of the current scoring system’s methods. Perhaps this speaks to a bigger issue in the sport as to how artistry should be accounted for in an athlete’s score, but in the end, because of this lack of consensus, artificially intelligent judging will be most effective and reliable if it evaluates solely objective deductions.

Assuming judging relied solely on AI technology, what would happen if a device suffered from a technical difficulty in the middle of a routine? What if there was a security breaching, causing a malfunction of the system? There is no room for failure because such an incident could have catastrophic effects on the competition. It could result in an athlete receiving an inaccurate score, or worse, no score at all. In this case, it seems the only possible solutions — having an athlete redo their routine or using video replay footage — are cumbersome and would interrupt the competition. Given the risks, such a situation lends itself to the question of whether competitions should depend entirely on technology.

Gary Marcus points out that neural networks, at their core, are a complicated combination of statistical techniques, which suffer from deviations. This means that the mechanisms involved in deep learning may not be reliable enough to use in practice since competitions cannot afford any mistakes. The consequences would be too great.

Though human biases may show through in the scores, it’s easier to monitor and mitigate these issues during a competition. Fujitsu’s technology may be able to judge autonomously for bar routines and vault skills, but for beam and floor, where choreography is involved, the judging panel should be divided as follows: provided the software is secure and thoroughly tested, it can handle D-score calculation and objective E-score deductions, while the human judges are responsible for subjective E-score deductions such as artistry.

Rather than replace human judges completely, perhaps artificially intelligent judges should only act as assistants because human judges are better equipped to handle novel situations. Ultimately, this will help ensure a future where athletes are scored as fairly as possible while competitions still reap the benefits of the accurate real-time scoring.

Article by Shannon Lee

16 thoughts on “The Pros and Cons of Artificially Intelligent Judges in Gymnastics

  1. I agree with the AI in gymnastics but just to help judges, also think they should be able to watch in slow motion if dosent exist already

    Like

  2. I think at least there should be something like Kathy Johnson’s handstand protractor or something that measures height and distance on vault and tumbling, and whether a leap was in a full split or not. Full AI judging would reduce gymnastics to just a bunch of tricks, which I don’t think anyone wants.

    Liked by 1 person

  3. Are there any concerns about judges losing their income? I gather that this isn’t their only job but it might help some of them get by. Full AI judging might take that extra income away from them. I’m wondering how judges themselves feel about this.

    Liked by 1 person

  4. This is such a well-written article! I love this idea; as long as this technology checks out and tests consistently, I would love the idea for splitting up human and AI judging for beam and floor. Vault and bars are much more limited in the vocabulary of movement that they entail, so AI should be quite effective with those, I’d say! I can’t wait to see how this process plays out.

    Like

  5. I’m opposed to AI. The only place that I see it having 0 issues is vault, and even then I really don’t think it’ll be all that much more precise than the human judges since vault is the one event that is 100% objective in the scoring. As the article mentioned, the two big issues are novelty situations and artistry/originality. Since they have a set dictionary of skills, they may misinterpret new skills as a different element (like if the Schafer was debuted with AI maybe it would try to count it as a side somersault or something) or count it as a deduction when it really isn’t. And the same thing with what happened to Chiles at nationals. Plus, by implementing AI as the main source in scoring sure it helps with E scores, but they then derail everything they’ve tried to do in terms of preserving artistry. There is 0 ways of having AI interpret artistry. It only enforces this idea of throwing big skills as cleanly as possible and ignoring artistry. Plus I really don’t think AI is worth dumping money into implementing and producing for the maybe 1 competition a year where you have a 2011 wieber-Komova or 2015 Uneven Bars EF situation (although in my opinion they very clearly did not deserve the same score for the latter). The code of points is specific enough to separate the routines from one another as opposed to “if you hit you all get an 8.9” but broad enough so that there aren’t deductions that you literally can’t humanly catch. For sports like sprinting and speed skating it makes sense because the difference between first and second or third and off the podium is often so narrow that the human eye literally can’t see it. But the scoring for gymnastics isn’t that specific. I think that the only things they need to change is to 1) give the judges a more comprehensive view of the gymnasts routine so they can appropriately judge the routines+slow motion if it’s not already there and 2) make the code much more specific about the definition of artistry, and to appropriately crack down on penalizing routines that aren’t artistic, because routines so devoid of any type of artistry like Jade Carey’s floor routine from worlds should never have been able to make the podium regardless of how difficult and clean it was. I just think that there is nothing that humans can’t do or that AI can that will change anything.

    Like

    • Agree with everything you said. I would add that slow motion replay should be used for something that is under review like they have in figure skating. And I wouldn’t be opposed to some sort of censor to insure that out of bounds instances were properly credited on vault and floor exercise.

      Liked by 1 person

  6. Pet peeve here….Its not Berkley University (which is also spelled wrong) it’s the University of California Berkeley or just Berkeley or UC Berkeley or California or just Cal. Great article that just always irks me! Xoxo A Berkeley alum

    Like

    • I fixed the typo, but the writer of this article was saying “a Berkeley university student” meaning a university student at Berkeley, not “a Berkeley University student” (which is why the U wasn’t capitalized!).

      Like

  7. Well, the picture above the article shows how bad are Gabby Douglas’forms. I guess artificial intelligence judges won’t have given her the gold medal.

    Like

    • If you watch the video she does the skill correctly. Sometimes photographers don’t know the correct moment to capture in a skill and they publish photos that show the wrong moment.

      Like

  8. Des pointes de pied non tirées, un genou plié, le dos non cambré … Mais le mieux serait peut-être de remplacer les gymnastes par des robots ? L’intelligence artificielle jugerait des êtres artificiels dénués de toute personnalité comme la plupart des gymnastes américaines le sont : ce serait un rêve absolu pour la plupart d’entre vous, wouldn’t it ?

    Like

  9. Cool article, I was thinking that if they want to us AI they will have to rewrite the code of point by the rules of AI.

    For example vault: your point of gravity 2 meter above the table is 0 deduction, 1,5 meter is 0.1 en 1 meter 0.3.

    OR your feet must be 1 meter above the table. less is 0.1 deduction

    Gymnasts will need to know that the AI is looking for and the whole code has to be written to this 😛

    Like

  10. I would worry about less wealthy nations in this scenario. If the AI technology was made publicly available to purchase it is likely that wealthy nations and gyms would have the capability to allow their gymnasts to practice with the AI, forming an even greater barrier for gymnasts from nations that cannot afford to invest heavily in their training.

    Like

  11. Pingback: Around the Gymternet: About last night | The Gymternet

Leave a comment