Nearly Everything About Judging

Probullstats Blog

Mar 24, 2021, 8:59 am

Nearly Everything About Judging

By: Slade Long

Judging in rodeo and bull riding has been a more or less contentious issue for the forty or so years I've been around the sport, and in all likelihood, since the first judged rodeo. Almost everyone has an opinion about judge's opinions. Here we will examine how judging works, the problems with it and what might be done about them.

Judges have two basic jobs. Rule enforcement and scoring. In the timed events, judging is mostly about rule enforcement, and in riding events, scoring also comes into play. We'll go deeper into the scoring side of it, because that causes the most controversy, but we will deal with the rule enforcement as well because there are some common factors that bleed over and can effect outcomes.

In bull riding, bronc riding and bareback riding, judges score the animal and the rider. The score for each is an opinion, and that means these events are similar to other judged sports, such as gymnastics, figure skating, dressage, and so on. It's worth noting that because these are Olympic sports, the judging for them has been studied, debated, tweaked and tuned more than judging in rodeo ever has, and all the arguing and tweaking has mostly failed to eliminate controversy.

In rodeo riding events, and especially in bull riding, it is important that the judges are able to evaluate stock performance accurately. When judges get it wrong, it is often a result of inaccurate stock evaluation.

The root of most judging controversy is that scoring done by judges is subjective. This is kind of a non-negotiable fact. There isn't an acceptable way to remove all subjectivity from the scoring process in riding events or bucking stock competitions. The judges do sometimes get it wrong, and that can be proven just by observing what they've written down without watching a single animal buck. In some sense, this is just part of the game. When the judges get it wrong, problems arise.

These problems are often made worse by the perception of other parties. Contestants, stock contractors, producers, media, fans and assorted administrative staff see this in different ways and can have competing interests in "fixing" judging. Their perception of what is actually happening and what should be done about it is often way off base, even if their intentions are honest. Many judging remedies have been tried, and some of them made matters worse. There is not enough space here to list all the ideas that have been proposed or implemented, but they tend to follow a pattern.

Some ideas about how to improve judging involve minimizing subjectivity. The most ambitious of these is the PBR's experiment in using on board electronic sensors to evaluate the performance of bulls. This was a long shot at best, and at present there's no evidence it was effective. A huge chunk of other ideas involve changing the system in some way or addressing judge bias directly.

The System

Bucking stock and riding events are judged by a wide variety of systems now. The simplest is two judges using 0-25 whole points for the stock and rider, which the PRCA used for years. Many of the tweaks to this involve adding more judges and/or splitting points they are allowed to give. The PRCA now allows judges to use half-point increments. The PBR commonly uses 4 judges using half point increments. ABBI has used 4 to 6 judges using quarter-point increments. In practice, allowing the judges to use a greater range of point values and adding additional judges both have the primary effect of suppressing extreme scores on the high and low end.

This is why you have seen better rides than the 96.5 point PBR record, but have not seen any 96.5 or higher scores recently. All of the PBR's ten highest scores happened before May of 2004 - 17 years ago. Shortly thereafter the PBR went to 4 judges full time, which has the side effect of slightly suppressing the highest scores.

There are other schemes in which the high and low score is dropped or an outlying score is dropped. Dropping a score from the high or low side of the score distribution obviously adds to the effect of suppressing extreme high and low scores by moving all scores closer to the mean or average. This is an important point to remember - the primary effect of adding judges, adding precision or dropping extreme scores is they all move the resulting final score closer to the mean of the distribution of judges scores. This is what I mean when I say systems that favor the mean.

All of these system based judging schemes tend to diminish the impact of each individual judge. So if you have 6 judges and you are dropping the high and low score, each of your judges plays less of a role in the outcome than they would if there were 2-4 judges and you kept all their scores no matter what.

Are these schemes good or bad? It depends. If you plan to hire judges who aren't that good at judging, and those inferior judges tend to throw out wildly high or low scores a lot, they may serve you well. Outside of that specific situation, systems that favor the mean may have a number of negative effects.

The problem with systems in general is that they change your mindset. Many of these systems are not very well thought out. Some are quite complex and someone put a lot of thought into them. Some are knee-jerk reactions to some incident in the past. Whichever one you employ, you will tend to believe that the judges who give you the best results within that system are the "best" judges. The truth is that the judges you choose to hire have a much greater impact than the system does. If you put bad judges into a perfect system you will get bad results. If you put great judges out there with no system at all you will get good results.

The premise of systems that favor the mean is not bad. If you hired 10 to 1000 competent judges and tweaked the system to heavily favor the mean they would likely produce good results, and those results would be accurate to a high degree of precision. The flaw is that there are not that many quality judges around in the first place, and events almost never use more than 6 judges. The biggest flaw of systems that favor the mean is that they allow judges who are extremely conservative to thrive and multiply while killing off judges who spread their markings over the available range of scores.

Bias

Bias is often seen as a big issue in judging. Many of the systems were created with the idea that judge bias is the biggest single problem, and making the whole process more mechanical may eliminate bias. There are also standard practices in place because of potential bias, such as various means of quarantine, preventing judges from seeing one another's scores, etc.

Many specific complaints about judges arise from suspicion of bias for or against a particular animal, rider, or stock contractor. Many feel the judges are biased towards the highest performing animals or riders.

All of these suspicions about bias are understandable. All judges are indeed biased, just as every other human being on the planet is. Whether you realize it or not your perception of just about everything is filtered through your own experiences and this constitutes bias. You may not even realize it is happening, and it can be hard to recognize and resist. In just about every specific case I know of where a judge has been accused of bias, the accuser was actually more biased. I don't think this is unusual. I think it comes up as a major concern among stock owners because most stock owners watch every event from a much more biased perspective than the judges do. They pay a great deal more attention to their own animals. There is nothing wrong with this, except that it can lead to bad policies. Judges are usually the least partisan group at any given event, and stock contractors are often the most partisan, yet suspicions and accusations tend to flow the other way.

This is not to say that judging bias doesn't exist. It does, and can be demonstrated just by looking at numbers in some cases. But, it usually turns up where you least expect it and can only be seen by looking at large amounts of data.

A concrete example of this can be seen by looking at PBR long round scores over many years of time. These are mostly random draw rounds in which a score of ~87 or better is considered "good" and will consistently place. If you take a look at all the riders over the course of 5 or 10 years, you will notice that some guys hit for 87+ more often than others. If you compare riders of similar ability but different physical size you will find the smaller guys tend to hit more often. You can also find that the best riders in history at snatching 87+ point scores out of random draw rounds are all PBR World Champions in the year following their first title. They tend to hit significantly more often than their average for all the years before they won the title. This is an example of judge bias, but it's not exactly a smoking gun. This kind of bias is hard to escape, and the judges probably exhibit it to a lesser degree than most of the other people on hand at any given event. It is about expectations. Everyone in the arena pays attention when it is Bruiser or Bushwacker's turn because they expect to see something special. Likewise, McBride, Mauney, Lockwood, and Leme. Judges are not immune to this.

Popular concerns about bias are usually about a specific kind of bias - where an individual judge is a bad actor and is trying to game the results in favor of a friend or against an enemy or just personal preference. This kind of situation is possible, and has surely happened before. However, given that it is possible to detect bias that even the judges themselves are unaware of, finding evidence that a given judge is a bad actor is pretty easy. With competent oversight, there is zero chance that a bad actor judge could operate undetected. Whether or not competent oversight exists should be a bigger concern.

Aside from devious intent or inherent biases that every human being carries around, judges may instead be biased in favor of self preservation. This may be better understood as bias in favor of the person or organization who hired the judge.

If the employer openly desires higher scores at an event, or expresses opinions about various riders or stock, these things can influence a judge's scores and rule enforcement decisions whether the judge is aware of it or not. Most guys who judge want to do a good job of it, and be honest and fair. In the real world, whether they succeed or not often depends on their employer's definition of success. This happens all the time and can be shown by looking at the data. If the employer has expectations, over time the judging will trend toward fulfilling those expectations because they will hire judges who fulfill them and cull judges who do not. If their expectations are flawed, then judging at their events will eventually reflect those flaws to some extent. You do NOT want judges who are primarily concerned with figuring out what their employer wants to hear and producing scores to fit their employer's ideal.

Getting It (as) Right (as possible)

What's the best way to ensure consistent and accurate judging? In a nutshell, hire the best judges. The problem with this is, people have differing opinions about who the best judges are, and their opinions are usually far more biased than the judge's scores are.

Stock owners may (and often do) value judges who tend to give their animals high marks. Organizations may value judges who are complained about the least. People who hire judges for events may value judges they are friends with, or they may hire the judges who pester them for assignments the most. Contestants may prefer judges who are less likely to badger them with petty rule infractions.

Sometimes people believe that because judging is subjective by nature, that there is zero objectivity involved. If a judge marks a bull 22, then by definition that is a 22 point bull. If they tend to agree with the judge, then the judge is "good" and if not, then "bad". But, anyone can plainly see the difference between a bull that is average and a bull that is great and whether a rider is in control or not. There is an abstract concept of objective value that both stock and rider display, and there are a couple of ways to demonstrate this.

1: Imagine you were able to wire both bull and rider with flexible bodysuits covered in hundreds of sensors that could collect tens of thousands of data points for each ride, accurately recording the physics of a ride, AND you had plenty of on-hand processing power to parse this physical data into an accurate and consistent set of values that would represent the performance of animal and rider. The PBR's sensor project was a crude take on this, but in the real world it is not practical. Just writing the code to parse the data would be a monumental task with infinite pitfalls. Theoretically it is a valid approach, and the impracticality of it does not mean the true values do not exist.

2: Imagine you recorded video of five different rides from exactly the same ground level arena perspective, and removed any visual aspect of the environment that could introduce bias. Rider identities unknown, no chute signage, no sponsor patches, all the bulls the same size and color, no sound, no slow motion, etc. Then you ask 100 well-qualified people to individually score each ride in an isolation booth, one by one, with no time limit, infinite replays, and to only concern themselves with scoring accuracy. The mean or average of the 100 scores for each ride would be extremely close to a true objective value. You would essentially be crowdsourcing scores in a perfect environment. This is also impractical, but theoretically could give you very accurate values.

Neither scenario is a real world solution, but they do illustrate that reasonably concrete score values do exist. You want to hire judges who can come close to these values on the spot, under pressure, and in a noisy environment filled with distractions. Professional judges can do this, and some of them are quite good at it. Clearly there are judges who are not that good at it as well. They may compensate by consciously playing it safe and trying to get their scores in the middle every time, or they may just guess and hope for the best. Even the best judges are likely to miss once in a while, but they will miss by less.

So what's the answer? You can't create the perfect scenario, but with good oversight you can identify the judges who can score accurately in the environment where scoring has to be done. Judges who are aware of their own biases, and can set them aside and work efficiently with accuracy. Use the judges who take the job seriously and are the most proficient at doing it.

Good oversight includes the ability to evaluate judge performance using data AND outside the data as well. The person in charge must be capable of judging stock and riders at least as well as the average pro level judge. There's not a foolproof mathematical formula for evaluating scoring accuracy. Good oversight also means not introducing more bias into an environment that is already loaded with it, even unintentionally. You want judges who can consistently hit close to the theoretical perfect score every time, and you don't want to put anything in their brains that would distract from that objective.

Judging will never be perfect, but the approach of hiring the judges who do the best job of it, and minimizing assignments to those who don't will get us as close as we can get to perfection. There aren't that many excellent judges out there, and finding and developing more of them should be a priority. That means identifying judges who have potential and talent, and giving them experience. Like anything else, it's not something you can teach just anyone to do at a professional level.

Probullstats