I decided to catch a few minutes of the Indiana State/Ohio State game last Sunday when this happened…
If the video doesn’t work, here’s a description: As Ohio State’s Ques Glover tries to finish a reverse layup, Paul Szelc calls a foul but it doesn’t appear there was any contact. Indiana State’s Josiah LeGree complains with the sort of respectful demeanor that you’d expect from a freshman who comes off the bench. And I believe him.
This wasn’t an interesting event on its own, but it grabbed my attention because I knew that among big-time refs, Paul Szelc is more likely to blow his whistle than almost anyone else. I know this because I spent way more time than I should have tinkering with a model to estimate the effect every college basketball official has on a game’s foul count.
This model controls for the following in each game:
Other officials on the crew1
The fouling tendencies of the teams involved
The final margin of victory (closer games tend to have more fouls)
Number of overtime periods
With those inputs and applying a ridge regression across all games over the past three seasons we can get a handle on officiating tendencies. Let’s call this measure Fouls Above Average, or FAA. Fitting because so many officials depend on the FAA to function properly in order make their game assignments. These values are now posted on the referee ratings page and in the box scores, and will be updated occasionally.
The model tells us that Paul Szelc has an FAA of +1.9 - going forward, he’s expected to call 1.9 more fouls per game than the average D-I official. Even though he’s extreme, in the grand scheme of things that’s really not a lot. The game was quite the hack-fest, with 51 fouls called between both teams. But if you replace Szelc with an average ref, you’d expect 49 fouls. And if you replace him with someone from the other end of the spectrum, you get down to 47. Better, but still a lot more fouls than your average game.
Ultimately, the most important factors in predicting fouls called in a game are the tendencies of the teams involved and how they actually play in that game. Then you can start thinking about what small effect the officials will have on top of that.
So this is one of those things that is significant - the effects we are measuring for each official are definitely real - but not usually meaningful. Games with lots of actual fouls will have a lot of fouls called regardless of who is working the game. And games with few fouls will not have many whistles. Based on that, I don’t think there’s anything here that is actionable on a team level (although don’t let me stop you from trying).
One area where it might have a bit of utility is in how crews are constructed to work games. The thing about these tendencies is that they are surprisingly persistent. I did a similar exercise 11 years ago and the refs that were extreme cases are still doing their thing. Jerry Heater (+1.9) was one of the most frequent foul callers. John Hampton (-1.9), Roger Ayers (-1.7), and Ted Valentine (-1.4) were some of the least frequent.
We can look at this in a more systematic way. Here’s a plot of all officials who qualified for a rating (by working at least 40 games) in both 2022 and 2025:
And as officials get more games, their values become more reliable. Here are officials that worked at least 80 games in both periods:
Some officials do change, but those are the exceptions. The differences the model picks up on aren’t random and FAA is remarkably stable over time. There aren’t any dots in the upper left or lower right corners of the plot.
Technology has given officials better tools to improve their work, and a shrinking home court advantage hints that games are called more fairly than they were 10 or 20 years ago. But what we can learn here is that it seems no amount of corrective action can make Paul Szelc and Roger Ayers - two respected officials who worked the Final Four last year - call the game the same way.
That might seem troubling but let’s turn that frown upside down and use this to our advantage. A simple path to improving officiating consistency on a game level is to avoid grouping officials who have similar extreme tendencies. And since we know an each official’s tendency with reasonable accuracy, this goal is achievable.
For instance, the November 30 game between Rutgers and Texas A&M featured 48 fouls and 59 free throws. And it was almost entirely unaffected by late-game fouling - there were only 2 fouls in the final four minutes. The combined FAA of the crew was +3.6, one of the highest of the season. On the flip side a December 17 game between St. Bonaventure and Siena had 22 fouls and was reffed by a crew with a combined FAA of -3.5.
Now, I cherrypicked these games. It doesn’t always work out so predictably. In truth, my viewing of a Paul Szelc phantom call was a lucky coincidence. He’s not going to get fooled on that call most of the time. Maybe he calls that 15% of the time and John Hampton calls it 10% of the time. Naturally, those differences add up over the course of the game but not in some sort of way where we can guarantee Paul Szelc will always be involved in foul-fests and all John Hampton games will be free-flowing works of art. It’s mostly up to the teams to chart the course.
All that said, it’s uncanny how predictable the trends are for individual officials. And given that, it’s within the control of officiating coordinators to avoid extreme pairings. It wouldn’t take much. Of the 2,562 games I have officiating data for this season, just 52 (2.0%) had a crew with an aggregate FAA outside of +/- 3. Scheduling is complicated, but maybe it wouldn’t take a lot of effort to prevent those crews from existing.
Constructing more neutral crews would make officiating at least slightly more consistent from game to game. And now that we know the behavior of individual officials is predictable over time, it wouldn’t be a difficult change to make.
Any refs with fewer than 40 games over the past 3 seasons (D-I vs. D-I only) are combined into a monolithic “Inexperienced Ref” for the purposes of the regression. Inexperienced Ref ended up with a nearly neutral +0.1 rating.