We’ll get to the winners of the H.U.M.A.N. poll in due time, but the first thing I was excited to check when the season ended was whether the collection of H.U.M.A.N.’s that participated would have added value to my preseason ratings. This was the ultimate motivation for the project.
This was a tough year for the H.U.M.A.N. poll to debut. I spent some time last offseason to improve the preseason ratings given the new era of increased player movement and things went really well. Some of this was undoubtedly luck but some of it was improvement in the preseason ratings model. I’ll eventually have a post or 20 with plenty of obnoxious self-congratulatory words on how good the preseason ratings turned out. But for now, just know the computer made it tough on the humans.
The goal, though, wasn’t for the humans to beat the computer. That result would have been kind of embarrassing for me, actually. The goal was for the humans to add value to my model. In order to test that, I converted the human’s ratings into a pseudo-AdjEM so we could make direct comparisons.
For instance, the biggest discrepancy between the humans and the computer was Stanford. The computer had them #42 with a rating of 13.30. The humans had them #96 with a rating of 5.99. I overshot the humans by 9.27. Despite talent that would have suggested a better season, Stanford finished #105 with a rating of 6.57. The humans almost nailed it. Way to go, humans!
Of course, we can’t just rely on a one-off. We have 362 data points and we need to look at them all. First, let’s look at a comparison between how the preseason ratings performed and how the H.U.M.A.N poll performed in predicting the final kenpom rating.
There’s a difference here, and you can kind of tell visually that the kenpom preseason ratings were better than the H.U.M.A.N. ratings. My ratings were slightly more clustered around the solid blue line which represents perfection. We would have done fine using the H.U.M.A.N. poll which is a fun finding from all of this. But fine isn’t good enough, especially when room for improvement is possible.
Now let’s look at the skill of the H.U.M.A.N. poll relative to my preseason ratings. We can do this by comparing the difference between the H.U.M.A.N. and preseason kp rating to the difference in final and preseason kp rating. If the H.U.M.A.N.’s are providing some help then they should have skill in identifying the directional errors in the kp preseason rating.
This is not an easy graph to interpret, but the horizontal axis is the difference between the H.U.M.A.N. poll and my preseason ratings. Values right of zero (positive values) represent teams that my ratings were higher on than the humans. Basically, positive values are where the humans thought I was too optimistic about a team’s rating. Stanford is the rightmost point.
The vertical axis is the difference between my preseason ratings and my final ratings. Values above zero (positive values) represent teams that my preseason ratings were actually too optimistic about. So points in the top right are teams that the humans thought I was too optimistic on and that I actually was too optimistic on.
If the humans had crushed it, there would be a bunch of points in the top-right and bottom-left. Instead, there are points all over. The blue line is the best-fit line and it slopes in the right direction, at least. When the humans and my computer had a difference in opinion, there was a slight tendency for the humans to add value.
But for every Stanford, there was a case like Buffalo, who the humans inexplicably loved, ranking them 155th. My computer was more pessimistic, ranking them 253rd, but even that was way too optimistic, as the Bulls went 4-27, finishing 348th!
Then there was a case like High Point, who ended up being the biggest riser from my preseason ratings, going from 282nd to 114th. The problem was the humans didn’t provide any significant help, ranking the Panthers 271st.
Seeing this, I thought maybe the humans were more skillful with more famous teams. I designed the exercise so that people get more matchups involving teams expected to be better. For instance, Buffalo was offered on 167 ballots, High Point was shown on 121 ballots, and Stanford appeared on 271 ballots.
Maybe if we give more weight to the teams with more votes, we might see better skill from the humans. It turns out you can just get slightly better results with this approach. We might be overfitting our data, anyway, and this could be a relationship that wouldn’t hold for this exercise in future seasons.
In the end, the optimal weighting for the H.U.M.A.N. poll would provide a slight correction to the kp ratings. Stanford at #42 would be #50. Florida Atlantic, who was preseason kp #37 and H.U.M.A.N. #20 would have been #34 in a blended rating. Buffalo would have gone from #255 to #234. It ends up being about an 80/20 mix of computer and human rating.
Maybe that’s a win/win. The computer’s estimates are pretty robust, but y’all can have a little influence. It’s also possible we might to able to tease out which voters have some skill from those who don’t and squeeze out a little more value from the people. Anyway, thanks to everyone who participated in this experiment. I consider it a success and it’s something we can build on for a slightly better version next season. I’m looking forward to continuing to work on this.
The next step is to acknowledge our heroes who won this thing. Plus, we’ll do a little bonus analysis. Which teams tripped people up the most? Coming soon…