On Black Boxes and AI

Brady and I have spent four years now devising ways to create an advantage betting the NFL, and we’ve come a long way from the model’s earliest iterations.  One of the key differentiators we want to emphasize between First and Thirty and other handicapping sites is our clear and transparent process.  We found other handicapping sites rely too heavily on past performance or are too opaque about how they arrived at a wager recommendation.  In response, I will detail our full methodology here.

This post is long and intentionally designed to be technical, so for the uninitiated, a simple explanation: First and Thirty uses a machine learning model that takes advanced football statistics and our in-house power rankings to derive a spread.

For starters, a bit of history about how we arrived at the present model: the first prediction model I devised took FootballOutsider’s DVOA statistic and regressed it against Pro Football Reference’s Simple Rating System.  This allowed me to map DVOA values to an SRS, subtract the two, and adjust for home field advantage to compute a spread.

Subsequent model iterations used ensemble methods, a machine learning technique, to adjust for the nonlinear nature of the relationship between DVOA and SRS.  This was helpful, but I wanted to go deeper and build a more sophisticated model.

That sophistication came at a cost.  Last year’s model used a vast array of features, and while I used a number of techniques to control for overfitting, I can’t say I was wildly successful.  This model was the closest to what I would classify as an artificial intelligence, and it displayed some uncanny behavior that I never grew quite comfortable with.  The most egregious example was how the model seemingly shifted its expectations in Weeks 18+.  No input to the model training indicated that these were the playoff weeks, but the output in those weeks was starkly different than throughout the regular season, clearly favoring underdogs.

There’s a lot that I learned from this about the dangers of black box models and artificial intelligence, but I think Stephen Hawking put it quite succinctly when he wrote:

The real risk with AI isn’t malice but competence. A superintelligent AI will be extremely good at accomplishing its goals, and if those goals aren’t aligned with ours, we’re in trouble. You’re probably not an evil ant-hater who steps on ants out of malice, but if you’re in charge of a hydroelectric green energy project and there’s an anthill in the region to be flooded, too bad for the ants. Let’s not place humanity in the position of those ants.

Stephen Hawking

Suffice it to say that we don’t want our bets to be in that position either.

The second issue that has plagued the model over the years is its lack of a forward-looking component.  FootballOutsiders does a commendable job with their DAVE ratings, but there’s a diverse set of opinions about team quality as a season evolves, and many teams- especially in this pandemic year- are altered considerably by injuries and shifts in team composition.

The final model this year takes only two inputs- Brady’s power ranking of the two teams and each team’s DVOA (DAVE in the early season, WEI DVOA later on).  These DVOA inputs should be adjusted for injuries where they can be, but the incorporation of a power ranking will allow us to make the largest adjustments on the dial.

I strongly feel that this model will be less prone to overfitting, and by operating in a lower dimensional space it allows us to visually “see” the model and avoid the black box tendencies above. 

It also asks less of the ideas behind machine learning.  Machine learning is really good at mapping inputs to outputs, but it does terribly at handling abstract concepts and has no true ability to reason.  These limitations of deep learning are something I’ve become acutely aware of as the model has evolved.

DVOA values are mapped to SRS in much the same way this year as they were when I originally devised the model years ago.  But mapping Brady’s power ranking can’t be as scientific.  In lieu of kidnapping Brady and forcing him to watch tape of every NFL game since 2002 in a Jets-themed dungeon, I drew up a stratified random sampling of three “slates” of 32 teams across various seasons, asking him to power rank each slate.  These power rankings were regressed against SRS, allowing me to map power rankings to SRS values.

After each season going forward I can take his final rankings and feed them to the model, allowing it to update each year.

A machine learning purist will note that this is not the most scientifically “correct” process.  Brady will of course have used SRS to create his training data power rankings in the first place, creating a data leakage problem.  But this concern misses the entire purpose of the exercise; by applying human reason to sift through multiple sources of data, we are using ensemble methods as a mapping tool instead of expecting them to function as a prediction tool.

Nevertheless, this concern about data leakage is very much accurate, so in lieu of the more commonly used random forest regressor I have opted for the use of extremely randomized trees.

The model can be visualized thusly, with the vertical axis representing SRS and the horizontal axes representing Brady’s power ranking and FO’s DVOA statistic, respectively:

For each game, the home and away team is placed on this plane, their SRS is subtracted, and a 2-point home field advantage is applied to the home team to determine the final spread.

You will notice some intriguing tendencies about the shape of this plane.  First, for the upper echelons of our power rank and of DVOA, the model exhibits a plateau.  My explanation for this is that the NFL perhaps holds an inherent “skill ceiling.”  This would make intuitive sense; the human body can only be pushed so hard before reaching the limits of its biological potential, and if there are any athletes on the planet who might ever reach these limits, they are in the NFL.

Second, the worst of the terrible teams, the bottom-ranked, lowest DVOA scum, have an incredibly sharp falloff in their SRS, far more dramatic at any point in the curve.  My best explanation for this is that it’s hard evidence of tanking.  I naturally suspect we will be doing a lot of betting against the worst-ranked teams, but if anyone falls into “tanking” territory, they may turn into a consistent target for us.

In closing- we’re sharp as hell.  Bet with us.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

Up ↑

%d bloggers like this: