NBA Playoff Predictor Model Methodology
Introduction
We built a statistical model using multiple linear regression to predict every series in the 2024 NBA playoffs. The model is based on 13 years of advanced team stat data from 2010-2023 and outcomes from every playoff series in that time span. To see the model’s predictions for the 2024 NBA Playoffs, check out this piece. This article walks through the process, methodology, and logic of the model, before acknowledging its limitations and recommended application.
Data Collection and Cleaning
The model’s independent variable data comes from BasketballReference’s Advanced Stats pane from the 2010-11 season to 2022-23. We populated an Excel sheet with each season’s data, which came out to 390 rows (30 teams x 13 seasons). To account for changes over time in NBA play styles, such as offensive rating climbing dramatically over that time span, we rescaled each value within its season. This means that the team with the number one offensive rating in the league in 2010-11 has the same value in the dataset (1) as the best offensive team in 2022-23.
To train the model, we listed out the record of every team in each playoff series from 2011-2023. This resulted in another 390 rows (15 series x 13 seasons x 2 teams). In other words, each series appeared twice: once with each team as “Team One” and another as “Team Two.” The dependent variable is the win percentage of Team One in the series. For example, in the Cavaliers-Warriors series in the 2016 NBA Finals, the series appears twice, with the Cavaliers having a 57.14% win rate, and with the Warriors having a 42.86% win rate. The goal of the model is to predict series win percentage based on the assortment of advanced stats from BasketballReference.
To get a statistical picture for both teams in each series, we measured the difference between Team One and Team Two for each independent variable. We did this by subtracting Team One’s metrics from Team Two’s. For example, if the Hawks (Team One) had an offensive rating of 110.5 and the Wizards (Team Two) had an offensive rating of 112.0, the offensive rating value for the Hawks-Wizards series was 1.5. This also means that the same series, but with the Wizards as Team One, would have an offensive rating value of -1.5. This resulted in an array of values for each series capturing the deltas between the two teams for each independent variable.
Data Analysis
With the cleaned dataset, we used R to create a multiple linear regression model. To begin, we included all of BasketballReference’s advanced stats. This included age, wins, losses, win percentage, pythagorean wins, pythagorean losses, margin of victory, strength of schedule, simple rating system, offensive rating, defensive rating, net rating, net rating rank, pace, free throw rate, three point attempt rate, true shooting, and the Offense and Defense Four Factors (eFG%, TOV rate, offensive rebound percentage, and free throws per field goal attempt). The dependent variable was the win percentage of “Team One.” The initial model had an R2 value of 0.4389 and several variables with large alpha values. To reduce noise and bias within the model, we filtered out the variable with the highest alpha and re-ran the model until there were no variables with alphas greater than 0.10.
The resulting model had a shorter list of independent variables: net rating rank, wins, losses, win percentage, margin of victory, simple rating system, offensive rating, free throw rate, three point rate, true shooting, eFG%, turnover rate, opponent eFG%, and opponent turnover rate. Several variables interacted with others in the model. To account for this, we multiplied wins, losses, and win percentage together to capture the inherent interaction between winning and losing games. True shooting and eFG% also have a similar formula, creating some strong interaction between the two, so we multiplied those as well. This finalized model had a slightly improved adjusted R2 of 0.4436, meaning that approximately 44% of the variance in the data is captured by the model.
See a summary of the series model below.
Assumptions and Limitations
The relatively low R2 value indicates that there are a myriad of other factors that matter to playoff series outcomes that the model doesn’t capture. Some of these factors can’t be measured accurately, such as luck and momentum. Others, such as playoff experience, depth, and injuries, can be. A future model should look to account for these other measurable factors to get a more complete view of each playoff team.
The model also relies on the differences between teams in their advanced metrics to calculate their predicted playoff success. It’s plausible that the difference in metrics doesn’t tell the entire story. For example, two teams with elite offenses will have a small delta in their offensive rating, just like two teams with poor offenses will. In the case of the latter, the teams’ other metrics may become more important, as the team that takes advantage in the rebounding battle or turnover differential might be a bigger factor than if both teams have elite offenses. An ideal model should account for these different situations.
Another limitation of this model is that it utilizes regular season statistics to predict playoff outcomes. Postseason basketball is an entirely different game than the regular season in a myriad of ways. For one, defensive liabilities are often factored out of games due to matchup hunting in the playoffs. This reduces the effectiveness of guards who can’t defend effectively, as well as opens up the floor for mobile bigs and small ball lineups to replace interior defenders such as Rudy Gobert. The playoff game is also much more physical, and thus we see players who can play through contact such as Nikola Jokic, Kawhi Leonard, and LeBron James elevate their game in the postseason. FInally, easy buckets are few and far between in the playoffs. Lackadaisical defense during the regular season allows strong offensive teams to consistently find quality shots. During the playoffs, every point must be earned, so contested shot-makers such as Jimmy Butler and Jamal Murray tend to stand ahead of the pack when the lights are brightest.
Overall, this model should be used more as a thought-provoking experiment rather than a serious predictor of playoff success. It’s possible that it can expose inefficiencies in the betting markets by revealing underrated teams, such as the Pacers and Suns in 2024. However, given the amount of other factors that play into playoff outcomes, the model should not be used alone to predict individual series. It is just one of many ways to evaluate teams as they prepare for their playoff runs.
Contributors: Jason Taylor and Samuel Rui