Explaining ESPN’s Real Plus-Minus

Last Monday, ESPN debuted their newest basketball statistic: Real Plus-Minus (RPM). Unsurprisingly, it was met with a variety of reactions ranging from excited to confused to baffled.

I think it’s great that people are critiquing the stat and studying its strengths and weaknesses. I’ve read a lot of really insightful discussion about how to best use it, with opinions all over the map. However, I’ve also encountered a number of misconceptions about RPM. In this piece, I’d like to correct, clear up, and emphasize a handful of points that I think are important for anyone who wants to use this stat.

1. Real Plus-Minus is not new.

RPM is the latest version of xRAPM1, a stat created by Jeremias Engelmann (with contributions from Steve Ilardi, who authored ESPN’s introduction to RPM). Engelmann’s stat is meant to improve upon RAPM2. That stat attempts to fix the problems in APM3, which was conceived as a better version of conventional plus-minus (which you can find in most box scores). All of these metrics have the same goal: to quantify how much a player hurts or helps his team when he’s on the court. Studying this lineage can help us understand why Real Plus-Minus exists, where it excels, and where it struggles.

We begin with a different sport and country. The Montreal Canadiens created the vanilla version of plus-minus during the ’50s. After becoming a standard hockey stat, it eventually made its way to basketball during the 2000s. Its computation is simple. If a player’s team outscores his opponent by 5 points while he’s on the court, his plus-minus is +5. You can easily convert this into a per-minute or per-possession metric.

The problem with plus-minus is that it doesn’t account for the quality of a player’s teammates and opponents. As an estimator of how much a player helps or hurts his team, it’s inherently flawed. Plus-minus has a statistical property called bias–no matter how much data you collect (i.e. even if you could make teams play thousands of games), it will systematically misrepresent certain player’s contributions.

This is easily illustrated with an example. If Jeremy Lamb plays all of his minutes with Kevin Durant, Lamb will look pretty good. Sure enough, Lamb has a raw plus-minus of +8.0, ranking him just ahead of LeBron James. Is Lamb really as good as LeBron, or is this a reflection of the fact that he plays a lot of his minutes with the likely MVP (and against opponents’ second units)? Plus-minus can’t answer this.

Which leads us to adjusted plus minus, or APM. This metric tries to isolate a specific player’s impact by adjusting his plus-minus numbers for the quality of his teammates and opponents. To continue our example (with made-up numbers), APM might use the minutes Durant plays without Lamb to find that he’s worth and additional 9 points per 100 possessions. This would mean Lamb actually loses the Thunder one point per 100, making him a below average player4. APM can use minutes where one teammate plays without the other to figure out how much each contributes when they play together.

Unfortunately, what makes APM better also makes it volatile. If two players share the court for a large portion of their minutes, the rare instances where one plays without the other can have disproportionately large effects on both player’s ratings5.

Imagine Durant and Lamb play all of their minutes together during the first 81 games of the season. However, for their last game, the Thunder decide to rest Durant for the playoffs. In that game, Reggie Jackson and Derek Fisher catch fire and the Thunder destroy their opponent. When APM is rating players, it will see that the Thunder looked great the only time Lamb played without Durant and wrongly conclude that Durant was actually holding Lamb back the whole season6. Even though this example is a bit contrived, there are a lot of real-world instances where small sample randomness leads to bizarre APM results.

How do you work around this problem? That’s where regularization–the R in RAPM–comes in.

In APM, a statistical technique called linear regression estimates each player’s rating. RAPM employs a modification of this called ridge regression7. This method of estimating ratings has the effect of pulling values toward some pre-determined expectation known as a prior. RAPM uses a rating of 0 as its prior–in other words, it’s skeptical of a player who rates as strongly above or below average, unless it has a lot of data to back that up. Using ridge regression reduces the impact of the small sample size problems I described above. Values that stray too far from the prior, especially without lots of data to back them up, will be pulled back into (hopefully) a more reasonable estimate. RAPM doesn’t entirely fix the extreme Durant-Lamb case I outlined earlier, but it does a good job with more realistic scenarios. While randomness can still have an effect, the damage is less than it is for APM.

RAPM solves the major problems its predecessors face, but can still be improved upon. Its biggest weakness is that its method of reducing statistical noise also causes it to ignore some useful data. This is where–finally–expected RAPM (xRAPM) comes in.

Let’s revisit those priors. Instead of automatically giving everyone a 0 prior, xRAPM assign each player a value that tries to predict what his RAPM should be. Since ridge regression pulls ratings toward priors, Durant’s xRAPM moves toward a higher number than Lamb’s (assuming the priors rate Durant better than Lamb). The prior is a mathematical way to say, “If you’re not sure who should get credit for this, it’s probably the guy who we already think is better.”  If you can devise a method to intelligently set these priors8, you will improve the statistic’s predictive power.

So how are these priors set? One major part of the prior is a player’s performance in the previous season. Intuitively, this makes sense. If a player was really good last year, he’ll probably be good this year. Another component is based on box score stats. Although these numbers are prone to misrepresenting certain types of players, they can still give xRAPM useful information that helps it separate out different players’ contributions. In addition to these, other types of data like height and age help determine priors. All of these factors gives xRAPM clues about how good it should expect a give player to be.

That was a lot to take in, so let’s recap:

  • Plus-minus measures how well a player’s team performs when he plays, but doesn’t adjust for context.
  • Adjusted plus-minus (APM) accounts for the quality of a player’s teammates and opponents, but struggles when players get most of their minutes together and is susceptible so sample size issues.
  • Regularized adjusted plus-minus (RAPM) smooths out APM’s more extreme results, but tends to be a little too conservative.
  • Expected regularized adjusted plus-minus (xRAPM) brings in other types of data to improve RAPM. This stat is what Real Plus-Minus is based on.

So there you have it–a short history of Real Plus-Minus, which, aside from some possible tweaking, is not a new statistic.

2. Real Plus-Minus does NOT measure how well a player has performed this season. 

Statistics can be divided into two categories. Descriptive statistics tell us about what happened in the past. For instance, I can check how many page views this blog post has. Predictive statistics try to forecast what will happen in the future. I could create a model that estimates how many page views I’ll get over the next 24 hours. This difference between these is subtle, but important.

Real Plus-Minus is meant to be predictive. It’s interested in how well a player will perform in the future, rather than what he did in the past. RPM’s emphasis on prediction explains why it uses some of the tricks it does.

For instance, I mentioned earlier that RPM uses data from previous seasons in its priors. If my primary goal is to evaluate how well a player did this season, it wouldn’t make a lot of sense to use data from other seasons. However, if I want to predict what will happen in the future, the older numbers can help me differentiate between players who have been consistently good (and will likely keep being good) and players who are merely going through a hot streak (and will likely regress to their mean).

This has a number of implications. One is that RPM tends to be skeptical of player improvements (or regressions) that exceed what is expected for a player that age. This season, Anthony Davis improved much faster than most 20-21 year old playes. People who watch basketball know that Davis is super talented and accelerated growth is expected from him. However, Real Plus-Minus doesn’t understand this and suspects that Davis’ numbers might be a random blip. As a result, Real Plus-Minus is liable to underestimate Davis’s impact this season9.

On a less technical note, RPM’s focus on prediction makes it a poor way to determine who should get end-of-season awards. I think this is an important point to emphasize because ESPN does exactly this in its introduction to RPM, using it to argue that Taj Gibson is a better candidate than Jamal Crawford for 6th Man of the Year. RPM is optimized to predict the future, not evaluate the past.

3. Offense and defense are equally important in Real Plus-Minus.

RPM separates players’ offensive and defensive contributions (into ORPM and DRPM, respectively) and counts each of these equally in overall RPM. It seems obvious that both ends of the court are equally valuable, but it’s important to appreciate the ramifications of this. Again, I’ll illustrate this with an example.

Let’s look at James Harden. The Beard is regarded as a great offensive player, and his ORPM of 5.69 (fourth-best in the league) bears this out. He’s also viewed as a laughably poor defender, which is confirmed by his DRPM of -2.66 (77th out of 91 shooting guards). When we analyze the components of his game, his numbers agree with the eye-test.

However, this doesn’t hold up when we consider his game as a whole. Many analysts and fans consider Harden to be a top-10 player. But RPM ranks him 46th overall, just ahead of Robin Lopez. What gives?

This discrepancy exists because Real Plus-Minus forces us to weight Harden’s offense and defense equally. A subjective evaluation of his game makes it easy to focus on all the great things he does on offense and forgive his shortcomings on the other end. I found myself surprised by how low he ranked, but after thinking about it, I have to agree. In order to think Harden is a top-10 player, you must believe one of the following:

  1. Harden is close to league average on defense.
  2. Harden is significantly better on offense than Chris Paul and Stephen Curry.
  3. A point added on offense is worth more than a point added on defense.

I don’t agree with any of these statements, so I’m forced to concede that Harden isn’t a top-10 player.

I’m grouping the next two together because they both deal with how to interpret Real Plus-Minus scores.

4. If Player X has an RPM of 2.0 and Player Y has an RPM of 1.0, Player X is not twice as good as Player Y. 

5. A player with a negative RPM does not necessarily hurt his team.

I’ve seen these and similar errors in more than a few places. RPM values are reported as relative to an average NBA player. In the first case, you could say “Compared to a league average player, Player X adds twice as many points per 100 possessions than Player Y” and be correct, but that’s a bit of  a mouthful.

A better way to look at RPM scores is to compare them to either A) a player’s backup or B) a replacement level player.

CJ Watson gives us a good example of why backup comparison is important. The consensus view of Watson is that he’s the Pacers’ most important bench player and the team was clearly affected during his recent injury. However, Watson’s RPM of -0.08 doesn’t particularly impress. An uninformed glance at that number might make you say, “Okay, he’s average, but does he help the team at all?”

The answer: YES!

When Watson was out, third-string guard Donald Sloan picked up most of his minutes. What’s Sloan’s RPM? An atrocious -7.7, the fourth-worst rating in the league. If Sloan takes 12 of Watson’s 19 minutes per game, the Pacers lose around 1.8 points. For one bench player, that’s a huge impact10. Even if the Pacers had a below average backup point guard, (say, Ish Smith, who rates at -1.94), he would be a huge help by virtue of keeping Sloan off the court11.

The replacement level comparison is important because it gives us a way to determine whether a player is helpful or harmful, and how much so. Replacement level is the theoretical cutoff between someone good enough to play in the NBA and someone who can’t quite make it. If a player falls below that point, he’s no longer contributing and should be replaced by someone from outside the league. Therefore, a player’s value can be measured by how much better he performs than a replacement level player12.

This is the idea behind ESPN’s Wins Above Replacement metric (WAR). That stat uses Real Plus-Minus to estimate how many points a player had added (or subtracted) over the entire season. It then determines how many wins those points would be expected to add. This is useful because “wins” is a more relevant and understandable unit than “points above average.”

6. Position is relevant when using Real Plus-Minus to evaluate a player.

Unsurprisingly, different positions tend to be better at different things. Point guards tend to be more focused on offense, while center are more likely to be gifted defenders. This should have an effect on how we value different players.

Let’s consider the following question: Does Roy Hibbert or Paul George contribute more to Indiana’s league-leading defense?  DRPM gives Hibbert a rating of 3.52 while George is at 2.61. If you stopped your analysis there, you’d conclude that Hibbert is the key to the Pacers’ stinginess.

However, when you adjust for position, this isn’t the case. An average center has a DRPM of 1.78, while small forwards average a rating of 0.04. Hibbert exceeds his positional average by just 1.74 points per 100, while George does so by an impressive 3.48 points. When we consider position, you can make a compelling argument that George provides more defensive value than Hibbert13.

7. The quality of a player’s teammates and opponents DOES NOT impact his rating.

I discussed this earlier, but I want to emphasize it because I’ve noticed a lot of people making this mistake. The entire goal of any adjusted plus-minus stat is to filter out other players’s effects on raw plus-minus in order to isolate one player’s contributions. If you notice someone has a better rating than you would expect, it doesn’t make sense to attribute it to him sharing the court with a superstar or playing against second-units. There are plenty of reasons to think a player’s RPM doesn’t reflect his true value, some of which I’ve mentioned here. That isn’t one of them.

This doesn’t mean lineup factors don’t have any effect on RPM. Players who spend most of their time with players who complement them well will get a boost to their ratings, and vice versa. For an extreme example of this, imagine if you tried to play the five highest-rated centers. RPM predicts that they would outscore opponents by 20 points per 100 possessions, but I’m quite confident they’d perform much worse. In more realistic cases, lineups that don’t have enough shooting, creating, etc. could hurt those players ratings14.

It’s perfectly reasonable to use lineup factors to doubt a player’s Real Plus-Minus, but you have to use deeper analysis than simply pointing out the players he tends to play with and against.

1. Short for “expected regularized adjusted plus-minus”

2. Regularized adjusted plus-minus

3. Adjusted plus-minus. We’re losing letters by the sentence.

4. Of course, it’s not quite this simple, since Durant’s APM score is also affected by Lamb’s. In practice, these values are estimated by setting up several equations and running a regression to estimate coefficients that represent each player’s APM.

5. This is called collinearity. In statistics, when two variable are highly correlated, it can be hard to determine which has more causal effect on the outcome.

6. The opposite could happen, too. If the Thunder laid an egg that game, Durant’s rating would get inflated.

7. For the stats nerds: Ridge regression modifies the optimization criterion for standard linear regression. Instead of simply trying to minimize the residuals, it minimizes the sum of the residuals and a term based on the distance between the coefficients and the priors.

8. This isn’t that hard when your baseline is to assume every player is equally good. But finding the best way to do it as in ongoing topic of study.

9. This effect tends to be stronger for players who improve later in their careers because RPM expects little to no improvement from them.

10. That’s roughly equal to the difference between the Chicago Bulls and Charlotte Bobcats.

11. It’s worth noting that Sloan probably isn’t that bad. His increase in minutes coincided with the team’s overall regression, so he likely gets a disproportionate share of the blame for that. Still, other metrics and the eye-test make him out to be pretty terrible for an NBA player.

12. NBA replacement level is set at -2.35 points per 100.

13. The basic difference here is what you’re comparing them to. Compared to an average NBA player, Hibbert’s better. But compared to their positional averages, George is better. Seeing as a realistic lineup would almost always have Hibbert replacing another big man, I think the positional comparison makes more sense.

14. The Real Plus-Minus framework could be extended to account for this by including coaches as members of every lineup they play. This would give them some of the credit for their ability to play lineups with chemistry. That might already be part of Real Plus-Minus, although I imagine that if ESPN were computing coach values they would publish that somewhere. Maybe it’ll be added in the future.