Explaining ESPN’s Real Plus-Minus

Last Monday, ESPN debuted their newest basketball statistic: Real Plus-Minus (RPM). Unsurprisingly, it was met with a variety of reactions ranging from excited to confused to baffled.

I think it’s great that people are critiquing the stat and studying its strengths and weaknesses. I’ve read a lot of really insightful discussion about how to best use it, with opinions all over the map. However, I’ve also encountered a number of misconceptions about RPM. In this piece, I’d like to correct, clear up, and emphasize a handful of points that I think are important for anyone who wants to use this stat.

1. Real Plus-Minus is not new.

RPM is the latest version of xRAPM1, a stat created by Jeremias Engelmann (with contributions from Steve Ilardi, who authored ESPN’s introduction to RPM). Engelmann’s stat is meant to improve upon RAPM2. That stat attempts to fix the problems in APM3, which was conceived as a better version of conventional plus-minus (which you can find in most box scores). All of these metrics have the same goal: to quantify how much a player hurts or helps his team when he’s on the court. Studying this lineage can help us understand why Real Plus-Minus exists, where it excels, and where it struggles.

We begin with a different sport and country. The Montreal Canadiens created the vanilla version of plus-minus during the ’50s. After becoming a standard hockey stat, it eventually made its way to basketball during the 2000s. Its computation is simple. If a player’s team outscores his opponent by 5 points while he’s on the court, his plus-minus is +5. You can easily convert this into a per-minute or per-possession metric.

The problem with plus-minus is that it doesn’t account for the quality of a player’s teammates and opponents. As an estimator of how much a player helps or hurts his team, it’s inherently flawed. Plus-minus has a statistical property called bias–no matter how much data you collect (i.e. even if you could make teams play thousands of games), it will systematically misrepresent certain player’s contributions.

This is easily illustrated with an example. If Jeremy Lamb plays all of his minutes with Kevin Durant, Lamb will look pretty good. Sure enough, Lamb has a raw plus-minus of +8.0, ranking him just ahead of LeBron James. Is Lamb really as good as LeBron, or is this a reflection of the fact that he plays a lot of his minutes with the likely MVP (and against opponents’ second units)? Plus-minus can’t answer this.

Which leads us to adjusted plus minus, or APM. This metric tries to isolate a specific player’s impact by adjusting his plus-minus numbers for the quality of his teammates and opponents. To continue our example (with made-up numbers), APM might use the minutes Durant plays without Lamb to find that he’s worth and additional 9 points per 100 possessions. This would mean Lamb actually loses the Thunder one point per 100, making him a below average player4. APM can use minutes where one teammate plays without the other to figure out how much each contributes when they play together.

Unfortunately, what makes APM better also makes it volatile. If two players share the court for a large portion of their minutes, the rare instances where one plays without the other can have disproportionately large effects on both player’s ratings5.

Imagine Durant and Lamb play all of their minutes together during the first 81 games of the season. However, for their last game, the Thunder decide to rest Durant for the playoffs. In that game, Reggie Jackson and Derek Fisher catch fire and the Thunder destroy their opponent. When APM is rating players, it will see that the Thunder looked great the only time Lamb played without Durant and wrongly conclude that Durant was actually holding Lamb back the whole season6. Even though this example is a bit contrived, there are a lot of real-world instances where small sample randomness leads to bizarre APM results.

How do you work around this problem? That’s where regularization–the R in RAPM–comes in.

In APM, a statistical technique called linear regression estimates each player’s rating. RAPM employs a modification of this called ridge regression7. This method of estimating ratings has the effect of pulling values toward some pre-determined expectation known as a prior. RAPM uses a rating of 0 as its prior–in other words, it’s skeptical of a player who rates as strongly above or below average, unless it has a lot of data to back that up. Using ridge regression reduces the impact of the small sample size problems I described above. Values that stray too far from the prior, especially without lots of data to back them up, will be pulled back into (hopefully) a more reasonable estimate. RAPM doesn’t entirely fix the extreme Durant-Lamb case I outlined earlier, but it does a good job with more realistic scenarios. While randomness can still have an effect, the damage is less than it is for APM.

RAPM solves the major problems its predecessors face, but can still be improved upon. Its biggest weakness is that its method of reducing statistical noise also causes it to ignore some useful data. This is where–finally–expected RAPM (xRAPM) comes in.

Let’s revisit those priors. Instead of automatically giving everyone a 0 prior, xRAPM assign each player a value that tries to predict what his RAPM should be. Since ridge regression pulls ratings toward priors, Durant’s xRAPM moves toward a higher number than Lamb’s (assuming the priors rate Durant better than Lamb). The prior is a mathematical way to say, “If you’re not sure who should get credit for this, it’s probably the guy who we already think is better.”  If you can devise a method to intelligently set these priors8, you will improve the statistic’s predictive power.

So how are these priors set? One major part of the prior is a player’s performance in the previous season. Intuitively, this makes sense. If a player was really good last year, he’ll probably be good this year. Another component is based on box score stats. Although these numbers are prone to misrepresenting certain types of players, they can still give xRAPM useful information that helps it separate out different players’ contributions. In addition to these, other types of data like height and age help determine priors. All of these factors gives xRAPM clues about how good it should expect a give player to be.

That was a lot to take in, so let’s recap:

  • Plus-minus measures how well a player’s team performs when he plays, but doesn’t adjust for context.
  • Adjusted plus-minus (APM) accounts for the quality of a player’s teammates and opponents, but struggles when players get most of their minutes together and is susceptible so sample size issues.
  • Regularized adjusted plus-minus (RAPM) smooths out APM’s more extreme results, but tends to be a little too conservative.
  • Expected regularized adjusted plus-minus (xRAPM) brings in other types of data to improve RAPM. This stat is what Real Plus-Minus is based on.

So there you have it–a short history of Real Plus-Minus, which, aside from some possible tweaking, is not a new statistic.

2. Real Plus-Minus does NOT measure how well a player has performed this season. 

Statistics can be divided into two categories. Descriptive statistics tell us about what happened in the past. For instance, I can check how many page views this blog post has. Predictive statistics try to forecast what will happen in the future. I could create a model that estimates how many page views I’ll get over the next 24 hours. This difference between these is subtle, but important.

Real Plus-Minus is meant to be predictive. It’s interested in how well a player will perform in the future, rather than what he did in the past. RPM’s emphasis on prediction explains why it uses some of the tricks it does.

For instance, I mentioned earlier that RPM uses data from previous seasons in its priors. If my primary goal is to evaluate how well a player did this season, it wouldn’t make a lot of sense to use data from other seasons. However, if I want to predict what will happen in the future, the older numbers can help me differentiate between players who have been consistently good (and will likely keep being good) and players who are merely going through a hot streak (and will likely regress to their mean).

This has a number of implications. One is that RPM tends to be skeptical of player improvements (or regressions) that exceed what is expected for a player that age. This season, Anthony Davis improved much faster than most 20-21 year old playes. People who watch basketball know that Davis is super talented and accelerated growth is expected from him. However, Real Plus-Minus doesn’t understand this and suspects that Davis’ numbers might be a random blip. As a result, Real Plus-Minus is liable to underestimate Davis’s impact this season9.

On a less technical note, RPM’s focus on prediction makes it a poor way to determine who should get end-of-season awards. I think this is an important point to emphasize because ESPN does exactly this in its introduction to RPM, using it to argue that Taj Gibson is a better candidate than Jamal Crawford for 6th Man of the Year. RPM is optimized to predict the future, not evaluate the past.

3. Offense and defense are equally important in Real Plus-Minus.

RPM separates players’ offensive and defensive contributions (into ORPM and DRPM, respectively) and counts each of these equally in overall RPM. It seems obvious that both ends of the court are equally valuable, but it’s important to appreciate the ramifications of this. Again, I’ll illustrate this with an example.

Let’s look at James Harden. The Beard is regarded as a great offensive player, and his ORPM of 5.69 (fourth-best in the league) bears this out. He’s also viewed as a laughably poor defender, which is confirmed by his DRPM of -2.66 (77th out of 91 shooting guards). When we analyze the components of his game, his numbers agree with the eye-test.

However, this doesn’t hold up when we consider his game as a whole. Many analysts and fans consider Harden to be a top-10 player. But RPM ranks him 46th overall, just ahead of Robin Lopez. What gives?

This discrepancy exists because Real Plus-Minus forces us to weight Harden’s offense and defense equally. A subjective evaluation of his game makes it easy to focus on all the great things he does on offense and forgive his shortcomings on the other end. I found myself surprised by how low he ranked, but after thinking about it, I have to agree. In order to think Harden is a top-10 player, you must believe one of the following:

  1. Harden is close to league average on defense.
  2. Harden is significantly better on offense than Chris Paul and Stephen Curry.
  3. A point added on offense is worth more than a point added on defense.

I don’t agree with any of these statements, so I’m forced to concede that Harden isn’t a top-10 player.

I’m grouping the next two together because they both deal with how to interpret Real Plus-Minus scores.

4. If Player X has an RPM of 2.0 and Player Y has an RPM of 1.0, Player X is not twice as good as Player Y. 

5. A player with a negative RPM does not necessarily hurt his team.

I’ve seen these and similar errors in more than a few places. RPM values are reported as relative to an average NBA player. In the first case, you could say “Compared to a league average player, Player X adds twice as many points per 100 possessions than Player Y” and be correct, but that’s a bit of  a mouthful.

A better way to look at RPM scores is to compare them to either A) a player’s backup or B) a replacement level player.

CJ Watson gives us a good example of why backup comparison is important. The consensus view of Watson is that he’s the Pacers’ most important bench player and the team was clearly affected during his recent injury. However, Watson’s RPM of -0.08 doesn’t particularly impress. An uninformed glance at that number might make you say, “Okay, he’s average, but does he help the team at all?”

The answer: YES!

When Watson was out, third-string guard Donald Sloan picked up most of his minutes. What’s Sloan’s RPM? An atrocious -7.7, the fourth-worst rating in the league. If Sloan takes 12 of Watson’s 19 minutes per game, the Pacers lose around 1.8 points. For one bench player, that’s a huge impact10. Even if the Pacers had a below average backup point guard, (say, Ish Smith, who rates at -1.94), he would be a huge help by virtue of keeping Sloan off the court11.

The replacement level comparison is important because it gives us a way to determine whether a player is helpful or harmful, and how much so. Replacement level is the theoretical cutoff between someone good enough to play in the NBA and someone who can’t quite make it. If a player falls below that point, he’s no longer contributing and should be replaced by someone from outside the league. Therefore, a player’s value can be measured by how much better he performs than a replacement level player12.

This is the idea behind ESPN’s Wins Above Replacement metric (WAR). That stat uses Real Plus-Minus to estimate how many points a player had added (or subtracted) over the entire season. It then determines how many wins those points would be expected to add. This is useful because “wins” is a more relevant and understandable unit than “points above average.”

6. Position is relevant when using Real Plus-Minus to evaluate a player.

Unsurprisingly, different positions tend to be better at different things. Point guards tend to be more focused on offense, while center are more likely to be gifted defenders. This should have an effect on how we value different players.

Let’s consider the following question: Does Roy Hibbert or Paul George contribute more to Indiana’s league-leading defense?  DRPM gives Hibbert a rating of 3.52 while George is at 2.61. If you stopped your analysis there, you’d conclude that Hibbert is the key to the Pacers’ stinginess.

However, when you adjust for position, this isn’t the case. An average center has a DRPM of 1.78, while small forwards average a rating of 0.04. Hibbert exceeds his positional average by just 1.74 points per 100, while George does so by an impressive 3.48 points. When we consider position, you can make a compelling argument that George provides more defensive value than Hibbert13.

7. The quality of a player’s teammates and opponents DOES NOT impact his rating.

I discussed this earlier, but I want to emphasize it because I’ve noticed a lot of people making this mistake. The entire goal of any adjusted plus-minus stat is to filter out other players’s effects on raw plus-minus in order to isolate one player’s contributions. If you notice someone has a better rating than you would expect, it doesn’t make sense to attribute it to him sharing the court with a superstar or playing against second-units. There are plenty of reasons to think a player’s RPM doesn’t reflect his true value, some of which I’ve mentioned here. That isn’t one of them.

This doesn’t mean lineup factors don’t have any effect on RPM. Players who spend most of their time with players who complement them well will get a boost to their ratings, and vice versa. For an extreme example of this, imagine if you tried to play the five highest-rated centers. RPM predicts that they would outscore opponents by 20 points per 100 possessions, but I’m quite confident they’d perform much worse. In more realistic cases, lineups that don’t have enough shooting, creating, etc. could hurt those players ratings14.

It’s perfectly reasonable to use lineup factors to doubt a player’s Real Plus-Minus, but you have to use deeper analysis than simply pointing out the players he tends to play with and against.

1. Short for “expected regularized adjusted plus-minus”

2. Regularized adjusted plus-minus

3. Adjusted plus-minus. We’re losing letters by the sentence.

4. Of course, it’s not quite this simple, since Durant’s APM score is also affected by Lamb’s. In practice, these values are estimated by setting up several equations and running a regression to estimate coefficients that represent each player’s APM.

5. This is called collinearity. In statistics, when two variable are highly correlated, it can be hard to determine which has more causal effect on the outcome.

6. The opposite could happen, too. If the Thunder laid an egg that game, Durant’s rating would get inflated.

7. For the stats nerds: Ridge regression modifies the optimization criterion for standard linear regression. Instead of simply trying to minimize the residuals, it minimizes the sum of the residuals and a term based on the distance between the coefficients and the priors.

8. This isn’t that hard when your baseline is to assume every player is equally good. But finding the best way to do it as in ongoing topic of study.

9. This effect tends to be stronger for players who improve later in their careers because RPM expects little to no improvement from them.

10. That’s roughly equal to the difference between the Chicago Bulls and Charlotte Bobcats.

11. It’s worth noting that Sloan probably isn’t that bad. His increase in minutes coincided with the team’s overall regression, so he likely gets a disproportionate share of the blame for that. Still, other metrics and the eye-test make him out to be pretty terrible for an NBA player.

12. NBA replacement level is set at -2.35 points per 100.

13. The basic difference here is what you’re comparing them to. Compared to an average NBA player, Hibbert’s better. But compared to their positional averages, George is better. Seeing as a realistic lineup would almost always have Hibbert replacing another big man, I think the positional comparison makes more sense.

14. The Real Plus-Minus framework could be extended to account for this by including coaches as members of every lineup they play. This would give them some of the credit for their ability to play lineups with chemistry. That might already be part of Real Plus-Minus, although I imagine that if ESPN were computing coach values they would publish that somewhere. Maybe it’ll be added in the future.

22 thoughts on “Explaining ESPN’s Real Plus-Minus

  1. Randy Marsh says:

    Who’s Jared? ..
    Pretty good writeup – probably the best one so far. Would have been the first writeup without a mistake/misconception in it if it weren’t for stating that the ‘x’ in xRAPM stood for ‘expected’. It stands for nothing, really
    The Hibberts vs. George argument on defense can be debated. It *does* expect Hibbert to have more positive impact on defense. Yes you have to play players of each positition but if you had two units Guard_Guard_P.George_Forward_C and Guard_Guard_Forward_Forward_R.Hibbert with all non P.George/R.Hibbert players being rated a 0 on defense, the unit with Hibbert would be expected to be the better defensive unit

    • James says:

      “with all non P.George/R.Hibbert players being rated a 0 on defense”
      That’s not a fair comparison for exactly the reasons stated in the article. That’s equivalent to saying Hibbert + an average SF is better than George + a bad C. The article correctly points out Hibbert + average SF is worse than George + average C (according to Real Plus Minus).
      Although I’d be interested to know if replacement level is the same across all positions, and if the replacement level of -2.35 accounts for that.

      • Exactly. If you put George and Hibbert in lineups with positionally average defenders, George’s unit is +3.50 on defense while Hibbert’s is +3.11. If you give them four generic league average defenders, Hibbert’s unit projects better. But seeing as basically all NBA lineups are made of 1-2 point guards, 2-3 wings, and 1-2 bigs, I think the positional adjustment is important. Although it does depend on exactly what question you’re trying to answer.

        That’s a really good question above replacement level. RPM has 306 players about replacement level. Here’s the positional breakdown:
        55 PGs (18.0%)
        63 SGs (20.6%)
        61 SFs (20.0%)
        68 PFs (22.2%)
        59 Cs (19.3%)

        This data isn’t perfect as position isn’t a completely discrete variable, but it suggests to me that replacement level is pretty similar across positions if we assume teams want equal numbers of each. I imagine if any position had a significantly different replacement level it’d be either center or point guard. The former because sometimes you just need big body to bang with other bigs in certain situations, and the former because you need someone who can bring up the ball and start the offense. But that’s just my intuition with absolutely no facts to back it up.

        • James says:

          I looked into this some more today. First of all, ESPN’s replacement level may be too low: approximately 300 players have positive RPM and WAR out of 437, which is 69%. Compare with baseball where only 57% of players have positive WAR. 300 players is 10 players per team and I think there’s almost no chance that the average 10th player on a bench is above replacement level. But maybe it’s right – I don’t know enough about basketball players salaries and how replaceable end of bench guys are to know exactly.

          More importantly, I’m curious about what (if any) positional adjustments ESPN made or should have made when calculating WAR. ESPN’s methodology clearly thinks power forward is the easiest position on the floor, as PF has the most players with positive RPM and WAR, while small forward and center are in the middle of the ranges for both metrics. However point guard and shooting guard return very inconsistent results.

          Point guard has the fewest players with positive WAR, but nearly the most in positive RPM, while SG is nearly backwards: fewest positive RPM, while in the middle of positive WAR. I can’t figure out why either – the Top 40 players by WAR at at PG played 5% fewer total minutes than SG, but PG’s played more than SF who were most similar between the two metrics.

          In conclusion, RPM and WAR are correlated well for SF, PF, and C, but something is off with PG and SG. Also, there needs to be something done for PF – either a positional adjustment is in order or there’s some bias when determining a player’s position that causes ESPN to place disproportionately many good players at PF over SF and C.

          • A few thoughts:

            For reasons unclear to me, RPM doesn’t rate all NBA players. A total of 482 players played at least one minute this season. If we assume those not rated by RPM were below replacement level, that means just 63% of NBA players have a positive WAR.

            Also, I recall MLB has September call-ups. I don’t know how many of these players actually play and how many of them perform below replacement, but that seems like it might artificially depress baseball’s positive WAR rate.

            As far as positional replacement level goes, it’s important to remember that basketball positions are very fuzzy. For instance, several teams use power forwards as their backup centers. Several of the positive WAR PFs get most or all of their minutes at center.

            When I computed correlations between RPM and WAR I found that every position was within 0.01 of 0.905, except center, which was at 0.88.

            Do you know if any of the inter-positional effects you observed are significant? I wouldn’t be surprised some of them are just randomness within these distributions.

  2. James says:

    “Offense matters more than defense.”

    It’s quite possible this is true. In the NFL offense is more important than defense, and in the NHL defense is more important, while it’s balanced in MLB. You can tell this by looking at the spread in points scored/allowed – if the standard deviation for offenses is larger than the standard deviation for defenses, then offense is more important (i.e. the best offense is better than the best defense, the worst offense is worse than the worst defense). Here’s proof: http://tangotiger.com/index.php/site/comments/spread-of-offense-v-defense

    There are many possible reasons for this, but the biggest one are quarterbacks have a disproportionate effect on offense in the NFL, while goalies have a similar effect on defense. Furthermore, imagine the NBA – if your best offensive player is a point guard, he will still be an excellent offensive player no matter your opponent. However, if you have an excellent defensive guard, his talent is ‘wasted’ if the other team’s best player is a post player. Another way to think about it – if I’m the Heat, I run the majority of offensive possessions through LeBron, maximizing his impact on offense and minimizing the impact of my other players, and I can focus my offense to take advantage of my opponent’s defensive weakness, while avoiding their strength.

    • You bring up a really good point, and I should have been more precise in my phrasing. What I should have written was “A point added on offense is equal to a point added on defense.”

      But while we’re on this point, I tested the standard deviations and it looks like ORPM’s (2.24) is slightly but significantly higher than DRPM’s (2.07).

      I think a couple of factors mitigate the “waste” of defensive ability. One, team defense is probably more important in today’s NBA than 1-on-1 defense. Two, a lot of team’s will shift assignments around so their best perimeter/interior defender is on the opponent’s best perimeter/interior scorer.

      Another interesting note is that if you remove LeBron James and Kevin Durant from the sample, the difference almost entirely disappears.. So I should probably test this on data from different years somewhat far apart.

  3. Some elaboration on point #2:

    The RPM model is indeed tuned to provide optimum predictive out-of-sample (OOS) accuracy within the current season–in essence, it is a “True Talent” estimate of the current season. However, the only appropriate way mathematically to validate your model’s accuracy of measuring what has actually happened at all is to look at OOS lineup-based testing. By definition, this will build a little too much “regression to the mean” into the model, but that is simply something that must be accepted.

    Where some people may have trouble is that player RPMs over the current season, because of the priors (from previous seasons and box scores) and regression-to-the-mean, will NOT sum to the team’s actual performance. It WILL sum to a best estimate of the team’s True Talent level.

    The only way to make the RPM data “explain” better would be to force the the RPMs to sum to the team’s actual adjusted rating by adding some sort of constant to every player’s rating on the team. Unfortunately, that is a questionable and generic bias, and is in essence trying to “explain” what may likely be randomness/luck.

  4. Great point, Daniel.

    I think this ties into some interesting questions about how we should view a player’s season in hindsight. For instance, should we believe that Kevin Durant had a better season than LeBron James if we think that the difference was mostly a matter of luck/randomness/hot shooting? I think most people, at least right now, would say we should regard Durant’s current season as better, but shouldn’t predict him to be better in the future. I’m okay with that. Although it’s interesting that baseball seems to be moving towards being unimpressed by players who benefit from randomness.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s