Stat Exploration: SportVU Data

A few years ago, some mathematically-inclined minds decided to tackle the NBA and mine its secrets — we’ve seen industry after industry taken over by numbers, at least partially, and sports was next. One disciple of Bill James, the father of sabermetrics in baseball, gave everything in the box score a value, summed it together, adjusted for pace and league averages, and invented the popular PER metric. But he wasn’t alone: Win Shares, popular because of its inclusion in basketball-reference, and Wins Produced, child of Dr. Berri, sprang up to explain wins and which players mattered. They all used complicated formulas with complex yet intuitive methods, breaking down the game into discrete components and assigning values based on hard data. Every player’s value could be calculated, and the wins generally lined up with the team’s win column.

Berri was probably the most aggressive proponent of a single, all-encompassing stat. It supposedly had God-like power and he confidently explained the entirety of the game in relation to his stat, evaluating decisions and moving into philosophy, stating that scoring was overrated and accuracy was underrated — all they needed to understand was his stat. The popular Malcolm Gladwell wrote a piece about how Berri’s metrics explained wins at a high level and how Iverson during his MVP season actually the 91st best player in the league. The economist had cracked basketball, finding the relationship between wins and player stats through statistics, and it explains 95% of team wins.

So … that was it. The eggheads had solved basketball and it was time to criticize front offices, coaches, and even fans for thinking things like Iverson being useful to a team.

What did we miss?

Our arrogance blinded us from the mistakes. Wins Produced did indeed align with a team’s win total, but what does this matter? If all you’re doing is transferring team efficiency stats that we know correlate high with wins and assigning them to players in whatever order you want, it’s going to sum to wins anyway. There was a widely read series on the difference between player metrics and how well they did, and it popularized a term about how some metrics “explain what happened” well. But this is, honestly, rubbish. We already know what happened; we can look at the real win totals. A player metric that sums to a team’s win total doesn’t explain productivity. We have no basis for saying that. It’s locked in theory, untested.

One of the problems in modern basketball is that while we are focused on statistics now, we are not focused on the scientific method. Wins Produced failed because it wasn’t properly evaluated. To test its claims, it needs to be tested on a new set of data to show that it has actual utility. And this is common in basketball: it’s called the next season. If you want to tout your metric as the Holy Grail, then you need to predict how well a team will do with a new player — and then do hundreds of more of these micro-tests over a large number of seasons. We shouldn’t be striving for a basketball world drenched in statistics; we should be aiming for basketball science where we search for meaning and consistency, replicating our results to show the power of our discoveries.

Years ago, Hollinger developed his PER metric by assigning values to various box score stats based more on feel and wholly simplistic logical processes like assists, where he stated that since a passer is involved in one of three steps — getting open, passing, and making the shot — the passer should get one-third of the credit. This was in the stone age of NBA analytics and he had limited means of forming a scientifically rigorous model. Mock him for guessing the value of an assist, but how was this different from Win Shares or Wins Produced? How could we be sure our models were even useful?

In a critique now famous among those in the NBA analytics-sphere — The Pot Calling the Kettle Black — NBA player metrics were evaluated using their ability to predict a future season given the minutes distribution and stats beforehand. Essentially, what’s important is that our models are tested out-of-sample. Fitting data to a set of points we already have is no big feat and the world is filled with overfit models, offering nothing but a high R-squared. The real trick is fitting data points never seen before by the model.

The SportVU era

With optical tracking data providing an overflowing wealth of information, we can now fill in the cracks we had before in judging players with statistics. Even in the public world where we have the tip of the iceberg, we have information on every single shot now along with corresponding information like the game clock at the time of the shot, the closest defender, the defender’s distance, the shot’s distance, the number of dribbles, and more. Thus, with this new information, we can do away with previous flawed metrics — why use usage when we have touches per game and time of possession? Why use TS% when we can break down Kyrie Irving’s shots by every zone and type of shot, catch-and-shoot or off the dribble? He shot 5% higher on pull-up shots — and that’s all we need, assuming we’ve already adjusted for defender distance and the like. If we can pin down the exact details on a shot, we’ll know everything about it.

This is the type of arrogance that destroyed the first-wave of NBA metrics. We have one year of data and few means of evaluating how useful these stats are. Most people have used the catch-and-shoot versus pull-up FG%’s — I’ve been guilty, though I’ve tried to point out the limitations — but we don’t yet understand how much of those differences are noise and how much are true skill. We can cite drives per minute, but aside from describing what players do, what use is it? What does it matter if Ty Lawson drives one more time a game than Goran Dragic? What value does that add?

Likewise, the rim protection stats are highly cited as well, supporting the claim of Roy Hibbert’s monstrous defensive impact. But Robin Lopez is rated high too; we don’t yet know how meaningful the stats are. I also got access to the shotlog data of every single shot (tracked by SportVU) in the 2014 season. From there, I could calculate a defender’s FG% and even break it down by zones that make more sense — stats.NBA.com’s dashboard of this is lacking. I translated these numbers into “points saved” based on the usual FG% of the shooter. I wasn’t sure how to deal with defender distance and what the cutoff was, however. If you’re the closest defender and the guy who scored was 10 feet away, was that your bad defense if you couldn’t even contest? Conversely, was the guy open because the closest defender messed up? So I included both the raw stats with the closest defender and one where a defender is only judged on field goals within four feet.

But then I thought, why does this matter? How much of this is noise? I divined a means of testing a few of these new defensive toys from SportVU without succumbing too much to single season-itis by integrating the information into my 14-year sample of data. This still has issues, but I’m using a conservative method with cross-validation and other tweaks that make it difficult for new variables to find significance — they really have to prove themselves.

Statistical significance from initial tests

I’ve provided a summary of my results below. Every basic defensive SportVU stat I tested had significance in the model, which was surprising though there were only three basic types: rim protection, shot defense, and contested rebound percentage. Since the sample set is so small, I didn’t provide any coefficients — I don’t have confidence in them and I don’t want to lead anyone down the wrong track. Plus, given the method I used, the coefficients are shrunken anyway. This isn’t the first study of SportVU stats because you can see one here and another here, but it’s probably the most comprehensive on defense in the public domain.

Rim protection stats (within five feet of a defender and five feet of the rim):

Opponent FGA per 100 possessions
Points saved per 100 possessions

Points saved uses the provided field-goal percentage you can see on stats.NBA.com and calculates points saved versus the average field-goal percentage of a shot within five feet. Thus, opponent FGA might seem redundant, but I was looking for other effects like the deterrent effect where a rim protector provides extra value simply by being active in defending shots at the rim. This data doesn’t have shots diverted: scaring away field goals at the rim that are redirected elsewhere. Opponent FGA, and points saved, might be a proxy of this too. Also, remember that all these variables are tested together with other stats like the traditional steals and blocks along with field-goal defensive stats. Alas, opponent FGA wasn’t significant, but rim protection points saved was and it was actually the most dominant SportVU variable. This is one of the missing puzzle pieces in player metrics — player FG% defense that doesn’t show up in a box score — and it’s publicly available too.

Shots defended stats (based on closest defender):

Points saved per 100 possessions
Points saved (position adjusted) per 100 possessions
Points saved within four feet per 100 possessions
Points saved within four feet (position adjusted) per 100 possessions
Average distance as closest defender

Points saved is calculated the same as it was above except that it’s with every shot within 30 feet of the rim and it’s based on four zones: 0 to 4 feet, 4 to 10 feet, 10 feet to the line, and three-pointers within 30 feet. The position adjustment gives more weight to inside shots for centers and outside shots for point guards with linear gradations in between for the other positions. Both of those metrics are repeated for when the defender is within four feet. (Four feet was chosen because of how stats.NBA.com breaks down shots that are very tight, tight, open, and wide open.) The average distance was adjusted by zone because shots on the perimeter generally have a greater distance from the defender. There is a key difference between the shots defended stats and the rim protection stats: shots defended is based on a shooter’s usual FG% from a zone while the rim protection ones are just based on the league average. For an example, I’ve provided a list of the top players by points saved (total, not per possession) and the worst.

Ranked by points saved via shots defended against (min. 800 MP):
156.3 Roy Hibbert
147.9 Robin Lopez
126.4 Serge Ibaka
110.9 Tim Duncan
104.9 Joakim Noah
104.2 DeAndre Jordan
93.6 Andrew Bogut
89.5 Kevin Durant
80.6 Draymond Green
78.7 David West
78.3 Taj Gibson
77.9 Spencer Hawes
74.4 Marcin Gortat
71.4 Timofey Mozgov
70.2 Anthony Davis
70.2 Dwight Howard

Ranked by points saved in ascending order via shots defended against (min 800 MP):
-118.5 Evan Turner
-104.6 Monta Ellis
-95.5 Richard Jefferson
-90.5 Jose Calderon
-83.1 Brandon Knight
-78.0 Ben McLemore
-77.7 Ray Allen
-77.6 Thaddeus Young
-75.6 Ricky Rubio
-73.2 Luke Ridnour
-72.7 Mo Williams
-71.3 Arron Afflalo

(Yes, Rubio is an odd result, but the four factors method suggests he’s a valuable defender because of his turnover creation, not his shot defense. Young was stuck on a terrible Philadelphia team and probably shouldn’t get most of the blame. The bottom of the list is mainly guards like Ramon Sessions and young guys like Tim Hardaway Jr. Zach Randolph is one of the few big men ranked near the bottom — his reputation before Memphis was abhorrent; it’s not surprising.)

Based on the initial models, points saved could be a worthwhile addition to NBA statistics, though the output was weaker. However, this is only for one season of data, so any sign of significance even after heavy testing is encouraging. The version where the defender was closer didn’t actually yield better results so far, and the appropriate distance still needs to be found. I thought the average distance to the defender would pick up on more of the disciplined defenders and punish the gamblers — and it did. I’m wary to give out the coefficients yet, but depending on the model the coefficient ranged from -0.65 to -0.86. What’s that mean? For every foot, the individual causes the team to lose about three-quarters of a point per 100 possessions. That’s nothing to ignore either because that will cost about a couple wins over the course of the season. This punished guys who were often further from their man like Boozer and Jamal Crawford, while it helped guys who were usually in the thick of things like Duncan.

Contested rebounds:

DRB contest%
DRB FG contest%
DRB contest rebound%
DRB FG contest rebound%

The first variable is the percentage of defensive rebounds that were contested. The variables with “FG” are only looking at field goal rebounds, ignoring the easier to grab free throw rebounds. The last two variables are products of rebound rate (percentage of defensive rebounds collected out of the total possible) and contested rebound percentage. For an example of high players do well by DRB FG contest rebound%, I’ve provided the top players below (min. 800 MP) showing guys who have grabbed the most contested rebounds per 100 available defensive rebounds.

13.7 Omer Asik
13.2 Andre Drummond
12.2 Jeff Adrien
12.2 Kevin Love
11.9 DeAndre Jordan
11.7 Thomas Robinson
11.5 Enes Kanter
11.3 Andrew Bogut
11.2 Bismack Biyombo
11.1 Jordan Hill
11.0 Jonas Valanciunas
11.0 DeMarcus Cousins

The results here were fascinating. For reasons that are unclear to me right now, the percentage of contested rebounds was negatively correlated with the defense. I assumed this would be the opposite because a contested rebound is more about skill. I’m trying to unpack the implications here: what matters is running to a spot where no one else is? Guys who don’t care about boxing out and helping their teammates rebound have higher contest percentages? This warrants further study, and even though I used a few different methods and tested other combinations I’m not entirely confident about the results. I also tried a different combination where the output was nonlinear for contested rebounds where guys who rebounded a lot were penalized if they grabbed a high proportion of contested rebounds and non-rebounders were helped with high proportions. With more data, I’ll be able to suss out what’s really going on here. For an additional note, field-goal contested percentages, ignoring free throw rebounds, were more significant. Free throw rebounds are actually so different that it’s surprising that we lump them together with field goal rebounds. They should be separated for the best results possible — yet that’s rare even in 2014.

Here’s a list of the top players on defense for one Frankenstein model (min. 800 MP):
3.3 Andrew Bogut
3.2 Robin Lopez
3.1 Draymond Green
2.8 DeAndre Jordan
2.8 DeMarcus Cousins
2.8 Tim Duncan
2.7 Joakim Noah
2.6 Ricky Rubio
2.5 Kawhi Leonard
2.4 Tiago Splitter
2.3 Jimmy Butler
2.3 Derrick Favors
2.3 Chris Paul
2.2 Miles Plumlee
2.2 Dwight Howard
2.1 Omer Asik
2.1 Roy Hibbert
2.1 Nick Calathes
2.1 Tony Allen
2.1 Ian Mahinmi
2.0 Paul Millsap
2.0 Kevin Garnett
1.9 Bismack Biyombo
1.9 Paul George
1.8 Andre Iguodala
1.8 Kendrick Perkins
1.8 Eric Bledsoe
1.8 Marc Gasol

There are a few odd results in there — Robin Lopez is likely not the second best defensive player in the league, but his rim protection stats were great, and Calathes is not better than Marc Gasol — but it’s a list populated by the top defenders in the league with a bunch of other solid defenders. And, of course, here’s a corresponding list of the worst defenders by this metric:

-3.3 Gary Neal
-3.0 Tim Hardaway Jr.
-2.7 Rodney Stuckey
-2.6 Elliot Williams
-2.5 Will Bynum
-2.5 Luke Ridnour
-2.5 Ramon Sessions
-2.5 Mike Scott
-2.4 Jose Juan Barea
-2.1 Brandon Knight
-2.1 Arron Afflalo
-2.0 Matthew Dellavedova
-2.0 Michael Beasley
-1.9 D.J. Augustin
-1.9 Greivis Vasquez
-1.9 Tony Parker
-1.8 Cartier Martin
-1.8 Dion Waiters
-1.8 Jeff Teague
-1.8 Louis Williams
-1.8 Jordan Crawford

Again, this is not perfect — Dellavedova is not a terrible defender, but overall the results pass the laugh test. The bottom of the list is filled with small guards and only an occasional big man. For what it’s worth, Tolliver, Bargnani, and Scola were some of the lowest rated frontcourt players.

Summary

The new SportVU stats are very interesting and the applicability to defense is high. A simple rim protection stat that uses points saved (combining FG’s defended against and FG% against) is powerful, while others like average defender distance, points saved versus all shots, and contested defensive rebound percentage are promising. One final stable model used rim protection points saved per 100 possessions, points saved per 100 possessions, average defender distance, and contested defensive rebound rate — all significant even with cross-validation and integrating the data with a previous 14 year model.

Of course, those variables interact with the previous ones I used with interesting results. There were few major changes … except for steals and blocks. Steals were about 75% more valuable. I assume this is because with stats about shot defense and maybe gambling we could more accurately value the steal in a better context. Forcing turnovers is a major part of a team’s defense. But blocks were completely destroyed. Even though we judge the best defenders with their block rates, it was negatively correlated with defense in several models. I won’t proclaim anything too severe now about blocks, but with better rim protection stats perhaps we should reevaluate its uselessness. Blocking a shot is just one way to force an opponent to miss, and all other things being equal with two rim protectors the one with higher block rates perhaps is just taking himself out of position too much by going for blocks. (I used an interaction term in some models to account for fouls and blocks, and in that case blocks were closer to being neutral instead of negative.) Also, please note that a blocked shot on its own isn’t bad because it’s still a missed shot. It just might not be better than a normal forced missed. Regardless, this is a win for the new stats era because we found a worthy replacement for a long used stat that’s supposed to cover a notoriously hard area to judge: defense.

There’s still a lot of work to be done. The variables could be better formed and things like shot distance need to be studied more. Contested rebounds are a fascinating area and it could provide a better understanding of how individual rebounds translate to the team level. There’s also a lot of stuff missed by these SportVU stats like the type of team defense where you missed a rotation and an opponent scores on your teammate but not you. We should never lose sight of the fallibility of our tools and ourselves. Each new stat unveiled by SportVU should be greeted by the question, “Does this actually help me?” If we lose our skepticism, we lose our direction.

Quantcast