/cdn.vox-cdn.com/photo_images/623286/GYI0060267015.jpg)
I introduced Goals above eXpected (G-X) last night in a post about striker Didier Drogba. With that and a playing time chart for all outfield players last year, I was curious whether it would be possible to run a multivariate linear regression to determine which players contributed the most to Chelsea's offence. In truth, the idea struck me last night, woke me up, and deprived me of a lot of sleep, so I'm sort of obliged to write a post, even if the results aren't quite what I wanted.
The idea of a regression analysis is to find a best fit based on a certain set of inputs to a given output. While it's often seen done with just one variable, I thought it might be instructive to run one for our attacking players combined playing time against G-X, in an attempt to figure out which players contributed most to Chelsea's outstanding 2009/2010, a season which saw them score 103 goals. Excel provides a handy package to do this for you automatically, with up to sixteen inputs. I used fifteen players: Frank Lampard, Nicolas Anelka, Didier Drogba, Florent Malouda, Michael Ballack, Ashley Cole, Branislav Ivanovic, Jon Obi Mikel, Joe Cole, Deco, Michael Essien, Paulo Ferreira, Salomon Kalou, Yuri Zhirkov, and Jose Bosingwa, neglecting the lesser used players and the centre halfs, who typically don't have much influence on the attack anyway. The equation I fit the regression to is as follows:
Before I present the findings, I want to stress the deep flaws in this methodology. It's a top-down procedure, meaning that we have global data and are attempting to assign responsibility to different parts of the team without really knowing how things were achieved. Conclusions are therefore very dangerous, as the causation implied by the regression does not necessarily equal correlation on the field. It also might not be able to deal very well with players who see a lot of time, as the idea of the analysis is to determine how the team did with and without a certain player; the sample size for some of the 'withouts' (I'm looking at you, Mr. Lampard), is fairly lacking, and I'm unclear what to do about that. There are some more minor flaws - I'm still not correcting for home and away games, nor am I taking into account sendings off, which will obviously skew the expected goals per game slightly. Still, these are minor problems compared to the big issues with regression analysis, and I deeply regret not having enough detailed data to attack things in a more logical way. That said, let's get on with it - this is a reflection of what the data can tell us and not anything else.
Figure 1: Results for multivariate regression analysis of Chelsea's attack.
The R² on this is a little over 0.6, for those curious. It's a pretty good result, and it means that 60% of the variance in G-X can be explained by the time each of these fifteen players spent on the field last year. I'd like to reiterate once again that these numbers are not indicative of value; e.g. Anelka was not worth -0.3 goals for every 90 minutes he played. However, for last season, Chelsea's overall offensive output tended to diminish when certain players saw time and increase when others were involved. Some of these numbers are clearly data artifacts; Frank Lampard's +1.85 G-X per 90 seems more a function of Chelsea's overall ability than his own, especially considering that he featured in 94.1% of Chelsea's Premier League minutes last season. Ferreira and Zhirkov are probably sample size flukes, as well. Don't take too much away from the charts above, then but despite not having any real predictive power they do paint an interesting image of Chelsea's efficacy in front of goal using certain player combinations.
Not the most telling piece of analysis, then, but an interesting story. The above also helps to illustrate the limitations of the 'top-down' methodology that I described. Clearly, Didier Drogba is not a major problem on the offence - he just happened to be on the field when the team was (relatively) struggling, an important distinction to make. I'd love to attack this problem from a different angle, but until we get more data, this angle of attack and variations thereof are about the best we're going to get. These results are interesting rather than earth-shattering.