In part one of this post, we touched on whether or not football can be analysed and where we need to improve our current statistics. Unfortunately, the answer seems to be 'everywhere'. Although we like to think that we're entering an age in which football statistics carry great weight, they're currently lacking what the legendary sabermetrician Bill James might call 'the power of language'.
More prosaically, there's no mechanism by which we can figure out what any of the current statistics mean on the pitch. In order for football stats to have meaning rather than just be numbers, we need to figure out how to scale them accurately to what really matters - goals and wins. Since they've already done the heavy lifting in baseball, it's a good idea to take a peek at their most important techniques.
If you look at the past fifty years or so of serious baseball research and squint reasonably hard, two main themes begin to emerge which essentially opposing philosophies on how we should go about converting numbers to meaning. The first is a mostly statistical approach that actually resembles the methodology employed by those looking to decipher ancient languages. Maybe that language metaphor isn't so bad after all.
If we know the outputs (goals scored and conceded and goals scored) as well as our numerical inputs, we can run what essentially amounts to a regression analysis and come up with a crib that scales each of our previously irrelevant numbers to what actually matters. In baseball-land, that's runs. In our case, it would be goals. This sort of analysis could tell you that -- and these numbers are made up -- 437 completed passes generally means one goal, or that 7.4 interceptions is the equivalent of one goal-line clearance.
Such an approach, however, runs into several major problems. There's the obvious, classical problem with statistics in that a correlation between two events doesn't necessarily mean that there's a causal relationship. But probably more important is that this method cannot account for the fact that the data is far too coarse to make much sense of.
Take, for example, the pass. If we claim that 437 completed passes means one goal, we're essentially saying that a pass from Sergio Ramos to Iker Casillas is functionally identical to Xavi's defence-splitter for Jordi Alba against Italy. That's obviously absurd, and splitting up passes into through balls, long passes etc merely mitigates the problem, rather than solving it. The same is true for more or less everything else -- a tackle on the halfway line is far less important than a last-ditch, goal-saving one.
Such flaws would be acceptable, if only barely, if there wasn't a more sensible means of building our language bridge. Fortunately, unlike linguists studying long-dead tongues, we have the advantage of knowing exactly how modern sports work. This provides us with another attack vector, because if you're aware of the internal logic behind what you're studying, you can build from there rather than just looking at inputs and outputs and guessing.
This is very easy to do with baseball, which is a relatively simple game. You need seven pieces of data to describe the state of a game: the number of runs scored by each team, the inning, the number of outs, the baserunners, the number of balls, and the number of strikes. Some of these are less important than others -- the ball and strike counts don't matter that much -- so we can safely ignore them, and if all you care about is finding run expectancy* you can discard the score and the inning as well.
*Remember, we're just trying to find a way of mapping events to runs.
In baseball, this condenses to a three by eight matrix which describes your possible game states. It's an easy task to then mine the thousands of baseball games that take place each year and determine how many runs tend to score from each game state. The states themselves aren't of any real value, of course. What we care about is the transitions between them; what each event actually does to the overall game.
This turns out to be fairly straightforward. Every time the system moves from one state to the next, the run value changes. This is easy to visualise using a mathematical system called a Markov chain. I've drawn up a sample chain for a simple three-state system below:
This isn't the only fun thing you can do with Markov chains in sports -- it doesn't even take advantage of their most useful property -- but it does make what we're doing very obvious. Every event in our three-state system can be represented by one of the six transition values shown, and the average transition value is the overall value of that event.
When you do this for baseball's twenty-four state system, you come up with what's known in sabermetric circles as 'linear weights'. These are the heart and soul of modern baseball analytics and what enables analysts to translate between baseball statistics as essentially meaningless, disassociated numbers to runs and wins*. That we can reasonably hope to measure an excellent catch on the same spectrum as a home run is incredible. In terms of really understanding baseball, linear weights, which were fleshed out by Pete Palmer in 1984, are probably the most important advancement in the past century.
*The run-win (or goal-win) conversion is a fascinating subject on its own, but I'll save that story for another time.
How can we apply the lessons learned in baseball to finding football's statistical cipher? It's obviously not a matter of applying exactly the same technique, but baseball does provide us with some signposts. While linear weights are unlikely to be useful, since they end up having exactly the same granularity issues we encountered with a regression analysis, but the focus on the transitions between states is fertile ground for developing our cipher.
Let's focus on that next.