In the first two parts of this series, we've discussed what the obvious next step forward in football analytics is - finding a map which allows us to translate events on the field into goals. Without said map, football statistics lack context, and that means that they ultimately lack meaning.
We've also taken a look at the main mechanisms by which baseball sabermetricians have created their map in part two, so now it's time to determine how we might adapt that technique to football. We need figure out a way of breaking up the sport into a series of independent 'game states', determine what each state is worth in terms of goals score and conceded and measuring the goal values of the transitions between said states. Easier said than done, right?
Developing a coherent idea of game state in football is not exactly trivial. It's not a stop-start game like baseball, where every single event comes off a set piece. It's not immediately obvious how to construct our game states, but the always excellent Sarah Rudd has made a fairly good fist of things in using the data we currently have available. Unfortunately, her effort is unlikely to be good enough.
The most obvious flaw is that while an effort is made by StatDNA to provide some data about the defence, the overall positioning of the two sides is not included in said data. As far as I know, it's not actually collected anywhere. This is a major problem, since any sensible analysis of the sport will recognise that the positioning of the defence (as well as the structure of the attack) plays a vital role in how any given situation will develop. A three on three fast break is an entirely different proposition than trying to break down a massed defence, even if the ball is in exactly the same place.
The zones Ms. Rudd uses probably aren't precise enough either. Gridding a football pitch is essentially an arbitrary exercise, and you're forced to balance between getting more accurate results with a finer mesh and the computational ease of fewer zones. Finding the correct number and placement of zones will be a major component of coming up with a reasonable set of game states. Still, it's a fine initial effort based on what's currently available, and any advancement in the field should be based on Sarah's work.
Assume we work out game state, are we done? Can we use the same techniques that have proven so successful in baseball? Not exactly.
Although the non-discrete nature of football makes determining game-state more difficult, it creates another problem - it becomes very difficult to determine what goal expectancy actually means. Is it the chances a goal would be scored immediately? During the possession? The answer is not immediately obvious. Here's an example of what I mean:
If the player in possession takes passing option one, there's a reasonable chance 25 yards from goal. It's certainly a better opportunity to score than from the current position. It's also a better instantaneous chance than passing option two, because shooting from that angle becomes virtually impossible. If all we cared about was the immediate increase in goal probability, option one is the better call.
But in an actual game, it might be better to go the other way. Although the player at the end of pass number two won't score, he's reasonably likely to get an assist with a back-post cross to the unmarked forward. It'd be a better play than taking a long-range shot.
One mechanism for solving this problem, as mentioned earlier, was the idea that goal expectancy should measure the likelihood of a team scoring during their possession. You'd also have to assess the likelihood of losing possession as well. It's an elegant solution, breaking up football into a set of discrete sequences. Is it the right way to do things? I'm not sure.
Teams like to play on the counterattack, waiting for their opposition to over-commit before bursting forward into the space left behind. The shape of the team in possession as well as the location of a possible turnover has a direct impact on how likely the defending team is to have a successful counterattack. There's no way for a goal expectancy plus turnover chance model to account for this.
Instead I prefer a slightly less elegant solution -- instead of treating the run expectancy in each state as a scalar, we can look at it as a time dependent function. Each state would have a function associated with it for each team that tells us how likely both are to score over the next minute or so, and we could then collapse said function back down into a single number. Doing this would allow us to properly account for the fact that possession transitions play such an important role in football.
And then you could just look at how each on-field event changes the game state and weight them based on that. A defence-splitting pass from deep suddenly becomes valued properly. Beating the last man to go one-on-one with the goalkeeper matters more than a meaningless dribble in the centre of the pitch. We'd be able to see every single event on a football pitch in its proper context at long last.
That's my hope, at least. It's questionable whether the logic will work out at all, and there are plenty of questions with regards to implementation that will take significant time and brainpower to resolve. Not having all the answers is fine, but we need to try to at least ask the right questions, and this, I think, is where we need to try to take analysis.
Wouldn't attempting this be the equivalent of flying before we're able to walk? Yes. There's plenty of intermediate work to be done before we even get close to turning the numbers of football statistics into a coherent language. But in an age where we appear to be more interested in congratulating ourselves in how much we know, it's important to remember that we haven't even gotten started yet.
James' cipher needs cracking. Get to work, football fans.