There's a school of thought that says football is essentially invulnerable to statistical analysis of any sort, that the game is too free-flowing for the methods employed in baseball or American Football to be of any use whatsoever. My feeling is that this only focuses on the trivial elements of statistical analysis rather than the deeper, more general areas. Baseball in particular has provided us with an excellent road map of what we should be looking for, and although the challenges are different (and much more complex), the end goals are still the same - as is the general map for getting to that goal.
So, it's early days yet for football analysis, and I don't think these questions have been answered properly: What, exactly, do we propose to measure? What are our game states?
I think that phrase will have taken some readers by surprise, so I'll try to expand on exactly what I mean. In baseball, the game state is normally taken as the fundamental unit of the game, and most analysis is based on a thorough understanding of the effects of a change in game state. In other words, the game itself becomes a system moving from state to state due to the actions of the players. The most common usage of this sort of analysis is to exactly model the change in scoring probability after any given event, which is extremely useful. Baseball has a remarkably simple set of possible states (three bases plus three possible outs), which is why it lends itself so easily to number-crunching. Football? Not so much.
What are we measuring in football, anyway?
We don't care how far a player has dribbled - we care about the run's impact on scoring. We don't care about whether wingers have completed four crosses out of eight or six - we care about how many goals those two extra crosses might be worth. The only way to turn the numbers we have available into any sort of coherent language is through football's underlying (and as yet unknown) game state.
Incidentally, we do this already when we watch the games. We know that some situations are inherently more dangerous, otherwise we could not cheer in anticipation or freeze up in fear when something happens on the field. To me, the obvious delta in the probability of a goal scoring between, say, Carlos Puyol being on the ball in his own half vs. Lionel Messi bursting onto the ball in the box is sufficient evidence that there's something to look for here. There are definite game states for soccer, if not obviously discretised, and we're going to have to find them in order to figure anything out.
Let's talk about possible techniques, since I don't have anything more than vague answers at this point. Clearly, the team in possession is a key element (disregarding the many shades of grey inherent in that term), so let's put that as a definite in our game state definition. Ball position is obviously of some value, and we can break up the playing surface into arbitrary units to get as precise or lose measure of position as possible. Finer meshes will lead to more accuracy, but also pose certain problems - we'll get to those later.
Now, does anyone think that possession and ball position alone can get us to where we want to go?
I don't either. At the simplest level, football is an obstacle course, wherein a team of eleven tries to bypass another team in order to place an object into a goal. Anybody who's played at any level understands the importance of space in football - not just in terms of distance from the goal, but the spatial relationships between players. Player movement, which we currently don't have very good data for, is something that I think is key to understand our game state.
My proposal would therefore be to map player positions, chart who has the ball, and where it is. That's your game state, at any given time. Or at least, that's what my proposal would be if I had an infinite amount of data to look at, which I don't. Since player position is so variable and we need to track through historical data to determine the value of each game state, we're going to need to take shortcuts in order to make calculation feasible. What shortcuts? I have a couple:
- Ignore the team in possession's positioning. This actually lets you incorporate the positions of attackers as part of a skill, so it's not even really that much of a shortcut. Essentially, we assume that the movement of an attacking side is average - not so bad, since we then compare the results to average anyway.
- Ignore the defending team's actual position; instead count the number of players between ball and goal. This is more controversial, but has one major thing going for it - if a team plays in a manner reasonably close to optimum, the defensive positioning won't change that much when the ball is in a certain spot with a certain number of defenders.
These mechanisms allow us to bring down our number of states from several orders of magnitude higher than the number of atoms in the universe to a slightly more palatable figure (the exact number would depend on how tight your field meshing is, but for a 6x6 yard grid I'm back-of-the-enveloping around 5000 game states, and I think we could drag that number down by exploiting some clever meshing tricks).
Would we have enough data to then assign accurate values to our states? It's an open question, but we'd certainly have good information in heavily trafficked areas of the pitch, and we'd finally be able to start giving real answers to questions using real data, which I suspect is a jolly good idea. We'd suddenly have the ability to know exactly how many goals a clever crossfield pass was worth rather than just saying that Paul Scholes can pass well, we'd be able to measure exactly what it is a player does to help his team win rather than simply admiring the artistic elements of his play. We'd be able to figure out how much money to pay these guys, too.
Now, that's down the line, and pondering those questions overlong is tantamount to walking before we can crawl. But you're never going to be able to answer them without solid grasp on football's game state.
What stands in our way? Data (especially player positioning), and a lot of hard work. That doesn't sound so impossible, now, does it?