Exploring Event and Tracking Data using Metrica Sports Open Data Part I
This will be a series of articles exploring the Metrica Sports Open data available here: https://github.com/metrica-sports/sample-data. This dataset includes two real association football matches using both event-based data and tracking data. The two teams are unknown (and I believe each match involves different teams), and the players are unknown. The goal is to explore analytic and visual techniques that could be applied to event-based data of any kind, and tracking data of any kind.
The first part will be to demonstrate the development of passing network visualizations using Tableau Public. I’ll present two different styles that have their advantages and disadvantages. I’ve explored a third type called Sankey diagrams elsewhere (https://github.com/davidlamb/SpatialSoccerAnalysis/blob/master/projects/8_VisualizingPassNetworks.md). I really like these for showing flows; which are essentially what passes are, but they are not easy to reproduce in Tableau.
The two visualizations being explored are heat maps and passing network graphs.
Metrica’s event dataset did require some preprocessing in a couple of different ways. First, unlike other freely available event data (e.g. StatsBomb) we do not know if the pass event was successful or lost. This is a fairly basic metric that might be used to evaluate a player’s ability. Second, there are other properties of the pass such as the angle or direction of the pass, and its length. Finally, in order to produce a passing graph, it is useful to create a bi-directional key or id field that groups the passer to the recipient. What I mean by this, is that regardless if the pass went from Player 1 to Player 11, or Player 11 to Player 1, they receive the same key.
A few caveats, I’ve set up my own library to quickly load StatsBomb, Wyscout, and Metrica examples into a GeoPandas GeoDataframe. As part of this process I use the dimensions from StatsBomb as the basis for my fields. This means all three are translated to a field that runs from 0 to 120 in the x direction, and 0 to 80 in the y direction. In some ways this is just a legacy of the fact I started with StatsBomb data, and I wanted to make other event data comparable. I also use a brute force iterative method to find out if the pass was successful, failed, or a shot based on the following event. So it may not be the most accurate way.
You can find the work here: https://colab.research.google.com/drive/1XYp8O81XMRwRpi0DPCRWanK2pbcFnZkH?usp=sharing
A Heatmap Approach
The heatmap or really a matrix approach is similar to the data structure you might create for a social network graph. That is, the node labels follow the x and y accesses, and the cells are the number of pass attempts between the player in that direction. For example, player 1 on the home team passed 2 times to player 2, but player 2 only passed 1 time to player 1. This style is very good for showing which target of the pass was most frequent for specific players. The color reflects the number of passes and then the label is added for clarity. You can then filter these by different aspects if you want. In my example, I created a new categorical field derived from the length of the pass for short, medium, and long passes. These in turn were taken from three different percentile cut offs.
My opinion, on this style is that it can provide some useful information, and it seems more appropriate for filtering based on specific criteria. It can be a little difficult to tell who the top pairings were, and you lose the aspect of where the players are in the field.
A Network Approach
Another approach is to essentially create a graph layout. In Tableau, this is easiest if we use a non-directed graph, meaning you combine passes to and from different players regardless of who received or initiated the pass. This style in Tableau requires some more calculated fields compared to the heatmap, but you could set up a workflow to make it easy enough. I created three calculated fields for the average location of each player. This was complicated by the fact the event data from metrica sports was not oriented always in the same direction of play from left to right. To make this more flexible, I created some parameters for which team started in that direction, then shifted the x and y coordinates before calculating the average. Finally, a new calculated field need to be calculated for the size of the “edges” in the graph.
I think the main advantage of this visualization is the passing becomes combined with the average position of the player. So we can still see a lot of passes between Player 2 and Player 7, but now we know that it tends to be along the right side of the pitch. We do lose the direction component, although I think with some extra preprocessing this could be changed.
The notebook from Tableau is available to download so you can see how each visualization was built.