Exploring Event and Tracking Data using Metrica Sports Open Data Part IV-> Average Positions (density based)
This will be a series of articles exploring the Metrica Sports Open data available here: https://github.com/metrica-sports/sample-data. This dataset includes two real association football matches using both event-based data and tracking data. The two teams are unknown (and I believe each match involves different teams), and the players are unknown. The goal is to explore analytic and visual techniques that could be applied to event-based data of any kind, and tracking data of any kind.
In the previous parts, average position was examined as a single value such as mean center. This was slightly expanded by using standard distance to show the range of the distribution of point events. Another way to demonstrate the space a player tends to occupy is through what is more commonly known as a heatmap. Heatmaps are really two-dimensional kernel density estimations, where we are estimating the intensity of a process over a field. The field is represented by a grid of cells. This allows us to see where a player spent the majority of their time on the pitch; either through event data or tracking data.
The colab notebook is available here: https://colab.research.google.com/drive/1l-fQIHi3TlEbEWM31WnC_wKn5jzcqJ9p?usp=sharing
For a given grid of cells over the pitch, the density if found by placing a kernel over the center of each cell and calculating the number of points (events or tracked) within a distance of that cell. The kernel represents a distance decay function, so that the events further away from the cell exert less influence than those closer to the cell.
There are many different kernel functions, but the quartic function is very common when working with this type of spatial data. The choice of function does not appear to affect the overall appearance of the smoothed surface.
The distance is a cutoff where points beyond that are not counted towards the intensity of the density value. This is called the bandwidth. Generally, larger bandwidths produce smoother results, and smaller bandwidths produce spikes (if you were to visualize as a three-dimensional surface). The spikes look like isolated dense areas as apposed to blobs.
The choice of bandwidth is key, and there are ways to find the optimal bandwidth. There are cross validation methods, but require multiple runs through the surface. There is also the ad hoc method which relies on standard distance and mean center to calculate an optimal bandwidth.
Tableau has a simple process for creating a density plot of two-dimensional data. You basically produce the scatter plot for the x and y coordinates, and switch the marks to Density. You control the bandwidth by adjusting the size. It produces a reasonable plot and is probably easy to incorporate in a Tableau workflow as needed.
However, in my opinion it would be nice to control the bandwidth and understand the size of the grid cells. In Tableau Public, I cannot even tell what number the bandwidth is.
Pre Computed Kernel Density Estimation
The density value may be precomputed before loading into Tableau or any other visual analysis platform. This gives you full control over how the density surface is calculated. This requires calculating a grid to overlay the pitch, then calculating the estimate for each grid cell.
Unfortunately, there is no straight-forward way to visualize this grid in Tableau. One way I do like, is to use circle symbols scaled and colored by the density. I think this gives an interesting look, while also revealing the density of the locations.
Alternatively, you can actually apply the density mark to the density value, which effectively smooths the circles. At the same time it produces a lot of overlap.
Finally, the heatmap \ table approach can reproduce the grid of the KDE.
Kernel Density for Tracking Data
Tracking data can reveal more about the player’s off the ball positioning. However, there are arguments against using traditional KDE two-dimensional techniques when using this type of data. The main problem is that these points are not independent of each other. The location of a player and point 2 is dependent on where they were during point 1. This falls in the area of “spatial autocorrelation.”
One option for handling this is to introduce some spatial dependence structure into the calculation of distances. For the KDE-DT approach, this means you generate a Delaunay triangulation of the points, then convert this to a network. The distance between the cell and known points is calculated along this DT network. For example, for a given cell, find the nearest tracking point. From that nearest tracking point calculate the network distance to all other points within a bandwidth. Then use this distance in the kernel as you would for the two-dimensional KDE.
In the Colab notebook, I sample every 8th tracking point to reduce some of the processing time. Creating a network for each player and calculating the distance to points along the network graph can be time-consuming.
This may produce strikingly different views of the player’s movement during the match. Consider player 4 for example. Below is the Tableau Density for all their event data. Most of their events are located along the edges of the pitch.
Compared to their raw trajectory data (their tracking data as individual points). This spaghetti plot shows movement all over the pitch. There are two larger areas that are reflective of the two periods of the match.
The traditional KDE approach in Tableau is not much cleaner, but there are a few darker blue spots that indicate a lot of activity.
Finally, the KDE-DT is revealing for two areas of the pitch that are very distinct from the event-based data. Much more time was spent in these areas, perhaps moving further to the outside to receive or distribute the ball. Player 4 also looked like they stayed on the outside of the box, before moving to the lower corner for actual on the ball action. The comparison of these two datasets reveals a lot of information about the activities of that player.
An alternative approach is to only include points within a spatial bandwidth (distance) and temporal bandwidth (time-distance) of each other. This places greater emphasis on events that happen within a time-distance. This could be used in at least two different ways. For one, you might create densities at regular intervals (every 10 minutes) and then use a temporal bandwidth of 5 minutes to get a sense of positioning on average over every ten minutes.
Second, you might use a timestamped event as your base time, and then find positioning within plus or minus 10 seconds of that event (or just 10 seconds preceding the event). This is the first time we have attempted to incorporate any sort of movement of the individual players into their positioning. For example, below are all the player’s positions 10 seconds prior to a shot event (the red circle). The lighter the density the further in time those points were. While some are difficult to tell, you can see the player in the middle moved backwards prior to the shot. The goalkeeper on the left, moved very little in the 10 seconds, but was standing outside the box.