Exploring Event and Tracking Data using Metrica Sports Open Data Part II -> Social Network Analysis

7 min readMar 11, 2021

Introduction

This will be a series of articles exploring the Metrica Sports Open data available here: https://github.com/metrica-sports/sample-data. This dataset includes two real association football matches using both event-based data and tracking data. The two teams are unknown (and I believe each match involves different teams), and the players are unknown. The goal is to explore analytic and visual techniques that could be applied to event-based data of any kind, and tracking data of any kind.

In this part, I’ll expand on the passing network visualization discussed in Part I. While these are nice, eye-catching, visualizations they may be hard to parse through and draw conclusions from. Filtering by different pass properties (length, angle, and outcome) will be useful options to make sense of the network. Another option would be to use a metric to summarize into a single value.

These passing networks are essentially Graph structures, used often in social network analysis. Graphs are data structures made of nodes and edges (or links) between the nodes. In this case, each player is a node and the edge is created if they every pass to each other. The edge is weighted by the number of passes between the players.

Graphs may be undirected or directed (DiGraph). An undirected graph connects players if they passed to each other in any direction, and directed graphs would consider which player was the source of the pass and which was the target. In the case of a goalkeeper, that player may distribute the pass and never receive a pass. The DiGraph could capture this, and the undirected Graph would not.

Network Centrality

To derive a single measure from the passing graph data structure we can use the concept of centrality. In a social network analysis made up of actors and the links between, centrality described how central each of the actors in the network were. Other derivations of centrality look at how important the different actors are in the network. In this case, the actors are the players, and we are interested in the most central player in the team’s passing system.

The simplest measure of centrality is degree centrality. This really just counts up the number of edges a node has connected to it. If a DiGraph is used, then you might split up in degree centrality (number of passes to a player), and out degree centrality (number of passes to another player).

Eigenvector centrality expands on degree centrality to consider not just individual node’s number of edges, but who they are connected to. If two players had the same degree centrality, but player 1 is also connected to high degree centrality players, then they will have a higher score.

Finally, a more sophisticated version of centrality is betweeness centrality. This relies on the concept of the shortest path, which is the shortest path along edges that it takes to reach a target node from an origin node. Betweeness centrality looks to find nodes that are important in the flow of a network. According to Borgatti et a. (2018), it’s “a measure of the number of times a given node falls along the shortest paths between two other nodes” (p. 332). It looks for nodes that may act as a bridge between two other nodes. From the Wikipedia page on centrality, “It was introduced as a measure for quantifying the control of a human on the communication between other humans in a social network.” I think this is similar to how one might think about a passing network. The degree to which a player controls the “communication” (e.g. ball) between other players on a team.

Python Example

The graph data structures are built in Python using NetworkX. You can find a working notebook here: https://colab.research.google.com/drive/1TLUZh4i7MnTgO_6sH0JmJG3et5QDeuf9?usp=sharing

When building the graph, you need to think about the edge weight component. If you are building a graph to analyze the shortest path, then it doesn’t make sense to use the number of passes as your weight. Two players that pass between each other a lot, would have a higher weight but also appear to have a greater distance between them. Thus they may not have a short path between them. You can use the inverse of the weight (number of passes) and this will produce a shorter distance. Weight is appropriate for Eigenvector Centrality, and distance is appropriate for Betweeness centrality.

Degree Centrality

Examining the Home team from the first match based solely on degree centrality would place Player7 as the most connected player, with Player12 in second place. Player12 on average played in the center of the field, and Player7 along the right-side. Player7 is actually top using in and out degree centrality. Meaning they received and passed to most of their teammates. However, out-degree centrality’s second spot was for Player2 suggesting they connected with a lot of their teammates from the back. Whereas Player 8 received passes from a lot of their teammates from the front left side of the field.

Social Network Analysis of Passing Networks — Home Team

It’s helpful to compare these to the Away team to see some of the value that might be derived from summarizing by a single number rather than just the network visualization. For example, Player21 is the highest in terms of out-degree centrality, and their average position is more in the center.

Social Network Analysis of Passing Networks — Away Team

Eigenvector Centrality

With Eigenvector centrality we can bring in additional information by weighting each edge by the number of passes to different players using a DiGraph.

While Player7 was the most connected player on the home team, Player12 had the highest eigenvector centrality. Meaning they were connected to some of the more central players in the network. Player7 was still high, coming in second. These two played a clear role in passing, with their highest proportion of their passes going to each other. Player8, though, has a very similar Eigenvector centrality score to Player7, and might also be considered as a key role.

While the Home team has a sort of three part passing group, the Away team is highly reliant on Player21, whose Eigenvector score is about 1.5 times that of the second place Player19 (who also occupied the center). It seems, maybe a little naively on my part) that the Home team spread the ball across the field vertically, whereas the Away team relied on moving the ball into the center of the field.

These of course are static snapshots and don’t consider how passing changed over the course of the game.

Betweeness Centrality

For the home team, betweeness centrality shows a similar pattern with Player12 at the top and Player7 second. Yet, Player2 is actually in third. To me this suggests that considering the number of passes and connections, a lot of the activity went through Player12 and Player7, and perhaps also started in the back with Players 2 and 4.

In contrast, the Away team’s top betweeness players were Player21, Player17, and Player19. All three, on average, were located in the middle of the field. Again, perhaps a little naively, it appears that the Away team relies heavily on passing through the center rather than the sidelines. And again, you have to consider this is taken for the entire match, as an average of sorts.

Check

How might we check some of these interpretations? Well, in the colab notebook I adapted my brute-force approach to determining the success or failure of a pass to tracking pass sequences. That is, each series of passes that were successful and the team maintained possession were given a unique id. Sort of a possession id. Not the most elegant solution, but it produces a table of the number of times a player appeared in a passing sequence.

The top 3 home players that appeared in the most passing sequences were Player4, Player6, and Player7. The top3 that appeared in passing sequences that lasted more than 3 times were Player7, Player12, and Player4.

So Player4 appears in a number of sequences, but is less central to the passing network. Same with Player6.

In contrast, Player21, 17 and 16 showed up the most times in the Away team’s pass sequences. This also supports how occupied in the center of the field the Away team was.

Conclusion

I feel in any type of analysis if you detect a pattern using one method (metric or measure) then it’s always worth looking at the problem from different angles. So looking for frequency of passes between players using a visualization is useful, but also trying to summarize that information using a single metric gives slightly more information.

It may be simplistic of me, given how much happens on the field, but seeing how the Home team was able to spread the ball across the field between Player7, Player12, Player4, and Player6 it doesn’t necessarily surprise me they won 3 to 0.

Now, I can’t infer any type of causal relationship here. Was the Home team better at passing across the field or did Away’s defense force them to the sidelines? Did Away spend too much time in the center, or did Home’s defense let them in too much?