(10 minute read)
NHL play-by-play files contain individual events (e.g., shots, hits) that occur during an NHL game. These files represent a timeline of observed actions, their locations on the ice, their time of occurrence(s), and player(s) and team(s) involved in these events, etc. (Fig 1).
Fig 1. Screen shot of exemplar NHL play-by-play file for game 20189 of the 2018-2019 season, New York Rangers vs Anaheim Ducks.
These data are generated on-the-fly by NHL employees who document relevant events in near real-time while watching an NHL game. Because hockey is such a fluid sport where many events can occur in a short period of time and because these events are being documented by human-beings, there are sometimes errors and/or invalid data entries within these files.
This leads us to a topic that’s not often discussed, how reliable are the data in the NHL play-by-play files?
“No measurement is exact.” - BIPM (2008)
Before we dive into the play-by-play data, it’s important to understand metrology.
The science of measurement, which is known as metrology (BIPM 2008), is inherent in many aspects of our everyday lives. Basic actions such as weighing fruit at a grocery store, measuring a piece of wood before cutting it, checking the temperature on a Thanksgiving Turkey, etc., are all metrological actions. Metrology is an important field of science that is often overlooked. Disregarding the concepts of metrology often leads to poor data quality and misinformed data-driven decisions. While it is most likely a trivial need to weigh apples at the grocery store to the nearest quarter ounce, inaccurately measuring objects, concepts, time, etc., can be detrimental if these measurements are used to inform high-level data-driven decisions:
“NASA lost its $125-million Mars Climate Orbiter because spacecraft engineers failed to convert from English to metric measurements when exchanging vital data before the craft was launched.” – Holtz (1999).
Although the above quote represents an extreme scenario, it gives you an idea of how measurement uncertainty can cause major problems. It also exemplifies a component of measurement uncertainty that is harder to quantify - human error. We are not computers, we’re subjective by nature, and our minds and eyes do a great job of fooling us from time to time. As noted earlier, NHL play-by-play data are a compilation of events documented on-the-fly by humans. These data are used to generate player stats, which in turn are used to inform players’ future salaries and bonuses, predictive models, fantasy lineups, discussions on sports radios, etc. These data are important, obviously not as important (in context) as the measurements of the Mars Orbiter, but important in the realm of the hockey world. As such, it’s important to ensure that the data contained within these files are accurate and reliable.
On this page we present our quality assurance / quality control (QA/QC) framework and identify common errors found within these data. We provide guidance on how we ‘clean’ these files so that our subsequent exploratory analyses and predictive models are more reliable.
Our Automated Processing Framework
We developed an automated process that downloads and processes NHL play-by-play files. Below is an overview of the framework.
For a single NHL game, two different NHL play-by-play (.JSON) files are obtained. One file, which contains events, players on the ice, event players, event team, period time, period, elapsed seconds, etc. is obtained using the nhlscrapr package (Thomas and Ventura 2017). The second play-by-play file is obtained using ancillary scripts written in the R Programming Language. This file contains some information found in file 1, i.e., period, period time, events, and event team, and includes additional information not found in file 1, such as, detailed event descriptions, stoppage indicators, and x-y event locations. Up through the 2016-2017 season, both play-by-play files could be obtained using nhlscrapr. However, at the start of the 2017-2018 season, the second play-by-play file was moved to a different API link and was slightly reformatted.
After both datasets are downloaded, the two play-by-plays are saved to a repository. Next, the data are run through a suite of R functions that manipulate, clean, and perform QA. Essentially, we start with two raw datasets and end with a single, clean dataset. We’ve incorporated several quality assurance checkpoints along the framework to ensure that inadvertent errors are not propagated into any of the data (Fig 2).
Fig 2. Icetistics automated QA/QC framework.
Our nominal QA procedure identifies discrepancies in events. When this happens, a flag is raised, and further analysis is required to determine whether the discrepancy is valid (e.g., example 1 in Results) or invalid (e.g., example 2 in Results). If the event is deemed valid it is retained. If it is considered invalid it is removed from the dataset and archived in a separate file.
Novel Data-types and algorithms
Within our framework a handful of novel algorithms and data-types exist. These include:
- Automated flagging of dataset discrepancies and potentially invalid events (discussed above)
- Assignment of event team for offsides and icing events (more info)
- Quantification of defensive and offensive rebounds, as well as goaltender ‘cough-up’ rates (more info)
- Assignment of natural puck turnovers and puck possession teams (more info)
We also include previously researched, non-mainstream data-types, such shot quality, e.g., Ryder (2004) and Krzywicki (2010), that we feel are of high importance. It is our hope that our algorithms and new data-types coupled with previously investigated data-types will provide a valuable resource to the NHL community.
Invalid entries and file discrepancies:
At the time of this writing, we’ve analyzed 785 NHL games across the 2017-2018 and 2018-2019 season. In total, we’ve identified a total of 257,417 events within the entire dataset, with 941 flagged events that are either i) invalid or are ii) only documented in one of the two play-by-play files of a single NHL game. In terms of percentage, this equates to just 0.36% of the data being flagged (roughly 1.17 flagged events per game). This is a remarkably low error rate! Of the 257,417 events investigated, only 139 were removed from the dataset (Fig 3).
Fig 3. a) Boxplot of Percentage of data flagged per game; b) Breakdown of absolute flagged and removed events.
As you can see from Figure 3, roughly 75% of the flagged events are considered ‘stoppage’ events, i.e., OFFSIDE, PENALTY (PENL). This means that for a given NHL game these events were most likely documented in one of the play-by-play files and not the other.
One of the most notable findings is that only GIVEAWAY and TAKEAWAY events were considered invalid and removed from the dataset. This may be indicative of human-error and subjectivity revealing itself within the play-by-play datasets.
During the second game of the 2017-2018 season, Carl Hagelin was documented as having a missed shot with 17:22 in the third period, then he was pinned with a giveaway a second later at 17:23. At 17:24, his teammate, Sidney Crosby, was awarded a shot on net (Fig 4). For these events to have occurred as documented, Carl Hagelin would have had to miss his shot at 17:22, gained his own rebound, held onto the puck for enough time for possession to be awarded. Hagelin would then turn have to turn the puck over to a member of the opposing team (i.e., the GIVEAWAY). Next, Sidney Crosby would had to have obtained the puck from the opposing team member that Hagelin had overturned the puck to and then shot the puck on net. That’s a lot of action in a span of 2 seconds. Is there a way to confirm if all, none, or a subset of these three documented events truly happened?
By using the NHL Archives, we can visually re-watch this (and mostly any) game. What happens is this: at 17:22 in the third period Carl Hagelin misses a shot on net, the puck bounces off the end-boards and then Sidney Crosby obtains the puck and proceeds to shoot the puck on net at 17:24 in the third period. The giveaway awarded to Hagelin at 17:23 in the third period is invalid. This entry was flagged and removed from the final play-by-play file.
Fig 4. Carl Hagelin incorrectly awarded a Giveaway at 17:23 in the 3rd period of game 2 of the 2017-2018 season
It would be painstakingly slow to manually verify all instances like this. As such, we created a novel algorithm that considers the information below to flag and remove invalid events in an automated fashion.
- The elapsed time between neighboring events
- the types of neighboring events, and
- the team committing these events
Several discrepancies regarding stoppages in play, e.g., offside, icing, puck out of play, etc., seem to exist between play-by-play files for a single game. For example, the first play-by-play dataset might show an event sequence such as: SHOT, MISS, HIT, FACEOFF. Notice that no ‘stoppage’ event is documented before the FACEOFF. However, when looking at the same sequence in the second play-by-play dataset, a ‘stoppage’ event, e.g., ICING, is documented between the HIT and the subsequent FACEOFF. In these instances, it’s not proper to remove the ICING event, even though it represents a discrepancy between the two datasets. Rather, it is important to ensure that events like this are retained when merging or manipulating the datasets. Figure 5 shows an example of one of the two OFFSIDE discrepancies identified between the two play-by-play datasets for a game during the 2017-2018 season.
Fig 5. Left: Quality Assurance output comparing events in play-by-plays #1 and #2 for a single NHL game. Right: truncated play-by-play files #1 and #2 showing the valid offside in PBP #2 and the lack of the offside event in PBP #1. This offside play, although a noted discrepancy between files, was deemed valid and propagated to the merged play-by-play dataset.
Assignment of Icing and Offside Teams
In the nearly 800 games that we’ve analyzed, a total of 7,172 icing and 4,732 offside events were identified. Collectively these account for about 4.6% of all events documented within the play-by-play datasets. In the nominal play-by-play datasets an event team is not assigned to these events. This leads to information gaps within the dataset. We have successfully assigned event teams to approximately 97% of the icing and offside events via our novel algorithm. We could not confidently assign a team to only 3% of these events. We feel our ability to fill in these known information data gaps is important because it provides another metric (in addition to giveaway, takeaways, and penalties) to measure a team’s discipline. It also aids in the development of accurate estimates of natural puck turnovers and temporal estimates of puck possession.
Puck Possession and Natural Turnovers
Our proprietary algorithms estimate that a total of 92,615 natural turnovers (puck possession changes) occurred within the analyzed dataset (n= 785 games). This equates to roughly ~118 natural turnovers per game. We consider natural turnovers those that occur during live play, i.e., they exclude FACEOFF wins/losses. We provide an in-depth analysis of natural puck turnovers, the quantification of direct (temporal) estimates of Puck Possession, and the importance of this novel statistics in our article here
Our automated framework provides important insight into the play-by-play datasets produced by the NHL. Our nominal QA procedure identifies discrepancies among events between datasets and flags and removes invalid entries. Our novel data-types can help inform previously uninvestigated topics, such as direct (temporal) estimates of puck possession. It should be noted that we are unable to quantify the validity of undocumented events that truly occurred. In other words, we can’t investigate plays that may have happened but were never documented in the first place.
Holtz RL. (1999, Oct 01) Mars Probe Lost to Simple Math Error. Retrieved from: http://articles.latimes.com/1999/oct/01/news/mn-17288
International Bureau of Weights and Measures (BIPM) (2008) Evaluation of measurement data – guide to the expression of uncertainty in measurement
Krzywicki K (2010) NHL Shot Quality 2009-10: a look at shot angles and rebounds
Ryder A (2004) Shot Quality: a methodology for the study of the quality of a hockey team’s shots allowed
Thomas AC and Ventura SL (2017). nhlscrapr: Compiling the NHL Real Time Scoring System Database for easy use in R. R package version 1.8.1. https://CRAN.R-project.org/package=nhlscrapr