ETL Pipeline for NFL Game Data

Project Background

The purpose of this project was to take NFL game data from the now deprecated NFL API extract the details, conduct required data transformations and store the data in AWS Redshift. I was going to use this project to enable me to further analyze NFL game data; however, the NFL has since restricted access to this dataset. As such it is impossible to replicate this project.

The end goal of this project is to have a database where it is possible to conduct complex aggregate based queries against NFL data. To do this I used game data available from the NFL API, but as I stated earlier this API is no longer available to unregistered users.

This project used several technologies:

Apache Airflow for the workflow management
AWS Lambda to fetch data, extract and transform data
AWS s3 for JSON storage
AWS Redshift used as the analytical database
Python was used as the primary programming language

The GitHub for this project is located here: NFL-Data-Pipeline

Project Datasets

The NFL datasets used in this picture come from the NFL Teams, Schedule, Coaches, Players, and Games come from the NFL endpoint.

Teams: The team data by year use to be found at https://www.nfl.com/feeds-rs/teams/.json. For example the data for all NFL teams for the year 2000 use to be found here https://www.nfl.com/feeds-rs/teams/2000.json. The following data is provided from this endpoint:

Season: Season of the record
Team ID: A unique ID for each team in the NFL.
Team Abbreviation: Commonly used team abbreviation for example the Denver Broncos is "DEN".
City/State: How the team is identified by location for example the Denver Broncos is "Denver".
Full Name: The full name of the team.
Nickname: Commonly nickname for the team.
Team Type: Used to annotate if the team is a regular season team or the "Pro Bowl Team"
Conference Abbreviation, ID, and Full Name
Division Abbreviation, ID, and Full Name
Year Founded
Stadium name
Ticket Phone Number
Team URL
Team Ticket URL

Schedule: The full schedule for the entire year was posted at https://www.nfl.com/feeds-rs/schedules/.json. For example the 2019 season use to be found here https://www.nfl.com/feeds-rs/schedules/2019.json. The only changes that are made during the year are for the post season game locations. The following data is provided from this endpoint:

Season:
Season Type: This annotates if the is a preseason, regular season or post season game.
Week:
Game ID: A unique ID used for each game.
Game Key: A unique ID used for each game
Game Time Eastern:
Game Time Local:
ISO TIME:
Home Team ID:
Visitor Team ID:
Home Team Abbr:
Visitor Team Abbr:
Home Display Name:
Visitor Display Name:
Home Nick Name:
Visitor Nick Name:
Game Type:
Week Name Abbrievation:
Week Name:
Visitor Team, Season, Team ID, City/State, Full Name, Nickname, Team Type, Conference Abbr, Division Abbrievation:
Home Team, Season, Team ID, City/State, Full Name, Nickname, Team Type, Conference Abbr, Division Abbrievation:
Site ID, Site City, Site Full Name, Site State, Roof type, and network channel:

Coaches : Information for each coach for each team by year use to be found at https://www.nfl.com/feeds-rs/coach/byTeam//.json. For example the Denver Broncos coach information for 2019 could be found here: https://www.nfl.com/feeds-rs/coach/byTeam/1400/2019.json. This contains basic biographical information for each coach. The following data is provided:

NFLID: This is a unique ID used for coaches and players.
Status: This indicates if the coach or player is active, retired, or free agent.
Display Name:
First Name:
Last Name:
ESBID: This is a unique ID used for coaches and players. This number is also used to generate the url for the picture used.
Birthdate:
Home Town:
Is Deceased:
Current Status: This indicates if the coach or player is active, retired, or free agent.
College ID:
College Name:

Players: Player information was found at https://www.nfl.com/feeds-rs/playerStats/.json. The player information for Von Miller could found at https://www.nfl.com/feeds-rs/playerStats/2495202.json. While this site provides some statistical information for the player this project will only use the following:

NFLID: This is a unique ID used for coaches and players.
Status: This indicates if the coach or player is active, retired, or free agent.
Display Name:
First Name:
Last Name:
ESBID: This is a unique ID used for coaches and players.
GSISIS: This is a unique ID used for coaches and players.
Middle Name:
Suffix:
Birth Date:
Home Town:
College ID:
College Name:
Position Group:
Position:
Jersey Number:
Height:
Weight:
Team ID:

Games: The site to get detailed game information was found at https://www.nfl.com/feeds-rs/boxscorePbp/.json. An example for this data source can be seen at: https://www.nfl.com/feeds-rs/boxscorePbp/2016020700.json. This data source was primarily to fill the fact table. This json document is deeply nested. As part of the data preprocessing the JSON document will be flattened to be placed in the staging table. The field from this data source which will be used are:

Season:
Season Type: This indicates if the game is Preseason, Regular Season, or Post Season.
Week:
Game Id: This is a unique identifier for each game.
Game Key: This is a unique identifier for each game.
Game Date:
ISO Time:
Home Team ID: This the unique id for the home team
Visitor Team ID: This is the unique id for the visitor team.
Game Type: This is the common name for the game, it can be Preseason, Regular Season, or even SB for Superbowl.
Phase: This indicates the current phase of the game. Unless the game is currently in session this should read "FINAL"
Visitor Team Score Total:
Visitor Team Score Q1:
Visitor Team Score Q2:
Visitor Team Score Q3:
Visitor Team Score Q4:
Visitor Team Score OT:
Home Team Score Total:
Home Team Score Q1:
Home Team Score Q2:
Home Team Score Q3:
Home Team Score Q4:
Home Team Score OT:
Drive Sequence: This is a unique identifier for the drive for this specific game.
Play ID: This is a unique identifier for this play for this drive sequence.
Scoring: This is a Boolean value for this play resulted in a scoring drive.
Scoring Team ID: This will indicate the ID of the scoring team if the play resulted in a score.
Possession Team ID: This indicates the ID of the team that currently on offense.
Play Type: This is a categorical variable which indicates the play type. Most common forms are Pass, Rush, and Kick-off.
Quarter:
Down:
Yards to Go:
First Down or Touch Down:
Play Stat Sequence: As a play can have multiple participants this provides a unique value for each portion of a play.
Stat ID: This is the Stat ID of the play. This can be cross referenced in the stat_codes.csv, which will be included as a dimension table.
NFLID: This the unique ID of the player involved with this specific portion o f the play.
Yards: This is the number of yards this player is responsible for. If they player had no yards the value will be zero.
Play Description: This is the number of yards this player is responsible for. If they player had no yards the value will be zero.
Play Video: If the play contains a highlight video there will be a direct link to the video here, this is an example.

STAT ID: This is a CSV containing the STAT ID codes and descriptions for each stat used in a game. The CSV can be seen here STAT ID. The fields are described below.

STAT ID: This is the unique identifier for each stat.
Name: This is the common name for the stat.
Comment: This is a detailed description of the stat.

Project Tools

This project incorporated tools from both the Udacity course as well as other AWS offerings. The tools used included:

Python
AWS S3
AWS Lambda
AWS Redshift
Apache Airflow

Python: Python was used to access the NFL endpoints and conduct the data transformations to load the data into Redshift staging tables as well as the fact and dimension tables. Python code was used on a local computer to initiate the data ingest and transformation, in AWS Lambda to access multiple endpoints simultaneously, and in Apache Airflow.

AWS s3: s3 cloud storage was used as a temporary storage location for the team data json, NFL schedule json, coach data json, player data json, and game data json files.

AWS Lambda: Lambda was used for a majority of the data transformation. Lambda is serverless in nature and can be configured to run python code. Lambda works very well on small pieces of code which can be fully executed in less than fifteen minutes. As the transformations for the NFL data was simple, but there were many files to transform Lambda was used to parallelize these data transformations. When a json file was loaded into an s3 bucket the appropriate Lambda would execute.

AWS Redshift: Redshift was used as the analytical database. Data was loaded into staging tables from s3 and then sql queries were used to move the data into the fact and dimension tables.

AWS Apache Airflow: Airflow was used to monitor and manage some of the data transformation workflows for this project.

Final Database Schema

The data model used is a simple star data model. This data model works well as it is easy to query with quick speeds. The table has not been taken to 3NF, while this increases redundancy it will result in faster return times on queries. Once the data has been preprocessed and placed in the correct s3 folder it can be copied into the the staging table. From the staging tables it will be inserted into the correct fact and dimension tables.

The mapping for S3 folder to staging table to fact or dim table is as follows:

s3://nfl-cap/coaches » coaches_staging » coaches_dim

s3://nfl-cap/gameinfo » gameinfo_staging » game_dim

s3://nfl-cap/schedules_details » schedules_details_staging » game_dim

s3://nfl-cap/gameplays » gameplays_staging » play_fact

s3://nfl-cap/player_info » player_info_staging » players_dim

s3://nfl-cap/statids/ » statid_codes_staging » statid_dim

s3://nfl-cap/teams » team_staging » team_dim

Staging Tables

There are seven staging tables for this project. They can be seen in this PDF

Coaches Staging Table

Field	Data Type
coach_id	varchar
season	int
week	varchar
display_name	varchar
first_name	varchar
last_name	varchar
esbid	varchar
status	varchar
birthdate	date
hometown	varchar
collge	varchar
team_id	varchar
isdeceased	varchar
pic_url	varchar

Game Info Staging Table

Field	Data Type
game_id	varchar
season	int
season_type	varchar
week	varchar
game_key	varchar
game_date	date
game_time_iso	timestamp
vis_points_total	int
vis_points_q1	int
vis_points_q2	int
vis_points_q3	int
vis_points_q4	int
vis_points_ot	int
home_points_total	int
home_points_q1	int
home_points_q2	int
home_points_q3	int
home_points_q4	int
home_points_ot	int
win_team	varchar
lose_team	varchar
site_id	varchar
site_city	varchar
site_full_name	varchar
site_state	varchar
roof_type	varchar
phase	varchar

Game Plays Staging Table

Field	Data Type
game_id	varchar
week	varchar
drive_seq	varchar
play_id	varchar
play_stat_id	varchar
season	int
home_team	varchar
def_team	varchar
off_team	varchar
vis_team	varchar
penalty	varchar
scoring	varchar
scoring_team	varchar
play_type	varchar
quarter	varchar
down	int
yard_to_go	numeric
first_down	varchar
play_descript varchar
play_vid	varchar
stat_id	varchar
yards	int
player_id	varchar
player_team	varchar

Player Info Staging Table

Field	Data Type
nfl_id	varchar
esb_id	varchar
gsis_id	varchar
status	varchar
display_name	varchar
first_name	varchar
last_name	varchar
middle_name	varchar
suffix	varchar
birth_date	date
home_town	varchar
college_id	varchar
college_name	varchar
position_group	varchar
position	varchar
jersey_number	varchar
height	int
weight	int
current_team	varchar
player_pic_url	varchar

Statid Code Staging Table

Field	Data Type
stat_id	varchar
name	varchar
comment	varchar

Schedule Details Staging Table

Field	Data Type
game_id	varchar
season	int
season_type	varchar
week	varchar
game_key	varchar
home_id	varchar
vis_id	varchar
game_type	varchar
week_name_abbr	varchar
week_name	varchar

Team Staging Table

Field	Data Type
team_id	varchar
season	int
abbr	varchar
citystate	varchar
full_name	varchar
nick	varchar
team_type	varchar
conference_abbr	varchar
division_abbr	varchar
year_found	int
stadium_name	varchar

Fact and Dim Tables

There are five dimension tables and 1 fact table. They can be seen in this PDF

Coaches Dim Table

Field	Data Type	Key
coach_id	varchar	PRIMARY
season	int	PRIMARY
week	varchar	PRIMARY
display_name	varchar
full_name	varchar
last_name	varchar
esbid	varchar
birthdate	date
hometown	varchar
college	int
team_id	varchar
isdeceased	varchar
pic_url	varchar

Players Dim Table

Field	Data Type	Key
nflid	varchar	PRIMARY
esbid	varchar
gsisid	varchar
status	varchar
display_name	varchar
first_name	varchar
last_name	varchar
middle_name	date
suffix	varchar
birthdate	date
hometown	varchar
college_id	varchar
college	varchar
position_group	varchar
position	varchar
jersey_number	varchar
height	int
weight	int
current_team	varchar
player_pic_url	varchar

Game Dim Table

Field	Data Type	Key
game_id	varchar	PRIMARY
season	int
season_type	varchar
week	varchar
game_key	varchar
game_date	date
game_time_iso	timestamp
vis_points_total	int
vis_points_q1	int
vis_points_q2	int
vis_points_q3	int
vis_points_q4	int
vis_points_ot	int
home_points_total	int
home_points_q1	int
home_points_q2	int
home_points_q3	int
home_points_q4	int
home_points_ot	int
win_team	varchar
lose_team	varchar
site_id	varchar
site_city	varchar
site_full_name	varchar
site_state	varchar
roof_type	varchar
game_phase	varchar
week_name_abbr	varchar
week_name	varchar
game_type	varchar
home_id	varchar
away_id	varchar

Team Dim Table

Field	Data Type	Key
team_id	varchar	PRIMARY
season	int
abbr	varchar
citystate	varchar
full_name	varchar
nick	varchar
team_type	varchar
conference_abbr	varchar
division_abbr	varchar
year_found	int
stadium_name	varchar

Stat ID Dim Table

Field	Data Type	Key
stat_id	varchar	PRIMARY
name	int
comment	varchar

Play Fact Table

Field	Data Type	KEY
guid	sequence	PRIMARY
game_id	varchar
week	varchar
drive_seq	varchar
play_id	varchar
play_stat_id	varchar
season	int
home_team	varchar
def_team	varchar
off_team	varchar
vis_team	varchar
penalty	varchar
scoring	varchar
scoring_team	varchar
play_type	varchar
quarter	varchar
down	int
yard_to_go	numeric
first_down	varchar
play_descript	varchar
play_vid	varchar
stat_id	varchar
yards	int
player_id	varchar
player_team	varchar

Project Process

There are two primary problems with this dataset. The first is gathering the data and the second is the primary dataset called games is a heavily nested json document and will require preprocessing before it can be loaded into a staging table. To gather a all the required json files the following process was used.

Data Preprocess

Step 1

The data gathering process is initiated with a python script where the user enters in the season of the NFL data they want to enter into the database. The script code can be viewed in the access_raw_data Jupyter notebook. This script will execute four function functions

The script will go to https://www.nfl.com/feeds-rs/schedules/.json and do the following:
- Save each Game ID as a separate JSON in s3
- Save Game detailed information in s3
For each team in the NFL schedule for that season the script will go to https://www.nfl.com/feeds-rs/teams/.json and get detailed information for each team for the season selected. Each team will be saved as a separate json in s3.
For each team in the NFL schedule the script will go to https://www.nfl.com/feeds-rs/coach/byTeam//.json and save a json file for each coach for each season. The json file is saved by team, season, week. If the script is run on a weekly basis it will be able to log coaching changes that occur during the regular season.

Step 2

The time to process a single season, deconstruct each game file save each play stat and player json file could easily take over an hour. By using AWS Lambda this time can be reduced to minutes. For step 2 for each gameid saved in s3 will start an AWS process which will download the complete game json and save it in another s3 bucket. The only downside to this is each s3 prefix can only have one lambda trigger. Also Lambda functions can run for a maximum of 15 minutes, so it is a best practice to keep Lambda functions simple. It takes less than 30 seconds for Lambda to open each json file and download the game file and save it in the correct s3 folder. The code for this Lambda function is saved in the lambda_functions folder under the name gameids.py.

Step 3

For each full game saved in s3, Lambda will launch and deconstruct each game saving each play stat as a separate json file in s3. This function will execute a second task of saving each player from the game as a single json document in a separate json document. The code for this function can be viewed in the lambda_functions folder under the name gamesheet_decon.py.

Step 4

For each Player ID saved Lambda will save the Player json file into an s3 bucket in a format that is ready to load into the Redshift staging table.

Step 5

Once the initial data transformation from has taken place and the de-nested json files are in the staging s3 buckets, the process can be started in Apache Airflow. The data will be copied from s3 into the Redshift staging tables.

Step 6

Once the data has been moved from each bucket to the appropriate staging table, the staging tables will begin to populate the fact and dimension tables.

Step 7

After all the staging tables have populated the fact and dimension tables a data quality check is conducted, and Airlow will stop the process.

DAG Chart

Lessons Learned/Final Thoughts

The goal of this project was to build a database from NFL endpoints. This database will allow people to conduct queries on players, teams, and coaches performances. This can be used by NFL fans as well as those who play fantasy football to help plan their team and track potential points. The primary tools for this project were Python, AWS Lambda, AWS s3, AWS Redshift, and Airflow. AWS Lambda was used to help preprocess data that could take hours if done sequentially, by using Lambda this data preprocessing was reduced to minutes. AWS s3 storage was used for the json files as the storage is cheap and easily connects to AWS Lambda and Redshift. AWS Redshift was used to store the data. By using a column store distributed database, data should be readily available and returned quickly. By using Airflow the ability to schedule the transfer of the data from S3 to staging tables and finally to the dimension and fact tables.

For me this project is bittersweet. After working with the NFL data for about a year, I finally found a way to easily move the data into a database which allowed for aggregate analysis only for the NFL to stop public access to the endpoints. This project allowed me to incorporate all the tools I had used in several other projects and was a great exercise.

Share on

Twitter Facebook LinkedIn