Data Lakes with Spark

Project Background

This project is similar to the PostgreSQL-Data-Modeling Project and Data Warehouse Project in that the company Sparkify is rapid gowning and need a data system which can support the analytical needs of their staff. The Sparkify team started with PostgreSQL and then moved to Redshift and now wants to use Spark.

The GitHub for this project is located here: Data-Lake-With-AWS

Datasets

As with the PostgreSQL-Data-Modeling Project and Data Warehouse Project there are two primary datasets, the Song Dataset and the Log Dataset

Song Dataset

This is located at s3://udacity-dend/song_data. A sample of the data is below:

{“num_songs”: 1, “artist_id”: “ARJIE2Y1187B994AB7”, “artist_latitude”: null, “artist_longitude”: null, “artist_location”: “”, “artist_name”: “Line Renaud”, “song_id”: “SOUPIRU12A6D4FA1E1”, “title”: “Der Kleine Dompfaff”, “duration”: 152.92036, “year”: 0}

Log Dataset

This is located at s3://udacity-dend/log_data. A sample of the data is below:

{“artist”:”Pavement”, “auth”:”Logged In”, “firstName”:”Sylvie”, “gender”, “F”, “itemInSession”:0, “lastName”:”Cruz”, “length”:99.16036, “level”:”free”, “location”:”Klamath Falls, OR”, “method”:”PUT”, “page”:”NextSong”, “registration”:”1.541078e+12”, “sessionId”:345, “song”:”Mercy:The Laundromat”, “status”:200, “ts”:1541990258796, “userAgent”:”Mozilla/5.0(Macintosh; Intel Mac OS X 10_9_4…)”, “userId”:10}

Final Database Schema

There are 5 tables in the database. This design focuses on the songplay table which houses the most important information for the analytics team. The fact tables of time, users, songs, and artists help to provide context and additional details for the dimension songplay table.

The use tables are the songplay_fact, time_dim, user_dim, song_dim, and artist_dim tables. These tables are in the

The time table which contains:

Field	Data Type
start_time	int
hour	int
day	int
week	int
month	int
year	int
weekday	int

The users table which contains:

Field	Data Type
user_id	int
first_name	varchar
last_name	varchar
gender	varchar
level	varchar

The songs table which contains:

Field	Data Type
song_id	varchar
title	varchar
artist_id	varchar
artist_name	varchar
year	int
duration	float

The artists table which contains:

Field	Data Type
artist_id	varchar
name	varchar
location	varchar
latitude	varchar
longitude	varchar

The songplay table which contains:

Field	Data Type
songplay_id	int
start_time	int
user_id	int
song_id	varchar
artist_id	varchar
session_id	int
location	varchar
user_agent	varchar

Lessons Learned/Final Thoughts

Of all the lessons in the nano-degree this was the one I was the most excited for and the one that I was the most let down by. The lesson discussed the high level concepts of Spark but the actual details of working with Spark were limited. In fact most of the time a Spark view was created to work with the data in an SQL like environment. I will say since completing the nano-degree Udacity has added a new section with a focus on setting up Spark clusters on AWS and debugging in Spark. I have not watched these but hopefully they add more value to this portion of the course.

Overall I did not learn much nor was I impressed with this section.

Share on

Twitter Facebook LinkedIn