Project Background
This project is similar to the PostgreSQL-Data-Modeling Project and Data Warehouse Project in that the company Sparkify is rapid gowning and need a data system which can support the analytical needs of their staff. The Sparkify team started with PostgreSQL and then moved to Redshift and now wants to use Spark.
The GitHub for this project is located here: Data-Lake-With-AWS
Datasets
As with the PostgreSQL-Data-Modeling Project and Data Warehouse Project there are two primary datasets, the Song Dataset and the Log Dataset
Song Dataset
This is located at s3://udacity-dend/song_data. A sample of the data is below:
{“num_songs”: 1, “artist_id”: “ARJIE2Y1187B994AB7”, “artist_latitude”: null, “artist_longitude”: null, “artist_location”: “”, “artist_name”: “Line Renaud”, “song_id”: “SOUPIRU12A6D4FA1E1”, “title”: “Der Kleine Dompfaff”, “duration”: 152.92036, “year”: 0}
Log Dataset
This is located at s3://udacity-dend/log_data. A sample of the data is below:
{“artist”:”Pavement”, “auth”:”Logged In”, “firstName”:”Sylvie”, “gender”, “F”, “itemInSession”:0, “lastName”:”Cruz”, “length”:99.16036, “level”:”free”, “location”:”Klamath Falls, OR”, “method”:”PUT”, “page”:”NextSong”, “registration”:”1.541078e+12”, “sessionId”:345, “song”:”Mercy:The Laundromat”, “status”:200, “ts”:1541990258796, “userAgent”:”Mozilla/5.0(Macintosh; Intel Mac OS X 10_9_4…)”, “userId”:10}
Final Database Schema
There are 5 tables in the database. This design focuses on the songplay table which houses the most important information for the analytics team. The fact tables of time, users, songs, and artists help to provide context and additional details for the dimension songplay table.
The use tables are the songplay_fact, time_dim, user_dim, song_dim, and artist_dim tables. These tables are in the
The time table which contains:
Field | Data Type |
---|---|
start_time | int |
hour | int |
day | int |
week | int |
month | int |
year | int |
weekday | int |
The users table which contains:
Field | Data Type |
---|---|
user_id | int |
first_name | varchar |
last_name | varchar |
gender | varchar |
level | varchar |
The songs table which contains:
Field | Data Type |
---|---|
song_id | varchar |
title | varchar |
artist_id | varchar |
artist_name | varchar |
year | int |
duration | float |
The artists table which contains:
Field | Data Type |
---|---|
artist_id | varchar |
name | varchar |
location | varchar |
latitude | varchar |
longitude | varchar |
The songplay table which contains:
Field | Data Type |
---|---|
songplay_id | int |
start_time | int |
user_id | int |
song_id | varchar |
artist_id | varchar |
session_id | int |
location | varchar |
user_agent | varchar |
Lessons Learned/Final Thoughts
Of all the lessons in the nano-degree this was the one I was the most excited for and the one that I was the most let down by. The lesson discussed the high level concepts of Spark but the actual details of working with Spark were limited. In fact most of the time a Spark view was created to work with the data in an SQL like environment. I will say since completing the nano-degree Udacity has added a new section with a focus on setting up Spark clusters on AWS and debugging in Spark. I have not watched these but hopefully they add more value to this portion of the course.
Overall I did not learn much nor was I impressed with this section.