Data Modeling with PostgreSQL

Project Background

This is the introductory project in a traditional SQL database. For this project I designed, built, and populated an analytical database from user logs and song data files.

The GitHub for this project is located here: PostgreSQL-Data-Modeling

Project Datasets

There are two primary datasets for this project. The Song Dataset is a subset of data from the Million Song Dataset and the Log Dataset which is simulated log data for the “Sparkify” company.

Song Dataset

Each song is stored in a different JSON file. An exmaple of the data reads as


{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

The data schema for this JSON is:

Field	Data Type
num_songs	INT
artist_id	VARCHAR
artist_latitude	REAL
artist_longitude	REAL
artist_location	VARCHAR
artist_name	VARCHAR
song_id	VARCHAR
title	VARCHAR
duration	NUMERIC
year	INT

Log Dataset

In the Log Dataset each file contains one day’s worth of user logs. This log information is stored in a JSON Line format. In a JSON line file each line is a valid JSON object; however, this means the actual file itself cannot be treated as a JSON file. To avoid errors while reading this file into Python it is recommend to use pandas with the following format:

df = pd.read_json(‘.json', lines=True)

A sample of this log data in a dataframe is shown below: Log File Example

The data schema for the log dataset is:

Field	Data Type
artist	VARCHAR
auth	VARCHAR
firstName	VARCHAR
gender	VARCHAR
itemInSession	INT
lastName	VARCHAR
length	REAL
level	VARCHAR
location	VARCHAR
method	VARCHAR
page	VARCHAR
registration	NUMERIC
sessionId	INT
song	VARHCHAR
status	INT
ts	TIMESTAMP
userAgent	VARCHAR
userId	INT

Final Database Schema

To enable fast returns on the analytic queries the project database design used a star schema. By using the star schema, the database will likely be fully 3NF, but the tradeoff is queries will return faster. As the project dataset is small and the entries are being inserted via a python script the redundancy in the database will have little impact on the actual operation.

The star schema database has five tables: songplays, users, songs, artists, and time. The database ERD is shown below:

Project ERD Diagram

The time table which contains:

Field	Data Type	Key
start_time	INT	Primary
hour	INT
day	INT
week	INT
month	INT
year	INT
weekday	INT

The users table which contains:

Field	Data Type	Key
user_id	VARCHAR	Primary
first_name	VARCHAR
last_name	VARCHAR
gender	VARCHAR
level	VARCHAR

The songs table which contains:

Field	Data Type	Key
song_id	VARCHAR	Primary
title	VARCHAR
artist_id	VARCHAR	Foreign Key
year	INT
duration	NUMERIC

The artists table which contains:

Field	Data Type	Key
artist_id	VARCHAR	Primary
name	VARCHAR
location	VARCHAR	Foreign Key
latitude	real
longitude	real

The songplay table which contains:

Field	Data Type	Key
songplay_id	INT	Primary
start_time	INT	Foreign Key
user_id	VARCHAR	Foreign Key
song_id	VARCHAR	Foreign Key
artist_id	VARCHAR	Foreign Key
session_id	INT
location	VARCHAR
user_agent	VARCHAR

Project Process

This project required the creation of the tables along with the insert queries from the JSON data into the appropriate table.

The DROP TABLE query was made for each of the tables. This would ensure the table was deleted from the database before the same table was created. This is important if any changes were made to the table design.
The CREATE TABLE query was used for each table and dictated the data type and any constraints such as Primary Key and Foreign Key.
Data was inserted for the tables.

Data Test

The following quires were conducted to ensure the data had been inserted into the tables correctly.

TOP FIVE USERS

This query returned the top five users based upon instances.

Top 5

Bottom Ten Locations

This query returned the ten locations that has the lowest number of logged users.

Bottom 10

Lessons Learned/Final Thoughts

I have worked with SQL databases for many years and part of my Master’s program involved an in-depth class on relational theory and Oracle SQL databases, so the SQL basics of this class were not difficult. Perhaps the thing I learned the most was to be willing to break away from third normal form when moving data from a record-based system to an analytic based system. While I am still not entirely comfortable with this the though the speed impacts are hard to argue against. For me the key would be an automated process to ensure values are keeping some type of consistency.

Overall, this was a fun project and helped brush off the dust on my SQL skills.

Share on

Twitter Facebook LinkedIn