A big part of doing data analyses is simply getting the data. Sometimes this is super easy and takes minutes. Other times it can be a bit more complicated. This is especially true if data are stored in multiple data sets, as often the case in data science projects or large-scale epidemiological studies. Luckily, the Structured Query Language (SQL) makes this task much easier and less error prone.
PostgreSQL (Postgres for short) is a powerful, open source object-relational database system that uses SQL and can interface with many other programming languages, including Python. This makes it great for the data science workflow because it allows analysts to go from start to finish using one language and can be worked into a production environment. In my experience, SQL is less popular in academia, however, with “big data” getting more attention, I suspect this will change.
Although Postgres has been in the limelight for a while, more and more momentum is building and large firms such as Amazon Web Services and Microsoft Azure now offer Postgres integration. I primarily use Postgres to access and retrieve data, but Postgres (and others such MySQL) can do much more.
In this post, I will provide a basic example of how to connect to a Postgres database, query data, and do a basic count and relative frequency of a variable, all within Python. I assume that you have already downloaded Postgres and set it up. We will be using the DVD rental database, which you can download here.
The basic question I want to answer is “What was the most rented movie?” The DVD rental database is fictional, so the movie titles and rental frequencies are not real. But its layout is similar to many other real-world databases, so it is a useful example to work with.
Let’s get started!
# psycopg2 allows us to access and work with Postgress from Python import psycopg2 # Pandas and numpy for data wrangling and analysis import pandas as pd import numpy as np # Connect to the dvdrental database on your computer. # Please use whatever password you created for Postgres. conn = psycopg2.connect( host='localhost', dbname='dvdrental', user='postgres', password='') # Open a cursor to perform database operations cur = conn.cursor() # The "sql = " is the first part that allows us to store the SQL query as a dataframe # SELECT is selecting the variables we want to keep from each SQL tables. FROM lists the original table for the join. # The first INNER JOIN matches on "film_id" in both the film and inventory tables. This is stored as a temporary table. # The second INNER JOIN matches on "inventory_id" in both the temporary table and the rental table. # Note: INNER JOIN only keeps observations that match. sql = """ SELECT film.title, inventory.inventory_id, inventory.film_id, rental.rental_id FROM film INNER JOIN inventory ON film.film_id = inventory.film_id INNER JOIN rental ON inventory.inventory_id = rental.inventory_id ORDER BY inventory_id; """ # This creates a dataframe from the SQL query. We do this so we can work with it in Pandas. rentalData = pd.read_sql_query(sql, conn) # Close the connection and cursor to the Postgres database. conn.close() cur.close() # Total number of rows (rentals) print(len(rentalData.index)) # To answer the main question of this analyis, we will do a simple count and relative percentage # Let's first get a count of how many times each movie was rented. We do this with Panda's crosstab titleCount = pd.crosstab(index=rentalData['title'], columns="value") titleCount = titleCount.sort_values(by='value', ascending=False) # Now we will get the relative frequency of how many times each movie was rented as a share of all movie rentals rentalFreq = pd.DataFrame((titleCount / titleCount.sum() * 100)) # Sort the results rentalFreq = rentalFreq.sort_values(by='count', ascending=False) # We see the top 3 results are: # Bucket Brotherhood 0.21192 # Rocketeer Mother 0.20568 # Forward Temple 0.19945
Image credit: http://initd.org/psycopg/