Mode Analytics: Python Tutorial (3)

Lesson 3 covers data frames and selecting data using Pandas, a popular data manipulation library in python. This is an exciting tutorial because it’s my first opportunity to use Python with real data. The Mode tutorial for lesson 3 can be found here. A web traffic data set from Watsi is already included in the report linked with tutorial 3.

Although so far I’ve found Mode’s Python tutorials to be quite easy to follow, this tutorial begins with some use of SQL, a programming language I hope to learn in the future, but have not yet been exposed to (Mode also has a series of SQL tutorials). The SQL query in the report produces the data. Once I learn more about SQL, I hope to come back to this tutorial to fully grasp how the data is being accessed. With that being said, I’ve moved forward in the tutorial without complete comprehension of how SQL provided the data. Below is a quick summary of lesson 3:

Pandas DataFrame: A table of data with multiple columns. Data must be in a DataFrame in order to use methods (see Tutorial 2).

Series: A column of a DataFrame or a list-like object.

Aliases: Convention seems to dictate that libraries should be given aliases so they can be more easily called in your code. For instance, pd as an alias for pandas.

Method fill.na(”): Replaces missing values as empty strings, making text processing easier.

Index: Identifies rows in a DataFrame.

Method .ix[]: Selects a specific row, with the index number in brackets.

This tutorial highlights the importance of gaining context for any data set you’re using. In order to properly understand and analyze data, you should be clear on where the data is coming from, what the variables are, and what any analyses could mean. I like that this section was included in the tutorial because coming from a scientific background, I know that understanding the data is absolutely necessary to properly interpret results.

Below is my code for lesson 3.

# Python Notebook - Python Tutorial: Lesson 3

datasets[0].head(n=5) # Code to access data produced by SQL query (not written by me)

# Prepping a DataFrame
import pandas as pd

data = datasets[0] # This creates a variable for the SQL query results
data = data.fillna('') # Replaces mising values as strings

# Selecting columns in a DataFrame
data['url'] # select url column in the DataFrame

# Selecting rows in a DataFrame
data[:3] # Selects the first 3 rows in a DataFrame
data[4:7] # Selects index 4 up to (not including) index 7
data[4997:] # Selects from index 4997 on

# Selecting specific rows
data.ix[1] # Selects row of index 1

# Selecting specific rows and columns
data['title'][:3] # Selects column 'title', first 3 rows
data[:3]['title'] # Selecting row then column also works

# Pracice problem: Select records from rows 10 to 15 in the 'referrer' column.
data['referrer'][10:15]

 

Advertisements

One thought on “Mode Analytics: Python Tutorial (3)

  1. Pingback: Mode Analytics: Python Tutorial (4) – Data Doll

Comments are closed.