Mode Analytics: Python Tutorial (6)

Lesson 6 is the final and most complicated lesson using the Watsi data set. It covers deriving new columns and defining functions. The Mode tutorial for lesson 6 can be found here. Below is a summary of lesson 6.

Mode describes creating a new column in a DataFrame as being like creating a new key-value pair in a dictionary (see lesson 1 for more information on dictionaries). Columns can be created by assigning values to a new column name.

data['new_column'] = value

Operators: Operators are special symbols implement arithmetic or logical computation.

in (operator): The ‘in’ operator evaluates whether a value exists in a list. This is a type of membership operator.

Control statements: A control statement determines whether another statement will be executed.

if (control statement): ‘if’ statements must result in boolean values (true or false; see lessons 1 and 5 ). If true, then the following statement will be executed. If false, it will not be executed.

else and elif (control statements): If the ‘if’ statement is false, the function can still execute a different action. If an ‘if’ statement is false and an ‘elif’ (short for else if) statement is true, that statement will execute. If an ‘if’ statement is false and all ‘elif’ statements are false, then an ‘else’ statement will execute.

Functions: See lesson 2 for definition. The keyword ‘def’ signals that that the following block of code is a function.

Parameter: A temporary variable name given when the function is defined.

def function_name(parameter)

Argument: An argument replaces a parameter when the function is run.


return (statement): When a return statement is executed, the result becomes available to be stored as a variable or used in a different function.

Mode offers 3 best practice tips for writing functions:

  1. Functions should only do one logical thing (keep it simple).
  2. Indent with spaces (customary 4 spaces).
  3. Keep testing your function.

Method .apply(): This method allows you to apply a function to a specific column in your DataFrame.

Below is my code for lesson 6.

# Python Notebook - Python Tutorial: Lesson 6

datasets[0].head(n=5) # Code to access data produced by SQL query (not written by me)

# Prepping a DataFrame
import pandas as pd

data = datasets[0] # This creates a variable for the SQL query results
data = data.fillna('') # Replaces mising values as strings


# Deriving new columns
data['platform'].value_counts() # counts values in 'platform', provides information about distribution

# Creating a new column
data['new'] = 2 # Creates a new column with all rows assigned '2'
data[:3] # View first 3 rows of the DataFrame

# Overwritting values in a column
data['new'] = 'overwritten' # Changes value assigned to all rows of column 'new' from '2' to 'overwritten'
data[:3] # View first 3 rows of the DataFrame

# Python functions
mobile = ['iPhone', 'Android', 'iPad', 'Opera Mini', 'IEMobile', 'BlackBerry']

# in operator
print 'iPad' in mobile
print 'Desktop' in mobile
print 'Blair Witch Project' in mobile

# if control statement
if 'iPad' in mobile:
  print 'Confirm'

# else and elif statements
great_movies = ['Psycho', 'Blair Witch Project','The Birds', 'Silence of the Lambs']

if 'Blair Witch Project' in mobile:
  print 'A Mobile Platform'
elif 'Blair Witch Project' in great_movies:
  print 'Great Movie'
  print 'Deny'

# Defining functions
movie = ['Psycho', 'Blair Witch Project','The Birds', 'Silence of the Lambs', 'Anaconda', 'Killer Klowns','The Notebook']
def is_great_movie(movie): # defines "is_great_movie" function, which accepts an argument called "movie"
    if movie in great_movies:
        print 'This is a great movie'
        print 'This is not a great movie'
is_great_movie('Killer Klowns')

# return statements
def is_great_movie(movie): # defines "is_great_movie" function, which accepts an argument called "movie"
    if movie in great_movies:
      return 'This is a great movie'
    elif movie == 'The Notebook':
      return 'This is a great romance movie'
      return 'This is not a great movie'
is_great_movie('The Notebook') # The output is printed because it is the only output

is_psycho_great = is_great_movie('Psycho')
print is_psycho_great

# Testing your code (best practices)
def is_great_movie(movie_name): # defines "is_great_movie" function, which accepts an argument called "movie"
    if movie_name in great_movies:
      print 'This is a great movie'
    elif movie_name in movie:
      print 'This is not a great movie'
      return 'Unknown'

is_great_movie('Silence of the Lambs')
is_great_movie('Killer Klowns')
is_great_movie('Breaking Bad')

# Applying functions to DataFrames
def filter_desktop_mobile(platform):
    if platform in mobile:
        return 'Mobile'
    elif platform == 'Desktop':
        return 'Desktop'
        return 'Not Known'
data['platform'].apply(filter_desktop_mobile) # Applies function 'filter_desktop_mobile' only to column 'platform'
data['platform_type'] = data['platform'].apply(filter_desktop_mobile) # Stores series in new column 'platform_type'

# Selecting multiple columns
data[['platform','platform_type']][12:16] # Selects 'platform' and 'platform_type' columns for rows 12-15 (does not include 16) 
data['platform_type'].value_counts().plot(kind='bar') # Counts the values and plots creates a bar chart

# Practice Problem: Store the length of each row's referrer value in a new column. Hint: We used a method to measure length in a previous lesson.
def referrer_length(referrer):
  return len(referrer) # defines function 'referrer_length'
data['len_referrer'] = data['referrer'].apply(referrer_length) # creates a column of series created by 'referrer_length' function
data[:3] # View first 3 rows of the DataFrame to check 

# Practice Problem: Create a derived column from referrer_domain that filters domain types of 'organization' (for '.org') and 'company' (for '.com'), labeling any others as 'other'. Then plot a bar chart of their relative frequency. Hint: Use the in keyword creatively.
def org_com(refer_domain):
  if '.org' in refer_domain:
    return 'organization'
  elif '.com' in refer_domain:
    return 'company'
    return 'other'
data['org_com_oth'] = data['referrer_domain'].apply(org_com) # Creates a new column with series created by 'org_com' function

# Bonus: select the records that were not referred from, and plot their relative frequency. Hint: Think about what values are not equal to.
filter_watsi = data[data['referrer_domain']!= ''] # creates new DataFrame with referrers removed



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s