Lesson 5 covers filtering data with boolean indexes. The Mode tutorial can be found here. Lesson 5 uses the Watsi data set that was used in lessons 3 and 4. It expands on the practice of web analytics and the techniques that can be used to understand web usage. Below is a brief summary of lesson 5.
Segmentation: breaking down data into subsections (i.e., breaking down page views data based on referrers).
Boolean indexes can be used to filter data. Once an index is created, you can create a DataFrame to select instances only when that index is ‘true’ using square brackets. See below:
Boolean_variable_name = data[(data['column_name']== 'row_object')]
Once the filtered DataFrame is created, it can be used to further examine questions about that specific subset of data.
Method .str.contains(): This method returns a boolean index for whether the row object contains a specific string. This method is case sensitive unless case parameter is changed .str.contains(case=False).
Method .tolist(): This method returns a python list given a pandas series.
Query string: A query string is how searches are stored in urls and contain “?”. This can be used to understand what search terms people used to find specific webpages. In the below url “crowd funding for medical treatment” was searched.
My code for lesson 5 is below.
# Python Notebook - Python Tutorial: Lesson 5
datasets.head(n=5) # Code to access data produced by SQL query (not written by me)
# Prepping a DataFramehttps://modeanalytics.com/api/rcforbes/reports/8cef325f8480/notebooks/cfe2c5b8ed36/export.py?markdown_comments=false&cell_headers=false
import pandas as pd
data = datasets # This creates a variable for the SQL query results
data = data.fillna('') # Replaces mising values as strings
# Filtering data with boolean indexing
data['title']== 'Watsi | Fund medical treatments for people around the world' # Boolean index for views on homepage
homepage_index = (data['title'] == 'Watsi | Fund medical treatments for people around the world') # Assign index a variable name
watsi_homepage = data[homepage_index] # Selects only 'true' rows from homepage index
watsi_homepage['referrer'].value_counts()[:15] # Returns top 15 referrers to watsi homepage
# problem with above is that it's messy (many links from google)
watsi_homepage['referrer_domain'].value_counts()[:15] # combines referrers from same domain
# Practice problem: Select all the pageviews originating from the Reddit domain, and see where traffic is landing within Watsi.
reddit_index = data['referrer_domain'] == 'reddit.com'
watsi_reddit = data[reddit_index] # Selects only 'true' rows from reddit index
watsi_reddit['title'].value_counts() # Returns list of where traffic is landing from reddit.com
# top 2 rows can be combined like this instead:
watsi_reddit = data[data['referrer_domain']== 'reddit.com']
# Partial matching text with .str.contains()
medical_referrer_index = data['referrer'].str.contains('medical')
medical_referrals = data[medical_referrer_index]
medical_referrals['referrer'].tolist()# Returns a list instead of a pandas series
# Practice problem: Find the records with a referrer link containing "crowdfund"
crowdfund_index = data['referrer'].str.contains('crowdfund') # creates an index if referrer contains "crowdfun"
data[crowdfund_index]['referrer'].tolist() # Selects index and 'referrer' column and returns a list
# Practice problem:Find the users who visited the site on a windows phone using `user_agent`. Output the full string values.
windows_index = data['user_agent'].str.contains('IEMobile') # creates an index if user_agent contains "IEMobile"
data[windows_index]['user_agent'].tolist() # Selects index and 'user_agent' column and returns a list