Lesson 3 covers data frames and selecting data using Pandas, a popular data manipulation library in python. This is an exciting tutorial because it’s my first opportunity to use Python with real data. The Mode tutorial for lesson 3 can be found here. A web traffic data set from Watsi is already included in the report linked with tutorial 3.
Although so far I’ve found Mode’s Python tutorials to be quite easy to follow, this tutorial begins with some use of SQL, a programming language I hope to learn in the future, but have not yet been exposed to (Mode also has a series of SQL tutorials). The SQL query in the report produces the data. Once I learn more about SQL, I hope to come back to this tutorial to fully grasp how the data is being accessed. With that being said, I’ve moved forward in the tutorial without complete comprehension of how SQL provided the data. Below is a quick summary of lesson 3:
Pandas DataFrame: A table of data with multiple columns. Data must be in a DataFrame in order to use methods (see Tutorial 2).
Series: A column of a DataFrame or a list-like object.
Aliases: Convention seems to dictate that libraries should be given aliases so they can be more easily called in your code. For instance, pd as an alias for pandas.
Method fill.na(”): Replaces missing values as empty strings, making text processing easier.
Index: Identifies rows in a DataFrame.
Method .ix: Selects a specific row, with the index number in brackets.
This tutorial highlights the importance of gaining context for any data set you’re using. In order to properly understand and analyze data, you should be clear on where the data is coming from, what the variables are, and what any analyses could mean. I like that this section was included in the tutorial because coming from a scientific background, I know that understanding the data is absolutely necessary to properly interpret results.
Below is my code for lesson 3.