5 DataFrames in pandas

In this chapter we’ll introduce DataFrames and the pandas package.

In this session we’ll go in-depth working with data frames, the workhorse of data analysis in Python. For a quick reference, check out the official quick reference card here.

# Our environment
import pandas as pd 
import numpy as np

5.0.1 Pandas DataFrames, Data Containers part IV

We’ve already see a couple DataFrames in the workshop, like the plant_prowth and chick_weights data sets. We can make a DataFrame by importing a file using the functions we saw in the first two case studies. But, we can make them by hand using a dictionaries of lists, or a bunch of separate lists. Let’s take a look at how they’re constructed from scratch. We’re going to build a small play DataFrame which will help us get familiar with this data container.

5.1 From dictionaries to DataFrames

First, we can make some lists as usual, using []:

foo1 = [True, False, False, True, True, False]
foo2 = ["Liver", "Brain", "Testes", "Muscle", "Intestine", "Heart"]
foo3 = [13, 88, 1233, 55, 233, 18]

You should be familiar with the different types by now:

type(foo1)
#> <class 'list'>
type(foo2)
#> <class 'list'>
type(foo3)
#> <class 'list'>

Let’s make a data frame from scratch by combining different lists as dictionaries. The keys are used as column labels and the row labels are by designated integers. You can pass a dictionary directly to a pd.DataFrame() function.

foo_df = pd.DataFrame({'healthy': foo1, 'tissue': foo2, 'quantity': foo3})

Our data set:

foo_df
#>    healthy     tissue  quantity
#> 0     True      Liver        13
#> 1    False      Brain        88
#> 2    False     Testes      1233
#> 3     True     Muscle        55
#> 4     True  Intestine       233
#> 5    False      Heart        18

Exercise 5.1 Using the dictionary you made in the last chapter, distDict, convert it to a DataFrame.

5.2 From lists to DataFrames

We can also convert lists to a DataFrame by creating a list of lists.

# names
list_names = ['healthy', 'tissue', 'quantity']

# columns are a list of lists
list_cols = [foo1, foo2, foo3]

We say the zip() earlier when we created a dictionary from lists. Here, we’ll use it to construct a list, called zip_list, of tuples.

zip_list = list(zip(list_names, list_cols))
print(zip_list)
#> [('healthy', [True, False, False, True, True, False]), ('tissue', ['Liver', 'Brain', 'Testes', 'Muscle', 'Intestine', 'Heart']), ('quantity', [13, 88, 1233, 55, 233, 18])]

zip_dict = dict(zip_list)
zip_df = pd.DataFrame(zip_dict)
print(zip_df)
#>    healthy     tissue  quantity
#> 0     True      Liver        13
#> 1    False      Brain        88
#> 2    False     Testes      1233
#> 3     True     Muscle        55
#> 4     True  Intestine       233
#> 5    False      Heart        18

Exercise 5.2 (List to DataFrame) Returning to the two lists containing the cities and their distances from Berlin. Use the following list of names to create a data frame with two columns: city and dist.

list_names = ['cities', 'dist']

5.3 Accessing columns by name

We’ll get into accessing information by position and query in the next section. For now, we just need to be familiarized with accessing information using names. For indexing, as we’ll see in more detail later, we need to use []. Don’t confuse this with the [] we used to make a list. These [] follow directly after a variable name and are used to subset it. We use the following notation:

foo_df['healthy']
#> 0     True
#> 1    False
#> 2    False
#> 3     True
#> 4     True
#> 5    False
#> Name: healthy, dtype: bool

Notice that this looks quite different from what our DataFrame reported. That’s because this is an individual column in a data frame and is referred to as a Series. This is basically a list, as we’ve seen previously, but more accurately it’s a 1-dimensional NumPy array, in that it can only have a single type. The dtype is reported when we print the Series to the screen. Contrast that with:

foo_df[['healthy']]
#>    healthy
#> 0     True
#> 1    False
#> 2    False
#> 3     True
#> 4     True
#> 5    False

In the second case we get a DataFrame.

Alternatively, we could have also accessed the column by name using the . notation:

foo_df.healthy
#> 0     True
#> 1    False
#> 2    False
#> 3     True
#> 4     True
#> 5    False
#> Name: healthy, dtype: bool

Exercise 5.3
What’s the difference between these two methods?

Exercise 5.4 Select both the quantity and tissue columns.

5.4 Broadcasting

DataFrames are extremly functional. As an example, let’s see how we would add new columns.

foo_df['new'] = 0
foo_df
#>    healthy     tissue  quantity  new
#> 0     True      Liver        13    0
#> 1    False      Brain        88    0
#> 2    False     Testes      1233    0
#> 3     True     Muscle        55    0
#> 4     True  Intestine       233    0
#> 5    False      Heart        18    0

There was no previous column 'new' in the DataFrame, but pandas knows that we want to create a new column and populate with with the value 0.

We can even use this when creating a completely new DataFrame.

results = pd.DataFrame({'ID': 'A', 'value': range(10)})
results
#>   ID  value
#> 0  A      0
#> 1  A      1
#> 2  A      2
#> 3  A      3
#> 4  A      4
#> 5  A      5
#> 6  A      6
#> 7  A      7
#> 8  A      8
#> 9  A      9

This is great, since we don’t need to do any complicated looping that is necessary for lists and dictionaries! (but we’ll get to looping later on).

We can also drop columns by name using a drop method. Remember axis 1 refers to columns.

foo_df = foo_df.drop('new', axis = 1)
foo_df
#>    healthy     tissue  quantity
#> 0     True      Liver        13
#> 1    False      Brain        88
#> 2    False     Testes      1233
#> 3     True     Muscle        55
#> 4     True  Intestine       233
#> 5    False      Heart        18

5.5 Attributes, part II

In the last chapter, we saw functions and methods and a brief introduction to attributes. Let’s go a bit deeper into attributes. Everything in Python is an object, and almost all these objects have attributes associated with them. For example shape is an attribute of our data frame. It tells us how many rows and columns we have. We don’t need to use () when accessing attributes:

foo_df.shape
#> (6, 3)

Another attribute is the dtypes:

foo_df.dtypes
#> healthy       bool
#> tissue      object
#> quantity     int64
#> dtype: object

These are not functions, which call objects, nor are they methods, which are functions called by an object. So this doesn’t work:

foo_df.shape()

Although .info() is a method, it’s not exactly what we want to see.

foo_df.info()
#> <class 'pandas.core.frame.DataFrame'>
#> RangeIndex: 6 entries, 0 to 5
#> Data columns (total 3 columns):
#>  #   Column    Non-Null Count  Dtype 
#> ---  ------    --------------  ----- 
#>  0   healthy   6 non-null      bool  
#>  1   tissue    6 non-null      object
#>  2   quantity  6 non-null      int64 
#> dtypes: bool(1), int64(1), object(1)
#> memory usage: 230.0+ bytes

You can change the row and column names by just assigning a list object to the .columns and .index attribute. Note that you have to explicitly make a copy of the DataFrame, or else you’ll have a link to the original.

Exercise 5.5 Make a copy of foo_df, called foo_df_2. Access the .columns and .index attributes and change them to ['A', 'B', 'C'] and ['H', 'I', 'J', 'K', 'L', 'M'], respectively.

# Change col and row names
foo_df_2 = foo_df.copy()
foo_df_2.columns = ['A', 'B', 'C']
foo_df_2.index = ['H', 'I', 'J', 'K', 'L', 'M']
print(foo_df_2) # The copy
#>        A          B     C
#> H   True      Liver    13
#> I  False      Brain    88
#> J  False     Testes  1233
#> K   True     Muscle    55
#> L   True  Intestine   233
#> M  False      Heart    18
print(foo_df) # The original
#>    healthy     tissue  quantity
#> 0     True      Liver        13
#> 1    False      Brain        88
#> 2    False     Testes      1233
#> 3     True     Muscle        55
#> 4     True  Intestine       233
#> 5    False      Heart        18

5.6 Change data types

Remember that a common problem you’ll encounter is the wrong data type. We saw that for lists we can just use the appropriate function, e.g. str() to assign a type. For DataFrames, we can change types using the .astype() method called on a column

df['A'] = df['A'].astype(str)
df.info()

Type object means string. Access built-in string methods with str accessor functions, described in the next section below.

5.7 Accessors

These are some subtler points and we won’t cover them in the workshop. They’re here for your reference.

The String Accessor, e.g. strip removes leading and trailing whitespece

df = pd.DataFrame({'name': ['Daniel  ','  Eric', '  Julia  ']})
df['name_strip'] = df['name'].str.strip()
df

Categorical Accessor, for defining categorical variables

df = pd.DataFrame({'name': ['Daniel','Eric', 'Julia'],
                   'gender':['Male', 'Male', 'Female']})

df['gender_cat'] = df['gender'].astype('category')

See the categories by calling the cat accessor and the categories attribute on the column. Use the codes attribute to get the codes.

df['gender_cat'].cat.categories
df.gender_cat.cat.codes

Date Accessor, use the to_datetime() function

df = pd.DataFrame({'name': ['Rosaline Franklin', 'William Gosset'], 
                   'born': ['1920-07-25', '1876-06-13']})

df['born_dt'] = pd.to_datetime(df['born'])

Similar to strings and categorical values, you can access date components with the dt accessor

df['born_dt'].dt.day
df['born_dt'].dt.month
df['born_dt'].dt.year

5.8 Exercise

Exercise 5.6 To get familiar with basic functions and methods in Python data frames, try to solve the following problems using the classic mtcars data set. You can find this data set in your data directory.

The data has the following structure:

Variables	Description
`mpg`	Miles/(US) gallon
`cyl`	Number of cylinders
`disp`	Displacement (cu.in.)
`hp`	Gross horsepower
`drat`	Rear axle ratio
`wt`	Weight (1000 lbs)
`qsec`	1/4 mile time
`vs`	Engine (0 = V-shaped, 1 = straight)
`am`	Transmission (0 = automatic, 1 = manual)
`gear`	Number of forward gears

Complete the following tasks:

Import the data set and assign it to the variable mtcars.
Calculate the correlation between mpg and wt and test if it is significant.
Visualize the relationship in an XY scatter plot.
Convert weight from pounds to kg.

5.9 Further Pandas

5.9.1 Missing data

NaN missing values from from NumPy
For reference, np.NaN, np.NAN, np.nan are all the same as R’s NA value
Check missing with pd.isnull
Check non-missing with pd.notnull
pd.isnull is an alias for pd.isna

df = pd.DataFrame({
            'name': ["John Smith", "Jane Doe", "Mary Johnson"],
            'treatment_a': [None, 16, 3], 
            'treatment_b': [2, 11, 1]})

df
#>            name  treatment_a  treatment_b
#> 0    John Smith          NaN            2
#> 1      Jane Doe         16.0           11
#> 2  Mary Johnson          3.0            1

How to replace the NaN with the mean of the row?

a_mean = df['treatment_a'].mean()
a_mean

use the fillna method:

df['a_fill'] = df['treatment_a'].fillna(a_mean)
df

5.9.2 Applying custom functions

NumPy arrays (i.e. matrices) are beautiful since we know that every value has the same type. We can use some built-in and custom functions to take advantage of this.

import numpy as np
df = pd.DataFrame({'A': [1, 2, 3],
                   'B': [4, 5, 6]})

df.apply(np.mean, axis=0) # Column-wise
df.apply(np.mean, axis=1) # Row-wise

Here’s an example of applying custom functions to DataFrames:

df = pd.DataFrame({'A': [1000, 10000, 100000]})

# Create array of DataFrame values: np_vals
np_vals = df.values

# Create new array of base 10 logarithm values: np_vals_log10
np_vals_log10 = np.log10(np_vals)

# Create array of new DataFrame by passing df to np.log10()
df_log10 = np.log10(df)

# Print original and new data containers
[print(x, 'has type', type(eval(x))) for x in ['np_vals', 'np_vals_log10', 'df', 'df_log10']]
#> np_vals has type <class 'numpy.ndarray'>
#> np_vals_log10 has type <class 'numpy.ndarray'>
#> df has type <class 'pandas.core.frame.DataFrame'>
#> df_log10 has type <class 'pandas.core.frame.DataFrame'>
#> [None, None, None, None]

That last example is completely new and looks pretty scary at first glance. It’s called a list comprehension and we’ll see it in more detail later on in the workshop. This should give you an idea of how all the different components come together in clear and simple instructions.

5.10 Wrap-up

In this chapter we covered the essentials of creating and working with data frames. We indexed according to name, and in the next chapter we’ll move onto parsing information by position and logical expressions.

4 Python Fundamentals

6 Parsing data