5 DataFrames in pandas
In this chapter we’ll introduce DataFrames and the pandas package.
In this session we’ll go in-depth working with data frames, the workhorse of data analysis in Python. For a quick reference, check out the official quick reference card here.
# Our environment
import pandas as pd
import numpy as np
5.0.1 Pandas DataFrames, Data Containers part IV
We’ve already see a couple DataFrames in the workshop, like the plant_prowth
and chick_weights
data sets. We can make a DataFrame by importing a file using the functions we saw in the first two case studies. But, we can make them by hand using a dictionaries of lists, or a bunch of separate lists. Let’s take a look at how they’re constructed from scratch. We’re going to build a small play DataFrame which will help us get familiar with this data container.
5.1 From dictionaries to DataFrames
First, we can make some lists as usual, using []
:
= [True, False, False, True, True, False]
foo1 = ["Liver", "Brain", "Testes", "Muscle", "Intestine", "Heart"]
foo2 = [13, 88, 1233, 55, 233, 18] foo3
You should be familiar with the different types by now:
type(foo1)
#> <class 'list'>
type(foo2)
#> <class 'list'>
type(foo3)
#> <class 'list'>
Let’s make a data frame from scratch by combining different lists as dictionaries. The keys are used as column labels and the row labels are by designated integers. You can pass a dictionary directly to a pd.DataFrame()
function.
= pd.DataFrame({'healthy': foo1, 'tissue': foo2, 'quantity': foo3}) foo_df
Our data set:
foo_df#> healthy tissue quantity
#> 0 True Liver 13
#> 1 False Brain 88
#> 2 False Testes 1233
#> 3 True Muscle 55
#> 4 True Intestine 233
#> 5 False Heart 18
Exercise 5.1 Using the dictionary you made in the last chapter, distDict
, convert it to a DataFrame.
5.2 From lists to DataFrames
We can also convert lists to a DataFrame by creating a list of lists.
# names
= ['healthy', 'tissue', 'quantity']
list_names
# columns are a list of lists
= [foo1, foo2, foo3] list_cols
We say the zip()
earlier when we created a dictionary from lists. Here, we’ll use it to construct a list, called zip_list
, of tuples.
= list(zip(list_names, list_cols))
zip_list print(zip_list)
#> [('healthy', [True, False, False, True, True, False]), ('tissue', ['Liver', 'Brain', 'Testes', 'Muscle', 'Intestine', 'Heart']), ('quantity', [13, 88, 1233, 55, 233, 18])]
= dict(zip_list)
zip_dict = pd.DataFrame(zip_dict)
zip_df print(zip_df)
#> healthy tissue quantity
#> 0 True Liver 13
#> 1 False Brain 88
#> 2 False Testes 1233
#> 3 True Muscle 55
#> 4 True Intestine 233
#> 5 False Heart 18
Exercise 5.2 (List to DataFrame) Returning to the two lists containing the cities and their distances from Berlin. Use the following list of names to create a data frame with two columns: city
and dist
.
list_names = ['cities', 'dist']
5.3 Accessing columns by name
We’ll get into accessing information by position and query in the next section. For now, we just need to be familiarized with accessing information using names. For indexing, as we’ll see in more detail later, we need to use []
. Don’t confuse this with the []
we used to make a list. These []
follow directly after a variable name and are used to subset it. We use the following notation:
'healthy']
foo_df[#> 0 True
#> 1 False
#> 2 False
#> 3 True
#> 4 True
#> 5 False
#> Name: healthy, dtype: bool
Notice that this looks quite different from what our DataFrame reported. That’s because this is an individual column in a data frame and is referred to as a Series
. This is basically a list, as we’ve seen previously, but more accurately it’s a 1-dimensional NumPy array, in that it can only have a single type. The dtype
is reported when we print the Series to the screen. Contrast that with:
'healthy']]
foo_df[[#> healthy
#> 0 True
#> 1 False
#> 2 False
#> 3 True
#> 4 True
#> 5 False
In the second case we get a DataFrame.
Alternatively, we could have also accessed the column by name using the .
notation:
foo_df.healthy#> 0 True
#> 1 False
#> 2 False
#> 3 True
#> 4 True
#> 5 False
#> Name: healthy, dtype: bool
- Exercise 5.3
- What’s the difference between these two methods?
Exercise 5.4 Select both the quantity
and tissue
columns.
5.4 Broadcasting
DataFrames are extremly functional. As an example, let’s see how we would add new columns.
'new'] = 0
foo_df[
foo_df#> healthy tissue quantity new
#> 0 True Liver 13 0
#> 1 False Brain 88 0
#> 2 False Testes 1233 0
#> 3 True Muscle 55 0
#> 4 True Intestine 233 0
#> 5 False Heart 18 0
There was no previous column 'new'
in the DataFrame, but pandas knows that we want to create a new column and populate with with the value 0
.
We can even use this when creating a completely new DataFrame.
= pd.DataFrame({'ID': 'A', 'value': range(10)})
results
results#> ID value
#> 0 A 0
#> 1 A 1
#> 2 A 2
#> 3 A 3
#> 4 A 4
#> 5 A 5
#> 6 A 6
#> 7 A 7
#> 8 A 8
#> 9 A 9
This is great, since we don’t need to do any complicated looping that is necessary for lists and dictionaries! (but we’ll get to looping later on).
We can also drop columns by name using a drop method. Remember axis 1 refers to columns.
= foo_df.drop('new', axis = 1)
foo_df
foo_df#> healthy tissue quantity
#> 0 True Liver 13
#> 1 False Brain 88
#> 2 False Testes 1233
#> 3 True Muscle 55
#> 4 True Intestine 233
#> 5 False Heart 18
5.5 Attributes, part II
In the last chapter, we saw functions and methods and a brief introduction to attributes. Let’s go a bit deeper into attributes. Everything in Python is an object, and almost all these objects have attributes associated with them. For example shape
is an attribute of our data frame. It tells us how many rows and columns we have. We don’t need to use ()
when accessing attributes:
foo_df.shape#> (6, 3)
Another attribute is the dtypes
:
foo_df.dtypes#> healthy bool
#> tissue object
#> quantity int64
#> dtype: object
These are not functions, which call objects, nor are they methods, which are functions called by an object. So this doesn’t work:
foo_df.shape()
Although .info()
is a method, it’s not exactly what we want to see.
foo_df.info()#> <class 'pandas.core.frame.DataFrame'>
#> RangeIndex: 6 entries, 0 to 5
#> Data columns (total 3 columns):
#> # Column Non-Null Count Dtype
#> --- ------ -------------- -----
#> 0 healthy 6 non-null bool
#> 1 tissue 6 non-null object
#> 2 quantity 6 non-null int64
#> dtypes: bool(1), int64(1), object(1)
#> memory usage: 230.0+ bytes
You can change the row and column names by just assigning a list object to the .columns
and .index
attribute. Note that you have to explicitly make a copy of the DataFrame, or else you’ll have a link to the original.
Exercise 5.5 Make a copy of foo_df
, called foo_df_2
. Access the .columns
and .index
attributes and change them to ['A', 'B', 'C']
and ['H', 'I', 'J', 'K', 'L', 'M']
, respectively.
# Change col and row names
= foo_df.copy()
foo_df_2 = ['A', 'B', 'C']
foo_df_2.columns = ['H', 'I', 'J', 'K', 'L', 'M']
foo_df_2.index print(foo_df_2) # The copy
#> A B C
#> H True Liver 13
#> I False Brain 88
#> J False Testes 1233
#> K True Muscle 55
#> L True Intestine 233
#> M False Heart 18
print(foo_df) # The original
#> healthy tissue quantity
#> 0 True Liver 13
#> 1 False Brain 88
#> 2 False Testes 1233
#> 3 True Muscle 55
#> 4 True Intestine 233
#> 5 False Heart 18
5.6 Change data types
Remember that a common problem you’ll encounter is the wrong data type. We saw that for lists we can just use the appropriate function, e.g. str()
to assign a type. For DataFrames, we can change types using the .astype()
method called on a column
'A'] = df['A'].astype(str)
df[ df.info()
Type object
means string
. Access built-in string methods with str
accessor functions, described in the next section below.
5.7 Accessors
These are some subtler points and we won’t cover them in the workshop. They’re here for your reference.
The String Accessor, e.g. strip removes leading and trailing whitespece
= pd.DataFrame({'name': ['Daniel ',' Eric', ' Julia ']})
df 'name_strip'] = df['name'].str.strip()
df[ df
Categorical Accessor, for defining categorical variables
= pd.DataFrame({'name': ['Daniel','Eric', 'Julia'],
df 'gender':['Male', 'Male', 'Female']})
'gender_cat'] = df['gender'].astype('category') df[
See the categories by calling the cat
accessor and the categories
attribute on the column. Use the codes
attribute to get the codes.
'gender_cat'].cat.categories
df[ df.gender_cat.cat.codes
Date Accessor, use the to_datetime()
function
= pd.DataFrame({'name': ['Rosaline Franklin', 'William Gosset'],
df 'born': ['1920-07-25', '1876-06-13']})
'born_dt'] = pd.to_datetime(df['born']) df[
Similar to strings and categorical values, you can access date components with the dt
accessor
'born_dt'].dt.day
df['born_dt'].dt.month
df['born_dt'].dt.year df[
5.8 Exercise
Exercise 5.6 To get familiar with basic functions and methods in Python data frames, try to solve the following problems using the classic mtcars
data set. You can find this data set in your data directory.
The data has the following structure:
Variables | Description |
---|---|
mpg |
Miles/(US) gallon |
cyl |
Number of cylinders |
disp |
Displacement (cu.in.) |
hp |
Gross horsepower |
drat |
Rear axle ratio |
wt |
Weight (1000 lbs) |
qsec |
1/4 mile time |
vs |
Engine (0 = V-shaped, 1 = straight) |
am |
Transmission (0 = automatic, 1 = manual) |
gear |
Number of forward gears |
Complete the following tasks:
- Import the data set and assign it to the variable
mtcars
. - Calculate the correlation between mpg and wt and test if it is significant.
- Visualize the relationship in an XY scatter plot.
- Convert weight from pounds to kg.
5.9 Further Pandas
5.9.1 Missing data
-
NaN
missing values from from NumPy - For reference,
np.NaN
,np.NAN
,np.nan
are all the same as R’sNA
value - Check missing with
pd.isnull
- Check non-missing with
pd.notnull
-
pd.isnull
is an alias forpd.isna
= pd.DataFrame({
df 'name': ["John Smith", "Jane Doe", "Mary Johnson"],
'treatment_a': [None, 16, 3],
'treatment_b': [2, 11, 1]})
df#> name treatment_a treatment_b
#> 0 John Smith NaN 2
#> 1 Jane Doe 16.0 11
#> 2 Mary Johnson 3.0 1
How to replace the NaN
with the mean of the row?
= df['treatment_a'].mean()
a_mean a_mean
use the fillna
method:
'a_fill'] = df['treatment_a'].fillna(a_mean)
df[ df
5.9.2 Applying custom functions
NumPy arrays (i.e. matrices) are beautiful since we know that every value has the same type. We can use some built-in and custom functions to take advantage of this.
import numpy as np
= pd.DataFrame({'A': [1, 2, 3],
df 'B': [4, 5, 6]})
apply(np.mean, axis=0) # Column-wise
df.apply(np.mean, axis=1) # Row-wise df.
Here’s an example of applying custom functions to DataFrames:
= pd.DataFrame({'A': [1000, 10000, 100000]}) df
# Create array of DataFrame values: np_vals
= df.values np_vals
# Create new array of base 10 logarithm values: np_vals_log10
= np.log10(np_vals) np_vals_log10
# Create array of new DataFrame by passing df to np.log10()
= np.log10(df) df_log10
# Print original and new data containers
print(x, 'has type', type(eval(x))) for x in ['np_vals', 'np_vals_log10', 'df', 'df_log10']]
[#> np_vals has type <class 'numpy.ndarray'>
#> np_vals_log10 has type <class 'numpy.ndarray'>
#> df has type <class 'pandas.core.frame.DataFrame'>
#> df_log10 has type <class 'pandas.core.frame.DataFrame'>
#> [None, None, None, None]
That last example is completely new and looks pretty scary at first glance. It’s called a list comprehension and we’ll see it in more detail later on in the workshop. This should give you an idea of how all the different components come together in clear and simple instructions.