Chapter 2 Getting Started: Plant Growth Case Study
Our learning objectives in this session…
- Import data
- Performing descriptive and inferential statistics
- Plotting functions
2.1 Set up your environment
Using the environment you created in the last chapter, continue you script by following along with the following commands. First, import the packages used in this case study. Remember, you’ll need to install them using the terminal first.
# import an entire library
import math # Functions beyond the basic maths
# Import an entire library and give it an alias
import pandas as pd # For DataFrame and handling
import numpy as np # Array and numerical processing
import matplotlib.pyplot as plt # Low level plotting
import seaborn as sns # High level Plotting
import statsmodels.api as sm # Modeling, e.g. ANOVA
# Import only specific modules from a library
# we'll use this for the t-test function
from scipy import stats
# Import only specific functions from a library
# ols is for ordinary least squares
from statsmodels.formula.api import ols
2.2 load dataset
Read in the data set using:
= pd.read_csv('data/plant_growth.csv') plant_growth
2.3 Examine the data
It’s always good practice to look at our data before we start working on it.
# The object
plant_growth.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 30 entries, 0 to 29
## Data columns (total 2 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 weight 30 non-null float64
## 1 group 30 non-null object
## dtypes: float64(1), object(1)
## memory usage: 608.0+ bytes
plant_growth.shape
## (30, 2)
plant_growth.columns
## Index(['weight', 'group'], dtype='object')
# Summaries
plant_growth.describe()
## weight
## count 30.000000
## mean 5.073000
## std 0.701192
## min 3.590000
## 25% 4.550000
## 50% 5.155000
## 75% 5.530000
## max 6.310000
'group'].value_counts() plant_growth[
## ctrl 10
## trt1 10
## trt2 10
## Name: group, dtype: int64
# explore first dataset rows
plant_growth.head()
## weight group
## 0 4.17 ctrl
## 1 5.58 ctrl
## 2 5.18 ctrl
## 3 6.11 ctrl
## 4 4.50 ctrl
2.4 Descriptive statistics
Let’s take a look at some first steps in descriptive statistics.
# count group members
'group'].value_counts()
plant_growth[
# get average weight
## ctrl 10
## trt1 10
## trt2 10
## Name: group, dtype: int64
'weight'])
np.mean(plant_growth[
# summary statistics
## 5.0729999999999995
'group']).describe() plant_growth.groupby([
## weight
## count mean std min 25% 50% 75% max
## group
## ctrl 10.0 5.032 0.583091 4.17 4.5500 5.155 5.2925 6.11
## trt1 10.0 4.661 0.793676 3.59 4.2075 4.550 4.8700 6.03
## trt2 10.0 5.526 0.442573 4.92 5.2675 5.435 5.7350 6.31
There are always many ways to get to the same solution, but we’d tend to prefer the simplest.
= plant_growth.groupby(['group']).mean()
df1 = plant_growth.groupby(['group']).std()
df2 = pd.concat([df1, df2], axis=1) final_df
Can you tell what the difference is between the following two commands? Why is this important?
# Produces Pandas Series
'group')['weight'].mean() plant_growth.groupby(
## group
## ctrl 5.032
## trt1 4.661
## trt2 5.526
## Name: weight, dtype: float64
# Produces Pandas DataFrame
'group')[['weight']].mean() plant_growth.groupby(
## weight
## group
## ctrl 5.032
## trt1 4.661
## trt2 5.526
This is one easy and specific way:
'group']).agg({'weight':['mean','std']}) plant_growth.groupby([
## weight
## mean std
## group
## ctrl 5.032 0.583091
## trt1 4.661 0.793676
## trt2 5.526 0.442573
2.5 Plotting
Alright, let’s take a look at some data visualization of weight described by group. Here, we have a box plot:
='group', y='weight', data=plant_growth) sns.boxplot(x
Just the points:
="group", y="weight", data=plant_growth) sns.catplot(x
## <seaborn.axisgrid.FacetGrid object at 0x7f97e81a3550>
and just the means with their standard deviations:
="group", y="weight", data=plant_growth, join=False)
sns.pointplot(x# alternatively:
# sns.catplot(x="group", y="weight", data=plant_growth, kind="point")
2.6 Inferential Statistics
We discussed a couple different tests in inferential statistics that we’d like to perform on our data. Let’s take a look.
2.6.1 T-test
There is a specific t-test function, which we’ll explore later on, but in this case, we have a specific set-up in that we have three groups and we’re interested in two specific two-group comparisons. We can accomplish this by establishing a linear model.
# fit a linear model
# specify model
= ols("weight ~ group", plant_growth)
model
# fit model
= model.fit() results
We can get the coefficients of the model directly:
# extract coefficients
results.params.Intercept
## 5.032000000000002
"group[T.trt1]"] results.params[
## -0.3709999999999988
"group[T.trt2]"] results.params[
## 0.49400000000000066
Can you imagine what these number refer to?
Finally, let’s take a look at a summary of our model:
# Explore model results
results.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: weight R-squared: 0.264
## Model: OLS Adj. R-squared: 0.210
## Method: Least Squares F-statistic: 4.846
## Date: Wed, 28 Oct 2020 Prob (F-statistic): 0.0159
## Time: 13:30:09 Log-Likelihood: -26.810
## No. Observations: 30 AIC: 59.62
## Df Residuals: 27 BIC: 63.82
## Df Model: 2
## Covariance Type: nonrobust
## =================================================================================
## coef std err t P>|t| [0.025 0.975]
## ---------------------------------------------------------------------------------
## Intercept 5.0320 0.197 25.527 0.000 4.628 5.436
## group[T.trt1] -0.3710 0.279 -1.331 0.194 -0.943 0.201
## group[T.trt2] 0.4940 0.279 1.772 0.088 -0.078 1.066
## ==============================================================================
## Omnibus: 1.835 Durbin-Watson: 2.704
## Prob(Omnibus): 0.400 Jarque-Bera (JB): 1.406
## Skew: 0.524 Prob(JB): 0.495
## Kurtosis: 2.835 Cond. No. 3.73
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## """
2.6.2 ANOVA
Alright, let’s wrap this up with our one-way ANOVA. Notice that we’re using our model, results
, that we fitted above.
# ANOVA
# compute anova
= sm.stats.anova_lm(results, typ=2)
aov_table
# explore anova results
aov_table
## sum_sq df F PR(>F)
## group 3.76634 2.0 4.846088 0.01591
## Residual 10.49209 27.0 NaN NaN
## sum_sq df F PR(>F)
## group 3.76634 2.0 4.846088 0.01591
## Residual 10.49209 27.0 NaN NaN
What conclusions can you draw from you analysis? Did using different treatment affect the growth of our plants?
2.6.3 Exercises
Exercise 2.1 In the data folder you’ll find another file called “Chick Weights.txt.”
Here you’ll find the results from an experiment in which newly hatched chicks were randomly allocated into six groups. Each group was given a different feed supplement. After six weeks, their weights in grams were measured.
Using the commands we’ve learned so far complete the following tasks:
- Calculate the mean and standard deviation for each group.
- Calculate the number of chicks in each group.
- Calculate a within-group z-score.
- Produce a strip chart showing each chick as an individual data point
- Calculate a 1-way ANOVA.
- Calculate Tukey’s post-hoc test (i.e. p-values for all pair-wise t-tests)
You’ll have to find the function to perform Tukey’s Honestly Significant differences (hsd) post-how test. The results will be as follows:
## Multiple Comparison of Means - Tukey HSD, FWER=0.05
## ==============================================================
## group1 group2 meandiff p-adj lower upper reject
## --------------------------------------------------------------
## casein horsebean -163.3833 0.001 -232.3445 -94.4222 True
## casein linseed -104.8333 0.001 -170.5852 -39.0814 True
## casein meatmeal -46.6742 0.3324 -113.9039 20.5554 False
## casein soybean -77.1548 0.0084 -140.5149 -13.7947 True
## casein sunflower 5.3333 0.9 -60.4186 71.0852 False
## horsebean linseed 58.55 0.1412 -10.4112 127.5112 False
## horsebean meatmeal 116.7091 0.001 46.3375 187.0806 True
## horsebean soybean 86.2286 0.0042 19.544 152.9132 True
## horsebean sunflower 168.7167 0.001 99.7555 237.6778 True
## linseed meatmeal 58.1591 0.1274 -9.0705 125.3887 False
## linseed soybean 27.6786 0.7679 -35.6815 91.0387 False
## linseed sunflower 110.1667 0.001 44.4148 175.9186 True
## meatmeal soybean -30.4805 0.7133 -95.3729 34.4118 False
## meatmeal sunflower 52.0076 0.2207 -15.2221 119.2372 False
## soybean sunflower 82.4881 0.0039 19.128 145.8482 True
## --------------------------------------------------------------
2.7 Wrap-up
Congratulations! So far in our journey we’ve seen a complete walk-through of two case studies. We use fairly small data sets and saw the process from importing data all the way to making some meaningful conclusions. Now that we have an overview, we’re going to zoom in on the details of what we’ve been doing in the next chapter.