Chapter 2 Getting Started: Plant Growth Case Study

Our learning objectives in this session…

Import data
Performing descriptive and inferential statistics
Plotting functions

2.1 Set up your environment

Using the environment you created in the last chapter, continue you script by following along with the following commands. First, import the packages used in this case study. Remember, you’ll need to install them using the terminal first.

# import an entire library
import math # Functions beyond the basic maths

# Import an entire library and give it an alias
import pandas as pd # For DataFrame and handling
import numpy as np # Array and numerical processing
import matplotlib.pyplot as plt # Low level plotting
import seaborn as sns # High level Plotting
import statsmodels.api as sm # Modeling, e.g. ANOVA

# Import only specific modules from a library
# we'll use this for the t-test function
from scipy import stats

# Import only specific functions from a library 
# ols is for ordinary least squares
from statsmodels.formula.api import ols

2.2 load dataset

Read in the data set using:

plant_growth = pd.read_csv('data/plant_growth.csv')

2.3 Examine the data

It’s always good practice to look at our data before we start working on it.

# The object
plant_growth.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 30 entries, 0 to 29
## Data columns (total 2 columns):
##  #   Column  Non-Null Count  Dtype  
## ---  ------  --------------  -----  
##  0   weight  30 non-null     float64
##  1   group   30 non-null     object 
## dtypes: float64(1), object(1)
## memory usage: 608.0+ bytes

plant_growth.shape

## (30, 2)

plant_growth.columns

## Index(['weight', 'group'], dtype='object')

# Summaries
plant_growth.describe()

##           weight
## count  30.000000
## mean    5.073000
## std     0.701192
## min     3.590000
## 25%     4.550000
## 50%     5.155000
## 75%     5.530000
## max     6.310000

plant_growth['group'].value_counts()

## ctrl    10
## trt1    10
## trt2    10
## Name: group, dtype: int64

# explore first dataset rows
plant_growth.head()

##    weight group
## 0    4.17  ctrl
## 1    5.58  ctrl
## 2    5.18  ctrl
## 3    6.11  ctrl
## 4    4.50  ctrl

2.4 Descriptive statistics

Let’s take a look at some first steps in descriptive statistics.

# count group members
plant_growth['group'].value_counts()

# get average weight

## ctrl    10
## trt1    10
## trt2    10
## Name: group, dtype: int64

np.mean(plant_growth['weight'])

# summary statistics

## 5.0729999999999995

plant_growth.groupby(['group']).describe()

##       weight                                                    
##        count   mean       std   min     25%    50%     75%   max
## group                                                           
## ctrl    10.0  5.032  0.583091  4.17  4.5500  5.155  5.2925  6.11
## trt1    10.0  4.661  0.793676  3.59  4.2075  4.550  4.8700  6.03
## trt2    10.0  5.526  0.442573  4.92  5.2675  5.435  5.7350  6.31

There are always many ways to get to the same solution, but we’d tend to prefer the simplest.

df1 = plant_growth.groupby(['group']).mean()
df2 = plant_growth.groupby(['group']).std()
final_df = pd.concat([df1, df2], axis=1)

Can you tell what the difference is between the following two commands? Why is this important?

# Produces Pandas Series
plant_growth.groupby('group')['weight'].mean()

## group
## ctrl    5.032
## trt1    4.661
## trt2    5.526
## Name: weight, dtype: float64

# Produces Pandas DataFrame
plant_growth.groupby('group')[['weight']].mean()

##        weight
## group        
## ctrl    5.032
## trt1    4.661
## trt2    5.526

This is one easy and specific way:

plant_growth.groupby(['group']).agg({'weight':['mean','std']})

##       weight          
##         mean       std
## group                 
## ctrl   5.032  0.583091
## trt1   4.661  0.793676
## trt2   5.526  0.442573

2.5 Plotting

Alright, let’s take a look at some data visualization of weight described by group. Here, we have a box plot:

sns.boxplot(x='group', y='weight', data=plant_growth)

Just the points:

sns.catplot(x="group", y="weight", data=plant_growth)

## <seaborn.axisgrid.FacetGrid object at 0x7f97e81a3550>

and just the means with their standard deviations:

sns.pointplot(x="group", y="weight", data=plant_growth, join=False)
# alternatively:
# sns.catplot(x="group", y="weight", data=plant_growth, kind="point")

2.6 Inferential Statistics

We discussed a couple different tests in inferential statistics that we’d like to perform on our data. Let’s take a look.

2.6.1 T-test

There is a specific t-test function, which we’ll explore later on, but in this case, we have a specific set-up in that we have three groups and we’re interested in two specific two-group comparisons. We can accomplish this by establishing a linear model.

# fit a linear model
# specify model
model = ols("weight ~ group", plant_growth)

# fit model
results = model.fit()

We can get the coefficients of the model directly:

# extract coefficients
results.params.Intercept

## 5.032000000000002

results.params["group[T.trt1]"]

## -0.3709999999999988

results.params["group[T.trt2]"]

## 0.49400000000000066

Can you imagine what these number refer to?

Finally, let’s take a look at a summary of our model:

# Explore model results
results.summary()

## <class 'statsmodels.iolib.summary.Summary'>
## """
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                 weight   R-squared:                       0.264
## Model:                            OLS   Adj. R-squared:                  0.210
## Method:                 Least Squares   F-statistic:                     4.846
## Date:                Wed, 28 Oct 2020   Prob (F-statistic):             0.0159
## Time:                        13:30:09   Log-Likelihood:                -26.810
## No. Observations:                  30   AIC:                             59.62
## Df Residuals:                      27   BIC:                             63.82
## Df Model:                           2                                         
## Covariance Type:            nonrobust                                         
## =================================================================================
##                     coef    std err          t      P>|t|      [0.025      0.975]
## ---------------------------------------------------------------------------------
## Intercept         5.0320      0.197     25.527      0.000       4.628       5.436
## group[T.trt1]    -0.3710      0.279     -1.331      0.194      -0.943       0.201
## group[T.trt2]     0.4940      0.279      1.772      0.088      -0.078       1.066
## ==============================================================================
## Omnibus:                        1.835   Durbin-Watson:                   2.704
## Prob(Omnibus):                  0.400   Jarque-Bera (JB):                1.406
## Skew:                           0.524   Prob(JB):                        0.495
## Kurtosis:                       2.835   Cond. No.                         3.73
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## """

2.6.2 ANOVA

Alright, let’s wrap this up with our one-way ANOVA. Notice that we’re using our model, results, that we fitted above.

# ANOVA
# compute anova
aov_table = sm.stats.anova_lm(results, typ=2)

# explore anova results
aov_table

##             sum_sq    df         F   PR(>F)
## group      3.76634   2.0  4.846088  0.01591
## Residual  10.49209  27.0       NaN      NaN

##             sum_sq    df         F   PR(>F)
## group      3.76634   2.0  4.846088  0.01591
## Residual  10.49209  27.0       NaN      NaN

What conclusions can you draw from you analysis? Did using different treatment affect the growth of our plants?

2.6.3 Exercises

Exercise 2.1 In the data folder you’ll find another file called “Chick Weights.txt.”

Here you’ll find the results from an experiment in which newly hatched chicks were randomly allocated into six groups. Each group was given a different feed supplement. After six weeks, their weights in grams were measured.

Using the commands we’ve learned so far complete the following tasks:

Calculate the mean and standard deviation for each group.
Calculate the number of chicks in each group.
Calculate a within-group z-score.
Produce a strip chart showing each chick as an individual data point
Calculate a 1-way ANOVA.
Calculate Tukey’s post-hoc test (i.e. p-values for all pair-wise t-tests)

You’ll have to find the function to perform Tukey’s Honestly Significant differences (hsd) post-how test. The results will be as follows:

##      Multiple Comparison of Means - Tukey HSD, FWER=0.05      
## ==============================================================
##   group1    group2   meandiff p-adj    lower    upper   reject
## --------------------------------------------------------------
##    casein horsebean -163.3833  0.001 -232.3445 -94.4222   True
##    casein   linseed -104.8333  0.001 -170.5852 -39.0814   True
##    casein  meatmeal  -46.6742 0.3324 -113.9039  20.5554  False
##    casein   soybean  -77.1548 0.0084 -140.5149 -13.7947   True
##    casein sunflower    5.3333    0.9  -60.4186  71.0852  False
## horsebean   linseed     58.55 0.1412  -10.4112 127.5112  False
## horsebean  meatmeal  116.7091  0.001   46.3375 187.0806   True
## horsebean   soybean   86.2286 0.0042    19.544 152.9132   True
## horsebean sunflower  168.7167  0.001   99.7555 237.6778   True
##   linseed  meatmeal   58.1591 0.1274   -9.0705 125.3887  False
##   linseed   soybean   27.6786 0.7679  -35.6815  91.0387  False
##   linseed sunflower  110.1667  0.001   44.4148 175.9186   True
##  meatmeal   soybean  -30.4805 0.7133  -95.3729  34.4118  False
##  meatmeal sunflower   52.0076 0.2207  -15.2221 119.2372  False
##   soybean sunflower   82.4881 0.0039    19.128 145.8482   True
## --------------------------------------------------------------

2.7 Wrap-up

Congratulations! So far in our journey we’ve seen a complete walk-through of two case studies. We use fairly small data sets and saw the process from importing data all the way to making some meaningful conclusions. Now that we have an overview, we’re going to zoom in on the details of what we’ve been doing in the next chapter.