Chapter 7 Case-study: Diamonds
In this chapter we’ll revisit a data set we saw in the R section of the course.
Now that we’re familiar with some core concepts in Python and working with DataFrames, let’s work on a larger data set!
The diamonds.csv
file, contains data on over 50,000 diamonds. The data set actually comes from the ggplot2
package.
Exercise 7.1 (Import and Examine) If you choose to use the Google colab platform, you can find the commands to upload a file in the plant growth case study script.
If you are using VS Code locally, please set up a new directory and a new vitrual environment as described in the class for this case study.
Import the comma-separated values file called diamonds.csv
, found inside the data folder, and save it as an object called jems
.
The data set has the following variables:
Variable | Description |
---|---|
price |
Price in US dollars ($326 – $18,823) |
carat |
Weight of the diamond (0.2 – 5.01) |
cut |
Quality of the cut (Fair, Good, Very Good, Premium, Ideal) |
color |
Diamond colour, from D (best) to J (worst) |
clarity |
A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) |
x |
Length in mm (0 – 10.74) |
y |
Width in mm (0 – 58.9) |
z |
Depth in mm (0 – 31.8) |
depth |
Total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79) |
table |
Width of top of diamond relative to widest point (43–95) |
Let’s begin exploring our data by looking at some basic plots and doing some transformations.
7.1 Exercises for Plotting, transforming and EDA
After you have familiarized yourself with the jems
data set, proceed with the following exercises.
Exercise 7.4 (Summarizing proportions) - What proportion of the whole is made up of each category of clarity?
Exercise 7.5 (Find specific diamonds prices) - What is the cheapest diamond price overall? - What is the range of diamond prices? - What is the average diamond price in each category of cut and color?
Exercise 7.6 (Basic plotting) Make a scatter plot that shows the diamond price described by carat.
Your plot should look like this:
Exercise 7.7 (Applying transformations)
Apply a log10 transformation to both the price and carat and store these as new columns in the DataFrame: price_log10
and carat_log10
.
Exercise 7.8 (Basic plotting) Redraw the scatterplot using the transformed values.
Exercise 7.9 (Viewing models) Define a linear model that describes the relatioship shown in the plot.
7.1.1 Saving a data frame to a file
Exercise 7.10 (Export data) Refer to the following table and save your data with transformed values on your computer.
To save a Python object outside of the environment you’ll write it to a file using the pandas function to_csv(file)
.
Function | Description |
---|---|
df.to_csv(file) |
Save the data frame df as file . |
df.to_csv(file, sep='\t') |
To delimit by a tab. |
df.to_csv(file, sep='\t', encoding='utf-8') |
Uses a specific encoding. |
7.2 Wrap-up
In this case study we worked through a case study with familiar data, implementing all the core lessons we’ve covered so far.