Chapter 7 Case-study: Diamonds

In this chapter we’ll revisit a data set we saw in the R section of the course.

Now that we’re familiar with some core concepts in Python and working with DataFrames, let’s work on a larger data set!

The diamonds.csv file, contains data on over 50,000 diamonds. The data set actually comes from the ggplot2 package.

Exercise 7.1 (Import and Examine) If you choose to use the Google colab platform, you can find the commands to upload a file in the plant growth case study script.

If you are using VS Code locally, please set up a new directory and a new vitrual environment as described in the class for this case study.

Import the comma-separated values file called diamonds.csv, found inside the data folder, and save it as an object called jems.

The data set has the following variables:

Variable Description
price Price in US dollars ($326 – $18,823)
carat Weight of the diamond (0.2 – 5.01)
cut Quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color Diamond colour, from D (best) to J (worst)
clarity A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x Length in mm (0 – 10.74)
y Width in mm (0 – 58.9)
z Depth in mm (0 – 31.8)
depth Total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table Width of top of diamond relative to widest point (43–95)
Exercise 7.2 (Examine structure) Can you examine the structure of the DataFrame? What type of data is contained in each column?

Let’s begin exploring our data by looking at some basic plots and doing some transformations.

7.1 Exercises for Plotting, transforming and EDA

After you have familiarized yourself with the jems data set, proceed with the following exercises.

Exercise 7.3 (Counting individual groups) - How many diamonds with a clarity of category “IF” are present in the data-set? - What fraction of the total do they represent?

Exercise 7.4 (Summarizing proportions) - What proportion of the whole is made up of each category of clarity?

Exercise 7.5 (Find specific diamonds prices) - What is the cheapest diamond price overall? - What is the range of diamond prices? - What is the average diamond price in each category of cut and color?

Exercise 7.6 (Basic plotting) Make a scatter plot that shows the diamond price described by carat.

Your plot should look like this:

Exercise 7.7 (Applying transformations) Apply a log10 transformation to both the price and carat and store these as new columns in the DataFrame: price_log10 and carat_log10.

Exercise 7.8 (Basic plotting) Redraw the scatterplot using the transformed values.

Exercise 7.9 (Viewing models) Define a linear model that describes the relatioship shown in the plot.

7.1.1 Saving a data frame to a file

Exercise 7.10 (Export data) Refer to the following table and save your data with transformed values on your computer.

To save a Python object outside of the environment you’ll write it to a file using the pandas function to_csv(file).

Function Description
df.to_csv(file) Save the data frame df as file.
df.to_csv(file, sep='\t') To delimit by a tab.
df.to_csv(file, sep='\t', encoding='utf-8') Uses a specific encoding.

7.2 Wrap-up

In this case study we worked through a case study with familiar data, implementing all the core lessons we’ve covered so far.