8 Case-study: Diamonds
Now that we’re familiar with some core concepts in Python, including pd.DataFrame
, let’s work on a larger data set. We’ll use the diamonds
data set which we already got familiar with in R. It’s available in the /00-data/diamonds.csv
file.
- Exercise 8.1 (Import and Examine)
- Begin a new Jupyter Notebook in VS code or another platform as discussed in class.
- Import the
/00-data/diamonds.csv
as apd.DataFrame
.
The data set has the following variables:
Variable | Description |
---|---|
price |
Price in US dollars ($326 – $18,823) |
carat |
Weight of the diamond (0.2 – 5.01) |
cut |
Quality of the cut (Fair, Good, Very Good, Premium, Ideal) |
color |
Diamond colour, from D (best) to J (worst) |
clarity |
A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) |
x |
Length in mm (0 – 10.74) |
y |
Width in mm (0 – 58.9) |
z |
Depth in mm (0 – 31.8) |
depth |
Total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79) |
table |
Width of top of diamond relative to widest point (43–95) |
Exercise 8.2 (Examine structure) Examine the structure of the data set? What type of data is contained in each column?
Let’s begin exploring our data by looking at some basic plots and doing some transformations.
8.1 Exercises for Plotting, transforming and EDA
After you have familiarized yourself with the diamonds
data set, proceed with the following exercises.
- Exercise 8.3 (Counting individual groups)
- How many diamonds with a clarity of category “IF” are present in the data-set?
- What fraction of the total do they represent?
- Exercise 8.4 (Summarizing proportions)
- What proportion of the whole is made up of each category of clarity?
- Exercise 8.5 (Find specific diamonds prices)
- What is the cheapest diamond price overall?
- What is the range of diamond prices?
- What is the average diamond price in each category of cut and color?
Exercise 8.6 (Basic plotting) Make a scatter plot that shows the diamond price described by carat.
Your plot should look like this:
Exercise 8.7 (Applying transformations) Apply a log10 transformation to both the price and carat and store these as new columns in the DataFrame: price_log10
and carat_log10
.
Exercise 8.8 (Basic plotting) Redraw the scatterplot using the transformed values.
Exercise 8.9 (Viewing models) Define a linear model that describes the relatioship shown in the plot.
8.1.1 Saving a data frame to a file
Exercise 8.10 (Export data) Refer to the following table and save your data with transformed values on your computer.
To save a Python object outside of the environment you’ll write it to a file using the pandas function to_csv(file)
.
Function | Description |
---|---|
df.to_csv(file) |
Save the data frame df as file . |
df.to_csv(file, sep='\t') |
To delimit by a tab. |
df.to_csv(file, sep='\t', encoding='utf-8') |
Uses a specific encoding. |