Homework 1.2: Palmer penguins and split-apply-combine (30 pts)
The Palmer penguins data set is a nice data set with which to practice various data science skills. For this exercise, we will use as subset of it, which you can download here: https://s3.amazonaws.com/bebi103.caltech.edu/data/penguins_subset.csv. The data set consists of measurements of three different species of penguins acquired at the Palmer Station in Antarctica. The measurements were made between 2007 and 2009 by Kristen Gorman.
a) Take a look at the CSV file containing the data set. Is it in tidy format? Why or why not?
b) You can convert the CSV file to a “tall” format using the bebi103.utils.unpivot_csv()
function. You can do that with the following function call, where path_to_penguins
is a string containing the path to the penguin_subset.csv file.
bebi103.utils.unpivot_csv(
path_to_penguins,
"penguins_tall.csv",
n_header_rows=2,
header_names=["species", "quantity"],
comment_prefix="#",
retain_row_index=True,
row_index_name='penguin_id',
)
After running that function, load in the data set stored in the penguins_tall.csv
file and store it in a variable named df_tall
. Is this a tidy data set?
c) Perform the following operations to make a new DataFrame
from the one you loaded in to generate a new DataFrame
. You do not need to worry about what these operations do (that is the topic of next week, just do them to answer this question):
df = (
df_tall
.pivot(
index=['penguin_id', 'species'], on='quantity', values='value'
)
.select(pl.exclude('penguin_id'))
)
Is the resulting data frame df
tidy? Why or why not?
d) Using the data frame you created in part (c), slice out all of the bill lengths for Gentoo penguins.
e) Make a new data frame containing the mean measured bill depth, bill length, body mass in kg, and flipper length for each species. You can use millimeters for all length measurements.
f) Make a scatter plot of bill length versus flipper length with the glyphs colored by species.