E2. To be completed after lesson 10

Data set download


[2]:
import pandas as pd

Exercise 2.1

In the lesson exercise, we will again work with a subset of the Palmer penguin data set. I will load it and view it now.

[3]:
df = pd.read_csv(os.path.join(data_path, "penguins_subset.csv"), header=[0, 1])

df.head()
[3]:
Gentoo Adelie Chinstrap
bill_depth_mm bill_length_mm flipper_length_mm body_mass_g bill_depth_mm bill_length_mm flipper_length_mm body_mass_g bill_depth_mm bill_length_mm flipper_length_mm body_mass_g
0 16.3 48.4 220.0 5400.0 18.5 36.8 193.0 3500.0 18.3 47.6 195.0 3850.0
1 15.8 46.3 215.0 5050.0 16.9 37.0 185.0 3000.0 16.7 42.5 187.0 3350.0
2 14.2 47.5 209.0 4600.0 19.5 42.0 200.0 4050.0 16.6 40.9 187.0 3200.0
3 15.7 48.7 208.0 5350.0 18.3 42.7 196.0 4075.0 20.0 52.8 205.0 4550.0
4 14.1 48.7 210.0 4450.0 18.0 35.7 202.0 3550.0 18.7 45.4 188.0 3525.0

Explain in words what each of the following code cells does as we work toward tidying this data frame. For each cell, I show the top of the data frame.

[4]:
df.columns.names = ['species', 'quantity']

df.head()
[4]:
species Gentoo Adelie Chinstrap
quantity bill_depth_mm bill_length_mm flipper_length_mm body_mass_g bill_depth_mm bill_length_mm flipper_length_mm body_mass_g bill_depth_mm bill_length_mm flipper_length_mm body_mass_g
0 16.3 48.4 220.0 5400.0 18.5 36.8 193.0 3500.0 18.3 47.6 195.0 3850.0
1 15.8 46.3 215.0 5050.0 16.9 37.0 185.0 3000.0 16.7 42.5 187.0 3350.0
2 14.2 47.5 209.0 4600.0 19.5 42.0 200.0 4050.0 16.6 40.9 187.0 3200.0
3 15.7 48.7 208.0 5350.0 18.3 42.7 196.0 4075.0 20.0 52.8 205.0 4550.0
4 14.1 48.7 210.0 4450.0 18.0 35.7 202.0 3550.0 18.7 45.4 188.0 3525.0
[5]:
df = df.stack(level='quantity')

df.head()
[5]:
species Adelie Chinstrap Gentoo
quantity
0 bill_depth_mm 18.5 18.3 16.3
bill_length_mm 36.8 47.6 48.4
body_mass_g 3500.0 3850.0 5400.0
flipper_length_mm 193.0 195.0 220.0
1 bill_depth_mm 16.9 16.7 15.8
[6]:
df = df.reset_index(level='species')

df.head()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/opt/anaconda3/envs/bebi103_build/lib/python3.11/site-packages/pandas/core/indexes/multi.py:1488, in MultiIndex._get_level_number(self, level)
   1487 try:
-> 1488     level = self.names.index(level)
   1489 except ValueError as err:

ValueError: 'species' is not in list

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[6], line 1
----> 1 df = df.reset_index(level='species')
      3 df.head()

File ~/opt/anaconda3/envs/bebi103_build/lib/python3.11/site-packages/pandas/core/frame.py:6162, in DataFrame.reset_index(self, level, drop, inplace, col_level, col_fill, allow_duplicates, names)
   6160 if not isinstance(level, (tuple, list)):
   6161     level = [level]
-> 6162 level = [self.index._get_level_number(lev) for lev in level]
   6163 if len(level) < self.index.nlevels:
   6164     new_index = self.index.droplevel(level)

File ~/opt/anaconda3/envs/bebi103_build/lib/python3.11/site-packages/pandas/core/frame.py:6162, in <listcomp>(.0)
   6160 if not isinstance(level, (tuple, list)):
   6161     level = [level]
-> 6162 level = [self.index._get_level_number(lev) for lev in level]
   6163 if len(level) < self.index.nlevels:
   6164     new_index = self.index.droplevel(level)

File ~/opt/anaconda3/envs/bebi103_build/lib/python3.11/site-packages/pandas/core/indexes/multi.py:1491, in MultiIndex._get_level_number(self, level)
   1489 except ValueError as err:
   1490     if not is_integer(level):
-> 1491         raise KeyError(f"Level {level} not found") from err
   1492     if level < 0:
   1493         level += self.nlevels

KeyError: 'Level species not found'
[ ]:
df = df.reset_index(drop=True)

df.head()
[ ]:
df.columns.name = None

df.head()

Exercise 2.2

What is the difference between merging and concatenating data frames?

Exercise 2.3

Describe the difference between categorical and quantitative variables. How are they fundamentally different in the way we plot them?

Exercise 2.4

Give pros and cons for using a histogram for display of repeated measurements. Then give pros and cons for using an ECDF.

Exercise 2.5

Write down any questions or points of confusion that you have.

Computing environment

[ ]:
%load_ext watermark
%watermark -v -p pandas,jupyterlab