Making a data frame wide

Data set download


[2]:
import polars as pl

import bebi103

import iqplot

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

We have seen how unpivoting a data frame can bring it to tidy format, but a tall format is often not the only tidy option nor the easiest to work with. As usual, this is best seen by example, and we will use a subset of the Palmer penguins data set, which you can download here: https://s3.amazonaws.com/bebi103.caltech.edu/data/penguins_subset.csv, to demonstrate. The data set consists of measurements of three different species of penguins acquired at the Palmer Station in Antarctica. The measurements were made between 2007 and 2009 by Kristen Gorman.

As we work toward getting the data in a useful tidy format, we will learn some additional wrangling techniques.

Exploring the data set

First, let’s take a look at the data set stored in the CSV file.

[3]:
!head ../data/penguins_subset.csv
Gentoo,Gentoo,Gentoo,Gentoo,Adelie,Adelie,Adelie,Adelie,Chinstrap,Chinstrap,Chinstrap,Chinstrap
bill_depth_mm,bill_length_mm,flipper_length_mm,body_mass_g,bill_depth_mm,bill_length_mm,flipper_length_mm,body_mass_g,bill_depth_mm,bill_length_mm,flipper_length_mm,body_mass_g
16.3,48.4,220.0,5400.0,18.5,36.8,193.0,3500.0,18.3,47.6,195.0,3850.0
15.8,46.3,215.0,5050.0,16.9,37.0,185.0,3000.0,16.7,42.5,187.0,3350.0
14.2,47.5,209.0,4600.0,19.5,42.0,200.0,4050.0,16.6,40.9,187.0,3200.0
15.7,48.7,208.0,5350.0,18.3,42.7,196.0,4075.0,20.0,52.8,205.0,4550.0
14.1,48.7,210.0,4450.0,18.0,35.7,202.0,3550.0,18.7,45.4,188.0,3525.0
15.0,49.6,216.0,4750.0,19.1,39.8,184.0,4650.0,18.2,49.6,193.0,3775.0
15.7,49.3,217.0,5850.0,18.4,40.8,195.0,3900.0,17.5,48.5,191.0,3400.0
15.2,49.2,221.0,6300.0,18.4,36.6,184.0,3475.0,18.2,49.2,195.0,4400.0

We see that we have two header rows. The first gives the species and the second the quantity that is being measured. Apparently, then, each row of data has information for three different penguins, one from each species. This is not a tidy data set!

Reading the data set into a Polars data frame

We start by naively reading this data set using pl.read_csv().

[4]:
df = pl.read_csv(os.path.join(data_path, "penguins_subset.csv"))
df.head()
[4]:
shape: (5, 12)
GentooGentoo_duplicated_0Gentoo_duplicated_1Gentoo_duplicated_2AdelieAdelie_duplicated_0Adelie_duplicated_1Adelie_duplicated_2ChinstrapChinstrap_duplicated_0Chinstrap_duplicated_1Chinstrap_duplicated_2
strstrstrstrstrstrstrstrstrstrstrstr
"bill_depth_mm""bill_length_mm""flipper_length_mm""body_mass_g""bill_depth_mm""bill_length_mm""flipper_length_mm""body_mass_g""bill_depth_mm""bill_length_mm""flipper_length_mm""body_mass_g"
"16.3""48.4""220.0""5400.0""18.5""36.8""193.0""3500.0""18.3""47.6""195.0""3850.0"
"15.8""46.3""215.0""5050.0""16.9""37.0""185.0""3000.0""16.7""42.5""187.0""3350.0"
"14.2""47.5""209.0""4600.0""19.5""42.0""200.0""4050.0""16.6""40.9""187.0""3200.0"
"15.7""48.7""208.0""5350.0""18.3""42.7""196.0""4075.0""20.0""52.8""205.0""4550.0"

Ooof! This is nasty. The second header row is included with the data, which results in inferring every column data type to be a string. Polars only allows for a single header row, so we cannot load this data set.

This is a fairly common occurrence with human-made tabular data. Researchers will have hierarchical column headings. In this case, the first header row is species and the second is the quantity that is being measured for the respective species. (Note that I did this intentionally for instructional purposes; this is not what the original penguins data set had.)

To convert this type of structure into a tidy format, we can perform an unpivoting operation where each of the levels of the hierarchical column headings become rows. Polars will not do this for you, since it forbids hierarchical column headings (hierarchical indexing is not a good idea, in my opinion, so I see why Polars forbids it). I therefore wrote a function for the bebi103 package that takes a CSV file, possibly with a hierarchical index, and unpivots it to give a new CSV file. We can use this on the penguins data set and then load in the result.

[5]:
bebi103.utils.unpivot_csv(
    os.path.join(data_path, "penguins_subset.csv"),
    os.path.join(data_path, "penguins_tall.csv"),
    header_names=['species', 'quantity'],
    retain_row_index=True,
    row_index_name='penguin_id',
    force_overwrite=True
)

df = pl.read_csv(os.path.join(data_path, "penguins_tall.csv"))

df.head(10)
[5]:
shape: (10, 4)
penguin_idspeciesquantityvalue
i64strstrf64
0"Gentoo""bill_depth_mm"16.3
0"Gentoo""bill_length_mm"48.4
0"Gentoo""flipper_length_mm"220.0
0"Gentoo""body_mass_g"5400.0
0"Adelie""bill_depth_mm"18.5
0"Adelie""bill_length_mm"36.8
0"Adelie""flipper_length_mm"193.0
0"Adelie""body_mass_g"3500.0
0"Chinstrap""bill_depth_mm"18.3
0"Chinstrap""bill_length_mm"47.6

Notice that we kept the row index in the column 'penguin_id' to make sure that we didn’t lose track of what penguin each measurement was associated with. Note also that each penguin has a unique identifier when combined with the species. E.g., Adelie penguin with penguin ID 0 has four measurements associated with it.

Pivoting from tall to wide format

Looking at the melded data frame above, it is very tall. Each row only has a single value, that is, a single measurement, and every other column in the row is metadata associated with that measurement, specifically which penguin/species and which quantity was being measured.

While this tall format is tidy, we can imagine using a wide format, in which each row is a specific penguin and each column is a quantity measured on that penguin. This is also a tidy format. In some cases, a tall format is more convenient, and in others a wide format is more convenient.

To convert from a tall to a wide format, we perform a pivot. A pivot operation classifies columns of a tall data frame in three ways.

  • The column(s) we pivot on gets converted into column headings of the pivoted data frame.

  • The value column’s entries get populated under the new column headings determined by the on column.

  • The index column(s) are essentially along for the ride. They are retained as columns. If the index columns are not unique, they you need to specify an aggregating operation that will be applied to the values associated with rows that have like entries in the index column(s).

Let us now perform the pivot. We want 'quantity' as the “on” column, since it takes a column or columns of a data frame and converts them to column headings. In this case, the value column is 'value', and 'penguin_id' and 'species' columns are the index columns. They uniquely define a penguin, so we do not need to provide any aggregating function.

[6]:
df = df.pivot(
    on='quantity',
    index=['penguin_id', 'species'],
    values='value'
)

# Take a look
df.head()
[6]:
shape: (5, 6)
penguin_idspeciesbill_depth_mmbill_length_mmflipper_length_mmbody_mass_g
i64strf64f64f64f64
0"Gentoo"16.348.4220.05400.0
0"Adelie"18.536.8193.03500.0
0"Chinstrap"18.347.6195.03850.0
1"Gentoo"15.846.3215.05050.0
1"Adelie"16.937.0185.03000.0

Excellent! We now have a wide, but still tidy, data frame.

The 'penguin_id' column is dispensable, but we will keep it for now, since we will demonstrate a pivot operation from this wide data frame in a moment.

A couple of plots for fun

Now that we’ve done all this work and our data set is tidy, let’s make a plot for fun. First, we’ll plot the ECDFs of the bill lengths.

[7]:
bokeh.io.show(
    iqplot.ecdf(
        data=df,
        cats='species',
        q='bill_length_mm',
        x_axis_label='bill length (mm)',
        frame_width=400,
    )
)

We can also plot bill length versus flipper length to see if we can see a difference among the species. It is also useful to have a hover tool that shows bill depth and body mass.

[8]:
# Create figure
p = bokeh.plotting.figure(
    frame_width=300,
    frame_height=300,
    x_axis_label="bill length (mm)",
    y_axis_label="flipper length (mm)",
    toolbar_location="above",
    tooltips=[('bill depth', '@bill_depth_mm'), ('body mass', '@body_mass_g')]
)

# Build legend as we populate glyphs
legend_items = []
for color, ((species,), sub_df) in zip(bokeh.palettes.Category10_3, df.group_by("species")):
    glyph = p.scatter(
        source=sub_df.to_dict(), x="bill_length_mm", y="flipper_length_mm", color=color
    )
    legend_items.append((species, [glyph]))

# Place legend
legend = bokeh.models.Legend(items=legend_items, location="center")
p.add_layout(legend, "right")
p.legend.click_policy = "hide"

bokeh.io.show(p)

An important note about tidiness

It is important to note that there is more than one way to make a data set tidy. In the example of the Palmer penguin data set, we saw two legitimate ways of making the data frame tidy. In our preferred wide version, each row corresponded to a measurement of a single penguin, which had several variables associated with it. In the tall version, each row corresponded to a single feature of a penguin.

To demonstrate that the tall version is workable, but more cumbersome, we can make the same plots as above. First, we’ll unpivot the data frame to make it tall. We needs to specify which columns as index columns for the unpivot because we want the 'penguin_id' and 'species' columns to remain in the tall data frame (which we did not have to do in the previous lesson).

[9]:
df = df.unpivot(index=["penguin_id", "species"])

df.head()
[9]:
shape: (5, 4)
penguin_idspeciesvariablevalue
i64strstrf64
0"Gentoo""bill_depth_mm"16.3
0"Adelie""bill_depth_mm"18.5
0"Chinstrap""bill_depth_mm"18.3
1"Gentoo""bill_depth_mm"15.8
1"Adelie""bill_depth_mm"16.9

Plotting the ECDFs is not really a problem with this form of the data frame. We just need to use a filter context to pull out the bill length rows.

[10]:
bokeh.io.show(
    iqplot.ecdf(
        data=df.filter(pl.col('variable') == 'bill_length_mm'),
        q="value",
        cats="species",
        frame_width=400,
        x_axis_label="bill length (mm)",
    )
)

Making the scatter plot, however, is much more difficult and involves a lot of filtering by hand.

[11]:
# Set up figure
p = bokeh.plotting.figure(
    frame_width=300,
    frame_height=300,
    x_axis_label="bill length (mm)",
    y_axis_label="flipper length (mm)",
    toolbar_location="above",
)

# Expressions for filtering
bill_length = pl.col('variable') == "bill_length_mm"
flipper_length = pl.col('variable') == "flipper_length_mm"

# Build legend as we populate glyphs
legend_items = []
for color, ((species,), sub_df) in zip(bokeh.palettes.Category10_3, df.group_by('species')):
    # Slice out bill and flipper lengths for species
    bill = sub_df.filter(bill_length).get_column('value').to_numpy()
    flipper = sub_df.filter(flipper_length).get_column('value').to_numpy()

    # Populate glyph
    glyph = p.scatter(bill, flipper, color=color)
    legend_items.append((species, [glyph]))

# Build and place legend
legend = bokeh.models.Legend(items=legend_items, location="center")
p.add_layout(legend, "right")
p.legend.click_policy = "hide"

bokeh.io.show(p)

This works fine, but is more cumbersome. Importantly, we could not use a column data source in the plot to enable display of more data upon hover. The moral of the story is that you should tidy your data, but you should think carefully about in what way you want your data to be tidy.

Computing environment

[12]:
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,iqplot,jupyterlab
Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.25.0

numpy     : 1.26.4
pandas    : 2.2.2
bokeh     : 3.4.1
iqplot    : 0.3.7
jupyterlab: 4.0.13