Reshaping data frames by stacking and unstacking

Data set download

[2]:

import pandas as pd

import iqplot

import bokeh.io
bokeh.io.output_notebook()

Loading BokehJS ...

We have seen how melting a data frame can bring it to tidy format, but this is not always the best option. Stacking and unstacking operations can be very useful for shaping data frames. As usual, this is best seen by example, and we will use a subset of the Palmer penguins data set, which you can download here: https://s3.amazonaws.com/bebi103.caltech.edu/data/penguins_subset.csv. The data set consists of measurements of three different species of penguins acquired at the Palmer Station in Antarctica. The measurements were made between 2007 and 2009 by Kristen Gorman.

Melting the data set too soon

We start by loading in the data set, bearing in mind that for this particular format of it the header rows are rows zero and one.

[3]:

df = pd.read_csv(os.path.join(data_path, "penguins_subset.csv"), header=[0, 1])
df.head()

[3]:

	Gentoo				Adelie				Chinstrap
	bill_depth_mm	bill_length_mm	flipper_length_mm	body_mass_g	bill_depth_mm	bill_length_mm	flipper_length_mm	body_mass_g	bill_depth_mm	bill_length_mm	flipper_length_mm	body_mass_g
0	16.3	48.4	220.0	5400.0	18.5	36.8	193.0	3500.0	18.3	47.6	195.0	3850.0
1	15.8	46.3	215.0	5050.0	16.9	37.0	185.0	3000.0	16.7	42.5	187.0	3350.0
2	14.2	47.5	209.0	4600.0	19.5	42.0	200.0	4050.0	16.6	40.9	187.0	3200.0
3	15.7	48.7	208.0	5350.0	18.3	42.7	196.0	4075.0	20.0	52.8	205.0	4550.0
4	14.1	48.7	210.0	4450.0	18.0	35.7	202.0	3550.0	18.7	45.4	188.0	3525.0

We have a multiindex for the column names. This data set is not tidy because each row corresponds to observations of three different penguins. To tidy it, we could go ahead and melt it.

To ease melting and further analysis, I will do what I always do when I have a multiindex; I will name the levels of the index.

[4]:

df.columns.names = ["species", "property"]

df.head()

[4]:

species	Gentoo				Adelie				Chinstrap
property	bill_depth_mm	bill_length_mm	flipper_length_mm	body_mass_g	bill_depth_mm	bill_length_mm	flipper_length_mm	body_mass_g	bill_depth_mm	bill_length_mm	flipper_length_mm	body_mass_g
0	16.3	48.4	220.0	5400.0	18.5	36.8	193.0	3500.0	18.3	47.6	195.0	3850.0
1	15.8	46.3	215.0	5050.0	16.9	37.0	185.0	3000.0	16.7	42.5	187.0	3350.0
2	14.2	47.5	209.0	4600.0	19.5	42.0	200.0	4050.0	16.6	40.9	187.0	3200.0
3	15.7	48.7	208.0	5350.0	18.3	42.7	196.0	4075.0	20.0	52.8	205.0	4550.0
4	14.1	48.7	210.0	4450.0	18.0	35.7	202.0	3550.0	18.7	45.4	188.0	3525.0

We are now ready for a melt. Because we have names for the levels of our multiindex, the melt function will automatically name the columns of the resulting melted data frame.

[5]:

df_melted = df.melt()

df_melted.head()

[5]:

	species	property	value
0	Gentoo	bill_depth_mm	16.3
1	Gentoo	bill_depth_mm	15.8
2	Gentoo	bill_depth_mm	14.2
3	Gentoo	bill_depth_mm	15.7
4	Gentoo	bill_depth_mm	14.1

This seems like a tidy data frame, and it is, but we have lost information! Specifically, we lost the which individual penguin each measurements belongs to. A total of 204 penguins were measured (68 for each species), and four properties of each were measured. The melted data frame has 204 × 4 = 816 rows. This is no good, since we want to know which penguin each set of four measurements belongs to.

So, we want a tidy data frame that has five columns, each column containing a variable for one set of measurements. The variables are the beak depth, beak length, flipper length, and weight, and which species the penguin being measured is.

Stacking

Stacking is procedure that takes a wide data frame and makes it narrower by converting the names at a level of a columnar multiindex into an index in the data frame. In this case, we want to make the species, as given in the top level of the column muliindex, and make it an index. Let’s perform the stacking operation and see what we get.

[6]:

df = df.stack(level="species")

df.head()

[6]:

	property	bill_depth_mm	bill_length_mm	body_mass_g	flipper_length_mm
	species
0	Adelie	18.5	36.8	3500.0	193.0
	Chinstrap	18.3	47.6	3850.0	195.0
	Gentoo	16.3	48.4	5400.0	220.0
1	Adelie	16.9	37.0	3000.0	185.0
1	Chinstrap	16.7	42.5	3350.0	187.0

We now have a multiindex for the data frame index. The outer index is nameless and the inner index carries the name “species”. We no longer have a multiindex for the column names, but have the sets of columns we like.

This structure of the data frame makes sense, but it would be easier to understand if we swapped the levels of the index.

[7]:

df = df.swaplevel(axis="index")

df.head()

[7]:

	property	bill_depth_mm	bill_length_mm	body_mass_g	flipper_length_mm
species
Adelie	0	18.5	36.8	3500.0	193.0
Chinstrap	0	18.3	47.6	3850.0	195.0
Gentoo	0	16.3	48.4	5400.0	220.0
Adelie	1	16.9	37.0	3000.0	185.0
Chinstrap	1	16.7	42.5	3350.0	187.0

Now, the second index is the measurement number for a given species. It becomes easier to understand if we provide a name for the inner index and sort by species.

[8]:

df.index.names = ["species", "penguin number"]
df = df.sort_index(level="species")

df.head()

[8]:

	property	bill_depth_mm	bill_length_mm	body_mass_g	flipper_length_mm
species	penguin number
Adelie	0	18.5	36.8	3500.0	193.0
	1	16.9	37.0	3000.0	185.0
	2	19.5	42.0	4050.0	200.0
	3	18.3	42.7	4075.0	196.0
	4	18.0	35.7	3550.0	202.0

However, if we want to make a plot out of this, we need to convert the row index to columns of the data frame, which we can do with the reset_index() method.

[9]:

df = df.reset_index()

df.head()

[9]:

property	species	penguin number	bill_depth_mm	bill_length_mm	body_mass_g	flipper_length_mm
0	Adelie	0	18.5	36.8	3500.0	193.0
1	Adelie	1	16.9	37.0	3000.0	185.0
2	Adelie	2	19.5	42.0	4050.0	200.0
3	Adelie	3	18.3	42.7	4075.0	196.0
4	Adelie	4	18.0	35.7	3550.0	202.0

Finally, we do not need the column index to be named.

[10]:

df.columns.name = None

df.head()

[10]:

	species	penguin number	bill_depth_mm	bill_length_mm	body_mass_g	flipper_length_mm
0	Adelie	0	18.5	36.8	3500.0	193.0
1	Adelie	1	16.9	37.0	3000.0	185.0
2	Adelie	2	19.5	42.0	4050.0	200.0
3	Adelie	3	18.3	42.7	4075.0	196.0
4	Adelie	4	18.0	35.7	3550.0	202.0

Beautiful! A tidy data frame! (Note that we can delete the 'penguin number' column if we like because it is irrelevant. In most situations, I would delete it, but we will use it for illustrative purposes later in this lesson.)

Reshaping by unstacking

Sometimes, our data set is tidy, contains all of the information we need, but is not in a convenient format. As an example, I will create a tidy data frame for the penguin data where each row is a single measurement of a single feature of a penguin. This is what we had before when we melted the data frame too soon, but if we melt the data frame now, specifying 'penguin number' and 'species' as ID variables, we get a tidy data frame that still has all of the information in the data set.

[11]:

df = df.melt(id_vars=["penguin number", "species"])

df.head()

[11]:

	penguin number	species	variable	value
0	0	Adelie	bill_depth_mm	18.5
1	1	Adelie	bill_depth_mm	16.9
2	2	Adelie	bill_depth_mm	19.5
3	3	Adelie	bill_depth_mm	18.3
4	4	Adelie	bill_depth_mm	18.0

This format is useful, but working with it can be more painful than in the previous format where each row corresponded to a measurement of a given penguin as opposed to one measurement of one property of a penguin. The present data frame is in a sense too tall. We would like to widen it, or unstack it.

To do this unstacking operation, we need to do two steps:

Set the index of the data frame to be a multiindex based on all columns that are necessary to have a unique index for each row. In this case, that would be 'species' and 'penguin number'. Additionally, use the column whose entries you want to become column names upon unstacking as part of the multiindex. In this case, this is 'variable'.
Perform an unstacking operation with the level being the level of the multiindex you want to become the column names (in our case 'variable').

Let’s do it!

[12]:

# Set index for necessary unique identifiers and column with desired column names
df = df.set_index(["species", "penguin number", "variable"])

# Unstack
df = df.unstack(level="variable")

df.head()

[12]:

		value
	variable	bill_depth_mm	bill_length_mm	body_mass_g	flipper_length_mm
species	penguin number
Adelie	0	18.5	36.8	3500.0	193.0
	1	16.9	37.0	3000.0	185.0
	2	19.5	42.0	4050.0	200.0
	3	18.3	42.7	4075.0	196.0
	4	18.0	35.7	3550.0	202.0

This is close to the shape we want. We have multiindexes for both the rows and columns. To flatten the multiindexed column names, we could use df.columns.to_flat_index(), but this converts the multiindex to a single index comprised of tuples. So, the column names would be something like ('value', 'bill_depth_mm'). Instead, we just want the inner level of the multiindex, which has a name 'variable'. We can set the columns by getting the values of the indices at this level.

[13]:

df.columns = df.columns.get_level_values(level="variable")

df.head()

[13]:

	variable	bill_depth_mm	bill_length_mm	body_mass_g	flipper_length_mm
species	penguin number
Adelie	0	18.5	36.8	3500.0	193.0
	1	16.9	37.0	3000.0	185.0
	2	19.5	42.0	4050.0	200.0
	3	18.3	42.7	4075.0	196.0
	4	18.0	35.7	3550.0	202.0

Now all we are left to do is reset the index to bring the species and penguin number entries from indexes to columns in the data frame.

[14]:

df = df.reset_index()

df.head()

[14]:

variable	species	penguin number	bill_depth_mm	bill_length_mm	body_mass_g	flipper_length_mm
0	Adelie	0	18.5	36.8	3500.0	193.0
1	Adelie	1	16.9	37.0	3000.0	185.0
2	Adelie	2	19.5	42.0	4050.0	200.0
3	Adelie	3	18.3	42.7	4075.0	196.0
4	Adelie	4	18.0	35.7	3550.0	202.0

A couple of plots for fun

Now that we’ve done all this work and our data set is tidy, let’s make a plot for fun. First, we’ll plot the ECDFs of the bill lengths.

[15]:

bokeh.io.show(
    iqplot.ecdf(
        data=df,
        cats='species',
        q='bill_length_mm',
        x_axis_label='bill length (mm)',
        frame_width=400,
    )
)

We can also plot bill length versus flipper length to see if we can see a difference among the species.

[16]:

# Create figure
p = bokeh.plotting.figure(
    frame_width=300,
    frame_height=300,
    x_axis_label="bill length (mm)",
    y_axis_label="flipper length (mm)",
    toolbar_location="above",
)

# Build legend as we populate glyphs
legend_items = []
for color, (species, g) in zip(bokeh.palettes.Category10_3, df.groupby("species")):
    glyph = p.circle(source=g, x="bill_length_mm", y="flipper_length_mm", color=color)
    legend_items.append((species, [glyph]))

# Place legend
legend = bokeh.models.Legend(items=legend_items, location="center")
p.add_layout(legend, "right")
p.legend.click_policy = "hide"

bokeh.io.show(p)

An important note about tidiness

It is important to note that there is more than one way to make a data set tidy. In the example of the Palmer penguin data set, we saw two legitimate ways of making the data frame tidy. In our preferred version, each row corresponded to a measurement of a single penguin, which had several variables associated with it. In another version, each row corresponded to a single feature of a penguin.

To demonstrate that this latter version is workable, but more cumbersome, we can make the same plots as above. First, we’ll melt it again.

[17]:

df = df.melt(id_vars=["penguin number", "species"])

df.head()

[17]:

	penguin number	species	variable	value
0	0	Adelie	bill_depth_mm	18.5
1	1	Adelie	bill_depth_mm	16.9
2	2	Adelie	bill_depth_mm	19.5
3	3	Adelie	bill_depth_mm	18.3
4	4	Adelie	bill_depth_mm	18.0

Plotting the ECDFs is not really a problem with this form of the data frame. We just need to use Boolean indexing to pull out the bill length rows.

[18]:

bokeh.io.show(
    iqplot.ecdf(
        data=df.loc[df["variable"] == "bill_length_mm", :],
        q="value",
        cats="species",
        frame_width=400,
        x_axis_label="bill length (mm)",
    )
)

Making the scatter plot, however, is much more difficult and involves a lot of Boolean indexing by hand.

[19]:

# Set up figure
p = bokeh.plotting.figure(
    frame_width=300,
    frame_height=300,
    x_axis_label="bill length (mm)",
    y_axis_label="flipper length (mm)",
    toolbar_location="above",
)

# Build legend as we populate glyphs
legend_items = []
for color, species in zip(bokeh.palettes.Category10_3, df["species"].unique()):
    # Which species
    species_inds = df["species"] == species

    # Slice out bill and flipper lengths for species
    bill_length = df.loc[(df["variable"] == "bill_length_mm") & species_inds, "value"]
    flipper_length = df.loc[(df["variable"] == "flipper_length_mm") & species_inds, "value"]

    # Populate glyph
    glyph = p.circle(bill_length, flipper_length, color=color)
    legend_items.append((species, [glyph]))

# Build and place legend
legend = bokeh.models.Legend(items=legend_items, location="center")
p.add_layout(legend, "right")
p.legend.click_policy = "hide"

bokeh.io.show(p)

This works fine, but is more cumbersome and therefore prone to error because we cannot use a groupby operation. The moral of the story is that you should tidy your data, but you should think carefully about in what way your data are tidy.

Computing environment

[20]:

%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,iqplot,jupyterlab

Python implementation: CPython
Python version       : 3.9.13
IPython version      : 8.4.0

numpy     : 1.21.5
pandas    : 1.4.3
bokeh     : 2.4.3
iqplot    : 0.3.2
jupyterlab: 3.4.4