Making a data frame tall

Data set download


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars iqplot watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
[2]:
import numpy as np
import polars as pl

import iqplot

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

In this part of the lesson, we will perform some wrangling on a data set that involves:

  • Using Polars’s read_excel() function to read in data from an MS Excel spread sheet

  • Parsing column names to extract useful metadata en route to tidy data

  • Performing an unpivot operation to efficiently tidy a data set.

The data set

We will use a data set from Angela Stathopoulos’s lab, acquired to study morphogen profiles in developing fruit fly embryos. The original paper is Reeves, Trisnadi, et al., Dorsal-ventral gene expression in the Drosophila embryo reflects the dynamics and precision of the Dorsal nuclear gradient, Dev. Cell., 22, 544-557, 2012, and the data set may be downloaded here: https://s3.amazonaws.com/bebi103.caltech.edu/data/Reeves2012_data.xlsx.

In this experiment, Reeves, Trisnadi, and coworkers measured expression levels of a fusion of Dorsal, a morphogen transcription factor important in determining the dorsal-ventral axis of the developing organism, and Venus, a yellow fluorescent protein along the dorsal/ventral- (DV) coordinate. They put this construct on the third chromosome, while the wild type dorsal is on the second. Instead of the wild type, they had a homozygous dorsal-null mutant on the second chromosome. The Dorsal-Venus construct rescues wild type behavior, so they could use this construct to study Dorsal gradients.

Dorsal shows higher expression on the ventral side of the organism, thus giving a gradient in expression from dorsal to ventral which can be ascertained by the spatial distribution of Venus fluorescence intensity.

This can be seen in the image below, which is a cross section of a fixed embryo with anti-Dorsal staining. The bottom of the image is the ventral side and the top is the dorsal side of the embryo. The DV coordinate system is defined by the yellow line. The image is adapted from the Reeves, Trisnadi, et al. paper.

Dorsal gradient

A quick note on nomenclature: Dorsal (capital D) is the name of the protein product of the gene dorsal (italicized). The dorsal (adjective) side of the embryo is its back. The ventral side is its belly. Dorsal is expressed more strongly on the ventral side of the developing embryo. This can be confusing.

To quantify the gradient, Reeves, Trisnadi, and coworkers had to first choose a metric for describing it. They chose to fit the measured profile of fluorescence intensity with a Gaussian peak (plus background) and use the standard deviation of that Gaussian as a metric for the width of the Dorsal gradient.

In this lesson, we will use the gradient widths as outputted from this procedure. The units of the widths are dimensionless, consistent with the coordinate system shown in the image above. I asked one of the authors for the data sets used in making the figures. She sent me a MS Excel file that had a separate sheet for each of several figures in the paper that I asked about. We will focus on the data used for Fig. 1F of the paper. In this figure, the authors seek to demonstrate that live imaging with their Venus-Dorsal construct gives a Dorsal gradient of similar width as would be obtained by fixing wild type cells and doing Dorsal antibody staining (the gold standard). These wild type embryos were analyzed as whole mounts and also as cross sections. They also tried anti-Dorsal staining and anti-Venus staining in the Venus-Dorsal construct. Finally, they also measured gradient widths of a GFP-Dorsal construct that fails to complete development.

Loading in an Excel sheet

Generally, you should store your data sets in portable formats, like CSV, JSON, XML, HDF5, OME-TIFF, etc., and not proprietary formats. Nonetheless, software like Microsoft Excel is widely used, and you will often receive data sets in this format. Fortunately, Polars can read Excel files, provided they are from fairly recent versions of Excel.

To read in this data set, we will use pl.read_excel(). Importantly, because an Excel document may have many sheets, we need to specify the sheet name we want, in this case 'Fig 1F'.

[3]:
df = pl.read_excel(os.path.join(data_path, "Reeves2012_data.xlsx"), sheet_name="Fig 1F")

df.head()
[3]:
shape: (5, 8)
wt wholemountswt cross-sectionsanti-Dorsal dl1/+dl-venus/+anti-gfp dl1/+dl-venus/+Venus (live) dl1/+dl-venus/+anti-Dorsal dl1/+dl-GFP/+anti-gfp dl1/+dl-GFP/+ GFP (live) dl1/+dl-GFP/+
f64f64f64f64f64f64f64f64
0.12880.13270.14820.16320.16660.22480.23890.2412
0.15540.14570.15030.16710.17530.18910.20350.1942
0.13060.14470.15770.17040.17050.17050.19430.2186
0.14130.12820.17110.1779null0.17350.20.2104
0.15570.14870.13420.1483null0.21350.2560.2463

The data frame is not tidy. Each entry corresponds to one observation, not each row. The column headings contain important metadata, the genotype (wt, dl1/+dl-venus/+, or dl1/+dl-GFP/+) and the method (wholemounts, cross-sections, anti-Dorsal, anti-gfp, Venus (live), and GFP (live)).

The data set has other issues we need to clean up. The column 'anti-gfp dl1/+dl-venus/+' is mislabeled; it should be 'anti-Venus dl1/+dl-venus/+'. We would also like to clean up the genotypes, putting in a semicolon to separate the chromosomes. The wild type columns have the genotype first ('wt') followed by the method, whereas the other columns have the method first, followed by genotype.

Parsing the column names

We will start our process of tidying this data set by changing the column names. They are pretty messy, so this is best done by hand in this case. We will rename the columns with strings where the genotype comes first, followed by the method for measuring the gradient width, separated by an underscore.

[4]:
col_names = {
    'wt wholemounts': 'WT_whole mount',
    'wt cross-sections': 'WT_cross-section',
    'anti-Dorsal dl1/+dl-venus/+': 'dl1/+ ; dl-venus/+_anti-Dorsal',
    'anti-gfp dl1/+dl-venus/+': 'dl1/+ ; dl-venus/+_anti-Venus',
    'Venus (live)  dl1/+dl-venus/+': 'dl1/+ ; dl-venus/+_Venus (live)',
    'anti-Dorsal  dl1/+dl-GFP/+': 'dl1/+ ; dl-gfp/+_anti-Dorsal',
    'anti-gfp  dl1/+dl-GFP/+ ': 'dl1/+ ; dl-gfp/+_anti-GFP',
    'GFP (live)  dl1/+dl-GFP/+': 'dl1/+ ; dl-gfp/+_GFP (live)'
}

df = df.rename(col_names)

df.head()
[4]:
shape: (5, 8)
WT_whole mountWT_cross-sectiondl1/+ ; dl-venus/+_anti-Dorsaldl1/+ ; dl-venus/+_anti-Venusdl1/+ ; dl-venus/+_Venus (live)dl1/+ ; dl-gfp/+_anti-Dorsaldl1/+ ; dl-gfp/+_anti-GFPdl1/+ ; dl-gfp/+_GFP (live)
f64f64f64f64f64f64f64f64
0.12880.13270.14820.16320.16660.22480.23890.2412
0.15540.14570.15030.16710.17530.18910.20350.1942
0.13060.14470.15770.17040.17050.17050.19430.2186
0.14130.12820.17110.1779null0.17350.20.2104
0.15570.14870.13420.1483null0.21350.2560.2463

Unpivot the data frame

When we unpivot the data frame, the data within it, called values, become a single column. The column names, called variables also populate a new column. So, to unpivot it, we need to specify what we want to call the values and what we want to call the variable. The unpivot() method does the rest!

[5]:
df = df.unpivot(
    variable_name='genotype_method',
    value_name='gradient width'
).drop_nulls()

# Take a look
df.head()
[5]:
shape: (5, 2)
genotype_methodgradient width
strf64
"WT_whole mount"0.1288
"WT_whole mount"0.1554
"WT_whole mount"0.1306
"WT_whole mount"0.1413
"WT_whole mount"0.1557

Nice! We now have a tidy data frame. Note that we also dropped the null values, since the nulls from the original columns come along for the ride when unpivoting.

Note that df.unpivot() has other options. For example, you can specify columns that do not comprise data, but should still be included in the unpivoted data frame using the id_vars keyword argument. That does not apply to this data frame, but comes up often. As a final comment, note that unpivoting is sometimes called melting, as it is when using Pandas.

Splitting the genotype_method column

Now our goal is to convert the 'genotype_method' column into a two columns, one encoding genotype and the other the method. To do this, we first use Polars’s string methods to split the entries in the column at the underscore. This gives a series of list data types. We then convert the list to a struct, where the fields are the column labels we want when we split the column into two, in this case 'genotype' and 'method'. Finally, we can unnest the column with our structs.

[6]:
df = df.with_columns(
    pl.col('genotype_method')
    .str.split('_')
    .list.to_struct(fields=['genotype', 'method'])
).unnest('genotype_method')

# Take a look
df.head()
[6]:
shape: (5, 3)
genotypemethodgradient width
strstrf64
"WT""whole mount"0.1288
"WT""whole mount"0.1554
"WT""whole mount"0.1306
"WT""whole mount"0.1413
"WT""whole mount"0.1557

Looking at the data frame above, it is very tall. Each row only has a single value, that is, a single measurement, and every other column in the row is metadata associated with that measurement, specifically which penguin/species and which quantity was being measured. Unpivoting operations generally make data frames taller, with more rows before unpivoting.

Using the data frame

Let’s make a plot like Fig. 1F of the Reeves, Trisnadi, et al. paper, but not with boxes, rather as a strip plot.

[7]:
p = iqplot.strip(
    data=df,
    q='gradient width',
    cats=['genotype', 'method'],
    color_column='genotype',
    spread="jitter",
    frame_height=350,
    frame_width=450,
)

bokeh.io.show(p)

Computing environment

[8]:
%load_ext watermark
%watermark -v -p polars,bokeh,iqplot,jupyterlab
Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.25.0

polars    : 1.6.0
bokeh     : 3.4.1
iqplot    : 0.3.7
jupyterlab: 4.0.13