Merging and concatenating data frames
[2]:
import numpy as np
import polars as pl
import iqplot
import bokeh.io
bokeh.io.output_notebook()
It often happens that experiments consist of multiple data files that need to be brought together into a single data frame to work with in exploratory data analysis and subsequent analyses. Through its concatenation and merging capabilities, Polars provides powerful tools for handling this sort of data.
The frog tongue strike data set
As usual, we will work with a real data set to learn about concatenation and merging of data frames. The data set we will use comes from a fun paper about the adhesive properties of frog tongues. The reference is Kleinteich and Gorb, Tongue adhesion in the horned frog Ceratophrys sp., Sci. Rep., 4, 5225, 2014. You might also want to check out a New York Times feature on the paper here.
In this paper, the authors investigated various properties of the adhesive characteristics of the tongues of horned frogs when they strike prey. The authors had a striking pad connected to a cantilever to measure forces. They also used high speed cameras to capture the strike and record relevant data.
To get an idea of the experimental set up, you can check out this movie, kindly sent to me by Thomas Kleinteich. If video does not play in your browser, you may download it here.
The data files
I pulled data files from the Kleinteich and Gorb paper. You can download the data files here: https://s3.amazonaws.com/bebi103.caltech.edu/data/frog_strikes.zip.
There are four files, one for each of the four frogs, labeled with IDs I, II, III, and IV, that were studied. To see the format of the files, we can look at the content of the file for frog I. You can use
head -n 20 ../data/frog_strikes_I.csv
from the command line. Here is the content of the first data file.
# These data are from Kleinteich and Gorb, Sci. Rep., 4, 5225, 2014.
# Frog ID: I
# Age: adult
# Snout-vent-length (SVL): 63 mm
# Body weight: 63.1 g
# Species: Ceratophrys cranwelli crossed with Ceratophrys cornuta
date,trial number,impact force (mN),impact time (ms),impact force / body weight,adhesive force (mN),time frog pulls on target (ms),adhesive force / body weight,adhesive impulse (N-s),total contact area (mm2),contact area without mucus (mm2),contact area with mucus / contact area without mucus,contact pressure (Pa),adhesive strength (Pa)
2013_02_26,3,1205,46,1.95,-785,884,1.27,-0.290,387,70,0.82,3117,-2030
2013_02_26,4,2527,44,4.08,-983,248,1.59,-0.181,101,94,0.07,24923,-9695
2013_03_01,1,1745,34,2.82,-850,211,1.37,-0.157,83,79,0.05,21020,-10239
2013_03_01,2,1556,41,2.51,-455,1025,0.74,-0.170,330,158,0.52,4718,-1381
2013_03_01,3,493,36,0.80,-974,499,1.57,-0.423,245,216,0.12,2012,-3975
2013_03_01,4,2276,31,3.68,-592,969,0.96,-0.176,341,106,0.69,6676,-1737
2013_03_05,1,556,43,0.90,-512,835,0.83,-0.285,359,110,0.69,1550,-1427
2013_03_05,2,1928,46,3.11,-804,508,1.30,-0.285,246,178,0.28,7832,-3266
2013_03_05,3,2641,50,4.27,-690,491,1.12,-0.239,269,224,0.17,9824,-2568
2013_03_05,4,1897,41,3.06,-462,839,0.75,-0.328,266,176,0.34,7122,-1733
2013_03_12,1,1891,40,3.06,-766,1069,1.24,-0.380,408,33,0.92,4638,-1879
2013_03_12,2,1545,48,2.50,-715,649,1.15,-0.298,141,112,0.21,10947,-5064
2013_03_12,3,1307,29,2.11,-613,1845,0.99,-0.768,455,92,0.80,2874,-1348
2013_03_12,4,1692,31,2.73,-677,917,1.09,-0.457,186,129,0.31,9089,-3636
2013_03_12,5,1543,38,2.49,-528,750,0.85,-0.353,153,148,0.03,10095,-3453
2013_03_15,1,1282,31,2.07,-452,785,0.73,-0.253,290,105,0.64,4419,-1557
2013_03_15,2,775,34,1.25,-430,837,0.70,-0.276,257,124,0.52,3019,-1677
2013_03_15,3,2032,60,3.28,-652,486,1.05,-0.257,147,134,0.09,13784,-4425
2013_03_15,4,1240,34,2.00,-692,906,1.12,-0.317,364,260,0.28,3406,-1901
2013_03_15,5,473,40,0.76,-536,1218,0.87,-0.382,259,168,0.35,1830,-2073
The first lines all begin with #
signs, signifying that they are comments. They do give important information about the frog, though.
The first line after the comments are the headers, giving the column names for the data frame we will load.
Concatenating data frames
We would like to have all of the data frames be together in one data frame so we can conveniently do things like make plots comparing the four frogs. Let’s read in the data sets and make a list of data frames.
[3]:
# On a local machine, we would do this: fnames = glob.glob('../data/frog_strikes_*.csv')
# But for Colab compatibility, we will do it by hand
fnames = [
os.path.join(data_path, f"frog_strikes_{frog_id}.csv")
for frog_id in ["I", "II", "III", "IV"]
]
dfs = [pl.read_csv(f, comment_prefix="#") for f in fnames]
# Take a look at first data frame
dfs[0].head()
[3]:
date | trial number | impact force (mN) | impact time (ms) | impact force / body weight | adhesive force (mN) | time frog pulls on target (ms) | adhesive force / body weight | adhesive impulse (N-s) | total contact area (mm2) | contact area without mucus (mm2) | contact area with mucus / contact area without mucus | contact pressure (Pa) | adhesive strength (Pa) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | i64 | i64 | i64 | f64 | i64 | i64 | f64 | f64 | i64 | i64 | f64 | i64 | i64 |
"2013_02_26" | 3 | 1205 | 46 | 1.95 | -785 | 884 | 1.27 | -0.29 | 387 | 70 | 0.82 | 3117 | -2030 |
"2013_02_26" | 4 | 2527 | 44 | 4.08 | -983 | 248 | 1.59 | -0.181 | 101 | 94 | 0.07 | 24923 | -9695 |
"2013_03_01" | 1 | 1745 | 34 | 2.82 | -850 | 211 | 1.37 | -0.157 | 83 | 79 | 0.05 | 21020 | -10239 |
"2013_03_01" | 2 | 1556 | 41 | 2.51 | -455 | 1025 | 0.74 | -0.17 | 330 | 158 | 0.52 | 4718 | -1381 |
"2013_03_01" | 3 | 493 | 36 | 0.8 | -974 | 499 | 1.57 | -0.423 | 245 | 216 | 0.12 | 2012 | -3975 |
We have successfully loaded in all of the data frames. They all have the same columns (as given by the CSV files). So, we wish to tape the data frames together vertically. We can use the pl.concat()
function to do this.
Before we do that, though, we might notice a problem. We will not have information to tell us which frog is which. We might therefore like to add a column to each data frame that has the frog ID, and then concatenate them. We can parse the ID of the frog from the file name, as we can see by looking at the file names.
[4]:
fnames
[4]:
['../data/frog_strikes_I.csv',
'../data/frog_strikes_II.csv',
'../data/frog_strikes_III.csv',
'../data/frog_strikes_IV.csv']
So, for each data frame/file name pair, we extract the Roman numeral and add a column to the data frame containing the frog ID. To do this, we use a Polars literal, accessible with pl.lit()
, which means that we want to insert a specific value (in this case, "I"
, "II"
, "III"
, or "IV"
) into a data frame as a column.
[5]:
for i, f in enumerate(fnames):
frog_id = f[f.rfind('_')+1:f.rfind('.')]
dfs[i] = dfs[i].with_columns(pl.lit(frog_id).alias('ID'))
# Take a look
dfs[0].head()
[5]:
date | trial number | impact force (mN) | impact time (ms) | impact force / body weight | adhesive force (mN) | time frog pulls on target (ms) | adhesive force / body weight | adhesive impulse (N-s) | total contact area (mm2) | contact area without mucus (mm2) | contact area with mucus / contact area without mucus | contact pressure (Pa) | adhesive strength (Pa) | ID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | i64 | i64 | i64 | f64 | i64 | i64 | f64 | f64 | i64 | i64 | f64 | i64 | i64 | str |
"2013_02_26" | 3 | 1205 | 46 | 1.95 | -785 | 884 | 1.27 | -0.29 | 387 | 70 | 0.82 | 3117 | -2030 | "I" |
"2013_02_26" | 4 | 2527 | 44 | 4.08 | -983 | 248 | 1.59 | -0.181 | 101 | 94 | 0.07 | 24923 | -9695 | "I" |
"2013_03_01" | 1 | 1745 | 34 | 2.82 | -850 | 211 | 1.37 | -0.157 | 83 | 79 | 0.05 | 21020 | -10239 | "I" |
"2013_03_01" | 2 | 1556 | 41 | 2.51 | -455 | 1025 | 0.74 | -0.17 | 330 | 158 | 0.52 | 4718 | -1381 | "I" |
"2013_03_01" | 3 | 493 | 36 | 0.8 | -974 | 499 | 1.57 | -0.423 | 245 | 216 | 0.12 | 2012 | -3975 | "I" |
Good! Now all data frames have an 'ID'
column, and we can concatenate. The pl.concat()
function takes as input a list of data frames to be concatenated and stacks them on top of each other.
[6]:
# Concatenate data frames
df = pl.concat(dfs)
# Make sure we got them all
print(
"Number of rows:", len(df), "\nUnique IDs:", df.get_column("ID").unique().to_list()
)
Number of rows: 80
Unique IDs: ['IV', 'III', 'I', 'II']
Check!
Creating a DataFrame from scratch
Looking back at the headers of the original data files, we see that there is information present in the header that we would like to have in our data frame. For example, it would be nice to know if each strike came from an adult or juvenile. Or what the snout-vent length was. Working toward the goal of including this in our data frame, we will first construct a new data frame containing information about each frog.
Data frames from dictionaries
One way do create this new data frame is to first construct a dictionary with the respective fields. Since these data sets are small, we can look at the files and make the dictionary by hand.
[7]:
data_dict = {
"ID": ["I", "II", "III", "IV"],
"age": ["adult", "adult", "juvenile", "juvenile"],
"SVL (mm)": [63, 70, 28, 31],
"body weight (g)": [63.1, 72.7, 12.7, 12.7],
"species": ["cross", "cross", "cranwelli", "cranwelli"],
}
Now that we have this dictionary, we can convert it into a DataFrame
by instantiating a pl.DataFrame
class with it, using the data
kwarg.
[8]:
# Make it into a DataFrame
df_frog_info = pl.DataFrame(data=data_dict)
# Take a look
df_frog_info
[8]:
ID | age | SVL (mm) | body weight (g) | species |
---|---|---|---|---|
str | str | i64 | f64 | str |
"I" | "adult" | 63 | 63.1 | "cross" |
"II" | "adult" | 70 | 72.7 | "cross" |
"III" | "juvenile" | 28 | 12.7 | "cranwelli" |
"IV" | "juvenile" | 31 | 12.7 | "cranwelli" |
Nice!
Data frames from numpy arrays
Sometimes the data sets are not small enough to construct a dictionary by hand. Oftentimes, we have a two-dimensional array of data that we want to make into a DataFrame
. As an example, let’s say we have a Numpy array where the first column is snout vent length and the second is weight.
[9]:
data = np.array([[63, 70, 28, 31], [63.1, 72.7, 12.7, 12.7]]).transpose()
# Verify that it's what we think it is
data
[9]:
array([[63. , 63.1],
[70. , 72.7],
[28. , 12.7],
[31. , 12.7]])
To make this into a DataFrame
, we again create pl.DataFrame
instance, but this time we also specify the schema
keyword argument to label the columns.
[10]:
df_demo = pl.DataFrame(data=data, schema=["SVL (mm)", "weight (g)"])
# Take a look
df_demo
[10]:
SVL (mm) | weight (g) |
---|---|
f64 | f64 |
63.0 | 63.1 |
70.0 | 72.7 |
28.0 | 12.7 |
31.0 | 12.7 |
That also works. Generally, any two-dimensional Numpy array can be converted into a DataFrame
in this way. You just need to supply column names.
Programmatically creating a data frame
Hand-entering data should be minimized. The information about each frog were hand-entered once by the experimenter. We should not hand-enter them again. We therefore should parse the comment lines of input files to get the pertinent information.
Note, though, that in the case of a single experiment with only four data sets, hand entering might be faster and indeed less error prone than doing it programmatically. We should definitely do it programmatically if we have a large number of data files or will ever do an experiment with the same file format again.
So, let’s programmatically parse the files. We start by writing a function to parse the metadata from a single file. Recall that the comment lines look like this:
# These data are from Kleinteich and Gorb, Sci. Rep., 4, 5225, 2014.
# Frog ID: I
# Age: adult
# Snout-vent-length (SVL): 63 mm
# Body weight: 63.1 g
# Species: Ceratophrys cranwelli crossed with Ceratophrys cornuta
(The function below will not work with Colab because open()
does not work for files specified by a URL.)
[11]:
def parse_frog_metadata(fname):
with open(fname, 'r') as f:
# Citation line, ignore.
f.readline()
# Frog ID
line = f.readline()
frog_id = line[line.find(':')+1:].strip()
# Age
line = f.readline()
age = line[line.find(':')+1:].strip()
# SVL, assume units given as mm
line = f.readline()
svl = line[line.find(':')+1:line.rfind(' ')].strip()
# Body weight, assume units given as g
line = f.readline()
body_weight = line[line.find(':')+1:line.rfind(' ')].strip()
# Species (either cranwelli or cross)
line = f.readline()
species = line[line.find(':')+1:].strip()
if 'cross' in species:
species = 'cross'
else:
species = 'cranwelli'
return frog_id, age, svl, body_weight, species
Let’s take it for a spin.
[12]:
parse_frog_metadata(os.path.join(data_path, 'frog_strikes_I.csv'))
[12]:
('I', 'adult', '63', '63.1', 'cross')
Looks good! Now we can create a list of tuples to use as data for making a data frame.
[13]:
data = [parse_frog_metadata(f) for f in fnames]
# Take a look
data
[13]:
[('I', 'adult', '63', '63.1', 'cross'),
('II', 'adult', '70', '72.7', 'cross'),
('III', 'juvenile', '28', '12.7', 'cranwelli'),
('IV', 'juvenile', '31', '12.7', 'cranwelli')]
We now input this list of tuples, plus the column names, into pl.DataFrame()
, and we’ve got our data frame. We do have to specify that this list of tuples is row-oriented, so Polars knows that each tuple is a row and not a column.
[14]:
df_frog_info = pl.DataFrame(
data=data,
schema=["ID", "age", "SVL (mm)", "body weight (g)", "species"],
orient='row',
)
# Take a look
df_frog_info
[14]:
ID | age | SVL (mm) | body weight (g) | species |
---|---|---|---|---|
str | str | str | str | str |
"I" | "adult" | "63" | "63.1" | "cross" |
"II" | "adult" | "70" | "72.7" | "cross" |
"III" | "juvenile" | "28" | "12.7" | "cranwelli" |
"IV" | "juvenile" | "31" | "12.7" | "cranwelli" |
Joining DataFrames
We want to add the information about the frogs into our main data frame, df
, that we have been working with. Specifically, for each row of the data frame, we also want to include the frog’s age, snout-vent length, body weight, and species. So, we want to take the data frame with all of the information about the tongue strikes and combine it with the data frame containing information about each frog. This combining of data frames is a join operation. In this case, we join on the 'ID'
column, since the value of the that column in each data frame indicates the frog we are talking about.
To perform a join operation we use the df.join()
method (or df.join_asof()
method for approximate matches). Its default join strategy is an inner join, in which the entries of a given row are included in the joined data frame if and only if the entry 'ID'
column of the respective frames match. You can read more about available join strategies in the documentation.
[15]:
df = df.join(df_frog_info, on='ID')
# Take a look
df.head()
[15]:
date | trial number | impact force (mN) | impact time (ms) | impact force / body weight | adhesive force (mN) | time frog pulls on target (ms) | adhesive force / body weight | adhesive impulse (N-s) | total contact area (mm2) | contact area without mucus (mm2) | contact area with mucus / contact area without mucus | contact pressure (Pa) | adhesive strength (Pa) | ID | age | SVL (mm) | body weight (g) | species |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | i64 | i64 | i64 | f64 | i64 | i64 | f64 | f64 | i64 | i64 | f64 | i64 | i64 | str | str | str | str | str |
"2013_02_26" | 3 | 1205 | 46 | 1.95 | -785 | 884 | 1.27 | -0.29 | 387 | 70 | 0.82 | 3117 | -2030 | "I" | "adult" | "63" | "63.1" | "cross" |
"2013_02_26" | 4 | 2527 | 44 | 4.08 | -983 | 248 | 1.59 | -0.181 | 101 | 94 | 0.07 | 24923 | -9695 | "I" | "adult" | "63" | "63.1" | "cross" |
"2013_03_01" | 1 | 1745 | 34 | 2.82 | -850 | 211 | 1.37 | -0.157 | 83 | 79 | 0.05 | 21020 | -10239 | "I" | "adult" | "63" | "63.1" | "cross" |
"2013_03_01" | 2 | 1556 | 41 | 2.51 | -455 | 1025 | 0.74 | -0.17 | 330 | 158 | 0.52 | 4718 | -1381 | "I" | "adult" | "63" | "63.1" | "cross" |
"2013_03_01" | 3 | 493 | 36 | 0.8 | -974 | 499 | 1.57 | -0.423 | 245 | 216 | 0.12 | 2012 | -3975 | "I" | "adult" | "63" | "63.1" | "cross" |
Note that the entries for the added columns were repeated appropriately, e.g., body weight column had 63 for every row corresponding to frog I.
At long last, a plot!
While the purpose of this part of the lesson was to learn how to concatenate and merge data frames, going through all of that wrangling effort would somehow be unsatisfying if we we didn’t generate a plot. Let’s compare the impact force on a per-mass basis for each frog.
[16]:
p = iqplot.strip(
df,
q="impact force / body weight",
cats="ID",
color_column="age",
spread="jitter",
x_axis_label="impact force / body weight (mN/g)",
y_axis_label="frog ID"
)
bokeh.io.show(p)
Apparently Frog III consistently packs a powerful punch, er…. tongue, for its body weight.
Computing environment
[17]:
%load_ext watermark
%watermark -v -p numpy,polars,bokeh,iqplot,jupyterlab
Python implementation: CPython
Python version : 3.12.4
IPython version : 8.25.0
numpy : 1.26.4
polars : 1.6.0
bokeh : 3.4.1
iqplot : 0.3.7
jupyterlab: 4.0.13