Merging and concatenating data frames

Data set download


[2]:
import numpy as np
import pandas as pd

import iqplot

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

It often happens that experiments consist of multiple data files that need to be brought together into a single data frame to work with in exploratory data analysis and subsequent analyses. Through its concatenation and merging capabilities, Pandas provides powerful tools for handling this sort of data.

The frog tongue strike data set

As usual, we will work with a real data set to learn about concatenation and merging of data frames. The data set we will use comes from a fun paper about the adhesive properties of frog tongues. The reference is Kleinteich and Gorb, Tongue adhesion in the horned frog Ceratophrys sp., Sci. Rep., 4, 5225, 2014. You might also want to check out a New York Times feature on the paper here.

In this paper, the authors investigated various properties of the adhesive characteristics of the tongues of horned frogs when they strike prey. The authors had a striking pad connected to a cantilever to measure forces. They also used high speed cameras to capture the strike and record relevant data.

To get an idea of the experimental set up, you can check out this movie, kindly sent to me by Thomas Kleinteich. If video does not play in your browser, you may download it here.

The data files

I pulled data files from the Kleinteich and Gorb paper. You can download the data files here: https://s3.amazonaws.com/bebi103.caltech.edu/data/frog_strikes.zip.

There are four files, one for each of the four frogs, labeled with IDs I, II, III, and IV, that were studied. To see the format of the files, we can look at the content of the file for frog I. You can use

head -n 20 ../data/frog_strikes_I.csv

from the command line. Here is the content of the first data file.

# These data are from Kleinteich and Gorb, Sci. Rep., 4, 5225, 2014.
# Frog ID: I
# Age: adult
# Snout-vent-length (SVL): 63 mm
# Body weight: 63.1 g
# Species: Ceratophrys cranwelli crossed with Ceratophrys cornuta
date,trial number,impact force (mN),impact time (ms),impact force / body weight,adhesive force (mN),time frog pulls on target (ms),adhesive force / body weight,adhesive impulse (N-s),total contact area (mm2),contact area without mucus (mm2),contact area with mucus / contact area without mucus,contact pressure (Pa),adhesive strength (Pa)
2013_02_26,3,1205,46,1.95,-785,884,1.27,-0.290,387,70,0.82,3117,-2030
2013_02_26,4,2527,44,4.08,-983,248,1.59,-0.181,101,94,0.07,24923,-9695
2013_03_01,1,1745,34,2.82,-850,211,1.37,-0.157,83,79,0.05,21020,-10239
2013_03_01,2,1556,41,2.51,-455,1025,0.74,-0.170,330,158,0.52,4718,-1381
2013_03_01,3,493,36,0.80,-974,499,1.57,-0.423,245,216,0.12,2012,-3975
2013_03_01,4,2276,31,3.68,-592,969,0.96,-0.176,341,106,0.69,6676,-1737
2013_03_05,1,556,43,0.90,-512,835,0.83,-0.285,359,110,0.69,1550,-1427
2013_03_05,2,1928,46,3.11,-804,508,1.30,-0.285,246,178,0.28,7832,-3266
2013_03_05,3,2641,50,4.27,-690,491,1.12,-0.239,269,224,0.17,9824,-2568
2013_03_05,4,1897,41,3.06,-462,839,0.75,-0.328,266,176,0.34,7122,-1733
2013_03_12,1,1891,40,3.06,-766,1069,1.24,-0.380,408,33,0.92,4638,-1879
2013_03_12,2,1545,48,2.50,-715,649,1.15,-0.298,141,112,0.21,10947,-5064
2013_03_12,3,1307,29,2.11,-613,1845,0.99,-0.768,455,92,0.80,2874,-1348
2013_03_12,4,1692,31,2.73,-677,917,1.09,-0.457,186,129,0.31,9089,-3636
2013_03_12,5,1543,38,2.49,-528,750,0.85,-0.353,153,148,0.03,10095,-3453
2013_03_15,1,1282,31,2.07,-452,785,0.73,-0.253,290,105,0.64,4419,-1557
2013_03_15,2,775,34,1.25,-430,837,0.70,-0.276,257,124,0.52,3019,-1677
2013_03_15,3,2032,60,3.28,-652,486,1.05,-0.257,147,134,0.09,13784,-4425
2013_03_15,4,1240,34,2.00,-692,906,1.12,-0.317,364,260,0.28,3406,-1901
2013_03_15,5,473,40,0.76,-536,1218,0.87,-0.382,259,168,0.35,1830,-2073

The first lines all begin with # signs, signifying that they are comments. They do give important information about the frog, though.

The first line after the comments are the headers, giving the column names for the data frame we will load.

Concatenating data frames

We would like to have all of the data frames be together in one data frame so we can conveniently do things like make plots comparing the four frogs. Let’s read in the data sets and make a list of data frames.

[3]:
# On a local machine, we would do this: fnames = glob.glob('../data/frog_strikes_*.csv')
# But for Colab compatibility, we will do it by hand
fnames = [
    os.path.join(data_path, f"frog_strikes_{frog_id}.csv")
    for frog_id in ["I", "II", "III", "IV"]
]

dfs = [pd.read_csv(f, comment="#") for f in fnames]

# Take a look at first data frame
dfs[0].head()
[3]:
date trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
0 2013_02_26 3 1205 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030
1 2013_02_26 4 2527 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695
2 2013_03_01 1 1745 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239
3 2013_03_01 2 1556 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381
4 2013_03_01 3 493 36 0.80 -974 499 1.57 -0.423 245 216 0.12 2012 -3975

We have successfully loaded in all of the data frames. They all have the same columns (as given by the CSV files) and they all have the same indexes (range indexes that were applied be default when loading from the CSV files). We do not really care about the indexes. So, we wish to tape the data frames together vertically. We can use the pd.concat() function to do this.

Before we do that, though, we might notice a problem. We will not have information to tell us which frog is which. We might therefore like to add a column to each data frame that has the frog ID, and then concatenate them. We can parse the ID of the frog from the file name, as we can see by looking at the file names.

[4]:
fnames
[4]:
['../data/frog_strikes_I.csv',
 '../data/frog_strikes_II.csv',
 '../data/frog_strikes_III.csv',
 '../data/frog_strikes_IV.csv']

So, for each data frame/file name pair, we extract the Roman numeral and add a column to the data frame containing the frog ID.

[5]:
for i, f in enumerate(fnames):
    frog_id = f[f.rfind('_')+1:f.rfind('.')]
    dfs[i]['ID'] = frog_id

# Take a look
dfs[0].head()
[5]:
date trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa) ID
0 2013_02_26 3 1205 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030 I
1 2013_02_26 4 2527 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695 I
2 2013_03_01 1 1745 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239 I
3 2013_03_01 2 1556 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381 I
4 2013_03_01 3 493 36 0.80 -974 499 1.57 -0.423 245 216 0.12 2012 -3975 I

Good! Now all data frames have an 'ID' column, and we can concatenate. The pd.concat() function takes as input a list of data frames to be concatenated. Since we do not care about the index, we can use the ignore_index=True kwarg.

[6]:
df = pd.concat(dfs, ignore_index=True)

# Make sure we got them all
print('Number of rows:', len(df), '\nUnique IDs:', df['ID'].unique())
Number of rows: 80
Unique IDs: ['I' 'II' 'III' 'IV']

Check!

More advanced concatenation

When we concatenated, we updated each data frame with a fresh column. The pd.concat() function can handle some of this for you. If we instead passed a dictionary of data frames instead of a list, it applies the keys to each data frame that is concatenated using a multiindex. First, we’ll read in the data frames as a dictionary of data frames instead of a list.

[7]:
# Make dictionary of data frames
dfs = {
    f[f.rfind("_") + 1 : f.rfind(".")]: pd.read_csv(f, comment="#")
    for i, f in enumerate(fnames)
}

# Verify that keys are in fact IDs
dfs.keys()
[7]:
dict_keys(['I', 'II', 'III', 'IV'])

Now, if we call pd.concat() with dictionary input, we get a new data frame with a multiindex.

[8]:
df = pd.concat(dfs)

# Take a look
df.head()
[8]:
date trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
I 0 2013_02_26 3 1205 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030
1 2013_02_26 4 2527 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695
2 2013_03_01 1 1745 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239
3 2013_03_01 2 1556 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381
4 2013_03_01 3 493 36 0.80 -974 499 1.57 -0.423 245 216 0.12 2012 -3975

We have a multiindex for the rows, with the high level index being the ID and the low level index being the original index of the data frame that was concatenated. It is useful to give these indexes names so we can conveniently refer to them. We can do that by setting the df.index.names property as

df.index.names = ['ID', 'original index']

We can instead specify a names kwarg when we call pd.concat(). This kwarg specifies the names of the resulting multiindex from the concatenation.

[9]:
df = pd.concat(dfs, names=['ID', 'original index'])

# Take a look
df.head()
[9]:
date trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
ID original index
I 0 2013_02_26 3 1205 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030
1 2013_02_26 4 2527 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695
2 2013_03_01 1 1745 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239
3 2013_03_01 2 1556 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381
4 2013_03_01 3 493 36 0.80 -974 499 1.57 -0.423 245 216 0.12 2012 -3975

We conveniently have labeled indexes, and we can now make ID a column in the data frame using the reset_index() method.

[10]:
df = df.reset_index()

df.head()
[10]:
ID original index date trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
0 I 0 2013_02_26 3 1205 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030
1 I 1 2013_02_26 4 2527 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695
2 I 2 2013_03_01 1 1745 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239
3 I 3 2013_03_01 2 1556 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381
4 I 4 2013_03_01 3 493 36 0.80 -974 499 1.57 -0.423 245 216 0.12 2012 -3975

We now have a default range index for the data frame that we do not care about. Because the original index was not informative either, we can delete that column if we like, but it is not really a burden to have an unused column laying around in a data set this small. Nonetheless, let’s blow it away.

[11]:
del df['original index']

df.head()
[11]:
ID date trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
0 I 2013_02_26 3 1205 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030
1 I 2013_02_26 4 2527 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695
2 I 2013_03_01 1 1745 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239
3 I 2013_03_01 2 1556 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381
4 I 2013_03_01 3 493 36 0.80 -974 499 1.57 -0.423 245 216 0.12 2012 -3975

We now have a nice, tidy data frame!

Creating a DataFrame from scratch

Looking back at the headers of the original data files, we see that there is information present in the header that we would like to have in our data frame. For example, it would be nice to know if each strike came from an adult or juvenile. Or what the snout-vent length was. Working toward the goal of including this in our data frame, we will first construct a new data frame containing information about each frog.

Data frames from dictionaries

One way do create this new data frame is to first construct a dictionary with the respective fields. Since these data sets are small, we can look at the files and make the dictionary by hand.

[12]:
data_dict = {
    "ID": ["I", "II", "III", "IV"],
    "age": ["adult", "adult", "juvenile", "juvenile"],
    "SVL (mm)": [63, 70, 28, 31],
    "body weight (g)": [63.1, 72.7, 12.7, 12.7],
    "species": ["cross", "cross", "cranwelli", "cranwelli"],
}

Now that we have this dictionary, we can convert it into a DataFrame by instantiating a pd.DataFrame class with it, using the data kwarg.

[13]:
# Make it into a DataFrame
df_frog_info = pd.DataFrame(data=data_dict)

# Take a look
df_frog_info
[13]:
ID age SVL (mm) body weight (g) species
0 I adult 63 63.1 cross
1 II adult 70 72.7 cross
2 III juvenile 28 12.7 cranwelli
3 IV juvenile 31 12.7 cranwelli

Nice!

Data frames from numpy arrays

Sometimes the data sets are not small enough to construct a dictionary by hand. Oftentimes, we have a two-dimensional array of data that we want to make into a DataFrame. As an example, let’s say we have a Numpy array where the first column is snout vent length and the second is weight.

[14]:
data = np.array([[63, 70, 28, 31], [63.1, 72.7, 12.7, 12.7]]).transpose()

# Verify that it's what we think it is
data
[14]:
array([[63. , 63.1],
       [70. , 72.7],
       [28. , 12.7],
       [31. , 12.7]])

To make this into a DataFrame, we again create pd.DataFrame instance, but this time we also specify the columns keyword argument.

[15]:
df_demo = pd.DataFrame(data=data, columns=["SVL (mm)", "weight (g)"])

# Take a look
df_demo
[15]:
SVL (mm) weight (g)
0 63.0 63.1
1 70.0 72.7
2 28.0 12.7
3 31.0 12.7

That also works. Generally, any two-dimensional Numpy array can be converted into a DataFrame in this way. You just need to supply column names.

Programmatically creating a data frame

Hand-entering data should be minimized. The information about each frog were hand-entered once by the experimenter. We should not hand-enter them again. We therefore should parse the comment lines of input files to get the pertinent information.

Note, though, that in the case of a single experiment with only four data sets, hand entering might be faster and indeed less error prone than doing it programmatically. We should definitely do it programmatically if we have a large number of data files or will ever do an experiment with the same file format again.

So, let’s programmatically parse the files. We start by writing a function to parse the metadata from a single file. Recall that the comment lines look like this:

# These data are from Kleinteich and Gorb, Sci. Rep., 4, 5225, 2014.
# Frog ID: I
# Age: adult
# Snout-vent-length (SVL): 63 mm
# Body weight: 63.1 g
# Species: Ceratophrys cranwelli crossed with Ceratophrys cornuta

(The function below will not work with Colab because open() does not work for files specified by a URL.)

[16]:
def parse_frog_metadata(fname):
    with open(fname, 'r') as f:
        # Citation line, ignore.
        f.readline()

        # Frog ID
        line = f.readline()
        frog_id = line[line.find(':')+1:].strip()

        # Age
        line = f.readline()
        age = line[line.find(':')+1:].strip()

        # SVL, assume units given as mm
        line = f.readline()
        svl = line[line.find(':')+1:line.rfind(' ')].strip()

        # Body weight, assume units given as g
        line = f.readline()
        body_weight = line[line.find(':')+1:line.rfind(' ')].strip()

        # Species (either cranwelli or cross)
        line = f.readline()
        species = line[line.find(':')+1:].strip()
        if 'cross' in species:
            species = 'cross'
        else:
            species = 'cranwelli'

    return frog_id, age, svl, body_weight, species

Let’s take it for a spin.

[17]:
parse_frog_metadata(os.path.join(data_path, 'frog_strikes_I.csv'))
[17]:
('I', 'adult', '63', '63.1', 'cross')

Looks good! Now we can create a list of tuples to use as data for making a data frame.

[18]:
data = [parse_frog_metadata(f) for f in fnames]

# Take a look
data
[18]:
[('I', 'adult', '63', '63.1', 'cross'),
 ('II', 'adult', '70', '72.7', 'cross'),
 ('III', 'juvenile', '28', '12.7', 'cranwelli'),
 ('IV', 'juvenile', '31', '12.7', 'cranwelli')]

We now input this list of tuples, plus the column names, into pd.DataFrame(), and we’ve got our data frame.

[19]:
df_frog_info = pd.DataFrame(
    data=data,
    columns=["ID", "age", "SVL (mm)", "body weight (g)", "species"]
)

# Take a look
df_frog_info
[19]:
ID age SVL (mm) body weight (g) species
0 I adult 63 63.1 cross
1 II adult 70 72.7 cross
2 III juvenile 28 12.7 cranwelli
3 IV juvenile 31 12.7 cranwelli

Merging DataFrames

Our ultimate goal is to add the information about the frogs into our main data frame, df, that we have been working with.

Brute force merge

We can do it using tools we have already learned. For each row in the DataFrame, we can add the relevant value in each column. Because this will not be the final way I recommend doing this, I will do these operations on a copy of df using the copy() method.

[20]:
# Make a copy of df
df_copy = df.copy()

# Build each column
for col in df_frog_info.columns[df_frog_info.columns != 'ID']:
    # Make a new column with empty values
    df_copy[col] = np.empty(len(df_copy))

    # Add in each entry, row by row
    for i, r in df_copy.iterrows():
        ind = df_frog_info['ID'] == r['ID']
        df_copy.loc[i, col] = df_frog_info.loc[ind, col].iloc[0]

# Take a look at the updated DataFrame
df_copy.head()
[20]:
ID date trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa) age SVL (mm) body weight (g) species
0 I 2013_02_26 3 1205 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030 adult 63 63.1 cross
1 I 2013_02_26 4 2527 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695 adult 63 63.1 cross
2 I 2013_03_01 1 1745 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239 adult 63 63.1 cross
3 I 2013_03_01 2 1556 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381 adult 63 63.1 cross
4 I 2013_03_01 3 493 36 0.80 -974 499 1.57 -0.423 245 216 0.12 2012 -3975 adult 63 63.1 cross

Note that I used the iterrows() method of the df_copy data frame. This iterator gives an index (which I called i) and a row of a data frame (which I called r). This method, and the analogous one for iterating over columns, iteritems(), can be useful.

But this approach seems rather clunky.

Using pd.merge()

There is a much better way to do it is to use Pandas’s built-in merge() method. Called with all the default keyword arguments, this function finds common columns between two DataFrames (in this case, there is just one, the ID column), and then uses those columns to merge them, filling in values that match in the common columns. This is exactly what we want.

[21]:
df = df.merge(df_frog_info)

# Check it out!
df.head()
[21]:
ID date trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa) age SVL (mm) body weight (g) species
0 I 2013_02_26 3 1205 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030 adult 63 63.1 cross
1 I 2013_02_26 4 2527 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695 adult 63 63.1 cross
2 I 2013_03_01 1 1745 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239 adult 63 63.1 cross
3 I 2013_03_01 2 1556 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381 adult 63 63.1 cross
4 I 2013_03_01 3 493 36 0.80 -974 499 1.57 -0.423 245 216 0.12 2012 -3975 adult 63 63.1 cross

Note that the entries for the added columns were repeated appropriately, e.g., body weight column had 63 for every row corresponding to frog I.

I think this example of merging DataFrames highlights the power of using them in your data analysis. Note also that there are plenty of options for how merges are done, and you should consult the Pandas documentation.

This example also brings up an important point. When you have to perform operations on data frames, you can often “brute force” it with loops, etc. But if what you are trying to do seems like something a data analyst would frequently encounter, there is a good chance it’s already built-in to Pandas, and you should ask Google how to do it.

At long last, a plot!

While the purpose of this part of the lesson was to learn how to concatenate and merge data frames, going through all of that wrangling effort would somehow be unsatisfying if we we didn’t generate a plot. Let’s compare the impact force on a per-mass basis for each frog.

[22]:
p = iqplot.strip(
    df,
    q="impact force / body weight",
    cats="ID",
    color_column="age",
    spread="jitter",
    x_axis_label="impact force / body weight (mN/g)",
    y_axis_label="frog ID"
)

bokeh.io.show(p)

Apparently Frog III consistently packs a powerful punch, er…. tongue.

Computing environment

[23]:
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,iqplot,jupyterlab
Python implementation: CPython
Python version       : 3.11.5
IPython version      : 8.15.0

numpy     : 1.24.3
pandas    : 2.0.3
bokeh     : 3.2.1
iqplot    : 0.3.5
jupyterlab: 4.0.6