Tidy data and split-apply-combine

Data set download

[2]:

import numpy as np
import pandas as pd

We have dipped our toe into Pandas to see its power. In this lesson, we will continue to harness the power of Pandas to pull out subsets of data we are interested in, and of vital importance, will introduce the concept of tidy data. I suspect this will be a demarcation in your life. You will have the times in your life before tidy data and after. Welcome to your bright tomorrow.

Tidy data

Hadley Wickham wrote a great article in favor of “tidy data.” Tidy data frames follow the rules: 1. Each variable is a column. 2. Each observation is a row. 3. Each type of observation has its own separate data frame.

This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats.

You may raise some objections about tidy data. Here are a few, and my responses.

Objection: Looking at a table of tidy data is ugly. It is not intuitively organized. I would almost never display a tidy data table in a publication.

Response: Correct! Having tabular data in a format that is easy to read as a human studying a table is a very different thing than having it in a format that is easy to explore and work with using a computer. As Daniel Chen put it, “There are data formats that are better for reporting and data formats that are better for analysis.” We are using the tidy data frames for analysis, not reporting (though we will see in the coming lessons that having the data in a tidy format makes making plots much easier, and plots are a key medium for reporting.)

Objection: Isn’t it better to sometimes have data arranged in other ways? Say in a matrix?

Response: This is certainly true for things like images, or raster-style data in general. It makes more sense to organize an image in a 2D matrix than to have it organized as a data frame with three columns (row in image, column in image, intensity of pixel), where each row corresponds to a single pixel. For an image, indexing it by row and column is always unambiguous, my_image[i, j] means the pixel at row i and column j.

For other data, though, the matrix layout suffers from the fact that there may be more than one way to construct a matrix. If you know a data frame is tidy, you already know its structure. You need only to ask what the columns are, and then you immediately know how to access data using Boolean indexing. In other formats, you might have to read and write extensive comments to understand the structure of the data. Of course, you can read and write comments, but it opens the door for the possibility of misinterpretation or mistakes.

Objection: But what about time series? Clearly, that can be in matrix format. One column is time, and then subsequent columns are observations made at that time.

Response: Yes, that is true. But then the matrix-style described could be considered tidy, since each row is a single observation (time point) that has many facets.

Objection: Isn’t this an inefficient use of memory? There tend to be lots of repeated entries in tidy data frames.

Response: Yes, there are more efficient ways of storing and accessing data. But for data sets that are not “big data,” this is seldom a real issue. The extra expense in memory, as well as the extra expense in access, are small prices to pay for the simplicity and speed of the human user in accessing the data.

Objection: Once it’s tidy, we pretty much have to use Boolean indexing to get what we want, and that can be slower than other methods of accessing data. What about performance?

Response: See the previous response. Speed of access really only becomes a problem with big, high-throughput data sets. In those cases, there are often many things you need to be clever about beyond organization of your data.

Conclusion: I really think that tidying a data set allows for fluid exploration. We will focus on tidy data sets going forward. The techniques for bringing untidy data into tidy format use many of Pandas’s functions, and we will explore them in future lessons.

The data set

We will again use the data set from the Beattie, et al. paper on facial matching under sleep deprivation. Let’s load in the original data set and add the column on insomnia as we did in previous part of this lesson.

[3]:

fname = os.path.join(data_path, "gfmt_sleep.csv")
df = pd.read_csv(fname, na_values="*")
df["insomnia"] = df["sci"] <= 16

# Take a look
df.head()

[3]:

	participant number	gender	age	correct hit percentage	correct reject percentage	percent correct	confidence when correct hit	confidence incorrect hit	confidence correct reject	confidence incorrect reject	confidence when correct	confidence when incorrect	sci	psqi	ess	insomnia
0	8	f	39	65	80	72.5	91.0	90.0	93.0	83.5	93.0	90.0	9	13	2	True
1	16	m	42	90	90	90.0	75.5	55.5	70.5	50.0	75.0	50.0	4	11	7	True
2	18	f	31	90	95	92.5	89.5	90.0	86.0	81.0	89.0	88.0	10	9	3	True
3	22	f	35	100	75	87.5	89.5	NaN	71.0	80.0	88.0	80.0	13	8	20	True
4	27	f	74	60	65	62.5	68.5	49.0	61.0	49.0	65.0	49.0	13	9	12	True

This data set is in tidy format. Each row represents a single test on a single participant. The aspects of that person’s test are given in each column. We already saw the power of having the data in this format when we did Boolean indexing in the last lesson. Now, we will see how this format allows use to easily do an operation we do again and again with data sets, split-apply-combine.

Split-apply-combine

Let’s say we want to compute the median percent correct face matchings for subjects with insomnia and the median percent correct face matchings for those without. Ignoring for the second the mechanics of how we would do this with Python, let’s think about it in English. What do we need to do?

Split the data set up according to the 'insomnia' field, i.e., split it up so we have a separate data set for the two classes of subjects, those with insomnia and those without.
Apply a median function to the activity in these split data sets.
Combine the results of these averages on the split data set into a new, summary data set that contains the two classes (insomniac and not) and means for each.

We see that the strategy we want is a split-apply-combine strategy. This idea was put forward by Hadley Wickham in this paper. It turns out that this is a strategy we want to use very often. Split the data in terms of some criterion. Apply some function to the split-up data. Combine the results into a new data frame.

Note that if the data are tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.

Pandas’s split-apply-combine operations are achieved using the groupby() method. You can think of groupby() as the splitting part. You can then apply functions to the resulting DataFrameGroupBy object. The Pandas documentation on split-apply-combine is excellent and worth reading through. It is extensive though, so don’t let yourself get intimidated by it.

Aggregation: Median percent correct

Let’s go ahead and do our first split-apply-combine on this tidy data set. First, we will split the data set up by insomnia condition.

[4]:

grouped = df.groupby("insomnia")

# Take a look
grouped

[4]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1316926d0>

There is not much to see in the DataFrameGroupBy object that resulted. But there is a lot we can do with this object. Typing grouped. and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps happen together and a new data frame is returned. The grouped.median() method is exactly what we want. We use the numeric_only=True keyword argument so Pandas does not attempt to find the median of categorical data like gender.

[5]:

df_median = grouped.median(numeric_only=True)

# Take a look
df_median

[5]:

	participant number	age	correct hit percentage	correct reject percentage	percent correct	confidence when correct hit	confidence incorrect hit	confidence correct reject	confidence incorrect reject	confidence when correct	confidence when incorrect	sci	psqi	ess
insomnia
False	54.0	36.0	90.0	80.0	85.0	74.5	55.5	71.5	59.0	75.0	59.25	26.0	4.0	6.0
True	46.0	39.0	90.0	75.0	75.0	76.5	72.0	71.0	68.5	77.0	65.00	14.0	9.0	7.0

The outputted data frame has the medians of all quantities, including the percent correct that we wanted. Note that this data frame has insomnia as the name of the row index. If we want to instead keep insomnia (which, remember, is what we used to split up the data set before we computed the summary statistics) as a column, we can use the reset_index() method.

[6]:

df_median.reset_index()

[6]:

	insomnia	participant number	age	correct hit percentage	correct reject percentage	percent correct	confidence when correct hit	confidence incorrect hit	confidence correct reject	confidence incorrect reject	confidence when correct	confidence when incorrect	sci	psqi	ess
0	False	54.0	36.0	90.0	80.0	85.0	74.5	55.5	71.5	59.0	75.0	59.25	26.0	4.0	6.0
1	True	46.0	39.0	90.0	75.0	75.0	76.5	72.0	71.0	68.5	77.0	65.00	14.0	9.0	7.0

We can also use multiple columns in our groupby() operation. For example, we may wish to look at four groups, male insomniacs, female insomniacs, male non-insomniacs, and female non-insomniacs. To do this, we simply pass in a list of columns into df.groupby().

[7]:

df.groupby(["gender", "insomnia"]).median().reset_index()

[7]:

	gender	insomnia	participant number	age	correct hit percentage	correct reject percentage	percent correct	confidence when correct hit	confidence incorrect hit	confidence correct reject	confidence incorrect reject	confidence when correct	confidence when incorrect	sci	psqi	ess
0	f	False	58.0	36.0	85.0	80.0	85.00	74.00	55.00	70.50	60.00	74.00	58.75	26.0	4.0	7.0
1	f	True	46.0	39.0	80.0	75.0	72.50	76.50	73.75	71.00	68.50	77.00	70.50	14.0	9.0	7.0
2	m	False	41.0	38.5	90.0	80.0	82.50	76.00	57.75	74.25	54.75	76.25	59.25	29.0	3.0	6.0
3	m	True	55.5	37.0	95.0	82.5	83.75	83.75	55.50	75.75	73.25	81.25	62.50	14.0	9.0	8.0

This type of operation is called an aggregation. That is, we split the data set up into groups, and then computed a summary statistic for each group, in this case the median.

Transformation

Instead of summarizing data in a group with single summary statistics by aggregation, we can also do a transformation in which each row gets a new entry within a given group. As a simple example, we could generate a column that gives the rank of each participate in terms of percent correct for an insomniac group versus non-insomniac groups. That is, we perform a rank ordering with the insomniac group and within the non-insomniac group.

[8]:

grouped["percent correct"].rank().head()

[8]:

0    11.0
1    21.5
2    23.0
3    19.5
4     3.5
Name: percent correct, dtype: float64

This gave us a column of ranks with the indexing of the original data frame preserved. We can put this column into the data frame.

[9]:

df["rank grouped by insomnia"] = grouped["percent correct"].rank(method="first")

# Take a look
df.head()

[9]:

	participant number	gender	age	correct hit percentage	correct reject percentage	percent correct	confidence when correct hit	confidence incorrect hit	confidence correct reject	confidence incorrect reject	confidence when correct	confidence when incorrect	sci	psqi	ess	insomnia	rank grouped by insomnia
0	8	f	39	65	80	72.5	91.0	90.0	93.0	83.5	93.0	90.0	9	13	2	True	11.0
1	16	m	42	90	90	90.0	75.5	55.5	70.5	50.0	75.0	50.0	4	11	7	True	21.0
2	18	f	31	90	95	92.5	89.5	90.0	86.0	81.0	89.0	88.0	10	9	3	True	23.0
3	22	f	35	100	75	87.5	89.5	NaN	71.0	80.0	88.0	80.0	13	8	20	True	19.0
4	27	f	74	60	65	62.5	68.5	49.0	61.0	49.0	65.0	49.0	13	9	12	True	3.0

To verify that this worked correctly, and also to show some nice sorting properties of DataFrames, we will sort the data frame by insomnia and then by percent correct and make sure the ranks worked accordingly.

[10]:

df_sorted = df.sort_values(by=["insomnia", "percent correct"])

# Look at relevant columns
df_sorted[["insomnia", "percent correct", "rank grouped by insomnia"]]

[10]:

	insomnia	percent correct	rank grouped by insomnia
81	False	40.0	1.0
94	False	55.0	2.0
39	False	57.5	3.0
76	False	60.0	4.0
96	False	60.0	5.0
...	...	...	...
1	True	90.0	21.0
8	True	90.0	22.0
2	True	92.5	23.0
11	True	95.0	24.0
21	True	97.5	25.0

102 rows × 3 columns

Indeed it worked!

Aggregating and transforming with custom functions

Let’s say we want to compute the coefficient of variation (CoV, the standard deviation divided by the mean) of data in columns of groups in the data frame. There is no built-in function to do this. We have to write our own function to compute the CoV and then use it with the agg() method of a DataFrameGroupBy object. In the function below, the values of each column are denoted by data.

To compute the coefficient of variation, we will use one more Numpy function beyond np.mean() that you have already seen, np.std().

[11]:

def coeff_of_var(data):
    """Compute coefficient of variation from an array of data."""
    return np.std(data) / np.mean(data)

Now we can apply it as an aggregating function. Because it does not make sense to compute a coefficient of variation of a categorical variable like gender, we drop it from the data frame before doing the group by operation.

[12]:

df.drop('gender', axis=1).groupby('insomnia').agg(coeff_of_var).reset_index()

[12]:

	insomnia	participant number	age	correct hit percentage	correct reject percentage	percent correct	confidence when correct hit	confidence incorrect hit	confidence correct reject	confidence incorrect reject	confidence when correct	confidence when incorrect	sci	psqi	ess	rank grouped by insomnia
0	False	0.586191	0.384262	0.166784	0.184061	0.138785	0.195978	0.350286	0.204312	0.298216	0.187304	0.262509	0.175245	0.577869	0.571566	0.5699
1	True	0.536117	0.313853	0.218834	0.325760	0.171856	0.156219	0.225440	0.222827	0.211512	0.160061	0.197484	0.381907	0.299741	0.681514	0.5547

We can take a similar approach with a transformation, which you will explore for ECDFs in your homework.

You now have tremendous power in your hands. When your data are tidy, you can rapidly accelerate the ubiquitous split-apply-combine methods. Importantly, you now have a logical framework to think about how you slice and dice your data. As a final, simple example, I will show you how to go start to finish after loading the data set into a data frame, splitting by insomnia and gender, and then getting the quartiles and extrema, in addition to the mean and standard deviation, in the percent of correct facial matchings.

[13]:

df.groupby(["gender", "insomnia"])["percent correct"].describe()

[13]:

		count	mean	std	min	25%	50%	75%	max
gender	insomnia
f	False	55.0	82.045455	10.844982	55.0	75.000	85.00	88.750	100.0
f	True	19.0	73.947368	13.624829	45.0	67.500	72.50	83.750	95.0
m	False	22.0	80.000000	12.770875	40.0	77.500	82.50	89.375	95.0
m	True	6.0	82.916667	10.655593	67.5	76.875	83.75	88.750	97.5

Yes, that’s right. One single, clean, easy to read line of code. In coming tutorials, we will see how to use tidy data to quickly generate plots.

Looping over a GroupBy object

While the GroupBy methods we have learned so far (like transform() and agg()) are useful and lead to concise code, we sometimes want to loop over the groups of a GroupBy object. This often comes up in plotting applications, as we will see in future lessons. As an example, I will compute the median percent correct for female and males, insomniacs and not (which we already computed using describe().

[14]:

for name, group in df.groupby(["gender", "insomnia"]):
    print(name, ": ", group["percent correct"].median())

('f', False) :  85.0
('f', True) :  72.5
('m', False) :  82.5
('m', True) :  83.75

By using the GroupBy object as an iterator, it yields the name of the group (which I assigned as name) and the corresponding sub-data frame (which I assigned group).

Computing environment

[15]:

%load_ext watermark
%watermark -v -p numpy,pandas,jupyterlab

Python implementation: CPython
Python version       : 3.11.5
IPython version      : 8.15.0

numpy     : 1.24.3
pandas    : 2.0.3
jupyterlab: 4.0.6