Tidy data and split-apply-combine

Data set download


[2]:
import numpy as np
import polars as pl
import polars.selectors as cs

We have dipped our toe into Polars to see its power. In this lesson, we will continue to harness the power of Polars to pull out subsets of data we are interested in, and of vital importance, will introduce the concept of tidy data. I suspect this will be a demarcation in your life. You will have the times in your life before tidy data and after. Welcome to your bright tomorrow.

Tidy data

Hadley Wickham wrote a great article in favor of “tidy data.” Tidy data frames follow the rules: 1. Each variable is a column. 2. Each observation is a row. 3. Each type of observation has its own separate data frame.

This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats.

You may raise some objections about tidy data. Here are a few, and my responses.

Objection: Looking at a table of tidy data is ugly. It is not intuitively organized. I would almost never display a tidy data table in a publication.

Response: Correct! Having tabular data in a format that is easy to read as a human studying a table is a very different thing than having it in a format that is easy to explore and work with using a computer. As Daniel Chen put it, “There are data formats that are better for reporting and data formats that are better for analysis.” We are using the tidy data frames for analysis, not reporting (though we will see in the coming lessons that having the data in a tidy format makes making plots much easier, and plots are a key medium for reporting.)

Objection: Isn’t it better to sometimes have data arranged in other ways? Say in a matrix?

Response: This is certainly true for things like images, or raster-style data in general. It makes more sense to organize an image in a 2D matrix than to have it organized as a data frame with three columns (row in image, column in image, intensity of pixel), where each row corresponds to a single pixel. For an image, indexing it by row and column is always unambiguous, my_image[i, j] means the pixel at row i and column j.

For other data, though, the matrix layout suffers from the fact that there may be more than one way to construct a matrix. If you know a data frame is tidy, you already know its structure. You need only to ask what the columns are, and then you immediately know how to access data using Boolean indexing. In other formats, you might have to read and write extensive comments to understand the structure of the data. Of course, you can read and write comments, but it opens the door for the possibility of misinterpretation or mistakes.

Objection: But what about time series? Clearly, that can be in matrix format. One column is time, and then subsequent columns are observations made at that time.

Response: Yes, that is true. But then the matrix-style described could be considered tidy, since each row is a single observation (time point) that has many facets.

Objection: Isn’t this an inefficient use of memory? There tend to be lots of repeated entries in tidy data frames.

Response: Yes, there are more efficient ways of storing and accessing data. But for data sets that are not “big data,” this is seldom a real issue. The extra expense in memory, as well as the extra expense in access, are small prices to pay for the simplicity and speed of the human user in accessing the data.

Objection: Once it’s tidy, we pretty much have to use Boolean indexing to get what we want, and that can be slower than other methods of accessing data. What about performance?

Response: See the previous response. Speed of access really only becomes a problem with big, high-throughput data sets. In those cases, there are often many things you need to be clever about beyond organization of your data.

Conclusion: I really think that tidying a data set allows for fluid exploration. We will focus on tidy data sets going forward. Bringing untidy data into tidy format can be an annoying challenge that often involves lots of file I/O and text parsing operations. As such, it is wise to arrange your data in tidy format starting at acquisition. If your instrumentation prevents you from doing so, you should develop functions and scripts to parse them and convert them into tidy format.

The data set

We will again use the data set from the Beattie, et al. paper on facial matching under sleep deprivation. Let’s load in the original data set and add the column on insomnia as we did in previous part of this lesson.

[3]:
fname = os.path.join(data_path, "gfmt_sleep.csv")
df = pl.read_csv(fname, null_values="*")
df = df.with_columns((pl.col('sci') <= 16).alias("insomnia"))

# Take a look
df.head()
[3]:
shape: (5, 16)
participant numbergenderagecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiessinsomnia
i64stri64i64i64f64f64f64f64f64f64f64i64i64i64bool
8"f"39658072.591.090.093.083.593.090.09132true
16"m"42909090.075.555.570.550.075.050.04117true
18"f"31909592.589.590.086.081.089.088.01093true
22"f"351007587.589.5null71.080.088.080.013820true
27"f"74606562.568.549.061.049.065.049.013912true

This data set is in tidy format. Each row represents a single test on a single participant. The aspects of that person’s test are given in each column. We already saw the power of having the data in this format when we did Boolean indexing. Now, we will see how this format allows use to easily do an operation we do again and again with data sets, split-apply-combine.

Split-apply-combine

Let’s say we want to compute the median percent correct face matchings for subjects with insomnia and the median percent correct face matchings for those without. Ignoring for the second the mechanics of how we would do this with Python, let’s think about it in English. What do we need to do?

  1. Split the data set up according to the 'insomnia' field, i.e., split it up so we have a separate data set for the two classes of subjects, those with insomnia and those without.

  2. Apply a median function to the activity in these split data sets.

  3. Combine the results of these averages on the split data set into a new, summary data set that contains the two classes (insomniac and not) and means for each.

We see that the strategy we want is a split-apply-combine strategy. This idea was put forward by Hadley Wickham in this paper. It turns out that this is a strategy we want to use very often. Split the data in terms of some criterion. Apply some function to the split-up data. Combine the results into a new data frame.

Note that if the data are tidy, this procedure makes a lot of sense. Choose the column or columns you want to use to split by. All rows with like entries in the splitting column(s) are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.

Polars’s split-apply-combine operations are achieved in a group by/aggregation context, achieved with df.group_by().agg(). You can think of group_by() as the splitting part. The apply-combine part is done by passing expressions into agg(). The result is a data frame with as many rows as there are groups that we split the data frame into.

Median percent correct

Let’s go ahead and do our first split-apply-combine on this tidy data set. First, we will split the data set up by insomnia condition and then apply a median function.

[4]:
df.group_by('insomnia').median()
[4]:
shape: (2, 16)
insomniaparticipant numbergenderagecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiess
boolf64strf64f64f64f64f64f64f64f64f64f64f64f64f64
true46.0null39.090.075.075.076.572.071.068.577.065.014.09.07.0
false54.0null36.090.080.085.074.555.571.559.075.059.2526.04.06.0

Note that we used the median() method of the GroupBy object. This computes the median of all columns. This is equivalent the following using agg() with the expression pl.col('*').median().

[5]:
df.group_by('insomnia').agg(pl.col('*').median())
[5]:
shape: (2, 16)
insomniaparticipant numbergenderagecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiess
boolf64strf64f64f64f64f64f64f64f64f64f64f64f64f64
true46.0null39.090.075.075.076.572.071.068.577.065.014.09.07.0
false54.0null36.090.080.085.074.555.571.559.075.059.2526.04.06.0

The median doesn’t make sense for gender. If we only wanted to compute medians of quantities for which it makes sense to do so, we could use selectors.

[6]:
df.group_by('insomnia').agg(cs.numeric().exclude('participant number').median())
[6]:
shape: (2, 14)
insomniaagecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiess
boolf64f64f64f64f64f64f64f64f64f64f64f64f64
true39.090.075.075.076.572.071.068.577.065.014.09.07.0
false36.090.080.085.074.555.571.559.075.059.2526.04.06.0

If instead we only wanted the median of the percent correct and confidence when correct, we could do the following.

[7]:
(
    df
    .group_by('insomnia')
    .agg(
        pl.col('percent correct', 'confidence when correct').median(),
    )
)
[7]:
shape: (2, 3)
insomniapercent correctconfidence when correct
boolf64f64
false85.075.0
true75.077.0

We can also use multiple columns in our group_by() operation. For example, we may wish to look at four groups, male insomniacs, female insomniacs, male non-insomniacs, and female non-insomniacs. To do this, we simply pass in a list of columns into df.group_by().

[8]:
df.group_by(["gender", "insomnia"]).median()
[8]:
shape: (4, 16)
genderinsomniaparticipant numberagecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiess
strboolf64f64f64f64f64f64f64f64f64f64f64f64f64f64
"f"true46.039.080.075.072.576.573.7571.068.577.070.514.09.07.0
"f"false58.036.085.080.085.074.055.070.560.074.058.7526.04.07.0
"m"false41.038.590.080.082.576.057.7574.2554.7576.2559.2529.03.06.0
"m"true55.537.095.082.583.7583.7555.575.7573.2581.2562.514.09.08.0

Window functions

In the above examples, we split the data frame into groups and applied an aggregating function that gave us back a data frame with as many rows as groups. But what if we do not want to aggregate? For example, say we want to compute the ranking of each participant according to percent correct within each group of insomniacs and normal sleepers. First, let’s see what happens if we use a group by/aggregation context. When we do the grouping, we will retain the order of the entries so that the series of ranks that we acquire match the original order in the data frame. We will also use the 'ordinal' method for ranking, which gives a distinct rank to each entry even in the event of ties.

[9]:
(
    df
    .group_by('insomnia', maintain_order=True)
    .agg(
        pl.col('percent correct')
        .rank(method='ordinal')
    )
)
[9]:
shape: (2, 2)
insomniapercent correct
boollist[u32]
true[11, 21, … 10]
false[13, 35, … 7]

We see that we indeed get a data frame that has the two rows, one for each group. The 'percent correct' columns has a new data type, a Polars list (not the same as a Python list). If we wanted to convert each entry in the list into a new row of the data frame, we can use the explode() method of a data frame.

[10]:
(
    df
    .group_by('insomnia', maintain_order=True)
    .agg(
        pl.col('percent correct')
        .rank(method='ordinal')
    )
    .explode('percent correct')
)
[10]:
shape: (102, 2)
insomniapercent correct
boolu32
true11
true21
true23
true19
true3
false29
false57
false20
false12
false7

There are a few annoyances with doing this, importantly that we have to make a column with a list data type and then explode it.

Here is where window functions come into play. A window function only operates on a subset of the data, ignoring everything outside of that subset. Since we are applying a function (in our example a rank function) to a subset (group) of the data, we can think of doing a group by followed by a calculation that has the same number of rows in the output as there are rows in the data as a window function.

Window functions are implemented in Polars expressions using the over() method. The argument(s) of the over() specify which columns specify groups. The expression is then evaluated individually for each group. To add a column with rankings based on percent correct within each group (insomnia or regular sleeper), we can do the following.

[11]:
(
    df
    .with_columns(
        pl.col('percent correct')
        .rank(method='ordinal')
        .over('insomnia')
        .alias('percent correct ranked within insomnia groups')
    )
)
[11]:
shape: (102, 17)
participant numbergenderagecorrect hit percentagecorrect reject percentagepercent correctconfidence when correct hitconfidence incorrect hitconfidence correct rejectconfidence incorrect rejectconfidence when correctconfidence when incorrectscipsqiessinsomniapercent correct ranked within insomnia groups
i64stri64i64i64f64f64f64f64f64f64f64i64i64i64boolu32
8"f"39658072.591.090.093.083.593.090.09132true11
16"m"42909090.075.555.570.550.075.050.04117true21
18"f"31909592.589.590.086.081.089.088.01093true23
22"f"351007587.589.5null71.080.088.080.013820true19
27"f"74606562.568.549.061.049.065.049.013912true3
97"f"23708577.577.066.577.077.577.074.020810false29
98"f"70908587.565.585.587.080.074.080.01987false57
99"f"24708075.061.581.070.061.065.081.031215false20
102"f"40756570.053.037.084.052.081.051.02247false12
103"f"33854062.580.027.031.082.581.073.02457false7

Window functions and group by operations are similar. Consider computing the median percent correct for each group, insomniacs and normal sleepers. Using a group by/aggregation context, this is accomplished as follows.

[12]:
(
    df
    .group_by('insomnia')
    .agg(
        pl.col('percent correct')
        .median()
        .alias('median percent correct')
    )
)
[12]:
shape: (2, 2)
insomniamedian percent correct
boolf64
true75.0
false85.0

We can achieve the same result using a window function with in a select context.

[13]:
(
    df
    .select(
        pl.col('insomnia'),
        pl.col('percent correct')
        .median().over('insomnia')
        .alias('median percent correct')
    )
    .unique('median percent correct')
)
[13]:
shape: (2, 2)
insomniamedian percent correct
boolf64
true75.0
false85.0

A group by/aggregation context is preferred in this case. As a loose rule of thumb, use a group by/aggregation context when you want a resulting data frame with the number of rows being equal to the number of groups. Use a window function in a selection context when you want a resulting data frame with the same number of rows as your input data frame.

Custom aggregation and window functions

Let’s say we want to compute the coefficient of variation (CoV, the standard deviation divided by the mean) of data in columns of groups in the data frame. There is no built-in function in Polars to do this, but we could construct an expression that does it.

[14]:
(
    df
    .group_by('insomnia', maintain_order=True)
    .agg(
        (pl.col('percent correct').std(ddof=0) / pl.col('percent correct').mean())
        .alias('coeff of var percent correct')
    )
)
[14]:
shape: (2, 2)
insomniacoeff of var percent correct
boolf64
true0.171856
false0.138785

In this case, the coefficient of variation is not a complicated calculation, so it fits neatly on one line. We could instead write a function that generates a Polars expression to compute the coefficient of variation of a column.

[15]:
def coeff_of_var_pl(col, ddof=0):
    """Return Polars expression for coefficient of variation
    for a column or columns of data."""
    return col.std(ddof=ddof) / col.mean()


(
    df
    .group_by('insomnia', maintain_order=True)
    .agg(
        coeff_of_var_pl(pl.col('percent correct'))
        .alias('coeff of var percent correct')
    )
)
[15]:
shape: (2, 2)
insomniacoeff of var percent correct
boolf64
true0.171856
false0.138785

Alternatively (though not preferably, as discussed below), we could write our own Python function to compute the coefficient of variation using Numpy. Say we had such a function lying around that take as input a Numpy array and returns the coefficient of variation as a Numpy array.

[16]:
def coeff_of_var(data: np.ndarray) -> np.ndarray:
    """Compute coefficient of variation from an array of data."""
    return np.std(data) / np.mean(data)

We can use this function as an aggregating function using the map_elements() method of a Polars expression. The argument of map_elements() is a function that take a series as input and returns either a scalar or a series. We therefore need to pass in a function that converts the series to a Numpy array for use in the coeff_of_var() function, which we can accomplish with a lambda function.

[17]:
(
    df
    .group_by('insomnia', maintain_order=True)
    .agg(
        pl.col('percent correct')
        .map_elements(lambda s: coeff_of_var(s.to_numpy()), return_dtype=float, returns_scalar=True)
    )
)
[17]:
shape: (2, 2)
insomniapercent correct
boolf64
true0.171856
false0.138785

Note that it is important to supply the data type of the return of the function you are mapping to ensure that Polars correctly assigns data types in the resulting data frame.

If at all possible, use Polars expressions for custom aggregation and window function. They are cleaner to implement and far more performant, as illustrated in the timed calculations below.

[18]:
%%timeit
(
    df
    .group_by('insomnia', maintain_order=True)
    .agg(
        coeff_of_var_pl(pl.col('percent correct'))
        .alias('coeff of var percent correct')
    )
)
53.2 μs ± 2.49 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
[19]:
%%timeit
(
    df
    .group_by('insomnia', maintain_order=True)
    .agg(
        pl.col('percent correct')
        .map_elements(lambda s: coeff_of_var(s.to_numpy()), return_dtype=float, returns_scalar=True)
    )
)
219 μs ± 3.12 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Polars Structs and expressions using multiple columns

As discussed earlier, Polars expressions strictly take a series as input and return a series as output. What if we want to perform a calculation that requires two columns? This is where the Polars Struct data type comes in. As a simple example, if we cast out whole data frame into a series, we get a series with a struct data type.

[20]:
struct_series = pl.Series(df)

# Take a look
struct_series
[20]:
shape: (102,)
struct[16]
{8,"f",39,65,80,72.5,91.0,90.0,93.0,83.5,93.0,90.0,9,13,2,true}
{16,"m",42,90,90,90.0,75.5,55.5,70.5,50.0,75.0,50.0,4,11,7,true}
{18,"f",31,90,95,92.5,89.5,90.0,86.0,81.0,89.0,88.0,10,9,3,true}
{22,"f",35,100,75,87.5,89.5,null,71.0,80.0,88.0,80.0,13,8,20,true}
{27,"f",74,60,65,62.5,68.5,49.0,61.0,49.0,65.0,49.0,13,9,12,true}
{97,"f",23,70,85,77.5,77.0,66.5,77.0,77.5,77.0,74.0,20,8,10,false}
{98,"f",70,90,85,87.5,65.5,85.5,87.0,80.0,74.0,80.0,19,8,7,false}
{99,"f",24,70,80,75.0,61.5,81.0,70.0,61.0,65.0,81.0,31,2,15,false}
{102,"f",40,75,65,70.0,53.0,37.0,84.0,52.0,81.0,51.0,22,4,7,false}
{103,"f",33,85,40,62.5,80.0,27.0,31.0,82.5,81.0,73.0,24,5,7,false}

Each entry is an entire row of a data frame. The labels of each column is still present, as can be seen by considering one entry in the series of structs.

[21]:
struct_series[0]
[21]:
{'participant number': 8,
 'gender': 'f',
 'age': 39,
 'correct hit percentage': 65,
 'correct reject percentage': 80,
 'percent correct': 72.5,
 'confidence when correct hit': 91.0,
 'confidence incorrect hit': 90.0,
 'confidence correct reject': 93.0,
 'confidence incorrect reject': 83.5,
 'confidence when correct': 93.0,
 'confidence when incorrect': 90.0,
 'sci': 9,
 'psqi': 13,
 'ess': 2,
 'insomnia': True}

The labels for each entry in a struct (the keys in the dictionary displated above), are called fields, and we can get a list of the fields for the structs in a series.

[22]:
struct_series.struct.fields
[22]:
['participant number',
 'gender',
 'age',
 'correct hit percentage',
 'correct reject percentage',
 'percent correct',
 'confidence when correct hit',
 'confidence incorrect hit',
 'confidence correct reject',
 'confidence incorrect reject',
 'confidence when correct',
 'confidence when incorrect',
 'sci',
 'psqi',
 'ess',
 'insomnia']

The struct_series.struct.field() method allows us to take entries from all entries corresponding to a given field.

[23]:
struct_series.struct.field('percent correct')
[23]:
shape: (102,)
percent correct
f64
72.5
90.0
92.5
87.5
62.5
77.5
87.5
75.0
70.0
62.5

The following two operations give the same result.

pl.Series(df).struct.field('percent correct')
df.get_column('percent correct')

So, if we need to compute with multiple columns with an expression, we can convert whatever columns we need into a series of data type struct. We can then unpack what we need using the struct_series.struct.field() method.

As an example, say we have a function that compute the bivariate (a.k.a. Pearson) correlation coeff between two data sets given as Numpy arrays.

[24]:
def bivariate_corr(x: np.ndarray, y: np.ndarray) -> float:
    """Compute bivariate correlation coefficient for `x` and `y`.
    Ignores NaNs."""
    # Masked arrays to handle NaNs
    x = np.ma.masked_invalid(x)
    y = np.ma.masked_invalid(y)
    mask = ~x.mask & ~y.mask

    return np.corrcoef(x[mask], y[mask])[0, 1]

Now we want to compute the bivariate correlation coefficient for confidence when correct and confidence when incorrect for insomniacs and for normal sleepers. In our call to agg(), we use pl.struct to generate a series of data type struct containing the columns we need for the calculation of the correlation coefficient. We then use map_elements() to use the above function to do the calculation.

[25]:
(
    df
    .group_by('insomnia', maintain_order=True)
    .agg(
        pl.struct(['confidence when correct', 'confidence when incorrect'])
        .map_elements(
            lambda s:
            bivariate_corr(
                s.struct.field('confidence when correct').to_numpy(),
                s.struct.field('confidence when incorrect').to_numpy()
            ),
           return_dtype=float, returns_scalar=True
        )
        .alias('bivariate correlation')
    )
)
[25]:
shape: (2, 2)
insomniabivariate correlation
boolf64
true0.590435
false0.552045

While this example is instructive to demonstrate how to write your own functions to operate on data frames, as is often the case, Polars has a built-in function that computes the bivariate correlation coefficient.

[26]:
(
    df
    .group_by('insomnia', maintain_order=True)
    .agg(pl.corr('confidence when correct', 'confidence when incorrect'))
)
[26]:
shape: (2, 2)
insomniaconfidence when correct
boolf64
true0.590435
false0.552045

Looping over a GroupBy object

We sometimes want to loop over the groups of a GroupBy object. This often comes up in plotting applications, as we will see in future lessons. As an example, I will compute the median percent correct for female and males, insomniacs and not (which we already computed using describe()). In the cell below, note that I used the maintain_order=True kwarg to ensure that the order in which I evaluate the groups is the order they first appear in the data frame. This can result in a performance hit for larger data sets, since it restricts how Polars can parallelize the calculation, but it is useful to maintain a set order in plotting applications in particular.

[27]:
for group_name, sub_df in df.group_by(['gender', 'insomnia'], maintain_order=True):
    print(group_name, ": ", sub_df["percent correct"].median())
('f', True) :  72.5
('m', True) :  83.75
('f', False) :  85.0
('m', False) :  82.5

By using the GroupBy object as an iterator, it yields the name of the group (which I assigned as group_name) and the corresponding sub-data frame (which I assigned sub_df). Note that the group name is always given as a tuple, even when only grouping by a single column.

[28]:
for (group_name,), sub_df in df.group_by('gender', maintain_order=True):
    print(group_name, ": ", sub_df["percent correct"].median())
f :  83.75
m :  83.75

Computing environment

[29]:
%load_ext watermark
%watermark -v -p numpy,polars,jupyterlab
Python implementation: CPython
Python version       : 3.12.5
IPython version      : 8.27.0

numpy     : 1.26.4
polars    : 1.8.1
jupyterlab: 4.2.5