Introduction to data frames
[2]:
import numpy as np
import polars as pl
import polars.selectors as cs
Throughout your research career, you will undoubtedly need to handle data, possibly lots of data. Once in a usable form, you are empowered to rapidly make graphics and perform statistical inference. Tidy data is an important format and we will discuss that in subsequent sections of this lesson. In an ideal world, data sets would be stored in tidy format and be ready to use. The data comes in lots of formats, and you may have to spend much of your time wrangling the data to get it into a usable format. Wrangling is the topic of the next lesson; for now all data sets will be in tidy format from the get-go.
The data set
We will explore using data frames with a real data set. We will use a data set published in Beattie, et al., Perceptual impairment in face identification with poor sleep, Royal Society Open Science, 3, 160321, 2016. In this paper, researchers used the Glasgow Facial Matching Test (GMFT) to investigate how sleep deprivation affects a subject’s ability to match faces, as well as the confidence the subject has in those matches. Briefly, the test works by having subjects look at a pair of faces. Two such pairs are shown below.
The top two pictures are the same person, the bottom two pictures are different people. For each pair of faces, the subject gets as much time as he or she needs and then says whether or not they are the same person. The subject then rates his or her confidence in the choice.
In this study, subjects also took surveys to determine properties about their sleep. The Sleep Condition Indicator (SCI) is a measure of insomnia disorder over the past month (scores of 16 and below indicate insomnia). The Pittsburgh Sleep Quality Index (PSQI) quantifies how well a subject sleeps in terms of interruptions, latency, etc. A higher score indicates poorer sleep. The Epworth Sleepiness Scale (ESS) assesses daytime drowsiness.
The data set can be downloaded here. The contents of this file were adapted from the Excel file posted on the public Dryad repository. (Note this: if you want other people to use and explore your data, make it publicly available.) I’ll say it more boldly.
If at all possible, share your data freely.
The data file is a CSV file, where CSV stands for comma-separated value. This is a text file that is easily read into data structures in many programming languages. You should generally always store your data in such a format, not necessarily CSV, but a format that is open, has a well-defined specification, and is readable in many contexts. Excel files do not meet these criteria. Neither do .mat
files. There are other good ways to store data, such as JSON, but we
will almost exclusively use CSV files in this class.
Let’s take a look at the CSV file. We will use the command line program head
to look at the first 20 lines of the file.
[3]:
fname = os.path.join(data_path, "gfmt_sleep.csv")
# This will not work in Colab because the file is not local
!head {fname}
participant number,gender,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess
8,f,39,65,80,72.5,91,90,93,83.5,93,90,9,13,2
16,m,42,90,90,90,75.5,55.5,70.5,50,75,50,4,11,7
18,f,31,90,95,92.5,89.5,90,86,81,89,88,10,9,3
22,f,35,100,75,87.5,89.5,*,71,80,88,80,13,8,20
27,f,74,60,65,62.5,68.5,49,61,49,65,49,13,9,12
28,f,61,80,20,50,71,63,31,72.5,64.5,70.5,15,14,2
30,m,32,90,75,82.5,67,56.5,66,65,66,64,16,9,3
33,m,62,45,90,67.5,54,37,65,81.5,62,61,14,9,9
34,f,33,80,100,90,70.5,76.5,64.5,*,68,76.5,14,12,10
The first line contains the headers for each column. They are participant number, gender, age, etc. The data follow. There are two important things to note here. First, notice that the gender
column has string data (m
or f
), while the rest of the data are numeric. Note also that there are some missing data, denoted by the *
s in the file.
Given the file I/O skills you recently learned, you could write some functions to parse this file and extract the data you want. You can imagine that this might be kind of painful. However, if the file format is nice and clean, like we more or less have here, we can use pre-built tools to read in the data from the file and put it in a convenient data structure. Those structures are data frames.
Data frames
Though we will use the word “data frame” over and over again, what a data frame is is actually a bit nebulous. Our working definition of a data frame is that it is a representation of two-dimensional tabular data where each column has a label. We will restrict ourselves to the case where each column has a specific data type (e.g., strings, ints, floats, or even lists).
One can think of a data frame as a collection of labeled columns, each one called a series. (A series may be thought of as a single column of data.) Alternatively, it is sometimes convenient to think of a data frame as a collection of rows, where each entry in the row is labeled with the column heading.
For more reading on the history of data frames and an attempt (in my opinion a very good attempt) at clearly defining them see section 4 of this paper by Petersohn, et al.
Pandas
Pandas is one of the most widely used tools in the Python ecosystem for handling data. It is worth knowing about. We will, however, not be using Pandas, but instead will use Polars. I prefer Polars because its API is cleaner, in my opinion, but it has the added benefit of generally being much faster than Pandas.
Loading the data
Using polars.read_csv() to read in data
We have imported Polars with the alias pl
as is customary. We will use pl.read_csv()
to load the data set. The data are stored in a data frame (data type DataFrame
), which is one of the data types that makes Polars so convenient for use in data analysis. Data frames offer mixed data types, including incomplete columns, and convenient slicing, among many, many other convenient features. We will use the data frame to look at the data, at the same time demonstrating some of the
power of data frames. They are like spreadsheets, only a lot better.
[4]:
df = pl.read_csv(fname, null_values="*")
Notice that we used the kwarg null_values=*
to specify that entries marked with a *
are missing. The resulting data frame is populated with null, wherever this character is present in the file. In this case, we want null_values='*'
. So, let’s load in the data set.
If you check out the doc string for pl.read_csv()
, you will see there are lots of options for reading in the data.
Exploring the DataFrame
Let’s jump right in and look at the contents of the DataFrame
. We can look at the first several rows using the df.head()
method.
[5]:
# Look at the contents (first 5 rows)
df.head()
[5]:
participant number | gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | i64 | i64 | i64 |
8 | "f" | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 |
16 | "m" | 42 | 90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 | 4 | 11 | 7 |
18 | "f" | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 |
22 | "f" | 35 | 100 | 75 | 87.5 | 89.5 | null | 71.0 | 80.0 | 88.0 | 80.0 | 13 | 8 | 20 |
27 | "f" | 74 | 60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 | 13 | 9 | 12 |
We see that the column headings were automatically assigned, as have the data types of the columns, where i64
, f64
, and str
respectively denote integers, floats and strings. Also note (in row 3) that the missing data are denoted as null
.
Indexing data frames
The data frame is a convenient data structure for many reasons that will become clear as we start exploring. Let’s start by looking at how data frames are indexed. The rows in Polars data frames are indexed by integers, starting with zero as usual for Python. So, the first row of the data frame may be accessed as follows.
[6]:
df[0]
[6]:
participant number | gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | i64 | i64 | i64 |
8 | "f" | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 |
In practice you will almost never use row indices, but rather use Boolean indexing, which is accomplished using Polars’s filter()
method.
Because row indices in Polars data frames are always integers and column indices are not allowed to be integers (they must be strings), columns are accessed in the same way. If you choose to index with a string, Polars knows you are asking for a column.
[7]:
df['percent correct']
[7]:
percent correct |
---|
f64 |
72.5 |
90.0 |
92.5 |
87.5 |
62.5 |
… |
77.5 |
87.5 |
75.0 |
70.0 |
62.5 |
For accessing a single column, I prefer the get_column()
method.
[8]:
df.get_column('percent correct')
[8]:
percent correct |
---|
f64 |
72.5 |
90.0 |
92.5 |
87.5 |
62.5 |
… |
77.5 |
87.5 |
75.0 |
70.0 |
62.5 |
Boolean indexing of data frames
Let’s say I wanted the record for participant number 42. I can use Boolean indexing to specify the row. This is accomplished using the filter()
method.
[9]:
df.filter(pl.col("participant number") == 42)
[9]:
participant number | gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | i64 | i64 | i64 |
42 | "m" | 29 | 100 | 70 | 85.0 | 75.0 | null | 64.5 | 43.0 | 74.0 | 43.0 | 32 | 1 | 6 |
The argument of the filter()
method is an expression, pl.col('participant number') == 42
, which gives the rows (in this case, one row) for which the value of the 'participant number'
column is 42.
If I just wanted the percent correct, I can first filter to get the row I want, then extract the 'percent correct'
column, and then use the item()
method to extract the scalar value.
[10]:
(
df
.filter(pl.col('participant number') == 42)
.get_column('percent correct')
.item()
)
[10]:
85.0
Note how I expressed this code snippet stylistically. I am doing method chaining, and having each method on its own line adds readability.
Now, let’s pull out all records of females under the age of 21. We can again use Boolean indexing, but we need to use an &
operator, taken to mean logical AND. We did not cover this bitwise operator before, but the syntax is self-explanatory in the example below. Note that it is important that each Boolean operation you are doing is in parentheses because of the precedence of the operators involved. The other bitwise operators you may wish to use for Boolean indexing in data frames are |
for OR and ~
for NOT.
[11]:
df.filter((pl.col('gender') == 'f') & (pl.col('age') < 21))
[11]:
participant number | gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | i64 | i64 | i64 |
3 | "f" | 16 | 70 | 80 | 75.0 | 70.0 | 57.0 | 54.0 | 53.0 | 57.0 | 54.5 | 23 | 1 | 3 |
5 | "f" | 18 | 90 | 100 | 95.0 | 76.5 | 83.0 | 80.0 | null | 80.0 | 83.0 | 21 | 7 | 5 |
58 | "f" | 16 | 85 | 85 | 85.0 | 55.0 | 30.0 | 50.0 | 40.0 | 52.5 | 35.0 | 29 | 2 | 11 |
72 | "f" | 18 | 80 | 75 | 77.5 | 67.5 | 51.5 | 66.0 | 57.0 | 67.0 | 53.0 | 29 | 4 | 6 |
85 | "f" | 18 | 85 | 85 | 85.0 | 93.0 | 92.0 | 91.0 | 89.0 | 91.5 | 91.0 | 25 | 4 | 21 |
We can do something even more complicated, like pull out all females under 30 who got more than 85% of the face matching tasks correct.
[12]:
df.filter(
(pl.col('gender') == 'f')
& (pl.col('age') < 30)
& (pl.col('percent correct') > 85.0)
)
[12]:
participant number | gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | i64 | i64 | i64 |
93 | "f" | 28 | 100 | 75 | 87.5 | 89.5 | null | 67.0 | 60.0 | 80.0 | 60.0 | 16 | 7 | 4 |
5 | "f" | 18 | 90 | 100 | 95.0 | 76.5 | 83.0 | 80.0 | null | 80.0 | 83.0 | 21 | 7 | 5 |
6 | "f" | 28 | 95 | 80 | 87.5 | 100.0 | 85.0 | 94.0 | 61.0 | 99.0 | 65.0 | 19 | 7 | 12 |
10 | "f" | 25 | 100 | 100 | 100.0 | 90.0 | null | 85.0 | null | 90.0 | null | 17 | 10 | 11 |
44 | "f" | 21 | 85 | 90 | 87.5 | 66.0 | 29.0 | 70.0 | 29.0 | 67.0 | 29.0 | 26 | 7 | 18 |
48 | "f" | 23 | 90 | 85 | 87.5 | 67.0 | 47.0 | 69.0 | 40.0 | 67.0 | 40.0 | 18 | 6 | 8 |
51 | "f" | 24 | 85 | 95 | 90.0 | 97.0 | 41.0 | 74.0 | 73.0 | 83.0 | 55.5 | 29 | 1 | 7 |
67 | "f" | 25 | 100 | 100 | 100.0 | 61.5 | null | 58.5 | null | 60.5 | null | 28 | 8 | 9 |
Of interest in this exercise in Boolean indexing is that we never had to write a loop. To produce our indices, we could have done the following.
[13]:
# Initialize array of Boolean indices
inds = []
# Iterate over the rows of the DataFrame to check if the row should be included
for row in df.iter_rows(named=True):
inds.append(
row["age"] < 30
and row["gender"] == "f"
and row["percent correct"] > 85
)
# Look at inds
print(inds)
[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, True, True, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, True, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]
If we wanted, we could use this inds
list of True
s and False
s to filter our values.
[14]:
df.filter(inds)
[14]:
participant number | gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | i64 | i64 | i64 |
93 | "f" | 28 | 100 | 75 | 87.5 | 89.5 | null | 67.0 | 60.0 | 80.0 | 60.0 | 16 | 7 | 4 |
5 | "f" | 18 | 90 | 100 | 95.0 | 76.5 | 83.0 | 80.0 | null | 80.0 | 83.0 | 21 | 7 | 5 |
6 | "f" | 28 | 95 | 80 | 87.5 | 100.0 | 85.0 | 94.0 | 61.0 | 99.0 | 65.0 | 19 | 7 | 12 |
10 | "f" | 25 | 100 | 100 | 100.0 | 90.0 | null | 85.0 | null | 90.0 | null | 17 | 10 | 11 |
44 | "f" | 21 | 85 | 90 | 87.5 | 66.0 | 29.0 | 70.0 | 29.0 | 67.0 | 29.0 | 26 | 7 | 18 |
48 | "f" | 23 | 90 | 85 | 87.5 | 67.0 | 47.0 | 69.0 | 40.0 | 67.0 | 40.0 | 18 | 6 | 8 |
51 | "f" | 24 | 85 | 95 | 90.0 | 97.0 | 41.0 | 74.0 | 73.0 | 83.0 | 55.5 | 29 | 1 | 7 |
67 | "f" | 25 | 100 | 100 | 100.0 | 61.5 | null | 58.5 | null | 60.5 | null | 28 | 8 | 9 |
This feature, where the looping is done automatically on Polars objects like data frames, is very powerful and saves us writing lots of lines of code. This example also showed how to use the iter_rows()
method of a data frame. It is actually rare that you will need to do that, and you should generally avoid it, since it is also slow.
Contexts and expressions
We will now be a bit more formal in discussing how to work with Polars data frames. Specifically, Polars features the concepts of expressions and contexts.
The filter context
As an example, let us consider our above task of filtering the data frame to extract females under the age of 21. The syntax was
df.filter((pl.col('gender') == 'f') & (pl.col('age') < 21))
Consider first pl.col('gender') == 'f'
. This is an example of an expression. An expression consists of a calculation or transformation that can be applied to a series and returns a series. In this case, we are taking a column called 'gender'
and we are evaluating whether each element in that column is 'f'
. Indeed, if we ask the Python interpreter to tell us the type of the above expression, it is a Polars Expr
.
[15]:
type(pl.col('gender') == 'f')
[15]:
polars.expr.expr.Expr
Similarly, pl.col('age') < 21
is also an expression, as is the result when we apply the &
bitwise operator.
[16]:
type((pl.col('gender') == 'f') & (pl.col('age') < 21))
[16]:
polars.expr.expr.Expr
So, an expression says what we want to do to data. But now we ask, in what way, i.e., in what context, do we want to use the result of the expression? One way we may wish to use the above expression is to filter the rows in a data frame. The filter context is established by df.filter()
. The argument of df.filter()
is an expression (or expressions) that evaluate to Booleans. That is how we got our result; in the context of filtering, the expression is evaluated and only entries
where the expression gives True
are retained.
[17]:
df.filter((pl.col('gender') == 'f') & (pl.col('age') < 21))
[17]:
participant number | gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | i64 | i64 | i64 |
3 | "f" | 16 | 70 | 80 | 75.0 | 70.0 | 57.0 | 54.0 | 53.0 | 57.0 | 54.5 | 23 | 1 | 3 |
5 | "f" | 18 | 90 | 100 | 95.0 | 76.5 | 83.0 | 80.0 | null | 80.0 | 83.0 | 21 | 7 | 5 |
58 | "f" | 16 | 85 | 85 | 85.0 | 55.0 | 30.0 | 50.0 | 40.0 | 52.5 | 35.0 | 29 | 2 | 11 |
72 | "f" | 18 | 80 | 75 | 77.5 | 67.5 | 51.5 | 66.0 | 57.0 | 67.0 | 53.0 | 29 | 4 | 6 |
85 | "f" | 18 | 85 | 85 | 85.0 | 93.0 | 92.0 | 91.0 | 89.0 | 91.5 | 91.0 | 25 | 4 | 21 |
The selection context
The simplest way we can use an expression is simply to evaluate the expression and give its result as a new data frame. This is the selection context, in which we get the output of the expression. It can be invoked with df.select()
.
[18]:
df.select((pl.col('gender') == 'f') & (pl.col('age') < 21))
[18]:
gender |
---|
bool |
false |
false |
false |
false |
false |
… |
false |
false |
false |
false |
false |
Note that this is a data frame and not a series; it is a data frame containing one column. In this case, the column is named after the first column used in our Boolean expression. We can adjust the column label by applying the alias()
method, which does a renaming transformation.
[19]:
f_under_21 = (pl.col('gender') == 'f') & (pl.col('age') < 21)
df.select(f_under_21.alias('female under 21'))
[19]:
female under 21 |
---|
bool |
false |
false |
false |
false |
false |
… |
false |
false |
false |
false |
false |
If we wanted a series instead of a new data frame, we can apply the get_column()
method to the data frame returned by df.select()
.
[20]:
# Result of expression as a series
df.select(f_under_21.alias('female under 21')).get_column('female under 21')
[20]:
female under 21 |
---|
bool |
false |
false |
false |
false |
false |
… |
false |
false |
false |
false |
false |
We can also select with multiple expressions. For example, let’s say we additionally wanted to compute the ratio of confidence when correct to confidence when incorrect. First, we can make an expression for that.
[21]:
conf_ratio = pl.col('confidence when correct') / pl.col('confidence when incorrect')
Now, we can select that as well as the 'female under 21'
column.
[22]:
df.select(
f_under_21.alias('female under 21'),
conf_ratio.alias('confidence ratio')
)
[22]:
female under 21 | confidence ratio |
---|---|
bool | f64 |
false | 1.033333 |
false | 1.5 |
false | 1.011364 |
false | 1.1 |
false | 1.326531 |
… | … |
false | 1.040541 |
false | 0.925 |
false | 0.802469 |
false | 1.588235 |
false | 1.109589 |
Notice that df.select()
returns a new data frame containing only the columns that are given by the expressions and the original data frame is discarded. If we want the results of the expressions to instead be added to the data frame (keeping all of its original columns), we use the df.with_columns()
method. This is still a selection context; the output is just different, comprising of the original data frame with added columns. (In the output of the cell below, you will find the columns
added to the far right of the data frame.)
[23]:
df.with_columns(
f_under_21.alias('female under 21'),
conf_ratio.alias('confidence ratio')
)
[23]:
participant number | gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess | female under 21 | confidence ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | i64 | i64 | i64 | bool | f64 |
8 | "f" | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 | false | 1.033333 |
16 | "m" | 42 | 90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 | 4 | 11 | 7 | false | 1.5 |
18 | "f" | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 | false | 1.011364 |
22 | "f" | 35 | 100 | 75 | 87.5 | 89.5 | null | 71.0 | 80.0 | 88.0 | 80.0 | 13 | 8 | 20 | false | 1.1 |
27 | "f" | 74 | 60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 | 13 | 9 | 12 | false | 1.326531 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
97 | "f" | 23 | 70 | 85 | 77.5 | 77.0 | 66.5 | 77.0 | 77.5 | 77.0 | 74.0 | 20 | 8 | 10 | false | 1.040541 |
98 | "f" | 70 | 90 | 85 | 87.5 | 65.5 | 85.5 | 87.0 | 80.0 | 74.0 | 80.0 | 19 | 8 | 7 | false | 0.925 |
99 | "f" | 24 | 70 | 80 | 75.0 | 61.5 | 81.0 | 70.0 | 61.0 | 65.0 | 81.0 | 31 | 2 | 15 | false | 0.802469 |
102 | "f" | 40 | 75 | 65 | 70.0 | 53.0 | 37.0 | 84.0 | 52.0 | 81.0 | 51.0 | 22 | 4 | 7 | false | 1.588235 |
103 | "f" | 33 | 85 | 40 | 62.5 | 80.0 | 27.0 | 31.0 | 82.5 | 81.0 | 73.0 | 24 | 5 | 7 | false | 1.109589 |
Finally, we will do something we’ll want to use going forward. Recall that a subject is said to suffer from insomnia if he or she has an SCI of 16 or below. We might like to add a column to the data frame that specifies whether or not the subject suffers from insomnia, which we do in the code cell below. Note that until now, we have not updated our data frame. To actually update the data frame, we need an assignment operation, df = ...
.
[24]:
# Add a column with `True` for insomnia
df = df.with_columns((pl.col('sci') <= 16).alias('insomnia'))
# Take a look (.head() gives first five rows)
df.head()
[24]:
participant number | gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess | insomnia |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | i64 | i64 | i64 | bool |
8 | "f" | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 | true |
16 | "m" | 42 | 90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 | 4 | 11 | 7 | true |
18 | "f" | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 | true |
22 | "f" | 35 | 100 | 75 | 87.5 | 89.5 | null | 71.0 | 80.0 | 88.0 | 80.0 | 13 | 8 | 20 | true |
27 | "f" | 74 | 60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 | 13 | 9 | 12 | true |
Polars selectors
We have seen that we can choose which columns we want to work with in expressions using pl.col()
. Thus far, we have used a single string as an argument, but pl.col()
is more capable than that. For example, to select three columns of interest, we can do the following.
[25]:
df.select(pl.col('age', 'gender', 'percent correct')).head()
[25]:
age | gender | percent correct |
---|---|---|
i64 | str | f64 |
39 | "f" | 72.5 |
42 | "m" | 90.0 |
31 | "f" | 92.5 |
35 | "f" | 87.5 |
74 | "f" | 62.5 |
We can also pass regular expressions into pl.col()
. If we want all columns, for example, we can use pl.col('*')
. To get all columns containing the string 'confidence'
, we can use pl.col('^.*confidence.*$')
.
Personally, I always struggle with regular expressions. Fortunately, Polars has powerful selectors which help specify which columns are of interest. In addition to facilitating selection based on the names of the columns, selectors allow selection based on the data type of the column as well (actually, so does pl.col()
, but it is simplified with selectors).
We have to import the selectors separately, which we have done in the top cell of this notebook via import polars.selectors as cs
. The cs
alias is suggested by the Polars developers.
As an example, we can select all columns that have a column heading containing the string 'confidence'
.
[26]:
df.select(cs.contains('confidence')).head()
[26]:
confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect |
---|---|---|---|---|---|
f64 | f64 | f64 | f64 | f64 | f64 |
91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 |
75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 |
89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 |
89.5 | null | 71.0 | 80.0 | 88.0 | 80.0 |
68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 |
Or, perhaps we want to exclude all columns that are not indicators of performance on a test (participant number, age, and gender). We can specifically exclude columns with cs.exclude()
.
[27]:
df.select(cs.exclude('gender', 'age', 'participant number')).head()
[27]:
correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess | insomnia |
---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | i64 | i64 | i64 | bool |
65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 | true |
90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 | 4 | 11 | 7 | true |
90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 | true |
100 | 75 | 87.5 | 89.5 | null | 71.0 | 80.0 | 88.0 | 80.0 | 13 | 8 | 20 | true |
60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 | 13 | 9 | 12 | true |
What if we want columns with only a float data type?
[28]:
df.select(cs.float()).head()
[28]:
percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect |
---|---|---|---|---|---|---|
f64 | f64 | f64 | f64 | f64 | f64 | f64 |
72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 |
90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 |
92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 |
87.5 | 89.5 | null | 71.0 | 80.0 | 88.0 | 80.0 |
62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 |
Note that the sleep measures were omitted because they are integer data types. We could select everything that is numeric if we want to include those.
[29]:
df.select(cs.numeric()).head()
[29]:
participant number | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | i64 | i64 | i64 |
8 | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 |
16 | 42 | 90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 | 4 | 11 | 7 |
18 | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 |
22 | 35 | 100 | 75 | 87.5 | 89.5 | null | 71.0 | 80.0 | 88.0 | 80.0 | 13 | 8 | 20 |
27 | 74 | 60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 | 13 | 9 | 12 |
Unfortunately, this still gave us participant number and age again. We could exclude those explicitly by chaining methods.
[30]:
df.select(cs.numeric().exclude('age', 'participant number')).head()
[30]:
correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess |
---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | i64 | i64 | i64 |
65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 |
90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 | 4 | 11 | 7 |
90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 |
100 | 75 | 87.5 | 89.5 | null | 71.0 | 80.0 | 88.0 | 80.0 | 13 | 8 | 20 |
60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 | 13 | 9 | 12 |
Selectors also allow for set algebra. As a (contrived) example, let say we want all columns that are not of string data type that have spaces in the column heading and that we also want to exclude the participant number.
[31]:
df.select((~cs.string() & cs.contains(' ')).exclude('participant number')).head()
[31]:
correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect |
---|---|---|---|---|---|---|---|---|
i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 |
90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 |
90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 |
100 | 75 | 87.5 | 89.5 | null | 71.0 | 80.0 | 88.0 | 80.0 |
60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 |
Notice that we used the complement operator ~
and the intersection operator &
. Selectors also support the union operator |
and the difference operator -
.
To close our discussion on selectors, we note that selectors are their own data type:
[32]:
type(cs.string())
[32]:
polars.selectors._selector_proxy_
Selectors can be converted to expressions so you can continue computing with the as_expr()
method.
[33]:
type(cs.string().as_expr())
[33]:
polars.expr.expr.Expr
Computing summary statistics
To demonstrate a use of what we have learned so far, we can compute the mean percent correct for insomniacs and normal sleepers. We can filter the data frame according to the insomnia
column, select the percent correct
column and compute the mean.
To put them together in a new data frame, we make a dictionary and convert it to a data frame, which is one of the ways to make a Polars data frame. E.g.,
[34]:
pl.DataFrame(dict(a=[1,2,3], b=[4.5, 5.5, 6.5], c=['one', 'two', 'three']))
[34]:
a | b | c |
---|---|---|
i64 | f64 | str |
1 | 4.5 | "one" |
2 | 5.5 | "two" |
3 | 6.5 | "three" |
Now to make our data frame of means for insomniacs and normal sleepers.
[35]:
pl.DataFrame(
{
'insomniacs': df.filter(pl.col('insomnia')).get_column('percent correct').mean(),
'normal sleepers': df.filter(~pl.col('insomnia')).get_column('percent correct').mean()
}
)
[35]:
insomniacs | normal sleepers |
---|---|
f64 | f64 |
76.1 | 81.461039 |
Notice that I used the ~
operator, which is a bit switcher, to get the normal sleepers from the insomnia
column. It changes all True
s to False
s and vice versa. In this case, it functions like NOT.
It appears as though normal sleepers score better than insomniacs. We will learn techniques to more quantitatively assess that claim when we learn about statistical inference.
As we will soon see, what we have done is a split-apply-combine operation, for which there are more elegant and efficient methods using Polars. You will probably never use the approach in the above code cell again.
We will do a lot more computing with Polars data frames as the course goes on. For a nifty demonstration demonstration in this lesson, we can quickly compute summary statistics about each column of a data frame using its describe()
method.
[36]:
df.describe()
[36]:
statistic | participant number | gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess | insomnia |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | f64 | str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
"count" | 102.0 | "102" | 102.0 | 102.0 | 102.0 | 102.0 | 102.0 | 84.0 | 102.0 | 93.0 | 102.0 | 99.0 | 102.0 | 102.0 | 102.0 | 102.0 |
"null_count" | 0.0 | "0" | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 18.0 | 0.0 | 9.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 |
"mean" | 52.04902 | null | 37.921569 | 83.088235 | 77.205882 | 80.147059 | 74.990196 | 58.565476 | 71.137255 | 61.22043 | 74.642157 | 61.979798 | 22.245098 | 5.27451 | 7.294118 | 0.245098 |
"std" | 30.020909 | null | 14.02945 | 15.09121 | 17.569854 | 12.047881 | 14.165916 | 19.560653 | 14.987479 | 17.671283 | 13.619725 | 15.92167 | 7.547128 | 3.404007 | 4.426715 | null |
"min" | 1.0 | "f" | 16.0 | 35.0 | 20.0 | 40.0 | 29.5 | 7.0 | 19.0 | 17.0 | 24.0 | 24.5 | 0.0 | 0.0 | 0.0 | 0.0 |
"25%" | 26.0 | null | 26.0 | 75.0 | 70.0 | 72.5 | 66.0 | 47.0 | 64.5 | 50.0 | 66.0 | 51.0 | 17.0 | 3.0 | 4.0 | null |
"50%" | 53.0 | null | 37.0 | 90.0 | 80.0 | 85.0 | 75.0 | 56.5 | 71.5 | 61.0 | 76.0 | 61.5 | 24.0 | 5.0 | 7.0 | null |
"75%" | 78.0 | null | 45.0 | 95.0 | 90.0 | 87.5 | 87.0 | 73.0 | 80.0 | 74.0 | 82.5 | 73.0 | 29.0 | 7.0 | 10.0 | null |
"max" | 103.0 | "m" | 74.0 | 100.0 | 100.0 | 100.0 | 100.0 | 92.0 | 100.0 | 100.0 | 100.0 | 100.0 | 32.0 | 15.0 | 21.0 | 1.0 |
This gives us a data frame with summary statistics.
Outputting a new CSV file
Now that we added the insomniac column, we might like to save our data frame as a new CSV that we can reload later. We use df.write_csv()
for this.
[37]:
df.write_csv("gfmt_sleep_with_insomnia.csv")
Let’s take a look at what this file looks like.
[38]:
!head gfmt_sleep_with_insomnia.csv
participant number,gender,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess,insomnia
8,f,39,65,80,72.5,91.0,90.0,93.0,83.5,93.0,90.0,9,13,2,true
16,m,42,90,90,90.0,75.5,55.5,70.5,50.0,75.0,50.0,4,11,7,true
18,f,31,90,95,92.5,89.5,90.0,86.0,81.0,89.0,88.0,10,9,3,true
22,f,35,100,75,87.5,89.5,,71.0,80.0,88.0,80.0,13,8,20,true
27,f,74,60,65,62.5,68.5,49.0,61.0,49.0,65.0,49.0,13,9,12,true
28,f,61,80,20,50.0,71.0,63.0,31.0,72.5,64.5,70.5,15,14,2,true
30,m,32,90,75,82.5,67.0,56.5,66.0,65.0,66.0,64.0,16,9,3,true
33,m,62,45,90,67.5,54.0,37.0,65.0,81.5,62.0,61.0,14,9,9,true
34,f,33,80,100,90.0,70.5,76.5,64.5,,68.0,76.5,14,12,10,true
Very nice. Notice that by default Polars leaves an empty field for null values, and we do not need the null_values
kwarg when we load in this CSV file.
Renaming columns
You may be annoyed with the rather lengthy syntax of access column names and wish to change them. Actually, you probably do not want to do this. Explicit is better than implicit! And furthermore, high level plotting libraries, as we will soon see, often automatically use column names for axis labels. So, let’s instead lengthen a column name. Say we keep forgetting what ESS stands for an want to rename the ess column to “Epworth Sleepiness Scale.”
Data frames have a nice rename
method to do this. To rename the columns, we provide a dictionary where the keys are current column names and the corresponding values are the names we which to check them to. While we are at it, we will choose descriptive names for all three of the sleep quality indices.
[39]:
# Make a dictionary to rename columns
rename_dict = {
"ess": "Epworth Sleepiness Scale",
"sci": "Sleep Condition Indicator",
"psqi": "Pittsburgh Sleep Quality Index",
}
# Rename the columns
df = df.rename(rename_dict)
df.head()
[39]:
participant number | gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | Sleep Condition Indicator | Pittsburgh Sleep Quality Index | Epworth Sleepiness Scale | insomnia |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | i64 | i64 | i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | i64 | i64 | i64 | bool |
8 | "f" | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 | true |
16 | "m" | 42 | 90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 | 4 | 11 | 7 | true |
18 | "f" | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 | true |
22 | "f" | 35 | 100 | 75 | 87.5 | 89.5 | null | 71.0 | 80.0 | 88.0 | 80.0 | 13 | 8 | 20 | true |
27 | "f" | 74 | 60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 | 13 | 9 | 12 | true |
A note on indexing and speed
As we have seen Boolean indexing is very convenient and fits nicely into a logical framework which allows us to extract data according to criteria we want. The trade-off is speed. Slicing by Boolean indexing is essentially doing a reverse lookup in a dictionary. We have to loop through all values to find keys that match. This is much slower than directly indexing. Compare the difference in speed for indexing the percent correct by participant number 42 by Boolean indexing versus direct indexing (it’s row 54).
[40]:
print("Boolean indexing:")
%timeit df.filter(pl.col('participant number') == 42)['percent correct'].item()
print("\nDirect indexing:")
%timeit df[54, 'percent correct']
Boolean indexing:
250 μs ± 3.71 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Direct indexing:
885 ns ± 3.75 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
The speed difference is stark, differing by two orders of magnitude. For larger data sets, or for analyses that require repeated indexing, this speed consideration may be important. However, Polars does optimization that takes full advantage of parallelization that will accelerate Boolean indexing.
Computing environment
[41]:
%load_ext watermark
%watermark -v -p numpy,polars,jupyterlab
Python implementation: CPython
Python version : 3.12.5
IPython version : 8.27.0
numpy : 1.26.4
polars : 1.8.1
jupyterlab: 4.2.5