Polars for Pandas users
[2]:
import numpy as np
import polars as pl
import polars.selectors as cs
import pandas as pd
As of September 2024, Pandas is by far and away the most widely used data frame package for Python. We are using Polars primarily because, in my opinion, the API is more intuitive and therefore easier for beginners and experts alike to use. It is also faster, sometimes much faster. It is, however, important to know about Pandas and how to use it because many of your colleagues use it and many packages you may use do, too.
Therefore, in this part of the lesson, I discuss how to convert a Polars data frame to Pandas, and vice versa. I also provide syntax for doing common tasks in Polars and Pandas. It is also worth reading the section of the Polars user guide comparing Pandas and Polars.
A sample data frame
For ease of discussion and comparison, we will use a simple data frame that has two categorical columns, 'c1'
, and 'c2'
, two quantitative columns as floats, 'q1'
, and 'q2'
, and a column, 'i1'
of integer values. It also has an identity column, a unique identifier for each row that is useful when converting the data frame to tall format. Note that 'q1'
has a null value and 'q2'
has a NaN value.
[3]:
data = dict(
id=list(range(1, 9)),
c1=['a']*4 + ['b']*4,
c2=['c', 'd'] * 4,
q1=[1.1, 2.2, 3.1, None, 2.9, 1.7, 3.0, 7.3],
q2=[4.5, 2.3, np.nan, 1.1, 7.8, 2.3, 1.1, 0.8],
i1=[5, 3, 0, 2, 4, 3, 4, 1],
)
df = pl.DataFrame(data)
# Take a look
df
[3]:
id | c1 | c2 | q1 | q2 | i1 |
---|---|---|---|---|---|
i64 | str | str | f64 | f64 | i64 |
1 | "a" | "c" | 1.1 | 4.5 | 5 |
2 | "a" | "d" | 2.2 | 2.3 | 3 |
3 | "a" | "c" | 3.1 | NaN | 0 |
4 | "a" | "d" | null | 1.1 | 2 |
5 | "b" | "c" | 2.9 | 7.8 | 4 |
6 | "b" | "d" | 1.7 | 2.3 | 3 |
7 | "b" | "c" | 3.0 | 1.1 | 4 |
8 | "b" | "d" | 7.3 | 0.8 | 1 |
From Polars to Pandas and from Pandas to Polars
If you have a Polars data frame, you can directly convert it to a Pandas data frame using the to_pandas()
, method. Let’s do that for our data frame.
[4]:
df.to_pandas()
[4]:
id | c1 | c2 | q1 | q2 | i1 | |
---|---|---|---|---|---|---|
0 | 1 | a | c | 1.1 | 4.5 | 5 |
1 | 2 | a | d | 2.2 | 2.3 | 3 |
2 | 3 | a | c | 3.1 | NaN | 0 |
3 | 4 | a | d | NaN | 1.1 | 2 |
4 | 5 | b | c | 2.9 | 7.8 | 4 |
5 | 6 | b | d | 1.7 | 2.3 | 3 |
6 | 7 | b | c | 3.0 | 1.1 | 4 |
7 | 8 | b | d | 7.3 | 0.8 | 1 |
Note that the null value because a NaN. All missing data in Pandas are NaN. (Well, not really. You can have an object
data data type for a column that permits None
values. However, when Pandas reads in data, when there are missing data, it assigns it to be NaN by default.)
Note also that Pandas has an index displayed on the left side of the data frame. In general, we will not use Pandas indexes.
Similarly, if you have a data frame in Pandas, you can convert it to a Polars data frame using the pl.from_pandas()
function.
[5]:
pl.from_pandas(pd.DataFrame(data))
[5]:
id | c1 | c2 | q1 | q2 | i1 |
---|---|---|---|---|---|
i64 | str | str | f64 | f64 | i64 |
1 | "a" | "c" | 1.1 | 4.5 | 5 |
2 | "a" | "d" | 2.2 | 2.3 | 3 |
3 | "a" | "c" | 3.1 | null | 0 |
4 | "a" | "d" | null | 1.1 | 2 |
5 | "b" | "c" | 2.9 | 7.8 | 4 |
6 | "b" | "d" | 1.7 | 2.3 | 3 |
7 | "b" | "c" | 3.0 | 1.1 | 4 |
8 | "b" | "d" | 7.3 | 0.8 | 1 |
Pandas and Polars for common tasks
Below is a table listing common tasks with data frame done using Polars and Pandas.
Description |
Pandas |
Polars |
---|---|---|
Convert dictionary to df |
|
|
Make 2D Numpy array into df |
|
|
Read from CSV |
|
|
Lazily read CSV |
— |
|
Read from Excel |
|
|
Read from JSON |
|
|
Read from HDF5 |
|
— |
Write CSV |
|
|
Rename columns |
|
|
Get column |
|
|
Get column |
|
|
Get columns |
|
|
Get columns containing floats |
|
|
Get row |
|
|
Get row |
|
|
Get row where |
|
|
Sub df with rows where |
|
|
Sub df where |
|
|
Iterate over columns of df |
|
|
Iterate over rows of df |
|
|
Group by single column |
|
|
Group by maintaining order |
|
|
Group by multiple columns |
|
|
Iterate over groups |
|
|
Iterate over nested groups |
|
|
Group by and apply mean¹ |
|
|
Group by and apply median to one column |
|
|
Group by and apply mean to two columns |
|
|
Group by and apply custom func to col² |
|
|
Group by and apply custom func to 2 cols³ |
|
|
Group by and rank within each group |
|
|
Convert to tall format |
|
|
Pivot tall result above |
|
|
Select columns with string in name |
|
|
Add column of zeros to data frame |
|
|
Add a Numpy array as column |
|
|
Multiply two columns; make new column |
|
|
Apply a function to each row making new col⁴ |
|
|
Drop rows with missing data |
|
|
Sort according to a column |
|
|
Inner join two data frames⁵ |
|
|
Concatenate data frames vertically |
|
|
Concatenate data frames horizontally |
|
|
Footnotes
Note that in Pandas, NaNs are omitted from calculations like means. In Polars, NaNs are included, and the result will be NaN. However,
null
s are not included.For Pandas, the function
my_fun
must take anarray_like
data type (list, Numpy array, Pandas Series, etc.) as input. For Polars, the functionmy_fun
must take a Polars Series as input. It is wise to specify the data type of the output of the function (shown asfloat
in the above example, but can be whatever typemy_fun
returns). A Pandas example:my_fun = lambda x: np.sum(np.sin(x))
. A Polars example:my_fun = lambda s: s.exp().sum()
.For Pandas, the function must take a Pandas DataFrame as an argument. For Polars, it must take a Polars Series with a struct data type. A Pandas example:
my_fun = lambda df: (np.sin(s['q1']) * s['q2']).sum()
. A Polars example:my_fun = lambda s: (s.struct.field('q1').sin() * s.struct.field('q2')).sum()
For Pandas,
my_fun
must take as its argument a Pandas Series with an index containing the names of the columns of the original data frame. For Polars,my_fun
must take as its argument a dictionary with keys given by the names of the columns of the original data frame. The functions may then have the same syntax (though possibly with different type hints). An example:my_fun = lambda r: r['i1'] * np.sin(r['q2'])
. However, note that in Polars, anull
value is treated asNone
, which means you cannot apply a function to it, multiply by it, etc.For Polars, the
on
kwarg fordf.join()
is required. With Pandas, which columns to join on are inferred based on like-names of columns.
Hierarchical indexes
Pandas supports hierarchical indexes, called MultiIndex
es. This is not supports by Polars. Polars will not read a CSV file with hierarchical indexes. If you have a data set in a CSV file with hierarchical indexes, you can convert it to a CSV file in tall format where the MultiIndex
has been converted to columns using the bebi103.utils.unpivot_csv()
function. This operation is akin to a df.melt()
operation on a data frame with a hierarchical index. You can then read the
converted CSV file into Polars and begin working with it.
Computing environment
[6]:
%load_ext watermark
%watermark -v -p numpy,pandas,polars,jupyterlab
Python implementation: CPython
Python version : 3.12.4
IPython version : 8.25.0
numpy : 1.26.4
pandas : 2.2.2
polars : 1.6.0
jupyterlab: 4.0.13