Polars for Pandas users

[2]:

import numpy as np

import polars as pl
import polars.selectors as cs

import pandas as pd

As of September 2024, Pandas is by far and away the most widely used data frame package for Python. We are using Polars primarily because, in my opinion, the API is more intuitive and therefore easier for beginners and experts alike to use. It is also faster, sometimes much faster. It is, however, important to know about Pandas and how to use it because many of your colleagues use it and many packages you may use do, too.

Therefore, in this part of the lesson, I discuss how to convert a Polars data frame to Pandas, and vice versa. I also provide syntax for doing common tasks in Polars and Pandas. It is also worth reading the section of the Polars user guide comparing Pandas and Polars.

A sample data frame

For ease of discussion and comparison, we will use a simple data frame that has two categorical columns, 'c1', and 'c2', two quantitative columns as floats, 'q1', and 'q2', and a column, 'i1' of integer values. It also has an identity column, a unique identifier for each row that is useful when converting the data frame to tall format. Note that 'q1' has a null value and 'q2' has a NaN value.

[3]:

data = dict(
    id=list(range(1, 9)),
    c1=['a']*4 + ['b']*4,
    c2=['c', 'd'] * 4,
    q1=[1.1, 2.2, 3.1, None, 2.9, 1.7, 3.0, 7.3],
    q2=[4.5, 2.3, np.nan, 1.1, 7.8, 2.3, 1.1, 0.8],
    i1=[5, 3, 0, 2, 4, 3, 4, 1],
)

df = pl.DataFrame(data)

# Take a look
df

[3]:

shape: (8, 6)

id	c1	c2	q1	q2	i1
i64	str	str	f64	f64	i64
1	"a"	"c"	1.1	4.5	5
2	"a"	"d"	2.2	2.3	3
3	"a"	"c"	3.1	NaN	0
4	"a"	"d"	null	1.1	2
5	"b"	"c"	2.9	7.8	4
6	"b"	"d"	1.7	2.3	3
7	"b"	"c"	3.0	1.1	4
8	"b"	"d"	7.3	0.8	1

From Polars to Pandas and from Pandas to Polars

If you have a Polars data frame, you can directly convert it to a Pandas data frame using the to_pandas(), method. Let’s do that for our data frame.

[4]:

df.to_pandas()

[4]:

	id	c1	c2	q1	q2	i1
0	1	a	c	1.1	4.5	5
1	2	a	d	2.2	2.3	3
2	3	a	c	3.1	NaN	0
3	4	a	d	NaN	1.1	2
4	5	b	c	2.9	7.8	4
5	6	b	d	1.7	2.3	3
6	7	b	c	3.0	1.1	4
7	8	b	d	7.3	0.8	1

Note that the null value because a NaN. All missing data in Pandas are NaN. (Well, not really. You can have an object data data type for a column that permits None values. However, when Pandas reads in data, when there are missing data, it assigns it to be NaN by default.)

Note also that Pandas has an index displayed on the left side of the data frame. In general, we will not use Pandas indexes.

Similarly, if you have a data frame in Pandas, you can convert it to a Polars data frame using the pl.from_pandas() function.

[5]:

pl.from_pandas(pd.DataFrame(data))

[5]:

shape: (8, 6)

id	c1	c2	q1	q2	i1
i64	str	str	f64	f64	i64
1	"a"	"c"	1.1	4.5	5
2	"a"	"d"	2.2	2.3	3
3	"a"	"c"	3.1	null	0
4	"a"	"d"	null	1.1	2
5	"b"	"c"	2.9	7.8	4
6	"b"	"d"	1.7	2.3	3
7	"b"	"c"	3.0	1.1	4
8	"b"	"d"	7.3	0.8	1

Pandas and Polars for common tasks

Below is a table listing common tasks with data frame done using Polars and Pandas.

Description	Pandas	Polars
Convert dictionary to df	`pd.DataFrame(data)`	`pl.DataFrame(data)`
Make 2D Numpy array into df	`pd.DataFrame(my_ar, columns=['col1', 'col2', 'col3'])`	`pl.DataFrame(my_ar, schema=['col1', 'col2', 'col3']', orient='row')`
Read from CSV	`pd.read_csv(file_name)`	`pl.read_csv(file_name)`
Lazily read CSV	—	`pl.scan_csv(file_name)`
Read from Excel	`pd.read_excel(file_name)`	`pl.read_excel(file_name)`
Read from JSON	`pd.read_json(file_name)`	`pl.read_json(file_name)`
Read from HDF5	`pd.read_hdf(file_name)`	—
Write CSV	`df.to_csv(fname, index=False)`	`df.write_csv(fname)`
Rename columns	`df.rename(columns={'c1': 'cat1', 'c2': 'cat2'})`	`df.rename({'c1': 'cat1', 'c2': 'cat2'})`
Get column `'q1'` as series	`df['q1']` or `df.get_column('q1')` or `df.loc[:,'q1']`	`df['q1']` or `df.get_column('q1')`
Get column `'q1'` as data frame	`df[['q1']]` or `df.loc[:,['q1']]`	`df.select('q1')`
Get columns `'c1'` and `'q2'`	`df[['c1', 'q2']]` or `df.loc[:, ['c1', 'q2']]`	`df.select('c1', 'q2')`
Get columns containing floats	`df.select_dtypes(float)`	`df.select(cs.float())`
Get row `4`	`df.loc[4, :]`	`df.row(4)` or `df.row(4, named=True)`
Get row `4` as data frame	`df.loc[[4], :]`	`df[4]`
Get row where `i1` is `2`	`df.loc[df['i1']==2, :]`	`df.row(by_predicate=pl.col('i1')==2)` or `df.filter(pl.col('i1')==2)`
Sub df with rows where `c1` is `'a'`	`df.loc[df['c1']=='a', :]`	`df.filter(pl.col('c1')=='a')`
Sub df where `c1` is `'a'` and `c2` is `'d'`	`df.loc[(df['c1']=='a') & (df['c2']=='d'), :]`	`df.filter((pl.col('c1')=='a') & (pl.col('c2')=='d'))`
Iterate over columns of df	`for col, s in df.items()`	`for s in df`
Iterate over rows of df	`for row_ind, row in df.iterrows()`	`for r in df.iter_rows(named=True)`
Group by single column	`df.groupby('c1')`	`df.group_by('c1')`
Group by maintaining order	`df.groupby('c1', sort=False)`	`df.group_by('c1', maintain_order=True)`
Group by multiple columns	`df.groupby(['c1', 'c2'])`	`df.group_by(['c1', 'c2'])`
Iterate over groups	`for group, subdf in df.groupby('c1')`	`for (group,), subdf in df.group_by('c1')`
Iterate over nested groups	`for (g1, g2), subdf in df.groupby(['c1', 'c2'])`	`for (g1, g2), subdf in df.groupby(['c1', 'c2'])`
Group by and apply mean¹	`df.groupby('c1').mean(numeric_only=True)`	`df.group_by('c1').mean()`
Group by and apply median to one column	`df.groupby('c1')['q1'].median()`	`df.group_by('c1').agg(pl.col('q1').median())`
Group by and apply mean to two columns	`df.groupby('c1')[['q1', 'q2']].mean()`	`df.group_by('c1').agg(pl.col('q1', 'q2').mean())`
Group by and apply custom func to col²	`df.groupby('c1')['q1'].apply(my_fun)`	`df.group_by('c1').agg(pl.col('q1').map_elements(my_fun, return_dtype=float))`
Group by and apply custom func to 2 cols³	`df.groupby('c1')[['q1', 'q2']].apply(my_fun)`	`df.group_by('c1').agg(pl.struct(['q1', 'q2']).map_elements(my_fun, return_dtype=float))`
Group by and rank within each group	`df.groupby('c1')['q1'].rank()`	`df.select(pl.col('q1').rank().over('c1'))`
Convert to tall format	`df.melt(value_name='value', var_name='var', id_vars='id')`	`df.unpivot(value_name='value', variable_name='var', index='id')`
Pivot tall result above	`df_tall.pivot(columns='var', index='id').reset_index()`	`df_tall.pivot(on='var', index='id')`
Select columns with string in name	`df.filter(regex='q')` or `df[df.columns[df.columns.str.contains('q')]]`	`df.select(cs.contains('q'))`
Add column of zeros to data frame	`df['new_col'] = 0` or `df.assign(new_col=0)`	`df.with_columns(pl.lit(0).alias('new_col'))`
Add a Numpy array as column	`df['new_col'] = my_array` or `df.assign(new_col=my_array)`	`df.with_columns(pl.Series(my_array).alias('new_col'))`
Multiply two columns; make new column	`df['q1q2'] = df['q1'] * df['q2']` or `df.assign(q1q2=df['q1'] * df['q2']`	`df.with_columns((pl.col('q1') * pl.col('q2')).alias('q1q2'))`
Apply a function to each row making new col⁴	`df.assign(new_col=my_fun)`	`df.with_columns(pl.struct(pl.all()).map_elements(my_fun, return_dtype=float).alias('new_col'))`
Drop rows with missing data	`df.dropna()`	`df.drop_nulls()`
Sort according to a column	`df.sort_values(by='i1')`	`df.sort(by='i1')`
Inner join two data frames⁵	`pd.merge(df, df2, on=shared_columns)`	`df.join(df2, on=shared_columns)`
Concatenate data frames vertically	`pd.concat((df, df2))`	`pl.concat((df, df2), how='diagonal')`
Concatenate data frames horizontally	`pd.concat((df, df2), axis=1)`	`pl.concat((df, df2), how='horizontal')`

Footnotes

Note that in Pandas, NaNs are omitted from calculations like means. In Polars, NaNs are included, and the result will be NaN. However, nulls are not included.
For Pandas, the function my_fun must take an array_like data type (list, Numpy array, Pandas Series, etc.) as input. For Polars, the function my_fun must take a Polars Series as input. It is wise to specify the data type of the output of the function (shown as float in the above example, but can be whatever type my_fun returns). A Pandas example: my_fun = lambda x: np.sum(np.sin(x)). A Polars example: my_fun = lambda s: s.exp().sum().
For Pandas, the function must take a Pandas DataFrame as an argument. For Polars, it must take a Polars Series with a struct data type. A Pandas example: my_fun = lambda df: (np.sin(s['q1']) * s['q2']).sum(). A Polars example: my_fun = lambda s: (s.struct.field('q1').sin() * s.struct.field('q2')).sum()
For Pandas, my_fun must take as its argument a Pandas Series with an index containing the names of the columns of the original data frame. For Polars, my_fun must take as its argument a dictionary with keys given by the names of the columns of the original data frame. The functions may then have the same syntax (though possibly with different type hints). An example: my_fun = lambda r: r['i1'] * np.sin(r['q2']). However, note that in Polars, a null value is treated as None, which means you cannot apply a function to it, multiply by it, etc.
For Polars, the on kwarg for df.join() is required. With Pandas, which columns to join on are inferred based on like-names of columns.

Hierarchical indexes

Pandas supports hierarchical indexes, called MultiIndexes. This is not supports by Polars. Polars will not read a CSV file with hierarchical indexes. If you have a data set in a CSV file with hierarchical indexes, you can convert it to a CSV file in tall format where the MultiIndex has been converted to columns using the bebi103.utils.unpivot_csv() function. This operation is akin to a df.melt() operation on a data frame with a hierarchical index. You can then read the converted CSV file into Polars and begin working with it.

Computing environment

[6]:

%load_ext watermark
%watermark -v -p numpy,pandas,polars,jupyterlab

Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.25.0

numpy     : 1.26.4
pandas    : 2.2.2
polars    : 1.6.0
jupyterlab: 4.0.13