Polars for Pandas users


[2]:
import numpy as np

import polars as pl
import polars.selectors as cs

import pandas as pd

As of September 2024, Pandas is by far and away the most widely used data frame package for Python. We are using Polars primarily because, in my opinion, the API is more intuitive and therefore easier for beginners and experts alike to use. It is also faster, sometimes much faster. It is, however, important to know about Pandas and how to use it because many of your colleagues use it and many packages you may use do, too.

Therefore, in this part of the lesson, I discuss how to convert a Polars data frame to Pandas, and vice versa. I also provide syntax for doing common tasks in Polars and Pandas. It is also worth reading the section of the Polars user guide comparing Pandas and Polars.

A sample data frame

For ease of discussion and comparison, we will use a simple data frame that has two categorical columns, 'c1', and 'c2', two quantitative columns as floats, 'q1', and 'q2', and a column, 'i1' of integer values. It also has an identity column, a unique identifier for each row that is useful when converting the data frame to tall format. Note that 'q1' has a null value and 'q2' has a NaN value.

[3]:
data = dict(
    id=list(range(1, 9)),
    c1=['a']*4 + ['b']*4,
    c2=['c', 'd'] * 4,
    q1=[1.1, 2.2, 3.1, None, 2.9, 1.7, 3.0, 7.3],
    q2=[4.5, 2.3, np.nan, 1.1, 7.8, 2.3, 1.1, 0.8],
    i1=[5, 3, 0, 2, 4, 3, 4, 1],
)

df = pl.DataFrame(data)

# Take a look
df
[3]:
shape: (8, 6)
idc1c2q1q2i1
i64strstrf64f64i64
1"a""c"1.14.55
2"a""d"2.22.33
3"a""c"3.1NaN0
4"a""d"null1.12
5"b""c"2.97.84
6"b""d"1.72.33
7"b""c"3.01.14
8"b""d"7.30.81

From Polars to Pandas and from Pandas to Polars

If you have a Polars data frame, you can directly convert it to a Pandas data frame using the to_pandas(), method. Let’s do that for our data frame.

[4]:
df.to_pandas()
[4]:
id c1 c2 q1 q2 i1
0 1 a c 1.1 4.5 5
1 2 a d 2.2 2.3 3
2 3 a c 3.1 NaN 0
3 4 a d NaN 1.1 2
4 5 b c 2.9 7.8 4
5 6 b d 1.7 2.3 3
6 7 b c 3.0 1.1 4
7 8 b d 7.3 0.8 1

Note that the null value because a NaN. All missing data in Pandas are NaN. (Well, not really. You can have an object data data type for a column that permits None values. However, when Pandas reads in data, when there are missing data, it assigns it to be NaN by default.)

Note also that Pandas has an index displayed on the left side of the data frame. In general, we will not use Pandas indexes.

Similarly, if you have a data frame in Pandas, you can convert it to a Polars data frame using the pl.from_pandas() function.

[5]:
pl.from_pandas(pd.DataFrame(data))
[5]:
shape: (8, 6)
idc1c2q1q2i1
i64strstrf64f64i64
1"a""c"1.14.55
2"a""d"2.22.33
3"a""c"3.1null0
4"a""d"null1.12
5"b""c"2.97.84
6"b""d"1.72.33
7"b""c"3.01.14
8"b""d"7.30.81

Pandas and Polars for common tasks

Below is a table listing common tasks with data frame done using Polars and Pandas.

Description

Pandas

Polars

Convert dictionary to df

pd.DataFrame(data)

pl.DataFrame(data)

Make 2D Numpy array into df

pd.DataFrame(my_ar, columns=['col1', 'col2', 'col3'])

pl.DataFrame(my_ar, schema=['col1', 'col2', 'col3']', orient='row')

Read from CSV

pd.read_csv(file_name)

pl.read_csv(file_name)

Lazily read CSV

pl.scan_csv(file_name)

Read from Excel

pd.read_excel(file_name)

pl.read_excel(file_name)

Read from JSON

pd.read_json(file_name)

pl.read_json(file_name)

Read from HDF5

pd.read_hdf(file_name)

Write CSV

df.to_csv(fname, index=False)

df.write_csv(fname)

Rename columns

df.rename(columns={'c1': 'cat1', 'c2': 'cat2'})

df.rename({'c1': 'cat1', 'c2': 'cat2'})

Get column 'q1' as series

df['q1'] or df.get_column('q1') or df.loc[:,'q1']

df['q1'] or df.get_column('q1')

Get column 'q1' as data frame

df[['q1']] or df.loc[:,['q1']]

df.select('q1')

Get columns 'c1' and 'q2'

df[['c1', 'q2']] or df.loc[:, ['c1', 'q2']]

df.select('c1', 'q2')

Get columns containing floats

df.select_dtypes(float)

df.select(cs.float())

Get row 4

df.loc[4, :]

df.row(4) or df.row(4, named=True)

Get row 4 as data frame

df.loc[[4], :]

df[4]

Get row where i1 is 2

df.loc[df['i1']==2, :]

df.row(by_predicate=pl.col('i1')==2) or df.filter(pl.col('i1')==2)

Sub df with rows where c1 is 'a'

df.loc[df['c1']=='a', :]

df.filter(pl.col('c1')=='a')

Sub df where c1 is 'a' and c2 is 'd'

df.loc[(df['c1']=='a') & (df['c2']=='d'), :]

df.filter((pl.col('c1')=='a') & (pl.col('c2')=='d'))

Iterate over columns of df

for col, s in df.items()

for s in df

Iterate over rows of df

for row_ind, row in df.iterrows()

for r in df.iter_rows(named=True)

Group by single column

df.groupby('c1')

df.group_by('c1')

Group by maintaining order

df.groupby('c1', sort=False)

df.group_by('c1', maintain_order=True)

Group by multiple columns

df.groupby(['c1', 'c2'])

df.group_by(['c1', 'c2'])

Iterate over groups

for group, subdf in df.groupby('c1')

for (group,), subdf in df.group_by('c1')

Iterate over nested groups

for (g1, g2), subdf in df.groupby(['c1', 'c2'])

for (g1, g2), subdf in df.groupby(['c1', 'c2'])

Group by and apply mean¹

df.groupby('c1').mean(numeric_only=True)

df.group_by('c1').mean()

Group by and apply median to one column

df.groupby('c1')['q1'].median()

df.group_by('c1').agg(pl.col('q1').median())

Group by and apply mean to two columns

df.groupby('c1')[['q1', 'q2']].mean()

df.group_by('c1').agg(pl.col('q1', 'q2').mean())

Group by and apply custom func to col²

df.groupby('c1')['q1'].apply(my_fun)

df.group_by('c1').agg(pl.col('q1').map_elements(my_fun, return_dtype=float))

Group by and apply custom func to 2 cols³

df.groupby('c1')[['q1', 'q2']].apply(my_fun)

df.group_by('c1').agg(pl.struct(['q1', 'q2']).map_elements(my_fun, return_dtype=float))

Group by and rank within each group

df.groupby('c1')['q1'].rank()

df.select(pl.col('q1').rank().over('c1'))

Convert to tall format

df.melt(value_name='value', var_name='var', id_vars='id')

df.unpivot(value_name='value', variable_name='var', index='id')

Pivot tall result above

df_tall.pivot(columns='var', index='id').reset_index()

df_tall.pivot(on='var', index='id')

Select columns with string in name

df.filter(regex='q') or df[df.columns[df.columns.str.contains('q')]]

df.select(cs.contains('q'))

Add column of zeros to data frame

df['new_col'] = 0 or df.assign(new_col=0)

df.with_columns(pl.lit(0).alias('new_col'))

Add a Numpy array as column

df['new_col'] = my_array or df.assign(new_col=my_array)

df.with_columns(pl.Series(my_array).alias('new_col'))

Multiply two columns; make new column

df['q1q2'] = df['q1'] * df['q2'] or df.assign(q1q2=df['q1'] * df['q2']

df.with_columns((pl.col('q1') * pl.col('q2')).alias('q1q2'))

Apply a function to each row making new col⁴

df.assign(new_col=my_fun)

df.with_columns(pl.struct(pl.all()).map_elements(my_fun, return_dtype=float).alias('new_col'))

Drop rows with missing data

df.dropna()

df.drop_nulls()

Sort according to a column

df.sort_values(by='i1')

df.sort(by='i1')

Inner join two data frames⁵

pd.merge(df, df2, on=shared_columns)

df.join(df2, on=shared_columns)

Concatenate data frames vertically

pd.concat((df, df2))

pl.concat((df, df2), how='diagonal')

Concatenate data frames horizontally

pd.concat((df, df2), axis=1)

pl.concat((df, df2), how='horizontal')

Footnotes

  1. Note that in Pandas, NaNs are omitted from calculations like means. In Polars, NaNs are included, and the result will be NaN. However, nulls are not included.

  2. For Pandas, the function my_fun must take an array_like data type (list, Numpy array, Pandas Series, etc.) as input. For Polars, the function my_fun must take a Polars Series as input. It is wise to specify the data type of the output of the function (shown as float in the above example, but can be whatever type my_fun returns). A Pandas example: my_fun = lambda x: np.sum(np.sin(x)). A Polars example: my_fun = lambda s: s.exp().sum().

  3. For Pandas, the function must take a Pandas DataFrame as an argument. For Polars, it must take a Polars Series with a struct data type. A Pandas example: my_fun = lambda df: (np.sin(s['q1']) * s['q2']).sum(). A Polars example: my_fun = lambda s: (s.struct.field('q1').sin() * s.struct.field('q2')).sum()

  4. For Pandas, my_fun must take as its argument a Pandas Series with an index containing the names of the columns of the original data frame. For Polars, my_fun must take as its argument a dictionary with keys given by the names of the columns of the original data frame. The functions may then have the same syntax (though possibly with different type hints). An example: my_fun = lambda r: r['i1'] * np.sin(r['q2']). However, note that in Polars, a null value is treated as None, which means you cannot apply a function to it, multiply by it, etc.

  5. For Polars, the on kwarg for df.join() is required. With Pandas, which columns to join on are inferred based on like-names of columns.

Hierarchical indexes

Pandas supports hierarchical indexes, called MultiIndexes. This is not supports by Polars. Polars will not read a CSV file with hierarchical indexes. If you have a data set in a CSV file with hierarchical indexes, you can convert it to a CSV file in tall format where the MultiIndex has been converted to columns using the bebi103.utils.unpivot_csv() function. This operation is akin to a df.melt() operation on a data frame with a hierarchical index. You can then read the converted CSV file into Polars and begin working with it.

Computing environment

[6]:
%load_ext watermark
%watermark -v -p numpy,pandas,polars,jupyterlab
Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.25.0

numpy     : 1.26.4
pandas    : 2.2.2
polars    : 1.6.0
jupyterlab: 4.0.13