iqplot

Data set download


[2]:
import polars as pl

import iqplot

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

I developed iqplot, which generates Bokeh plots for data sets where exactly one variable is quantitative and all other variables of interest, if any, are categorical. This is where the name comes from; the first two letters of the package name are meant to indicate one (Roman number I) quantitative (Q) variable. The subclass of data sets that contain a single quantitative variable (and possibly several categorical variables) abound in the biological sciences.

There are seven types of plots that iqplot can generate. As you will see, all of these modes of plotting are meant to give a picture about how the quantitative measurements are distributed.

  • Box plots: iqplot.box()

  • Strip plots: iqplot.strip()

  • Spike plots: iqplot.spike()

  • Strip-box plots (strip and box plots overlaid): iqplot.stripbox()

  • Histograms: iqplot.histogram()

  • Strip-histogram plots (strip and histograms overlaid): iqplot.striphistogram()

  • ECDFs: iqplot.ecdf()

This first seven arguments are the same for all plots. They are:

  • data: A tidy data frame

  • q: The column of the data frame to be treated as the quantitative variable.

  • cats: A list of columns in the data frame that are to be considered as categorical variables in the plot. If None, a single box, strip, histogram, or ECDF is plotted.

  • q_axis: Along which axis, x or y that the quantitative variable varies. The default is 'x'.

  • palette: A list of hex colors to use for coloring the markers for each category. By default, it uses the Glasbey Category 10 color palette from colorcet.

  • order: If specified, the ordering of the categories to use on the categorical axis and legend (if applicable). Otherwise, the order of the inputted data frame is used.

  • p: If specified, the bokeh.plotting.Figure object to use for the plot. If not specified, a new figure is created.

If data is given as a Numpy array, it is the only required argument. If data is given as a Pandas DataFrame, q must also be supplied. All other arguments are optional and have reasonably set defaults. Any extra kwargs not in the function call signature are passed to bokeh.plotting.figure() when the figure is instantiated.

With this in mind, we will put iqplot to use on facial identification data set to demonstrate how we can make each of the seven kinds of plots. First, as we have become accustomed, we’ll load in the data set.

[3]:
fname = os.path.join(data_path, "gfmt_sleep.csv")
df = (
    pl.read_csv(fname, null_values="*")
    .with_columns(
        insomnia := (pl.col('sci') <= 16).alias('insomnia'),
        insomnia
            .replace_strict({True: 'insomniac', False: 'normal'}, return_dtype=pl.String)
            .alias('sleeper'),
        pl.col('gender')
            .replace_strict(dict(f='female', m='male'), return_dtype=pl.String)
            .alias('gender'),
    )
)

All seven plots

We now make plots of the percent correct insomniacs and normal sleepers so you can see how the syntax works.

Box plot

[4]:
p = iqplot.box(
    data=df,
    q="percent correct",
    cats="sleeper",
)

bokeh.io.show(p)

Strip plot

For this plot, I will add jitter, which is passed using the spread='jitter' kwarg.

[5]:
p = iqplot.strip(
    data=df,
    q="percent correct",
    cats="sleeper",
    spread="jitter",
)

bokeh.io.show(p)

Spike plot

Spike plots show the count of each value measured. They are only effective if the measurements take on discrete values.

[6]:
p = iqplot.spike(
    data=df,
    q="percent correct",
    cats="sleeper",
)

bokeh.io.show(p)

Note that when the spikes are in the “stacked” arrangement as above (as is default), the count values cannot be ascertained, only compared. We can display the counts using the arranged="overlay" keyword argument.

[7]:
p = iqplot.spike(
    data=df,
    q="percent correct",
    cats="sleeper",
    arrangement="overlay",
    style="spike-dot",
)

bokeh.io.show(p)

Strip-box plot

For a strip-box plot, a strip plot and box plot are overlaid with reasonable defaults for the box plot to enable visualization.

[8]:
p = iqplot.stripbox(
    data=df,
    q="percent correct",
    cats="sleeper",
    spread="jitter",
)

bokeh.io.show(p)

Histogram

For histograms, the number of bins are automatically chosen using the Freedman-Diaconis rule.

[9]:
p = iqplot.histogram(
    data=df,
    q="percent correct",
    cats="sleeper",
)

bokeh.io.show(p)

Strip-histogram

Strip plots may also be combined with histograms. By default, the histograms are normalized and mirrored, similar to a violin plot.

[10]:
p = iqplot.striphistogram(
    data=df,
    q="percent correct",
    cats="sleeper",
    spread="swarm",
)

bokeh.io.show(p)

ECDF

[11]:
p = iqplot.ecdf(
    data=df,
    q="percent correct",
    cats="sleeper",
)

bokeh.io.show(p)

Note that the ECDFs show a clear difference. Insomniacs have a distribution that is shifted leftward. This is most clearly revealed in the ECDF.

Customization with iqplot

You may have noticed in the above plots that I occasionally used keyword arguments beyond the seven arguments that are present in all of the function signatures. For example in the ECDF above, I used the default, which is style='staircase'. These are plot-type-specific kwargs which enable customization.

You can find out what kwargs are available for each function by reading their doc strings, e.g., with

iqplot.box?

of by reading the documentation. Any kwargs not in the function call signature are passed to bokeh.plotting.figure() when the figure is instantiated.

In our examples of customization, we will use both gender and sleeper status as keyword arguments.

Customizing box plots

We can also have vertical box plots using the q_axis kwarg.

[12]:
p = iqplot.box(
    data=df,
    q="percent correct",
    cats=["gender", "sleeper"],
    q_axis="y",
)

bokeh.io.show(p)

We can independently specify properties of the marks using box_kwargs, whisker_kwargs, median_kwargs, and outlier_kwargs. For example, say we wanted our colors to be Betancourt red, and that we wanted the outliers to also be that color and use diamond glyphs.

[13]:
p = iqplot.box(
    data=df,
    q='percent correct',
    cats=['gender', 'sleeper'],
    q_axis='y',
    whisker_caps=True,
    outlier_marker='diamond',
    box_kwargs=dict(fill_color='#7C0000'),
    whisker_kwargs=dict(line_color='#7C0000', line_width=2),
)

bokeh.io.show(p)

Customzing strip plots

To help alleviate the overlap problem, we can make a strip plot with dash markers and add some transparency.

[14]:
p = iqplot.strip(
    data=df,
    q='percent correct',
    cats=['gender', 'sleeper'],
    marker='dash',
    marker_kwargs=dict(alpha=0.5)
)

bokeh.io.show(p)

I prefer jittering or swarming to this. Below, I add hover tools that give more information about the respective data points in a jittered strip plot.

[15]:
p = iqplot.strip(
    data=df,
    q="percent correct",
    cats=["gender", "sleeper"],
    spread="swarm", # use spread='jitter' to add random jitter
    tooltips=[("age", "@{age}"), ("participant number", "@{participant number}")],
)

bokeh.io.show(p)
/Users/bois/Dropbox/git/iqplot/iqplot/cat.py:1804: UserWarning: 1 data points exceed maximum height. Consider using spread='jitter' or increasing the frame height.
  warnings.warn(

Note the warning that a data point exceeds the maximum spread allowed in a swarm plot such that data points overlap. For this reason, I often prefer jitter plots to swarm plots. In this particular case, because of the discrete values taken by the percent correct, a swarm plot is more effective at showing all data.

Customizing histograms

We can plot normalized histograms using the density kwarg, and we’ll make the plot a little wider to support the legend.

[16]:
# Plot the histogram
p = iqplot.histogram(
    data=df,
    q='percent correct',
    cats=['gender', 'sleeper'],
    density=True,
    frame_width=525,
)

bokeh.io.show(p)

Customizing ECDFs

Instead of plotting a separate ECDF for each category, we can put all of the categories together on one ECDF and color the points by the categorical variable by using the kind='colored' kwarg. Note that if we do this, we can only have the “dots” style ECDF, not the formal staircase.

[17]:
p = iqplot.ecdf(
    data=df,
    q="percent correct",
    cats=["gender", "sleeper"],
    kind="colored",
)

bokeh.io.show(p)

In general, for customization, you should check the documentation to see what is available.

Computing environment

[18]:
%load_ext watermark
%watermark -v -p polars,bokeh,iqplot,jupyterlab
Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.25.0

polars    : 1.6.0
bokeh     : 3.4.1
iqplot    : 0.3.7
jupyterlab: 4.0.13