8. Data file formats

Dataset download


[2]:
import polars as pl
import polars.selectors as cs
import numpy as np

import xlrd
import fcsparser
import h5py
import pynwb

import iqplot

import bebi103

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

Data sets, biological and otherwise, come in a large variety of formats, being acquired using an even larger variety of instruments. The ability to read and parse data sets in different formats is a key part of data wrangling.

There is no real prescribed procedure to handle various data file formats, so I instead give a few tips here.

  1. If the file format is proprietary, obtain documentation from the instrument manufacturer or whoever produced the file format. If the documentation is not available, call the company to obtain assistance. If no information is available, avoid using the format; don’t guess.

  2. If you are using a proprietary format, try to find a way to convert the data set to an open format. Depending on what field you are working in, there may be automated converters available. For example, you can convert a lot of proprietary microscopy formats to the open OME-TIFF format using Bio-Formats.

  3. For open file formats, carefully read the documentation of the file format. This may require some Googling.

  4. If you need to read in file formats into Pandas data frames, Numpy array, or other familiar Python classes, there are often packages available to do so, and these are usually described in the documentation of the file format.

  5. When reading in data, make sure you get it all. Metadata are data that provide information about other data, presumably the “main” content of a data set. (I often do not make this distinction; metadata are data!) Be sure you know how to access the metadata within a given file format.

These tips are not immediately helpful for any given file format. To demonstrate how to handle various file formats, I will give a few examples. Ultimately, though, it is your responsibility as an analyst of data to learn and understand the file format to make sure you completely and accurately can conveniently access a data set. I argue further that as a data producer, you are responsible for storing and sharing your data in an open, easily accessible, well-documented file format.

With that, I proceed to the examples.

Reading data from MS Excel

You will often find collaborators, and sometimes instruments, provide data in as Microscoft Excel documents. Prior to 2007, the Excel format was proprietary, but now it is open. This has allowed Python developers to write parsers of Excel files.

Polars can open Excel files with the pl.read_excel() function. Provided the Excel file is simple (e.g., does not have weirdly formatted multiple sheets), it more or less operates as expected. You may have to install Excel reader dependencies, though, such as xlsx2csv or calamine.

However, sometimes we need to open older versions of Excel files. For this, we can use xlrd, which should be used only for reading old Excel files with .xls suffixes. As an example of reading from an old format Excel file, we will open a data set from David Lack. The data set, available from Dryad, and downloadable here, contain Lack’s morphological measurements of finches taken from the Galápagos in 1905 and 1906 and stored in the California Academy of Sciences. These measurements were the basis for his famous book Darwin’s Finches.

[3]:
# Open Excel workbook
book = xlrd.open_workbook(os.path.join(data_path, "morphLack.xls"))

# Extract first sheet
sheet = book.sheet_by_index(0)

# First row is the headers
header = sheet.row_values(0)

# Specify data dtypes for schema
schema = [(col, str if i < 10 else float) for i, col in enumerate(header)]

# Loop through all columns and pull out values, omitting header
col_vals = [sheet.col_values(i)[1:] for i in range(sheet.ncols)]

# Instantiate data frame from columns and header
df = pl.DataFrame(col_vals, header, strict=False)

# Any empty strings are null
df = df.select(pl.col('*').replace('', None))

# Rows ending with L or W are lengths and widths and therefore floats
# Use strict=False when casting to convert uncastables to null
df = df.with_columns(
    (cs.ends_with('L') | cs.ends_with('W')).cast(float, strict=False)
)

# Take a look
df.head()
[3]:
shape: (5, 20)
SourceIslandIDTaxonOrigGenusL69SpeciesL69SubspL69SpeciesIDSubspIDSexPlumageBodyLWingLTailLBeakWBeakHLBeakLUBeakLN-UBkLTarsusLMToeL
strstrstrstrstrstrstrstrstrstrf64f64f64f64strf64f64f64f64f64
"Lack CAS""Wlf_Wnm""Geospiza magnirostris""Geospiza""magnirostris"null"Geo.mag""Geo.mag""M""Black"null86.0nullnull"20.7"null22.815.7nullnull
"Lack CAS""Wlf_Wnm""Geospiza magnirostris""Geospiza""magnirostris"null"Geo.mag""Geo.mag""M""Black"null86.0nullnull"19.5"null24.316.6nullnull
"Lack CAS""Wlf_Wnm""Geospiza magnirostris""Geospiza""magnirostris"null"Geo.mag""Geo.mag""M""Black"null88.0nullnullnullnull23.115.1nullnull
"Lack CAS""Wlf_Wnm""Geospiza magnirostris""Geospiza""magnirostris"null"Geo.mag""Geo.mag""M""Black"null84.0nullnull"20.6"null22.415.3nullnull
"Lack CAS""Wlf_Wnm""Geospiza magnirostris""Geospiza""magnirostris"null"Geo.mag""Geo.mag""M""Black"null86.0nullnull"21.4"null23.115.4nullnull

There are several other Excel readers available, as described at http://www.python-excel.org. For more fine-grained control of reading in the data from Excel spreadsheets, you might want to use xlrd for older documents, as we have done here, and openpyxl for newer documents.

Example: Flow cytometry data

Flow cytometry is a technique in which cells are run single file through a fluid channel and subjected to illumination by a laser. The front and side scatters of the light are measured, as is the fluorscence intensity emitted if a cells contain fluorophores.

The technique is many decades old. Most often, flow cytometry data are stored in flow cyometry standard (FCS) format. These are binary files that cannot be directly read in using Pandas. As such, there are a few options we can pursue for reading FCS files and getting them into a format to work with.

  1. Use commercial software like FlowJo to open the file, and use it for analysis or for save in the data as CSV.

  2. Study the FCS specification and write our own parser to open FCS files and store the content in a data frame.

  3. Look for packages that provide parsers.

In almost all cases for file formats, option (3) is preferred. A cursory Google search reveals several Python-based packages for opening FCS files and storing them in convenient formats (and also performing analysis).

There is quite a variety in these packages, particularly in terms of their scope. Several aim to facilitate standard analyses of flow cytometry data. fcsparser, conversely, simply aims to read in an FCS file so that you can use its contents in a Pandas data frame (which we can easily convert to Polars with pl.from_pandas()). This is the functionality we desire here.

As an example, we will use this FCS data file, which was part of this paper by Razo-Mejia, et al. (The complete set of FCS files used in that paper are available here.) Given its simple scope, providing a tool to read FCS files into Pandas data frames, usage of fcsparser is simple and may be ascertained from its equally simple documentation.

[4]:
# Read in data
metadata, df = fcsparser.parse(
    os.path.join(data_path, "20160804_wt_O2_HG104_0uMIPTG.fcs")
)

# Convert to Polars
df = pl.from_pandas(df)

# Take a look
df.head()
[4]:
shape: (5, 13)
HDR-TFSC-AFSC-HFSC-WSSC-ASSC-HSSC-WFITC-AFITC-HFITC-WAPC-Cy7-AAPC-Cy7-HAPC-Cy7-W
f32f32f32f32f32f32f32f32f32f32f32f32f32
0.4186746537.1484386417.625133513.12524118.71484422670.142578139447.2187511319.8652346816.254883217673.4062555.798954255.54083328620.398438
2.5634626402.215825969.625140570.17187523689.55468822014.142578141047.3906251464.1513675320.25488336071.437574.539612247.54083339468.460938
4.921265871.1255518.852539139438.42187516957.43359417344.511719128146.8593755013.3300787328.77978589661.203125-31.788519229.903214-18123.212891
5.4501126928.8657238729.474609104036.07812513665.24023411657.869141153641.3125879.1657716997.6533216467.523438118.226028362.19116242784.375
9.5707511081.5800786218.314453233581.76562543528.68359422722.318359251091.968752271.9606939731.52734430600.58593820.693352210.48689312885.928711

Very nice! Now we can unleash all of our Python-based analysis tools on flow cytometry data sets, and we will soon when we use this data set in some overplotting examples.

We can also save the data as a CSV file for future use using the write_csv() method,

df.write_csv('20160804_wt_O2_HG104_0uMIPTG.csv')

thereby eliminating the need for a special parser down the road.

Note, though, that you should never delete the raw data file that came from the data source, even if its format is inconvenient.

Example: Electrophysiologial recordings of neurons in the human brain

As an example of a file format specific to a given field, we will open up an NWB (Neurodata Without Borders) file. NWB was created as a standard for neuroscience data so that they may be shared across research groups. This is stated clearly in their description of the data standard:

Neurodata Without Borders (NWB) is a data standard for neurophysiology, providing neuroscientists with a common standard to share, archive, use, and build common analysis tools for neurophysiology data. NWB is designed to store a variety of neurophysiology data, including data from intracellular and extracellular electrophysiology experiments, data from optical physiology experiments, and tracking and stimulus data. The project includes not only the NWB format, but also a broad range of software for data standardization and application programming interfaces (APIs) for reading and writing the data as well as high-value data sets that have been translated into the NWB data standard.)

NWB files are stored in HDF5 format with a specific schema for their structure (with a more detailed schema here). Therefore, we can use an HDF5 parser to explore NWB data sets.

HDF stands for “hierarchical data format,” meaning that a single HDF file can contain a file system-like structure of data. Each entry is either a dataset, which contains an array of data, or a group, which contains data sets an other groups. You can think of groups like directories in a file system. Both datasets and groups can have arbitrary attributes which are (possibly formless) metadata. Importantly, HDF5 files are highly optimized for large data sets. When working with them, you do not need to store the data in RAM, but can rapidly access portions of the data from disk as needed.

To see how this works, let’s consider an example. We will work with the file sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb, available here. It may also be directly downloaded from the DANDI Archive by clicking here. The data set comes from the lab or Ueli Rutishauser at Cedars-Sinai. It was featured in a publication describing NWB by Chandravadia, et al., originally coming from their paper by Faraut, et al. highlighting the data set on declarative memory encoding and recognition in humans. The data set includes recordings from single neurons from human amygdala and hippocampus. The subjects were shown a set of images. They were then shown another set of images, some of which they had already seen. Neuronal spike times were recorded during this process. We will look at a portion of this data set measured from electrodes in the right amygdala and left hippocampus of a single subject, a 31-year-old male. The entire data set is freely available from the DANDI archive.

As we work through this example, note that there are many file formats across fields, and this serves as an example of one and how you might navigate the format to pull out what you need for your analysis. I note that prior to writing this section of the lesson, I have never opened an NWB file nor used h5py much at all. I used a skill you will hopefully master: asking the internet for help. I show this and the other examples to illustrate to you that you will encounter file formats you may not be familiar with, and you need to be patient and search and read carefully how to use appropriate tools to parse them.

Parsing the file with h5py

The h5py package enables opening and navigating HDF5 files. It takes advantage of HDF5’s structure to enable access of the data on disk without copying the whole thing into RAM. We have imported h5py above and can use it to open the file.

[5]:
with h5py.File(
    os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
) as f:
    print(f)
<HDF5 file "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb" (mode r)>

Note that we have used h5py.File() to instantiate a File instance to get access to the data in the file. I am using context management here to keep me out of trouble for reading and writing files, just as we did when we learned about file I/O in an earlier lesson.

After opening the file, I printed it, which is an HDF5 File instance. These behave much like dictionaries, and I can see the keys of the dictionary using the keys() method.

[6]:
with h5py.File(
    os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
) as f:
    print(f.keys())
<KeysViewHDF5 ['acquisition', 'analysis', 'file_create_date', 'general', 'identifier', 'intervals', 'processing', 'session_description', 'session_start_time', 'stimulus', 'timestamps_reference_time', 'units']>

We see that the HDF file contains groups acquisition, analysis, etc. It would be useful to crawl the whole hierarchical structure of the file, something like ls -R on the command line. This is possible using the visititems() method. It takes a single argument, which is a function you want applied to each item in the hierarchy of the file. To view what we have, the print() function will work fine.

[7]:
with h5py.File(
    os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
) as f:
    f.visititems(print)
acquisition <HDF5 group "/acquisition" (2 members)>
acquisition/events <HDF5 group "/acquisition/events" (2 members)>
acquisition/events/data <HDF5 dataset "data": shape (754,), type "|O">
acquisition/events/timestamps <HDF5 dataset "timestamps": shape (754,), type "<f8">
acquisition/experiment_ids <HDF5 group "/acquisition/experiment_ids" (2 members)>
acquisition/experiment_ids/data <HDF5 dataset "data": shape (754,), type "<f8">
acquisition/experiment_ids/timestamps <HDF5 dataset "timestamps": shape (754,), type "<f8">
analysis <HDF5 group "/analysis" (0 members)>
file_create_date <HDF5 dataset "file_create_date": shape (1,), type "|O">
general <HDF5 group "/general" (9 members)>
general/data_collection <HDF5 dataset "data_collection": shape (), type "|O">
general/devices <HDF5 group "/general/devices" (1 members)>
general/devices/Neuralynx-cheetah <HDF5 group "/general/devices/Neuralynx-cheetah" (0 members)>
general/experiment_description <HDF5 dataset "experiment_description": shape (), type "|O">
general/extracellular_ephys <HDF5 group "/general/extracellular_ephys" (11 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-1 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-1" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-10 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-10" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-11 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-11" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-13 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-13" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-15 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-15" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-3 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-3" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-4 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-4" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-5 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-5" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-6 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-6" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-7 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-7" (1 members)>
general/extracellular_ephys/electrodes <HDF5 group "/general/extracellular_ephys/electrodes" (10 members)>
general/extracellular_ephys/electrodes/filtering <HDF5 dataset "filtering": shape (10,), type "|O">
general/extracellular_ephys/electrodes/group <HDF5 dataset "group": shape (10,), type "|O">
general/extracellular_ephys/electrodes/group_name <HDF5 dataset "group_name": shape (10,), type "|O">
general/extracellular_ephys/electrodes/id <HDF5 dataset "id": shape (10,), type "<i4">
general/extracellular_ephys/electrodes/imp <HDF5 dataset "imp": shape (10,), type "<f8">
general/extracellular_ephys/electrodes/location <HDF5 dataset "location": shape (10,), type "|O">
general/extracellular_ephys/electrodes/origChannel <HDF5 dataset "origChannel": shape (10,), type "<u2">
general/extracellular_ephys/electrodes/x <HDF5 dataset "x": shape (10,), type "<f8">
general/extracellular_ephys/electrodes/y <HDF5 dataset "y": shape (10,), type "<f8">
general/extracellular_ephys/electrodes/z <HDF5 dataset "z": shape (10,), type "<f8">
general/institution <HDF5 dataset "institution": shape (), type "|O">
general/keywords <HDF5 dataset "keywords": shape (7,), type "|O">
general/lab <HDF5 dataset "lab": shape (), type "|O">
general/related_publications <HDF5 dataset "related_publications": shape (1,), type "|O">
general/subject <HDF5 group "/general/subject" (5 members)>
general/subject/age <HDF5 dataset "age": shape (), type "|O">
general/subject/description <HDF5 dataset "description": shape (), type "|O">
general/subject/sex <HDF5 dataset "sex": shape (), type "|O">
general/subject/species <HDF5 dataset "species": shape (), type "|O">
general/subject/subject_id <HDF5 dataset "subject_id": shape (), type "|O">
identifier <HDF5 dataset "identifier": shape (), type "|O">
intervals <HDF5 group "/intervals" (1 members)>
intervals/trials <HDF5 group "/intervals/trials" (14 members)>
intervals/trials/category_name <HDF5 dataset "category_name": shape (150,), type "|O">
intervals/trials/delay1_time <HDF5 dataset "delay1_time": shape (150,), type "<f8">
intervals/trials/delay2_time <HDF5 dataset "delay2_time": shape (150,), type "<f8">
intervals/trials/external_image_file <HDF5 dataset "external_image_file": shape (150,), type "|O">
intervals/trials/id <HDF5 dataset "id": shape (150,), type "<i4">
intervals/trials/new_old_labels_recog <HDF5 dataset "new_old_labels_recog": shape (150,), type "|O">
intervals/trials/response_time <HDF5 dataset "response_time": shape (150,), type "<f8">
intervals/trials/response_value <HDF5 dataset "response_value": shape (150,), type "<f8">
intervals/trials/start_time <HDF5 dataset "start_time": shape (150,), type "<f8">
intervals/trials/stimCategory <HDF5 dataset "stimCategory": shape (150,), type "|u1">
intervals/trials/stim_off_time <HDF5 dataset "stim_off_time": shape (150,), type "<f8">
intervals/trials/stim_on_time <HDF5 dataset "stim_on_time": shape (150,), type "<f8">
intervals/trials/stim_phase <HDF5 dataset "stim_phase": shape (150,), type "|O">
intervals/trials/stop_time <HDF5 dataset "stop_time": shape (150,), type "<f8">
processing <HDF5 group "/processing" (0 members)>
session_description <HDF5 dataset "session_description": shape (), type "|O">
session_start_time <HDF5 dataset "session_start_time": shape (), type "|O">
stimulus <HDF5 group "/stimulus" (2 members)>
stimulus/presentation <HDF5 group "/stimulus/presentation" (1 members)>
stimulus/presentation/StimulusPresentation <HDF5 group "/stimulus/presentation/StimulusPresentation" (7 members)>
stimulus/presentation/StimulusPresentation/data <HDF5 dataset "data": shape (150, 400, 300, 3), type "|u1">
stimulus/presentation/StimulusPresentation/dimension <HDF5 dataset "dimension": shape (3,), type "<i4">
stimulus/presentation/StimulusPresentation/distance <HDF5 dataset "distance": shape (), type "<f8">
stimulus/presentation/StimulusPresentation/field_of_view <HDF5 dataset "field_of_view": shape (3,), type "<f8">
stimulus/presentation/StimulusPresentation/format <HDF5 dataset "format": shape (), type "|O">
stimulus/presentation/StimulusPresentation/orientation <HDF5 dataset "orientation": shape (), type "|O">
stimulus/presentation/StimulusPresentation/timestamps <HDF5 dataset "timestamps": shape (150,), type "<f8">
stimulus/templates <HDF5 group "/stimulus/templates" (0 members)>
timestamps_reference_time <HDF5 dataset "timestamps_reference_time": shape (), type "|O">
units <HDF5 group "/units" (11 members)>
units/IsolationDist <HDF5 dataset "IsolationDist": shape (33,), type "<f8">
units/SNR <HDF5 dataset "SNR": shape (33,), type "<f8">
units/electrodes <HDF5 dataset "electrodes": shape (33,), type "<i4">
units/electrodes_index <HDF5 dataset "electrodes_index": shape (33,), type "<i4">
units/id <HDF5 dataset "id": shape (33,), type "<i4">
units/origClusterID <HDF5 dataset "origClusterID": shape (33,), type "<i4">
units/spike_times <HDF5 dataset "spike_times": shape (118870,), type "<f8">
units/spike_times_index <HDF5 dataset "spike_times_index": shape (33,), type "<i4">
units/waveform_mean_encoding <HDF5 dataset "waveform_mean_encoding": shape (33, 256), type "<f8">
units/waveform_mean_recognition <HDF5 dataset "waveform_mean_recognition": shape (33, 256), type "<f8">
units/waveform_mean_sampling_rate <HDF5 dataset "waveform_mean_sampling_rate": shape (33, 1), type "<f8">

To access a group or dataset from the File instance, we index like a dictionary with the “path” to the object of interest. So, for example, say we wanted to find the time point when stimuli were presented to the subject, we can index as follows. (Going foward, we will not use context management just to avoid code clutter, and will close the File instance when we are done. Generally, though, you should use context management, as when doing file I/O.)

[8]:
f = h5py.File(
    os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
)

f["intervals/trials/stim_on_time"]
[8]:
<HDF5 dataset "stim_on_time": shape (150,), type "<f8">

Note that we see this as a dataset consisting 150 64-bit floating point numbers (f8). The data set has not yet been accessed, but it is available for use, almost exactly like a NumPy array. Importantly, they support much of Numpy’s fancy indexing and slicing. If we do want direct access of the entire array as a Numpy array, we index with an empty tuple [()] or [tuple()].

[9]:
f['intervals/trials/stim_on_time'][()]
[9]:
array([3018.31409 , 3023.236694, 3027.561148, 3031.890272, 3036.014586,
       3039.860668, 3043.831013, 3048.447768, 3053.086304, 3058.577719,
       3062.238007, 3067.642116, 3072.621551, 3077.625471, 3082.789226,
       3088.027263, 3092.089399, 3096.196786, 3100.801929, 3105.771749,
       3110.665814, 3114.849509, 3118.918742, 3123.272534, 3128.122392,
       3132.590985, 3137.111894, 3141.603956, 3147.717728, 3153.470202,
       3158.333761, 3168.646004, 3175.345146, 3179.651875, 3184.451598,
       3188.370026, 3192.887064, 3197.331327, 3201.374939, 3205.493293,
       3209.305061, 3213.320073, 3217.695492, 3222.133979, 3226.03333 ,
       3230.202833, 3234.002927, 3237.959018, 3241.884543, 3245.480595,
       3956.716701, 3963.177978, 3969.633632, 3974.25094 , 3978.264324,
       3984.13252 , 3989.92923 , 3995.180906, 4002.849509, 4008.402487,
       4012.75376 , 4017.317062, 4022.795421, 4031.293864, 4037.318977,
       4043.365779, 4047.332315, 4054.664626, 4061.639448, 4067.666958,
       4072.004561, 4084.804356, 4089.888424, 4093.922298, 4097.611186,
       4101.601407, 4116.682899, 4120.547506, 4124.680114, 4130.163327,
       4137.020737, 4144.22169 , 4148.946118, 4153.538451, 4157.395439,
       4161.793468, 4166.075529, 4175.888019, 4179.973042, 4184.478775,
       4188.875299, 4194.887817, 4198.663827, 4203.638532, 4207.17382 ,
       4214.310752, 4217.899585, 4221.665581, 4225.69116 , 4234.420955,
       4238.493628, 4242.376759, 4248.550496, 4253.654993, 4257.915795,
       4261.888014, 4267.130351, 4271.468937, 4277.210474, 4281.230217,
       4287.272472, 4291.701405, 4295.630678, 4299.617304, 4304.125188,
       4308.165789, 4312.056017, 4317.048938, 4321.668581, 4326.359709,
       4332.466845, 4339.217012, 4344.433944, 4348.149283, 4352.545253,
       4357.203112, 4361.00293 , 4365.944918, 4371.058385, 4375.244998,
       4378.87601 , 4384.071039, 4387.848954, 4392.15089 , 4396.037738,
       4401.323729, 4404.87579 , 4412.951187, 4417.785962, 4422.133795,
       4429.22354 , 4434.198583, 4440.647387, 4445.09426 , 4449.167026,
       4453.384943, 4458.401212, 4464.093628, 4472.946794, 4478.351456])

We can use Bokeh’s image visualization to see what the stimulus images were. For example, let look at image number 125. (For convenience, we will use the imshow() function of the bebi103 package.)

[10]:
# Pull out first stimulus
stim_im = f['stimulus/presentation/StimulusPresentation/data'][125][()]

# View the image
bokeh.io.show(bebi103.image.imshow(stim_im))

Now, to do something interesting with the data, we can look at the number of spikes recorded from all electrodes while viewing a stimulus during training (before the subject has to answer whether or not he has seen the stimulus before), when the image is novel (not seen before) or familiar (seen before).

First, we pull out the information about the type of image.

[11]:
stim_type = f['intervals/trials/new_old_labels_recog'][()]

# Take a look
stim_type
[11]:
array([b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA',
       b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA',
       b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA',
       b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA',
       b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA',
       b'NA', b'NA', b'NA', b'NA', b'NA', b'0', b'1', b'0', b'1', b'1',
       b'1', b'0', b'1', b'1', b'1', b'1', b'0', b'0', b'1', b'0', b'0',
       b'1', b'0', b'1', b'1', b'1', b'0', b'1', b'0', b'0', b'1', b'0',
       b'1', b'1', b'0', b'0', b'1', b'0', b'0', b'0', b'1', b'0', b'1',
       b'1', b'1', b'1', b'0', b'0', b'1', b'0', b'0', b'1', b'1', b'0',
       b'0', b'1', b'0', b'1', b'0', b'1', b'1', b'0', b'1', b'0', b'0',
       b'1', b'1', b'1', b'0', b'0', b'1', b'0', b'1', b'1', b'1', b'0',
       b'0', b'0', b'1', b'1', b'0', b'0', b'0', b'0', b'1', b'1', b'0',
       b'0', b'0', b'0', b'0', b'1', b'1', b'0', b'1', b'1', b'1', b'0',
       b'0', b'0', b'0', b'1', b'1', b'1', b'0'], dtype=object)

Here, NA means it was a training image, 0 means that the subject had not seen the image before, and 1 means that they had. Note also that these are all byte strings, which we should convert to strings to play nicely with Bokeh. We might as well make them more descriptive in the meantime.

[12]:
rename = {b'NA': "training", b"1": "familiar", b"0": "novel"}
stim_type = np.array([rename[st] for st in stim_type])

t_on = f['intervals/trials/stim_on_time']
t_off = f['intervals/trials/stim_off_time']

Finally, we pull out the records of the spikes.

[13]:
spike_times = f['units/spike_times'][()]

Now, we can compute the spike rate (the number of observed spikes divided by the time the simulus was present) for each stimulus.

[14]:
spike_rate = np.empty_like(t_on)
for i, (t1, t2) in enumerate(zip(t_on, t_off)):
    spike_rate[i] = np.sum((spike_times > t1) & (spike_times < t2)) / (t2 - t1)

We can toss these data into a data frame for convenient plotting. It will help to change the values of the stimulus types to be more descriptive.

[15]:
df = pl.DataFrame(
    data={
        "stimulus type": stim_type,
        "spike rate (Hz)": spike_rate,
        "stimulus index": np.arange(len(stim_type)),
    }
)

Now we can make a plot!

[16]:
p = iqplot.ecdf(
    df,
    q='spike rate (Hz)',
    cats='stimulus type',
    tooltips=[('stim ind', '@{stimulus index}')]
)

bokeh.io.show(p)

In looking at the ECDF, there seems to be a higher spiking rate during training than not, and no real difference between presentation of novel versus familiar stimuli. By hovering over the point with the highest spike rate, we see that it is image 78. Let’s take a look!

[17]:
# Pull out stimulus 78
stim_im = f['stimulus/presentation/StimulusPresentation/data'][78][()]

# View the image
bokeh.io.show(bebi103.image.imshow(stim_im))

Oooo! A scary shark! It has scared us into remembering to close the file.

[18]:
f.close()

A domain-specific tool

While directly using h5py works just fine, as we have seen, the neuroscience community has built their own package for handling NWB files, pwnwb. You can install it using

pip install pynwb

But you should do it in a separate environment (it is not in the bebi103 environment). Instead of using dictionary-like indexing, pynwb enables access to data using functions specific to neuroscience data sets. For example, we can conveniently pull out the spike times of all spikes measured from a single neuron, say neuron with index 10. I deduced the syntax for doing this by reading the pynwb docs.

[19]:
# Open the file
raw_io = pynwb.NWBHDF5IO(
    os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"),
    "r"
)

# Read the content
nwb_in = raw_io.read()

# Pull out spike times of neuron with index 10
spike_times = nwb_in.units.get_unit_spike_times(10)

For many data types, domain specific tools exist and can be quite useful. I tend to only use them for parsing, slicing, accessing, and organizing data, but prefer to use my own bespoke methods for analysis.

Use conda environments!

It is fairly common for domain-specific packages, especially for opening proprietary formats, that they have restrictive dependencies, often requiring quite old versions of other packages. As a result, I advise creating a new conda environment for domain specific file load/converting tasks. You can use that environment to convert the file format into something more easily readable in your main analysis environment.

Computing environment