8. Data file formats

Dataset download


[2]:
import pandas as pd
import numpy as np

import h5py
import fcsparser
import pynwb

import iqplot

import bebi103

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

Data sets, biological and otherwise, come in a large variety of formats, being acquired using an even larger variety of instruments. The ability to read and parse data sets in different formats is a key part of data wrangling.

There is no real prescribed procedure to handle various data file formats, so I instead give a few tips here.

  1. If the file format is proprietary, obtain documentation from the instrument manufacturer or whoever produced the file format. If the documentation is not available, call the company to obtain assistance. If no information is available, avoid using the format; don’t guess.

  2. If you are using a proprietary format, try to find a way to convert the data set to an open format. Depending on what field you are working in, there may be automated converters available. For example, you can convert a lot of proprietary microscopy formats to the open OME-TIFF format using Bio-Formats.

  3. For open file formats, carefully read the documentation of the file format. This may require some Googling.

  4. If you need to read in file formats into Pandas data frames, Numpy array, or other familiar Python classes, there are often packages available to do so, and these are usually described in the documentation of the file format.

  5. When reading in data, make sure you get it all. Metadata are data that provide information about other data, presumably the “main” content of a data set. (I often do not make this distinction; metadata are data!) Be sure you know how to access the metadata within a given file format.

These tips are not immediately helpful for any given file format. To demonstrate how to handle various file formats, I will give a few examples. Ultimately, though, it is your responsibility as an analyst of data to learn and understand the file format to make sure you completely and accurately can conveniently access a data set. I argue further that as a data producer, you are responsible for storing and sharing your data in an open, easily accessible, well-documented file format.

With that, I proceed to the examples.

Reading data from MS Excel

You will often find collaborators, and sometimes instruments, provide data in as Microscoft Excel documents. Prior to 2007, the Excel format was proprietary, but now it is open. This has allowed Python developers to write parsers of Excel files.

As an example of reading from an Excel file, we will open a data set from David Lack. The data set, available from Dryad, and downloadable here, contain Lack’s morphological measurements of finches taken from the Galápagos in 1905 and 1906 and stored in the California Academy of Sciences. These measurements were the basis for his famous book Darwin’s Finches.

Reading in the data is as simple as using pd.read_excel(). For Excel documents with multiple sheets or more complicated layouts, be sure to read the docs. As with pd.read_csv(), there are lots of useful kwargs.

[3]:
# Read the data in
df = pd.read_excel(os.path.join(data_path, "morphLack.xls"))

# Take a look
df.head()
[3]:
Source IslandID TaxonOrig GenusL69 SpeciesL69 SubspL69 SpeciesID SubspID Sex Plumage BodyL WingL TailL BeakW BeakH LBeakL UBeakL N-UBkL TarsusL MToeL
0 Lack CAS Wlf_Wnm Geospiza magnirostris Geospiza magnirostris NaN Geo.mag Geo.mag M Black NaN 86.0 NaN NaN 20.7 NaN 22.8 15.7 NaN NaN
1 Lack CAS Wlf_Wnm Geospiza magnirostris Geospiza magnirostris NaN Geo.mag Geo.mag M Black NaN 86.0 NaN NaN 19.5 NaN 24.3 16.6 NaN NaN
2 Lack CAS Wlf_Wnm Geospiza magnirostris Geospiza magnirostris NaN Geo.mag Geo.mag M Black NaN 88.0 NaN NaN NaN NaN 23.1 15.1 NaN NaN
3 Lack CAS Wlf_Wnm Geospiza magnirostris Geospiza magnirostris NaN Geo.mag Geo.mag M Black NaN 84.0 NaN NaN 20.6 NaN 22.4 15.3 NaN NaN
4 Lack CAS Wlf_Wnm Geospiza magnirostris Geospiza magnirostris NaN Geo.mag Geo.mag M Black NaN 86.0 NaN NaN 21.4 NaN 23.1 15.4 NaN NaN

Aside from Pandas, there are several other Excel readers available, as described at http://www.python-excel.org. For more fine-grained control of reading in the data from Excel spreadsheets, you might want to use xlrd for older documents (which the Lack data actually is, but Pandas handles it), and openpyxl for newer documents.

Example: Flow cytometry data

Flow cytometry is a technique in which cells are run single file through a fluid channel and subjected to illumination by a laser. The front and side scatters of the light are measured, as is the fluorscence intensity emitted if a cells contain fluorophores.

The technique is many decades old. Most often, flow cytometry data are stored in flow cyometry standard (FCS) format. These are binary files that cannot be directly read in using Pandas. As such, there are a few options we can pursue for reading FCS files and getting them into a format to work with.

  1. Use commercial software like FlowJo to open the file, and use it for analysis or for save in the data as CSV.

  2. Study the FCS specification and write our own parser to open FCS files and store the content in a data frame.

  3. Look for packages that provide parsers.

In almost all cases for file formats, option (3) is preferred. A cursory Google search reveals several Python-based packages for opening FCS files and storing them in convenient formats (and also performing analysis).

There is quite a variety in these packages, particularly in terms of their scope. Several aim to facilitate standard analyses of flow cytometry data. fcsparser, conversely, simply aims to read in an FCS file so that you can use its contents in a Pandas data frame. This is the functionality we desire here.

Use conda environments!

In order to use fcsparser, we need to install it. It is not included in the bebi103 environment because it has earlier versions of Pandas as a dependency. This is fairly common in domain-specific packages; they often have restrictive dependencies. I advise creating a new conda environment for domain specific tasks. That is what I have done in this notebook.

As an example, we will use this FCS, which was part of this paper by Razo-Mejia, et al. (The complete set of FCS files used in that paper are available here.) Given its simple scope, providing a tool to read CSV files into data frames, usage of fcsparser is simple and may be ascertained from its equally simple documentation.

[4]:
metadata, df = fcsparser.parse(
    os.path.join(data_path, "20160804_wt_O2_HG104_0uMIPTG.fcs")
)

# Take a look
df.head()
[4]:
HDR-T FSC-A FSC-H FSC-W SSC-A SSC-H SSC-W FITC-A FITC-H FITC-W APC-Cy7-A APC-Cy7-H APC-Cy7-W
0 0.418674 6537.148438 6417.625000 133513.125000 24118.714844 22670.142578 139447.218750 11319.865234 6816.254883 217673.406250 55.798954 255.540833 28620.398438
1 2.563462 6402.215820 5969.625000 140570.171875 23689.554688 22014.142578 141047.390625 1464.151367 5320.254883 36071.437500 74.539612 247.540833 39468.460938
2 4.921260 5871.125000 5518.852539 139438.421875 16957.433594 17344.511719 128146.859375 5013.330078 7328.779785 89661.203125 -31.788519 229.903214 -18123.212891
3 5.450112 6928.865723 8729.474609 104036.078125 13665.240234 11657.869141 153641.312500 879.165771 6997.653320 16467.523438 118.226028 362.191162 42784.375000
4 9.570750 11081.580078 6218.314453 233581.765625 43528.683594 22722.318359 251091.968750 2271.960693 9731.527344 30600.585938 20.693352 210.486893 12885.928711

Very nice! Now we can unleash all of our Python-based analysis tools on flow cytometry data sets, and we will soon when we use this data set in some overplotting examples.

We can also save the data as a CSV file for future use using the to_csv() method,

df.to_csv('20160804_wt_O2_HG104_0uMIPTG.csv', index=False)

thereby eliminating the need for a special parser down the road.

Note, though, that you should never delete the raw data file that came from the data source, even if its format is inconvenient.

Example: Electrophysiologial recordings of neurons in the human brain

As an example of a file format specific to a given field, we will open up an NWB (Neurodata Without Borders) file. NWB was created as a standard for neuroscience data so that they may be shared across research groups. This is stated clearly in their description of the data standard:

Neurodata Without Borders (NWB) is a data standard for neurophysiology, providing neuroscientists with a common standard to share, archive, use, and build common analysis tools for neurophysiology data. NWB is designed to store a variety of neurophysiology data, including data from intracellular and extracellular electrophysiology experiments, data from optical physiology experiments, and tracking and stimulus data. The project includes not only the NWB format, but also a broad range of software for data standardization and application programming interfaces (APIs) for reading and writing the data as well as high-value data sets that have been translated into the NWB data standard.)

NWB files are stored in HDF5 format with a specific schema for their structure (with a more detailed schema here). Therefore, we can use an HDF5 parser to explore NWB data sets.

HDF stands for “hierarchical data format,” meaning that a single HDF file can contain a file system-like structure of data. Each entry is either a dataset, which contains an array of data, or a group, which contains data sets an other groups. You can think of groups like directories in a file system. Both datasets and groups can have arbitrary attributes which are (possibly formless) metadata. Importantly, HDF5 files are highly optimized for large data sets. When working with them, you do not need to store the data in RAM, but can rapidly access portions of the data from disk as needed.

To see how this works, let’s consider an example. We will work with the file sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb, available here. It may also be directly downloaded from the DANDI Archive by clicking here. The data set comes from the lab or Ueli Rutishauser at Cedars-Sinai. It was featured in a publication describing NWB by Chandravadia, et al., originally coming from their paper by Faraut, et al. highlighting the data set on declarative memory encoding and recognition in humans. The data set includes recordings from single neurons from human amygdala and hippocampus. The subjects were shown a set of images. They were then shown another set of images, some of which they had already seen. Neuronal spike times were recorded during this process. We will look at a portion of this data set measured from electrodes in the right amygdala and left hippocampus of a single subject, a 31-year-old male. The entire data set is freely available from the DANDI archive.

As we work through this example, note that there are many file formats across fields, and this serves as an example of one and how you might navigate the format to pull out what you need for your analysis. I note that prior to writing this section of the lesson, I have never opened an NWB file nor used h5py much at all. I used a skill you will hopefully master: asking the internet for help. I show this and the other examples to illustrate to you that you will encounter file formats you may not be familiar with, and you need to be patient and search and read carefully how to use appropriate tools to parse them.

Parsing the file with h5py

The h5py package enables opening and navigating HDF5 files. It takes advantage of HDF5’s structure to enable access of the data on disk without copying the whole thing into RAM. We have imported h5py above and can use it to open the file.

[5]:
with h5py.File(
    os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
) as f:
    print(f)
<HDF5 file "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb" (mode r)>

Note that we have used h5py.File() to instantiate a File instance to get access to the data in the file. I am using context management here to keep me out of trouble for reading and writing files, just as we did when we learned about file I/O in an earlier lesson.

After opening the file, I printed it, which is an HDF5 File instance. These behave much like dictionaries, and I can see the keys of the dictionary using the keys() method.

[6]:
with h5py.File(
    os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
) as f:
    print(f.keys())
<KeysViewHDF5 ['acquisition', 'analysis', 'file_create_date', 'general', 'identifier', 'intervals', 'processing', 'session_description', 'session_start_time', 'stimulus', 'timestamps_reference_time', 'units']>

We see that the HDF file contains groups acquisition, analysis, etc. It would be useful to crawl the whole hierarchical structure of the file, something like ls -R on the command line. This is possible using the visititems() method. It takes a single argument, which is a function you want applied to each item in the hierarchy of the file. To view what we have, the print() function will work fine.

[7]:
with h5py.File(
    os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
) as f:
    f.visititems(print)
acquisition <HDF5 group "/acquisition" (2 members)>
acquisition/events <HDF5 group "/acquisition/events" (2 members)>
acquisition/events/data <HDF5 dataset "data": shape (754,), type "|O">
acquisition/events/timestamps <HDF5 dataset "timestamps": shape (754,), type "<f8">
acquisition/experiment_ids <HDF5 group "/acquisition/experiment_ids" (2 members)>
acquisition/experiment_ids/data <HDF5 dataset "data": shape (754,), type "<f8">
acquisition/experiment_ids/timestamps <HDF5 dataset "timestamps": shape (754,), type "<f8">
analysis <HDF5 group "/analysis" (0 members)>
file_create_date <HDF5 dataset "file_create_date": shape (1,), type "|O">
general <HDF5 group "/general" (9 members)>
general/data_collection <HDF5 dataset "data_collection": shape (), type "|O">
general/devices <HDF5 group "/general/devices" (1 members)>
general/devices/Neuralynx-cheetah <HDF5 group "/general/devices/Neuralynx-cheetah" (0 members)>
general/experiment_description <HDF5 dataset "experiment_description": shape (), type "|O">
general/extracellular_ephys <HDF5 group "/general/extracellular_ephys" (11 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-1 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-1" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-10 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-10" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-11 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-11" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-13 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-13" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-15 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-15" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-3 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-3" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-4 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-4" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-5 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-5" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-6 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-6" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-7 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-7" (1 members)>
general/extracellular_ephys/electrodes <HDF5 group "/general/extracellular_ephys/electrodes" (10 members)>
general/extracellular_ephys/electrodes/filtering <HDF5 dataset "filtering": shape (10,), type "|O">
general/extracellular_ephys/electrodes/group <HDF5 dataset "group": shape (10,), type "|O">
general/extracellular_ephys/electrodes/group_name <HDF5 dataset "group_name": shape (10,), type "|O">
general/extracellular_ephys/electrodes/id <HDF5 dataset "id": shape (10,), type "<i4">
general/extracellular_ephys/electrodes/imp <HDF5 dataset "imp": shape (10,), type "<f8">
general/extracellular_ephys/electrodes/location <HDF5 dataset "location": shape (10,), type "|O">
general/extracellular_ephys/electrodes/origChannel <HDF5 dataset "origChannel": shape (10,), type "<u2">
general/extracellular_ephys/electrodes/x <HDF5 dataset "x": shape (10,), type "<f8">
general/extracellular_ephys/electrodes/y <HDF5 dataset "y": shape (10,), type "<f8">
general/extracellular_ephys/electrodes/z <HDF5 dataset "z": shape (10,), type "<f8">
general/institution <HDF5 dataset "institution": shape (), type "|O">
general/keywords <HDF5 dataset "keywords": shape (7,), type "|O">
general/lab <HDF5 dataset "lab": shape (), type "|O">
general/related_publications <HDF5 dataset "related_publications": shape (1,), type "|O">
general/subject <HDF5 group "/general/subject" (5 members)>
general/subject/age <HDF5 dataset "age": shape (), type "|O">
general/subject/description <HDF5 dataset "description": shape (), type "|O">
general/subject/sex <HDF5 dataset "sex": shape (), type "|O">
general/subject/species <HDF5 dataset "species": shape (), type "|O">
general/subject/subject_id <HDF5 dataset "subject_id": shape (), type "|O">
identifier <HDF5 dataset "identifier": shape (), type "|O">
intervals <HDF5 group "/intervals" (1 members)>
intervals/trials <HDF5 group "/intervals/trials" (14 members)>
intervals/trials/category_name <HDF5 dataset "category_name": shape (150,), type "|O">
intervals/trials/delay1_time <HDF5 dataset "delay1_time": shape (150,), type "<f8">
intervals/trials/delay2_time <HDF5 dataset "delay2_time": shape (150,), type "<f8">
intervals/trials/external_image_file <HDF5 dataset "external_image_file": shape (150,), type "|O">
intervals/trials/id <HDF5 dataset "id": shape (150,), type "<i4">
intervals/trials/new_old_labels_recog <HDF5 dataset "new_old_labels_recog": shape (150,), type "|O">
intervals/trials/response_time <HDF5 dataset "response_time": shape (150,), type "<f8">
intervals/trials/response_value <HDF5 dataset "response_value": shape (150,), type "<f8">
intervals/trials/start_time <HDF5 dataset "start_time": shape (150,), type "<f8">
intervals/trials/stimCategory <HDF5 dataset "stimCategory": shape (150,), type "|u1">
intervals/trials/stim_off_time <HDF5 dataset "stim_off_time": shape (150,), type "<f8">
intervals/trials/stim_on_time <HDF5 dataset "stim_on_time": shape (150,), type "<f8">
intervals/trials/stim_phase <HDF5 dataset "stim_phase": shape (150,), type "|O">
intervals/trials/stop_time <HDF5 dataset "stop_time": shape (150,), type "<f8">
processing <HDF5 group "/processing" (0 members)>
session_description <HDF5 dataset "session_description": shape (), type "|O">
session_start_time <HDF5 dataset "session_start_time": shape (), type "|O">
stimulus <HDF5 group "/stimulus" (2 members)>
stimulus/presentation <HDF5 group "/stimulus/presentation" (1 members)>
stimulus/presentation/StimulusPresentation <HDF5 group "/stimulus/presentation/StimulusPresentation" (7 members)>
stimulus/presentation/StimulusPresentation/data <HDF5 dataset "data": shape (150, 400, 300, 3), type "|u1">
stimulus/presentation/StimulusPresentation/dimension <HDF5 dataset "dimension": shape (3,), type "<i4">
stimulus/presentation/StimulusPresentation/distance <HDF5 dataset "distance": shape (), type "<f8">
stimulus/presentation/StimulusPresentation/field_of_view <HDF5 dataset "field_of_view": shape (3,), type "<f8">
stimulus/presentation/StimulusPresentation/format <HDF5 dataset "format": shape (), type "|O">
stimulus/presentation/StimulusPresentation/orientation <HDF5 dataset "orientation": shape (), type "|O">
stimulus/presentation/StimulusPresentation/timestamps <HDF5 dataset "timestamps": shape (150,), type "<f8">
stimulus/templates <HDF5 group "/stimulus/templates" (0 members)>
timestamps_reference_time <HDF5 dataset "timestamps_reference_time": shape (), type "|O">
units <HDF5 group "/units" (11 members)>
units/IsolationDist <HDF5 dataset "IsolationDist": shape (33,), type "<f8">
units/SNR <HDF5 dataset "SNR": shape (33,), type "<f8">
units/electrodes <HDF5 dataset "electrodes": shape (33,), type "<i4">
units/electrodes_index <HDF5 dataset "electrodes_index": shape (33,), type "<i4">
units/id <HDF5 dataset "id": shape (33,), type "<i4">
units/origClusterID <HDF5 dataset "origClusterID": shape (33,), type "<i4">
units/spike_times <HDF5 dataset "spike_times": shape (118870,), type "<f8">
units/spike_times_index <HDF5 dataset "spike_times_index": shape (33,), type "<i4">
units/waveform_mean_encoding <HDF5 dataset "waveform_mean_encoding": shape (33, 256), type "<f8">
units/waveform_mean_recognition <HDF5 dataset "waveform_mean_recognition": shape (33, 256), type "<f8">
units/waveform_mean_sampling_rate <HDF5 dataset "waveform_mean_sampling_rate": shape (33, 1), type "<f8">

To access a group or dataset from the File instance, we index like a dictionary with the “path” to the object of interest. So, for example, say we wanted to find the time point when stimuli were presented to the subject, we can index as follows. (Going foward, we will not use context management just to avoid code clutter, and will close the File instance when we are done. Generally, though, you should use context management, as when doing file I/O.)

[8]:
f = h5py.File(
    os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
)

f["intervals/trials/stim_on_time"]
[8]:
<HDF5 dataset "stim_on_time": shape (150,), type "<f8">

Note that we see this as a dataset consisting 150 64-bit floating point numbers (f8). The data set has not yet been accessed, but it is available for use, almost exactly like a NumPy array. Importantly, they support much of Numpy’s fancy indexing and slicing. If we do want direct access of the entire array as a Numpy array, we index with an empty tuple [()] or [tuple()].

[9]:
f['intervals/trials/stim_on_time'][()]
[9]:
array([3018.31409 , 3023.236694, 3027.561148, 3031.890272, 3036.014586,
       3039.860668, 3043.831013, 3048.447768, 3053.086304, 3058.577719,
       3062.238007, 3067.642116, 3072.621551, 3077.625471, 3082.789226,
       3088.027263, 3092.089399, 3096.196786, 3100.801929, 3105.771749,
       3110.665814, 3114.849509, 3118.918742, 3123.272534, 3128.122392,
       3132.590985, 3137.111894, 3141.603956, 3147.717728, 3153.470202,
       3158.333761, 3168.646004, 3175.345146, 3179.651875, 3184.451598,
       3188.370026, 3192.887064, 3197.331327, 3201.374939, 3205.493293,
       3209.305061, 3213.320073, 3217.695492, 3222.133979, 3226.03333 ,
       3230.202833, 3234.002927, 3237.959018, 3241.884543, 3245.480595,
       3956.716701, 3963.177978, 3969.633632, 3974.25094 , 3978.264324,
       3984.13252 , 3989.92923 , 3995.180906, 4002.849509, 4008.402487,
       4012.75376 , 4017.317062, 4022.795421, 4031.293864, 4037.318977,
       4043.365779, 4047.332315, 4054.664626, 4061.639448, 4067.666958,
       4072.004561, 4084.804356, 4089.888424, 4093.922298, 4097.611186,
       4101.601407, 4116.682899, 4120.547506, 4124.680114, 4130.163327,
       4137.020737, 4144.22169 , 4148.946118, 4153.538451, 4157.395439,
       4161.793468, 4166.075529, 4175.888019, 4179.973042, 4184.478775,
       4188.875299, 4194.887817, 4198.663827, 4203.638532, 4207.17382 ,
       4214.310752, 4217.899585, 4221.665581, 4225.69116 , 4234.420955,
       4238.493628, 4242.376759, 4248.550496, 4253.654993, 4257.915795,
       4261.888014, 4267.130351, 4271.468937, 4277.210474, 4281.230217,
       4287.272472, 4291.701405, 4295.630678, 4299.617304, 4304.125188,
       4308.165789, 4312.056017, 4317.048938, 4321.668581, 4326.359709,
       4332.466845, 4339.217012, 4344.433944, 4348.149283, 4352.545253,
       4357.203112, 4361.00293 , 4365.944918, 4371.058385, 4375.244998,
       4378.87601 , 4384.071039, 4387.848954, 4392.15089 , 4396.037738,
       4401.323729, 4404.87579 , 4412.951187, 4417.785962, 4422.133795,
       4429.22354 , 4434.198583, 4440.647387, 4445.09426 , 4449.167026,
       4453.384943, 4458.401212, 4464.093628, 4472.946794, 4478.351456])

We can use Bokeh’s image visualization to see what the stimulus images were. For example, let look at image number 125. (For convenience, we will use the imshow() function of the bebi103 package.)

[10]:
# Pull out first stimulus
stim_im = f['stimulus/presentation/StimulusPresentation/data'][125][()]

# View the image
bokeh.io.show(bebi103.image.imshow(stim_im))

Now, to do something interesting with the data, we can look at the number of spikes recorded from all electrodes while viewing a stimulus during training (before the subject has to answer whether or not he has seen the stimulus before), when the image is novel (not seen before) or familiar (seen before).

First, we pull out the information about the type of image.

[11]:
stim_type = f['intervals/trials/new_old_labels_recog'][()]

# Take a look
stim_type
[11]:
array([b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA',
       b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA',
       b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA',
       b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA',
       b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA', b'NA',
       b'NA', b'NA', b'NA', b'NA', b'NA', b'0', b'1', b'0', b'1', b'1',
       b'1', b'0', b'1', b'1', b'1', b'1', b'0', b'0', b'1', b'0', b'0',
       b'1', b'0', b'1', b'1', b'1', b'0', b'1', b'0', b'0', b'1', b'0',
       b'1', b'1', b'0', b'0', b'1', b'0', b'0', b'0', b'1', b'0', b'1',
       b'1', b'1', b'1', b'0', b'0', b'1', b'0', b'0', b'1', b'1', b'0',
       b'0', b'1', b'0', b'1', b'0', b'1', b'1', b'0', b'1', b'0', b'0',
       b'1', b'1', b'1', b'0', b'0', b'1', b'0', b'1', b'1', b'1', b'0',
       b'0', b'0', b'1', b'1', b'0', b'0', b'0', b'0', b'1', b'1', b'0',
       b'0', b'0', b'0', b'0', b'1', b'1', b'0', b'1', b'1', b'1', b'0',
       b'0', b'0', b'0', b'1', b'1', b'1', b'0'], dtype=object)

Here, NA means it was a training image, 0 means that the subject had not seen the image before, and 1 means that they had. Note also that these are all byte strings, which we should convert to strings to play nicely with Bokeh. We might as well make them more descriptive in the meantime.

[12]:
rename = {b'NA': "training", b"1": "familiar", b"0": "novel"}
stim_type = np.array([rename[st] for st in stim_type])

t_on = f['intervals/trials/stim_on_time']
t_off = f['intervals/trials/stim_off_time']

Finally, we pull out the records of the spikes.

[13]:
spike_times = f['units/spike_times'][()]

Now, we can compute the spike rate (the number of observed spikes divided by the time the simulus was present) for each stimulus.

[14]:
spike_rate = np.empty_like(t_on)
for i, (t1, t2) in enumerate(zip(t_on, t_off)):
    spike_rate[i] = np.sum((spike_times > t1) & (spike_times < t2)) / (t2 - t1)

We can toss these data into a data frame for convenient plotting. It will help to change the values of the stimulus types to be more descriptive.

[15]:
df = pd.DataFrame(
    data={
        "stimulus type": stim_type,
        "spike rate (Hz)": spike_rate,
        "stimulus index": np.arange(len(stim_type)),
    }
)

Now we can make a plot!

[16]:
p = iqplot.ecdf(
    df,
    q='spike rate (Hz)',
    cats='stimulus type',
    tooltips=[('stim ind', '@{stimulus index}')]
)

bokeh.io.show(p)

In looking at the ECDF, there seems to be a higher spiking rate during training than not, and no real difference between presentation of novel versus familiar stimuli. By hovering over the point with the highest spike rate, we see that it is image 78. Let’s take a look!

[17]:
# Pull out stimulus 78
stim_im = f['stimulus/presentation/StimulusPresentation/data'][78][()]

# View the image
bokeh.io.show(bebi103.image.imshow(stim_im))

Oooo! A scary shark! It has scared us into remembering to close the file.

[18]:
f.close()

A domain-specific tool

While directly using h5py works just fine, as we have seen, the neuroscience community has built their own package for handling NWB files, pwnwb. You can install it using

pip install pynwb

But you should do it in a separate environment (it is not in the bebi103 environment). Instead of using dictionary-like indexing, pynwb enables access to data using functions specific to neuroscience data sets. For example, we can conveniently pull out the spike times of all spikes measured from a single neuron, say neuron with index 10. I deduced the syntax for doing this by reading the pynwb docs.

[19]:
# Open the file
raw_io = pynwb.NWBHDF5IO(
    os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
)

# Read the content
nwb_in = raw_io.read()

# Pull out spike times of neuron with index 10
spike_times = nwb_in.units.get_unit_spike_times(10)

For many data types, domain specific tools exist and can be quite useful. I tend to only use them for parsing, slicing, accessing, and organizing data, but prefer to use my own bespoke methods for analysis.

Computing environment