Example Workflow¶
This tutorial walks through a hypothetical example of how to generate metadata for a cellular 2-photon imaging session. This won’t focus on session or rig metadata for now, but will expand in the future.
The example metadata here is intentionally simple. The names and values don’t
perfectly align with aind-data-schema
so as to show examples of mapping from
local conventions to the schema.
You will see through this example that creating these metadata JSON files reveals that some important data were not being tracked in the original metadata sources. This is common and are usually information that a single person keeps track of implicitly in their head. This information must be entered somewhere, either by updating the data sources or hard-coding values in the generation script. The latter is not advised but what we do here in this example to demonstrate the issue.
Identify metadata sources¶
In practice, key metadata is usually distributed into many data sources. They could be spreadsheets, databases, TIFF file headers, or even file names.
In this example, let’s say that our basic subject and surgical procedure
metadata are stored in and excel workbook with three sheets: mice
, sessions
, and procedures
.
Let’s say they look like this:
mice
:
id dam_id sire_id genotype dob sex
1 Vip-IRES-Cre/wt 9/22/2023 F
2 Ai32(RCL-ChR2(H134R)_EYFP)/Ai32(RCL-ChR2(H134R)_EYFP) 9/15/2023 M
3 1 2 Vip-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt 12/1/2023 F
procedures
:
mouse_id injection_date protocol brain_area virus_name virus_titer injection_volume injection_coord perfusion_date
3 1/2/2024 7:00 injection-perfusion-v1 VISp AAV2-Flex-ChrimsonR 2300000000 200 03.8,-0.87,-3.3,10 1/31/2024 10:22
sessions
:
mouse_id start_time end_time
3 1/26/2024 15:00 1/26/2024 15:30
3 1/27/2024 15:00 1/27/2024 15:30
3 1/28/2024 15:00 1/28/2024 15:30
In this example you can see that we recorded three sessions from one mouse,
which has a viral injection and a perfusion procedure. All mice are C57BL/6J,
were bred locally, and were housed with a running wheel in their cage. Download
example_workflow.xlsx
and
example_workflow.py
to follow along.
Make data description¶
The data description schema contains basic administrative metadata. Who collected the data, how was it funded, etc.
import os
import pandas as pd
from aind_data_schema_models.modalities import Modality
from aind_data_schema_models.organizations import Organization
from aind_data_schema_models.pid_names import PIDName
from aind_data_schema_models.platforms import Platform
from aind_data_schema.core.data_description import Funding, RawDataDescription
from aind_data_schema.core.procedures import NanojectInjection, Perfusion, Procedures, Surgery, ViralMaterial
from aind_data_schema.core.subject import BreedingInfo, Housing, Species, Subject
sessions_df = pd.read_excel("example_workflow.xlsx", sheet_name="sessions")
mice_df = pd.read_excel("example_workflow.xlsx", sheet_name="mice")
procedures_df = pd.read_excel("example_workflow.xlsx", sheet_name="procedures")
# everything was done by one person, so it's not in the spreadsheet
experimenter = "Sam Student"
# in our spreadsheet, we stored sex as M/F instead of Male/Female
subject_sex_lookup = {
"F": "Female",
"M": "Male",
}
# everything is covered by the same IACUC protocol
iacuc_protocol = "2109"
# loop through all of the sessions
for session_idx, session in sessions_df.iterrows():
# our data always contains planar optical physiology and behavior videos
d = RawDataDescription(
modality=[Modality.POPHYS, Modality.BEHAVIOR_VIDEOS],
platform=Platform.BEHAVIOR,
subject_id=str(session["mouse_id"]),
creation_time=session["end_time"].to_pydatetime(),
institution=Organization.OTHER,
investigators=[PIDName(name="Some Investigator")],
funding_source=[Funding(funder=Organization.NIMH)],
)
# we will store our json files in a directory named after the session
os.makedirs(d.name, exist_ok=True)
d.write_standard_file(output_directory=d.name)
# look up the mouse used in this session
mouse = mice_df[mice_df["id"] == session["mouse_id"]].iloc[0]
dam = mice_df[mice_df["id"] == mouse["dam_id"]].iloc[0]
sire = mice_df[mice_df["id"] == mouse["sire_id"]].iloc[0]
# construct the subject
Make subject¶
The subject metadata is a bit more complex. In this case, certain fields
are required but we simply didn’t keep track. As a best practice, we acknowledge
that this information is unavailable by saying it is unknown
.
subject_id=str(mouse["id"]),
species=Species.MUS_MUSCULUS, # all subjects are mice
sex=subject_sex_lookup.get(mouse["sex"]),
date_of_birth=mouse["dob"],
genotype=mouse["genotype"],
breeding_info=BreedingInfo(
maternal_id=str(dam["id"]),
maternal_genotype=dam["genotype"],
paternal_id=str(sire["id"]),
paternal_genotype=sire["genotype"],
breeding_group="unknown", # not in spreadsheet
),
housing=Housing(
home_cage_enrichment=["Running wheel"], # all subjects had a running wheel in their cage
cage_id="unknown", # not in spreadsheet
),
background_strain="C57BL/6J",
source=Organization.OTHER,
)
s.write_standard_file(output_directory=d.name)
# look up the procedures performed in this session
proc_row = procedures_df[procedures_df["mouse_id"] == mouse["id"]].iloc[0]
# we stored the injection coordinates as a comma-delimited string: AP,ML,DV,angle
coords = proc_row.injection_coord.split(",")
Make procedures¶
While it’s best practice to store each surgery as a separate record, in our example we instead have one row per mouse. The different procedures are stored in separate columns. This makes it harder to represent lists of procedures, but because our hypothetical protocol is always the same - one injection at one depth followed by a perfusion at a later date - we can get away with this simplification.
protocol = str(proc_row["protocol"])
p = Procedures(
subject_id=str(mouse["id"]),
subject_procedures=[
Surgery(
start_date=proc_row["injection_date"].to_pydatetime().date(),
protocol_id=protocol,
iacuc_protocol=iacuc_protocol,
experimenter_full_name=experimenter,
procedures=[
NanojectInjection(
protocol_id=protocol,
injection_materials=[
ViralMaterial(
material_type="Virus",
name=proc_row["virus_name"],
titer=proc_row["virus_titer"],
)
],
targeted_structure=proc_row["brain_area"],
injection_coordinate_ml=float(coords[1]),
injection_coordinate_ap=float(coords[0]),
injection_angle=float(coords[3]),
# multiple injection volumes at different depths are allowed, but that's not happening here
injection_coordinate_depth=[float(coords[2])],
injection_volume=[float(proc_row["injection_volume"])],
)
],
),
Surgery(
start_date=proc_row["perfusion_date"].to_pydatetime().date(),
experimenter_full_name=experimenter,
iacuc_protocol=iacuc_protocol,
protocol_id=protocol,
procedures=[Perfusion(protocol_id=protocol, output_specimen_ids=["1"])],
),
],
)
p.write_standard_file(output_directory=d.name)
And there you have it. More metadata to come!