Example Workflow

This tutorial walks through a hypothetical example of how to generate metadata for a cellular 2-photon imaging session. This won’t focus on session or rig metadata for now, but will expand in the future.

The example metadata here is intentionally simple. The names and values don’t perfectly align with aind-data-schema so as to show examples of mapping from local conventions to the schema.

You will see through this example that creating these metadata JSON files reveals that some important data were not being tracked in the original metadata sources. This is common and are usually information that a single person keeps track of implicitly in their head. This information must be entered somewhere, either by updating the data sources or hard-coding values in the generation script. The latter is not advised but what we do here in this example to demonstrate the issue.

Identify metadata sources

In practice, key metadata is usually distributed into many data sources. They could be spreadsheets, databases, TIFF file headers, or even file names.

In this example, let’s say that our basic subject and surgical procedure metadata are stored in and excel workbook with three sheets: mice, sessions, and procedures.

Let’s say they look like this:

mice:

id  dam_id  sire_id  genotype                                               dob         sex
1                    Vip-IRES-Cre/wt                                        9/22/2023   F
2                    Ai32(RCL-ChR2(H134R)_EYFP)/Ai32(RCL-ChR2(H134R)_EYFP)  9/15/2023   M
3   1       2        Vip-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt          12/1/2023   F

procedures:

mouse_id    injection_date  protocol                brain_area  virus_name           virus_titer  injection_volume  injection_coord     perfusion_date
3           1/2/2024 7:00   injection-perfusion-v1  VISp        AAV2-Flex-ChrimsonR  2300000000   200               03.8,-0.87,-3.3,10  1/31/2024 10:22

sessions:

mouse_id  start_time       end_time
3         1/26/2024 15:00  1/26/2024 15:30
3         1/27/2024 15:00  1/27/2024 15:30
3         1/28/2024 15:00  1/28/2024 15:30

In this example you can see that we recorded three sessions from one mouse, which has a viral injection and a perfusion procedure. All mice are C57BL/6J, were bred locally, and were housed with a running wheel in their cage. Download example_workflow.xlsx and example_workflow.py to follow along.

Make data description

The data description schema contains basic administrative metadata. Who collected the data, how was it funded, etc.

import os

import pandas as pd
from aind_data_schema_models.modalities import Modality
from aind_data_schema_models.organizations import Organization
from aind_data_schema_models.pid_names import PIDName
from aind_data_schema_models.platforms import Platform

from aind_data_schema.core.data_description import Funding, RawDataDescription
from aind_data_schema.core.procedures import NanojectInjection, Perfusion, Procedures, Surgery, ViralMaterial
from aind_data_schema.core.subject import BreedingInfo, Housing, Species, Subject

sessions_df = pd.read_excel("example_workflow.xlsx", sheet_name="sessions")
mice_df = pd.read_excel("example_workflow.xlsx", sheet_name="mice")
procedures_df = pd.read_excel("example_workflow.xlsx", sheet_name="procedures")

# everything was done by one person, so it's not in the spreadsheet
experimenter = "Sam Student"

# in our spreadsheet, we stored sex as M/F instead of Male/Female
subject_sex_lookup = {
    "F": "Female",
    "M": "Male",
}

# everything is covered by the same IACUC protocol
iacuc_protocol = "2109"

# loop through all of the sessions
for session_idx, session in sessions_df.iterrows():
    # our data always contains planar optical physiology and behavior videos
    d = RawDataDescription(
        modality=[Modality.POPHYS, Modality.BEHAVIOR_VIDEOS],
        platform=Platform.BEHAVIOR,
        subject_id=str(session["mouse_id"]),
        creation_time=session["end_time"].to_pydatetime(),
        institution=Organization.OTHER,
        investigators=[PIDName(name="Some Investigator")],
        funding_source=[Funding(funder=Organization.NIMH)],
    )

    # we will store our json files in a directory named after the session
    os.makedirs(d.name, exist_ok=True)

    d.write_standard_file(output_directory=d.name)

    # look up the mouse used in this session
    mouse = mice_df[mice_df["id"] == session["mouse_id"]].iloc[0]
    dam = mice_df[mice_df["id"] == mouse["dam_id"]].iloc[0]
    sire = mice_df[mice_df["id"] == mouse["sire_id"]].iloc[0]

    # construct the subject

Make subject

The subject metadata is a bit more complex. In this case, certain fields are required but we simply didn’t keep track. As a best practice, we acknowledge that this information is unavailable by saying it is unknown.

        subject_id=str(mouse["id"]),
        species=Species.MUS_MUSCULUS,  # all subjects are mice
        sex=subject_sex_lookup.get(mouse["sex"]),
        date_of_birth=mouse["dob"],
        genotype=mouse["genotype"],
        breeding_info=BreedingInfo(
            maternal_id=str(dam["id"]),
            maternal_genotype=dam["genotype"],
            paternal_id=str(sire["id"]),
            paternal_genotype=sire["genotype"],
            breeding_group="unknown",  # not in spreadsheet
        ),
        housing=Housing(
            home_cage_enrichment=["Running wheel"],  # all subjects had a running wheel in their cage
            cage_id="unknown",  # not in spreadsheet
        ),
        background_strain="C57BL/6J",
        source=Organization.OTHER,
    )
    s.write_standard_file(output_directory=d.name)

    # look up the procedures performed in this session
    proc_row = procedures_df[procedures_df["mouse_id"] == mouse["id"]].iloc[0]

    # we stored the injection coordinates as a comma-delimited string: AP,ML,DV,angle
    coords = proc_row.injection_coord.split(",")

Make procedures

While it’s best practice to store each surgery as a separate record, in our example we instead have one row per mouse. The different procedures are stored in separate columns. This makes it harder to represent lists of procedures, but because our hypothetical protocol is always the same - one injection at one depth followed by a perfusion at a later date - we can get away with this simplification.

    protocol = str(proc_row["protocol"])

    p = Procedures(
        subject_id=str(mouse["id"]),
        subject_procedures=[
            Surgery(
                start_date=proc_row["injection_date"].to_pydatetime().date(),
                protocol_id=protocol,
                iacuc_protocol=iacuc_protocol,
                experimenter_full_name=experimenter,
                procedures=[
                    NanojectInjection(
                        protocol_id=protocol,
                        injection_materials=[
                            ViralMaterial(
                                material_type="Virus",
                                name=proc_row["virus_name"],
                                titer=proc_row["virus_titer"],
                            )
                        ],
                        targeted_structure=proc_row["brain_area"],
                        injection_coordinate_ml=float(coords[1]),
                        injection_coordinate_ap=float(coords[0]),
                        injection_angle=float(coords[3]),
                        # multiple injection volumes at different depths are allowed, but that's not happening here
                        injection_coordinate_depth=[float(coords[2])],
                        injection_volume=[float(proc_row["injection_volume"])],
                    )
                ],
            ),
            Surgery(
                start_date=proc_row["perfusion_date"].to_pydatetime().date(),
                experimenter_full_name=experimenter,
                iacuc_protocol=iacuc_protocol,
                protocol_id=protocol,
                procedures=[Perfusion(protocol_id=protocol, output_specimen_ids=["1"])],
            ),
        ],
    )
    p.write_standard_file(output_directory=d.name)

And there you have it. More metadata to come!