Generating metadata

Why

We are gathering this metadata so that our assets will be FAIR (findable, accessible, interoperable, and re-usable).

After completing this tutorial you will be able to:

  • Build the data description, subject, and procedures metadata for your asset using Python code and pydantic.

  • Describe how a simple set of metadata for a few assets get converted into individual metadata records.

  • Navigate the documentation

Identify metadata sources

In practice, key metadata is usually distributed into many data sources. They could be spreadsheets, databases, TIFF file headers, or even file names.

In this example, let’s say that our basic subject and surgical procedure metadata are stored in an excel workbook with three sheets: mice, sessions, and procedures.

Let’s say they look like this:

mice:

id  dam_id  sire_id  genotype                                               dob         sex
1                    Vip-IRES-Cre/wt                                        9/22/2023   F
2                    Ai32(RCL-ChR2(H134R)_EYFP)/Ai32(RCL-ChR2(H134R)_EYFP)  9/15/2023   M
3   1       2        Vip-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt          12/1/2023   F

procedures:

mouse_id    injection_date  protocol                brain_area  virus_name           virus_titer  injection_volume  injection_coord     perfusion_date
3	        1/2/2024 7:00   injection-perfusion-v1  VISp        AAV2-Flex-ChrimsonR  2300000000   200               03.8,-0.87,-3.3,10  1/31/2024 10:22

sessions:

mouse_id  start_time       end_time
3         1/26/2024 15:00  1/26/2024 15:30
3         1/27/2024 15:00  1/27/2024 15:30
3         1/28/2024 15:00  1/28/2024 15:30

In this example you can see that we recorded three sessions from one mouse, which has a viral injection and a perfusion procedure. All mice are C57BL/6J, were bred locally, and were housed with a running wheel in their cage. Download example_workflow.xlsx and example_workflow.py to follow along.

Setup Python environment

First, we’ll set up the Python environment and define some shared variables.

import os
from typing import List
import pandas as pd
from datetime import datetime, date
from zoneinfo import ZoneInfo

from aind_data_schema_models.modalities import Modality
from aind_data_schema_models.organizations import Organization
from aind_data_schema_models.species import Strain
from aind_data_schema_models.units import VolumeUnit
from aind_data_schema_models.data_name_patterns import DataLevel
from aind_data_schema_models.brain_atlas import CCFv3

from aind_data_schema.components.coordinates import Rotation, Translation
from aind_data_schema.components.identifiers import Person
from aind_data_schema.components.injection_procedures import InjectionDynamics, InjectionProfile, ViralMaterial
from aind_data_schema.components.subject_procedures import BrainInjection, Perfusion
from aind_data_schema.components.subjects import BreedingInfo, Housing, MouseSubject, Species, Sex, HomeCageEnrichment
from aind_data_schema.components.coordinates import CoordinateSystemLibrary
from aind_data_schema.core.data_description import DataDescription, Funding
from aind_data_schema.core.procedures import Procedures, Surgery
from aind_data_schema.core.subject import Subject

sessions_df = pd.read_excel("example_workflow.xlsx", sheet_name="sessions")
mice_df = pd.read_excel("example_workflow.xlsx", sheet_name="mice")
procedures_df = pd.read_excel("example_workflow.xlsx", sheet_name="procedures")

# everything was done by one person, so it's not in the spreadsheet
experimenter = Person(name="Some experimenter")

# in our spreadsheet, we stored sex as M/F instead of Male/Female
subject_sex_lookup = {
    "F": Sex.FEMALE,
    "M": Sex.MALE,
}

# everything is covered by the same IACUC protocol
ethics_review_id = "2109"

How did we know which aind-data-schema classes to import?

Our general recommendation for metadata is to navigate the documentation starting from the core class you are working on. So for the data description you would go to that page: DataDescription. The import for any object can be read from the URL of the page, core classes are found in the core subfolder from aind_data_schema.core import DataDescription.

One of the objects you’ll need to build is going to be the Person. From the DataDescription page you can click-through (we recommend you ctrl+click or command+click to open the link in a new tab) to the Person page. Again read the URL to know where to import the file, in this case we’re in a subfolder components in the file identifiers from aind_data_schema.components.identifiers import Person. After importing the class and populating it in your Python code you can close the extra tab.

Let’s move on to build the actual data description now.

Data description

The data description schema contains basic administrative metadata. Who collected the data, how was it funded, etc. We’ll define a function to generate this, and re-use it for each of the three sessions.


def generate_data_description(subject_id: str, creation_time: datetime) -> DataDescription:
    """Create the DataDescription object
    our data always contains planar optical physiology and behavior videos
    """
    return DataDescription(
        modalities=[Modality.POPHYS, Modality.BEHAVIOR_VIDEOS],
        subject_id=subject_id,
        creation_time=creation_time,
        institution=Organization.AIND,
        funding_source=[Funding(funder=Organization.NIMH)],
        investigators=[experimenter],
        data_level=DataLevel.RAW,
        project_name="Example workflow",
    )

A few of the fields in the data description required us to use enumerated variables, like DataLevel. Controlled vocabularies like this one are used to standardize the metadata and make it easier for people to search across assets from different experiments. We also use controlled vocabularies that are linked to external registries, like for Organization

Subject

To create the subject metadata we’ll pull some information from the excel spreadsheet and pass it to a function which will return the validated Subject object.

Some of the required metadata, like the cage_id wasn’t available to us. We’ll put "unknown" in the metadata for that field. Never invent metadata!


def generate_subject(
    subject_id: str,
    sex: Sex,
    date_of_birth: date,
    genotype: str,
    maternal_id: str,
    maternal_genotype: str,
    paternal_id: str,
    paternal_genotype: str,
) -> Subject:
    """Create the subject object"""
    return Subject(
        subject_id=subject_id,
        subject_details=MouseSubject(
            species=Species.HOUSE_MOUSE,
            sex=sex,
            date_of_birth=date_of_birth,
            genotype=genotype,
            breeding_info=BreedingInfo(
                maternal_id=maternal_id,
                maternal_genotype=maternal_genotype,
                paternal_id=paternal_id,
                paternal_genotype=paternal_genotype,
                breeding_group="unknown",  # not in spreadsheet
            ),
            housing=Housing(
                home_cage_enrichment=[HomeCageEnrichment.RUNNING_WHEEL],  # all subjects had a running wheel
                cage_id="unknown",  # not in spreadsheet
            ),
            strain=Strain.C57BL_6J,
            source=Organization.OTHER,
        ),
    )

Procedures

We’ll next write a function that will construct the Procedures about two surgeries that were performed: a brain injection at a target depth and later (after data acquisition) a perfusion.


def generate_procedures(
        subject_id: str,
        protocol: str,
        virus_name: str,
        virus_titer: int,
        coords: List[float],
        injection_volume: float,
        brain_area: str,
        injection_date: datetime,
        perfusion_date: datetime,
        experimenter: Person,
        ethics_review_id: str,
) -> Procedures:
    """Create the procedures object"""

    # Create the first surgery (brain injection)

    # we stored the injection coordinates as a comma-delimited string: AP, ML, Depth (from surface), Rotation angle
    # Note that the depth coordinate is inverted, it should be positive downward
    # We don't know which axis was rotated around, so we'll assume this is sagittal angle (around the AP axis)
    coord = [
        Translation(
            translation=[float(coords[0]), float(coords[1]), 0, -float(coords[2])],
        ),
        Rotation(
            angles=[float(coords[3]), 0, 0],
        )
    ]

    brain_injection = BrainInjection(
        protocol_id=protocol,
        coordinate_system_name=CoordinateSystemLibrary.BREGMA_ARID.name,
        injection_materials=[
            ViralMaterial(
                name=virus_name,
                titer=virus_titer,
            )
        ],
        targeted_structure=getattr(CCFv3, brain_area.upper()),
        coordinates=[coord],  # Note: this is a list, because we could have multiple depths
        dynamics=[
            InjectionDynamics(
                volume=injection_volume,
                volume_unit=VolumeUnit.NL,
                profile=InjectionProfile.BOLUS,
            )
        ],
    )

    brain_injection_surgery = Surgery(
        start_date=injection_date,
        protocol_id=protocol,
        ethics_review_id=ethics_review_id,
        experimenters=[experimenter],
        procedures=[
            brain_injection,
        ],
    )

    # Create the second surgery (perfusion)
    perfusion_surgery = Surgery(
        start_date=perfusion_date,
        experimenters=[experimenter],
        ethics_review_id=ethics_review_id,
        protocol_id=protocol,
        procedures=[Perfusion(protocol_id=protocol, output_specimen_ids=["1"])],
    )

    # Return the full Procedures object
    return Procedures(
        subject_id=subject_id,
        coordinate_system=CoordinateSystemLibrary.BREGMA_ARID,
        subject_procedures=[
            brain_injection_surgery,
            perfusion_surgery,
        ],
    )

This is the point at which we need to also discuss the coordinate systems. To know the position of an object or procedure across experiments we need to record position, rotation, and scale information in a standardized system. This step in metadata creation can be a bit intimidating but know that we’ve created tools to help simplify it! The two things to think about are:

  • What was the origin that you used as a reference coordinate? For many animal experiments it’s probably bregma on the skull.

  • How did you go from the origin to your target coordinate? For many animal experiments you likely used a stereotax and should know the exact anterior-posterior, left-to-right (or medial-lateral), and superior-to-inferior position you went to, plus the depth you moved down along the injection axis. Make sure to also note any rotation you performed and which axis you rotated around.

For most mouse experiments like the one here, the coordinate system used had the origin at Bregma and the axes pointing anterior, right, and inferior (or ventral), plus a depth coordinate. To make your life easier you can import this coordinate system from the library so that you don’t have to worry about constructing it yourself.

from aind_data_schema.coordinates.components import CoordinateSystemLibrary

coordinate_system = CoordinateSystemLibrary.BREGMA_ARID

Generating metadata

Finally, we’re ready to generate all the metadata files. We’ll loop over the sessions listed in the excel spreadsheet and use our functions to build the JSON files. The write_standard_file() function will take care of writing the files to disk.


# loop through all of the sessions
for _, row in sessions_df.iterrows():
    # Pull information from the session row
    subject_id = row["mouse_id"]
    start_time = row["start_time"].to_pydatetime()
    end_time = row["end_time"].to_pydatetime()

    # If there's no timezone information, add the pacific timezone
    pacific_tz = ZoneInfo("America/Los_Angeles")
    if start_time.tzinfo is None:
        start_time = start_time.replace(tzinfo=pacific_tz)
    if end_time.tzinfo is None:
        end_time = end_time.replace(tzinfo=pacific_tz)

    # Build the data_description
    data_description = generate_data_description(str(subject_id), end_time)

    # Get the mouse data for this session
    mouse_df_row = mice_df[mice_df["id"] == subject_id].iloc[0]  # Gets all matching rows
    sex = Sex.MALE if mouse_df_row["sex"] == "M" else Sex.FEMALE
    genotype = mouse_df_row["genotype"]
    dob = mouse_df_row["dob"].to_pydatetime().date()
    dam_id = mouse_df_row["dam_id"]
    sire_id = mouse_df_row["sire_id"]

    # Get the full dam and sire information
    dam_row = mice_df[mice_df["id"] == dam_id].iloc[0]
    dam_genotype = dam_row["genotype"]
    sire_row = mice_df[mice_df["id"] == sire_id].iloc[0]
    sire_genotype = sire_row["genotype"]

    # Build the subject
    subject = generate_subject(
        subject_id=str(subject_id),
        sex=sex,
        date_of_birth=dob,
        genotype=genotype,
        maternal_id=str(dam_id),
        maternal_genotype=dam_genotype,
        paternal_id=str(sire_id),
        paternal_genotype=sire_genotype,
    )

    # Get the procedures information
    proc_row = procedures_df[procedures_df["mouse_id"] == subject_id].iloc[0]

    # First surgery
    injection_date = proc_row["injection_date"].to_pydatetime()
    protocol = str(proc_row["protocol"])
    brain_area = proc_row["brain_area"]
    virus_name = proc_row["virus_name"]
    virus_titer = proc_row["virus_titer"]
    injection_volume = proc_row["injection_volume"]
    # we stored the injection coordinates as a comma-delimited string: AP,ML,DV,Rotation angle
    coords = proc_row.injection_coord.split(",")

    # Second surgery
    perfusion_date = proc_row["perfusion_date"].to_pydatetime()

    procedures = generate_procedures(
        subject_id=str(subject_id),
        protocol=protocol,
        virus_name=virus_name,
        virus_titer=virus_titer,
        coords=[float(coord) for coord in coords],
        injection_volume=injection_volume,
        brain_area=brain_area,
        injection_date=injection_date.date(),
        perfusion_date=perfusion_date.date(),
        experimenter=experimenter,
        ethics_review_id=ethics_review_id,
    )

    # we will store our json files in a directory named after the session
    os.makedirs(data_description.name, exist_ok=True)

    # Save the metadata files
    data_description.write_standard_file(output_directory=data_description.name)
    subject.write_standard_file(output_directory=data_description.name)
    procedures.write_standard_file(output_directory=data_description.name)

Instrument and Acquisition and other metadata

The remaining metadata files needed for an experimental data asset (Instrument and Acquisition) follow the same pattern: extract the relevant information from a data source, transform it into the schema, and use the write_standard_file() function to construct the output JSON file that will be kept alongside your data asset.

During processing and analysis, you will also generate metadata files for Processing and QualityControl, and possibly a Model.