Generating metadata¶
Why¶
We are gathering this metadata so that our assets will be FAIR (findable, accessible, interoperable, and re-usable).
After completing this tutorial you will be able to:
Build the data description, subject, and procedures metadata for your asset using Python code and pydantic.
Describe how a simple set of metadata for a few assets get converted into individual metadata records.
Navigate the documentation
Identify metadata sources¶
In practice, key metadata is usually distributed into many data sources. They could be spreadsheets, databases, TIFF file headers, or even file names.
In this example, let’s say that our basic subject and surgical procedure metadata are stored in an excel workbook with three sheets: mice, sessions, and procedures.
Let’s say they look like this:
mice:
id dam_id sire_id genotype dob sex
1 Vip-IRES-Cre/wt 9/22/2023 F
2 Ai32(RCL-ChR2(H134R)_EYFP)/Ai32(RCL-ChR2(H134R)_EYFP) 9/15/2023 M
3 1 2 Vip-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt 12/1/2023 F
procedures:
mouse_id injection_date protocol brain_area virus_name virus_titer injection_volume injection_coord perfusion_date
3 1/2/2024 7:00 injection-perfusion-v1 VISp AAV2-Flex-ChrimsonR 2300000000 200 03.8,-0.87,-3.3,10 1/31/2024 10:22
sessions:
mouse_id start_time end_time
3 1/26/2024 15:00 1/26/2024 15:30
3 1/27/2024 15:00 1/27/2024 15:30
3 1/28/2024 15:00 1/28/2024 15:30
In this example you can see that we recorded three sessions from one mouse, which has a viral injection and a perfusion procedure. All mice are C57BL/6J, were bred locally, and were housed with a running wheel in their cage. Download example_workflow.xlsx and example_workflow.py to follow along.
Setup Python environment¶
First, we’ll set up the Python environment and define some shared variables.
import os
from typing import List
import pandas as pd
from datetime import datetime, date
from zoneinfo import ZoneInfo
from aind_data_schema_models.modalities import Modality
from aind_data_schema_models.organizations import Organization
from aind_data_schema_models.species import Strain
from aind_data_schema_models.units import VolumeUnit
from aind_data_schema_models.data_name_patterns import DataLevel
from aind_data_schema_models.brain_atlas import CCFv3
from aind_data_schema.components.coordinates import Rotation, Translation
from aind_data_schema.components.identifiers import Person
from aind_data_schema.components.injection_procedures import InjectionDynamics, InjectionProfile, ViralMaterial
from aind_data_schema.components.subject_procedures import BrainInjection, Perfusion
from aind_data_schema.components.subjects import BreedingInfo, Housing, MouseSubject, Species, Sex, HomeCageEnrichment
from aind_data_schema.components.coordinates import CoordinateSystemLibrary
from aind_data_schema.core.data_description import DataDescription, Funding
from aind_data_schema.core.procedures import Procedures, Surgery
from aind_data_schema.core.subject import Subject
sessions_df = pd.read_excel("example_workflow.xlsx", sheet_name="sessions")
mice_df = pd.read_excel("example_workflow.xlsx", sheet_name="mice")
procedures_df = pd.read_excel("example_workflow.xlsx", sheet_name="procedures")
# everything was done by one person, so it's not in the spreadsheet
experimenter = Person(name="Some experimenter")
# in our spreadsheet, we stored sex as M/F instead of Male/Female
subject_sex_lookup = {
"F": Sex.FEMALE,
"M": Sex.MALE,
}
# everything is covered by the same IACUC protocol
ethics_review_id = "2109"
How did we know which aind-data-schema classes to import?¶
Our general recommendation for metadata is to navigate the documentation starting from the core class you are working on. So for the data description you would go to that page: DataDescription. The import for any object can be read from the URL of the page, core classes are found in the core subfolder from aind_data_schema.core import DataDescription.
One of the objects you’ll need to build is going to be the Person. From the DataDescription page you can click-through (we recommend you ctrl+click or command+click to open the link in a new tab) to the Person page. Again read the URL to know where to import the file, in this case we’re in a subfolder components in the file identifiers from aind_data_schema.components.identifiers import Person. After importing the class and populating it in your Python code you can close the extra tab.
Let’s move on to build the actual data description now.
Data description¶
The data description schema contains basic administrative metadata. Who collected the data, how was it funded, etc. We’ll define a function to generate this, and re-use it for each of the three sessions.
def generate_data_description(subject_id: str, creation_time: datetime) -> DataDescription:
"""Create the DataDescription object
our data always contains planar optical physiology and behavior videos
"""
return DataDescription(
modalities=[Modality.POPHYS, Modality.BEHAVIOR_VIDEOS],
subject_id=subject_id,
creation_time=creation_time,
institution=Organization.AIND,
funding_source=[Funding(funder=Organization.NIMH)],
investigators=[experimenter],
data_level=DataLevel.RAW,
project_name="Example workflow",
)
A few of the fields in the data description required us to use enumerated variables, like DataLevel. Controlled vocabularies like this one are used to standardize the metadata and make it easier for people to search across assets from different experiments. We also use controlled vocabularies that are linked to external registries, like for Organization
Subject¶
To create the subject metadata we’ll pull some information from the excel spreadsheet and pass it to a function which will return the validated Subject object.
Some of the required metadata, like the cage_id wasn’t available to us. We’ll put "unknown" in the metadata for that field. Never invent metadata!
def generate_subject(
subject_id: str,
sex: Sex,
date_of_birth: date,
genotype: str,
maternal_id: str,
maternal_genotype: str,
paternal_id: str,
paternal_genotype: str,
) -> Subject:
"""Create the subject object"""
return Subject(
subject_id=subject_id,
subject_details=MouseSubject(
species=Species.HOUSE_MOUSE,
sex=sex,
date_of_birth=date_of_birth,
genotype=genotype,
breeding_info=BreedingInfo(
maternal_id=maternal_id,
maternal_genotype=maternal_genotype,
paternal_id=paternal_id,
paternal_genotype=paternal_genotype,
breeding_group="unknown", # not in spreadsheet
),
housing=Housing(
home_cage_enrichment=[HomeCageEnrichment.RUNNING_WHEEL], # all subjects had a running wheel
cage_id="unknown", # not in spreadsheet
),
strain=Strain.C57BL_6J,
source=Organization.OTHER,
),
)
Procedures¶
We’ll next write a function that will construct the Procedures about two surgeries that were performed: a brain injection at a target depth and later (after data acquisition) a perfusion.
def generate_procedures(
subject_id: str,
protocol: str,
virus_name: str,
virus_titer: int,
coords: List[float],
injection_volume: float,
brain_area: str,
injection_date: datetime,
perfusion_date: datetime,
experimenter: Person,
ethics_review_id: str,
) -> Procedures:
"""Create the procedures object"""
# Create the first surgery (brain injection)
# we stored the injection coordinates as a comma-delimited string: AP, ML, Depth (from surface), Rotation angle
# Note that the depth coordinate is inverted, it should be positive downward
# We don't know which axis was rotated around, so we'll assume this is sagittal angle (around the AP axis)
coord = [
Translation(
translation=[float(coords[0]), float(coords[1]), 0, -float(coords[2])],
),
Rotation(
angles=[float(coords[3]), 0, 0],
)
]
brain_injection = BrainInjection(
protocol_id=protocol,
coordinate_system_name=CoordinateSystemLibrary.BREGMA_ARID.name,
injection_materials=[
ViralMaterial(
name=virus_name,
titer=virus_titer,
)
],
targeted_structure=getattr(CCFv3, brain_area.upper()),
coordinates=[coord], # Note: this is a list, because we could have multiple depths
dynamics=[
InjectionDynamics(
volume=injection_volume,
volume_unit=VolumeUnit.NL,
profile=InjectionProfile.BOLUS,
)
],
)
brain_injection_surgery = Surgery(
start_date=injection_date,
protocol_id=protocol,
ethics_review_id=ethics_review_id,
experimenters=[experimenter],
procedures=[
brain_injection,
],
)
# Create the second surgery (perfusion)
perfusion_surgery = Surgery(
start_date=perfusion_date,
experimenters=[experimenter],
ethics_review_id=ethics_review_id,
protocol_id=protocol,
procedures=[Perfusion(protocol_id=protocol, output_specimen_ids=["1"])],
)
# Return the full Procedures object
return Procedures(
subject_id=subject_id,
coordinate_system=CoordinateSystemLibrary.BREGMA_ARID,
subject_procedures=[
brain_injection_surgery,
perfusion_surgery,
],
)
This is the point at which we need to also discuss the coordinate systems. To know the position of an object or procedure across experiments we need to record position, rotation, and scale information in a standardized system. This step in metadata creation can be a bit intimidating but know that we’ve created tools to help simplify it! The two things to think about are:
What was the origin that you used as a reference coordinate? For many animal experiments it’s probably bregma on the skull.
How did you go from the origin to your target coordinate? For many animal experiments you likely used a stereotax and should know the exact anterior-posterior, left-to-right (or medial-lateral), and superior-to-inferior position you went to, plus the depth you moved down along the injection axis. Make sure to also note any rotation you performed and which axis you rotated around.
For most mouse experiments like the one here, the coordinate system used had the origin at Bregma and the axes pointing anterior, right, and inferior (or ventral), plus a depth coordinate. To make your life easier you can import this coordinate system from the library so that you don’t have to worry about constructing it yourself.
from aind_data_schema.coordinates.components import CoordinateSystemLibrary
coordinate_system = CoordinateSystemLibrary.BREGMA_ARID
Generating metadata¶
Finally, we’re ready to generate all the metadata files. We’ll loop over the sessions listed in the excel spreadsheet and use our functions to build the JSON files. The write_standard_file() function will take care of writing the files to disk.
# loop through all of the sessions
for _, row in sessions_df.iterrows():
# Pull information from the session row
subject_id = row["mouse_id"]
start_time = row["start_time"].to_pydatetime()
end_time = row["end_time"].to_pydatetime()
# If there's no timezone information, add the pacific timezone
pacific_tz = ZoneInfo("America/Los_Angeles")
if start_time.tzinfo is None:
start_time = start_time.replace(tzinfo=pacific_tz)
if end_time.tzinfo is None:
end_time = end_time.replace(tzinfo=pacific_tz)
# Build the data_description
data_description = generate_data_description(str(subject_id), end_time)
# Get the mouse data for this session
mouse_df_row = mice_df[mice_df["id"] == subject_id].iloc[0] # Gets all matching rows
sex = Sex.MALE if mouse_df_row["sex"] == "M" else Sex.FEMALE
genotype = mouse_df_row["genotype"]
dob = mouse_df_row["dob"].to_pydatetime().date()
dam_id = mouse_df_row["dam_id"]
sire_id = mouse_df_row["sire_id"]
# Get the full dam and sire information
dam_row = mice_df[mice_df["id"] == dam_id].iloc[0]
dam_genotype = dam_row["genotype"]
sire_row = mice_df[mice_df["id"] == sire_id].iloc[0]
sire_genotype = sire_row["genotype"]
# Build the subject
subject = generate_subject(
subject_id=str(subject_id),
sex=sex,
date_of_birth=dob,
genotype=genotype,
maternal_id=str(dam_id),
maternal_genotype=dam_genotype,
paternal_id=str(sire_id),
paternal_genotype=sire_genotype,
)
# Get the procedures information
proc_row = procedures_df[procedures_df["mouse_id"] == subject_id].iloc[0]
# First surgery
injection_date = proc_row["injection_date"].to_pydatetime()
protocol = str(proc_row["protocol"])
brain_area = proc_row["brain_area"]
virus_name = proc_row["virus_name"]
virus_titer = proc_row["virus_titer"]
injection_volume = proc_row["injection_volume"]
# we stored the injection coordinates as a comma-delimited string: AP,ML,DV,Rotation angle
coords = proc_row.injection_coord.split(",")
# Second surgery
perfusion_date = proc_row["perfusion_date"].to_pydatetime()
procedures = generate_procedures(
subject_id=str(subject_id),
protocol=protocol,
virus_name=virus_name,
virus_titer=virus_titer,
coords=[float(coord) for coord in coords],
injection_volume=injection_volume,
brain_area=brain_area,
injection_date=injection_date.date(),
perfusion_date=perfusion_date.date(),
experimenter=experimenter,
ethics_review_id=ethics_review_id,
)
# we will store our json files in a directory named after the session
os.makedirs(data_description.name, exist_ok=True)
# Save the metadata files
data_description.write_standard_file(output_directory=data_description.name)
subject.write_standard_file(output_directory=data_description.name)
procedures.write_standard_file(output_directory=data_description.name)
Instrument and Acquisition and other metadata¶
The remaining metadata files needed for an experimental data asset (Instrument and Acquisition) follow the same pattern: extract the relevant information from a data source, transform it into the schema, and use the write_standard_file() function to construct the output JSON file that will be kept alongside your data asset.
During processing and analysis, you will also generate metadata files for Processing and QualityControl, and possibly a Model.