Data description¶

Link to code

The data_description.json file tracks administrative information about a data asset, including affiliated researchers/organizations, projects, data modalities, dates of collection, and more.

Uniqueness¶

Every data asset is uniquely identified by its DataDescription.name field, which combines the subject_id and acquisition session_end_time. You can group data assets together using the DataDescription.tags: List[str]. Tags should be shared across assets within experiments. Do not repeat information in the tags that already exists elsewhere in the metadata, for example modalities should never be included in tags.

Example¶

"""example data description"""

import argparse
from datetime import datetime, timezone

from aind_data_schema_models.modalities import Modality
from aind_data_schema_models.organizations import Organization
from aind_data_schema_models.data_name_patterns import DataLevel

from aind_data_schema.core.data_description import Funding, DataDescription
from aind_data_schema.components.identifiers import Person

d = DataDescription(
    modalities=[Modality.ECEPHYS, Modality.BEHAVIOR_VIDEOS],
    subject_id="123456",
    creation_time=datetime(2022, 2, 21, 16, 30, 1, tzinfo=timezone.utc),
    institution=Organization.AIND,
    investigators=[Person(name="Daniel Birman", registry_identifier="0000-0003-3748-6289")],
    funding_source=[Funding(funder=Organization.AI)],
    project_name="Example project",
    data_level=DataLevel.RAW,
    tags=["Pilot data"],
)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--output-dir", default=None, help="Output directory for generated JSON file")
    args = parser.parse_args()

    serialized = d.model_dump_json()
    deserialized = DataDescription.model_validate_json(serialized)
    deserialized.write_standard_file(output_directory=args.output_dir)

Core file¶

DataDescription¶

Description of a logical collection of data files

Field	Type	Title (Description)
`license`	License	License
`subject_id`	`Optional[str]`	Subject ID (Unique identifier for the subject of data acquisition)
`creation_time`	`datetime (timezone-aware)`	Creation Time (Time that data files were created, used to uniquely identify the data)
`tags`	`Optional[List[str]]`	Tags (Descriptive strings to help categorize and search for data)
`name`	`Optional[str]`	Data asset name (When left blank, a name will be generated based on subject_id and creation_time. Conventionally also used as the name of the data folder.)
`institution`	Organization	Institution (An established society, corporation, foundation or other organization that collected this data)
`funding_source`	List[Funding]	Funding source (Funding source. If internal funding, select ‘Allen Institute’)
`data_level`	DataLevel	Data Level (Level of processing that data has undergone)
`group`	Optional[Group]	Group (A short name for the group of individuals that collected this data)
`investigators`	List[Person]	Investigators (Full name(s) of key investigators (e.g. PI, lead scientist, contact person))
`project_name`	`str`	Project Name (A name for a set of coordinated activities intended to achieve one or more objectives.)
`restrictions`	`Optional[str]`	Restrictions (Detail any restrictions on publishing or sharing these data)
`modalities`	List[Modality]	Modalities (A short name for the specific manner, characteristic, pattern of application, or the employment of any technology or formal procedure to generate data for a study)
`source_data`	`Optional[List[str]]`	Source data (For derived assets, list the source data asset names used to create this data)
`data_summary`	`Optional[str]`	Data summary (Semantic summary of experimental goal)

Model definitions¶

Funding¶

Description of funding sources

Field	Type	Title (Description)
`funder`	Organization	Funder
`grant_number`	`Optional[str]`	Grant number
`fundee`	Optional[List[Person]]	Fundee (Person(s) funded by this mechanism)