Processing¶

Link to code

The processing.json file captures the data processing and analysis steps that have been carried out – mostly for derived data assets. This tracks what code was used for each step, when it was run, what the input and outputs where, what parameters were set. This includes things like spike sorting, image alignment, cell segmentation. It also includes manual annotation, quality control, and data analysis.

The processing file should be appended to with each subsequent stage of processing or analysis.

Example¶

"""example processing"""

from datetime import datetime, timezone

from aind_data_schema.components.identifiers import Person, Code
from aind_data_schema.core.processing import (
    DataProcess,
    Processing,
    ProcessName,
    ProcessStage,
    ResourceTimestamped,
    ResourceUsage,
)
from aind_data_schema_models.units import MemoryUnit
from aind_data_schema_models.system_architecture import OperatingSystem, CPUArchitecture

# If a timezone isn't specified, the timezone of the computer running this
# script will be used as default
t = datetime(2022, 11, 22, 8, 43, 00, tzinfo=timezone.utc)


cpu_usage_list = [
    ResourceTimestamped(timestamp=datetime(2024, 9, 13, tzinfo=timezone.utc), usage=75.5),
    ResourceTimestamped(timestamp=datetime(2024, 9, 13, tzinfo=timezone.utc), usage=80.0),
]

gpu_usage_list = [
    ResourceTimestamped(timestamp=datetime(2024, 9, 13, tzinfo=timezone.utc), usage=60.0),
    ResourceTimestamped(timestamp=datetime(2024, 9, 13, tzinfo=timezone.utc), usage=65.5),
]

ram_usage_list = [
    ResourceTimestamped(timestamp=datetime(2024, 9, 13, tzinfo=timezone.utc), usage=70.0),
    ResourceTimestamped(timestamp=datetime(2024, 9, 13, tzinfo=timezone.utc), usage=72.5),
]

file_io_usage_list = [
    ResourceTimestamped(timestamp=datetime(2024, 9, 13, tzinfo=timezone.utc), usage=5.5),
    ResourceTimestamped(timestamp=datetime(2024, 9, 13, tzinfo=timezone.utc), usage=6.0),
]

example_code = Code(
    url="https://github.com/abcd",
    version="0.1",
    parameters={"size": 7},
)

p = Processing.create_with_sequential_process_graph(
    pipelines=[
        Code(
            name="Imaging processing pipeline",
            url="https://url/for/pipeline",
            version="0.1.1",
        ),
    ],
    data_processes=[
        DataProcess(
            process_type=ProcessName.IMAGE_TILE_FUSING,
            experimenters=[Person(name="Dr. Dan")],
            stage=ProcessStage.PROCESSING,
            start_date_time=t,
            end_date_time=t,
            output_path="/path/to/outputs",
            pipeline_name="Imaging processing pipeline",
            code=example_code.model_copy(
                update=dict(
                    parameters={"size": 7},
                )
            ),
            resources=ResourceUsage(
                os=OperatingSystem.UBUNTU_20_04,
                architecture=CPUArchitecture.X86_64,
                cpu="Intel Core i7",
                cpu_cores=8,
                gpu="NVIDIA GeForce RTX 3080",
                system_memory=32.0,
                system_memory_unit=MemoryUnit.GB,
                ram=16.0,
                ram_unit=MemoryUnit.GB,
                cpu_usage=cpu_usage_list,
                gpu_usage=gpu_usage_list,
                ram_usage=ram_usage_list,
            ),
        ),
        DataProcess(
            process_type=ProcessName.FILE_FORMAT_CONVERSION,
            pipeline_name="Imaging processing pipeline",
            experimenters=[Person(name="Dr. Dan")],
            stage=ProcessStage.PROCESSING,
            start_date_time=t,
            end_date_time=t,
            output_path="/path/to/outputs",
            code=example_code.model_copy(
                update=dict(
                    parameters={"u": 7, "z": True},
                )
            ),
        ),
        DataProcess(
            process_type=ProcessName.IMAGE_DESTRIPING,
            pipeline_name="Imaging processing pipeline",
            experimenters=[Person(name="Dr. Dan")],
            stage=ProcessStage.PROCESSING,
            start_date_time=t,
            end_date_time=t,
            output_path="/path/to/output",
            code=example_code.model_copy(
                update=dict(
                    parameters={"a": 2, "b": -2},
                )
            ),
        ),
        DataProcess(
            stage=ProcessStage.ANALYSIS,
            experimenters=[Person(name="Some Analyzer")],
            process_type=ProcessName.ANALYSIS,
            start_date_time=t,
            end_date_time=t,
            output_path="/path/to/outputs",
            code=example_code.model_copy(
                update=dict(
                    parameters={"size": 7},
                )
            ),
        ),
        DataProcess(
            name="Analysis 2",
            stage=ProcessStage.ANALYSIS,
            experimenters=[Person(name="Some Analyzer")],
            process_type=ProcessName.ANALYSIS,
            start_date_time=t,
            end_date_time=t,
            output_path="/path/to/outputs",
            code=example_code.model_copy(
                update=dict(
                    parameters={"u": 7, "z": True},
                )
            ),
        ),
    ],
)

if __name__ == "__main__":
    serialized = p.model_dump_json()
    deserialized = Processing.model_validate_json(serialized)
    p.write_standard_file()

Core file¶

Processing¶

Description of all processes run on data

Field	Type	Description
`data_processes`	List[DataProcess]
`pipelines`	Optional[List[Code]]	For processing done with pipelines, list the repositories here. Pipelines must use the name field ,and be referenced in the pipeline_name field of a DataProcess.
`notes`	`Optional[str]`
`dependency_graph`	`Dict[str, List[str]]`	Directed graph of processing step dependencies. Each key is a process name, and the value is a list of process names that are inputs to that process.

Model definitions¶

DataProcess¶

Description of a single processing step

Field	Type	Description
`process_type`	ProcessName
`name`	`str`	(‘Unique name of the processing step.’, ‘ If not provided, the type will be used as the name.’)
`stage`	ProcessStage
`code`	Code	Code used for processing
`experimenters`	List[Person]	People responsible for processing
`pipeline_name`	`Optional[str]`	Pipeline names must exist in Processing.pipelines
`start_date_time`	`datetime (timezone-aware)`
`end_date_time`	`datetime (timezone-aware)`
`output_path`	`Optional[AssetPath]`	Path to processing outputs, if stored.
`output_parameters`	`dict`	Output parameters
`notes`	`Optional[str]`
`resources`	Optional[ResourceUsage]

ProcessStage¶

Stages of processing

Name	Value
`PROCESSING`	`Processing`
`ANALYSIS`	`Analysis`

ResourceTimestamped¶

Description of resource usage at a moment in time

Field	Type	Description
`timestamp`	`datetime (timezone-aware)`
`usage`	`float`

ResourceUsage¶

Description of resources used by a process

Field	Type	Description
`os`	`str`
`architecture`	`str`
`cpu`	`Optional[str]`
`cpu_cores`	`Optional[int]`
`gpu`	`Optional[str]`
`system_memory`	`Optional[float]`
`system_memory_unit`	Optional[MemoryUnit]
`ram`	`Optional[float]`
`ram_unit`	Optional[MemoryUnit]
`cpu_usage`	Optional[List[ResourceTimestamped]]
`gpu_usage`	Optional[List[ResourceTimestamped]]
`ram_usage`	Optional[List[ResourceTimestamped]]
`usage_unit`	`str`