Data Configure#

Before starting OntoAnno, open configs/demo.yaml and fill in the fields for your dataset. This is the minimum template for a normal first run.

If you want optional features such as reference-label comparison, PDF evidence, or precomputed marker genes, use configs/demo_optional.yaml as the example template.

Required Fields#

For a first run, you usually only need to edit these fields:

project:
  name: MyProject
  work_dir: /work/MyProject

inputs:
  seurat_rds: /data/my_project/my_dataset.rds

annotation:
  species: human
  tissue_name: human pancreatic tumor
  parent_res:
    - 0.1
    - 0.3

Field	What it controls	Example
`project.name`	A short name for this OntoAnno run. Use letters, numbers, or underscores.	`PDAC_sn`
`project.work_dir`	Where OntoAnno saves memory, intermediate files, reviewed labels, and reports.	`/work/PDAC_sn`
`inputs.seurat_rds`	The Seurat `.rds` file to annotate.	`/data/pdac/pdac_sn.rds`
`annotation.species`	The species used for ontology and marker evidence lookup.	`human` or `mouse`
`annotation.tissue_name`	A biological description of the dataset tissue or disease context.	`human pancreatic tumor` or `mouse brain cortex`
`annotation.parent_res`	Clustering resolutions OntoAnno will test for parent annotation.	`0.1`, `0.3`, `0.5`

Optional Fields#

These fields are shown in configs/demo_optional.yaml. Change them only when you want to use the specific feature or intentionally change pipeline behavior.

Optional Input Sources#

Field	What it controls	Example / options
`inputs.reference_labels_csv`	Optional labels from another method. This does not change OntoAnno annotation; it is used only when `evaluation.enabled` is `true`.	`/data/project/labels.csv` or `null`. CSV should include a cell ID/barcode column and a label column such as `celltype`.
`inputs.pdf_dir`	Folder of literature PDFs for marker evidence extraction. Empty or `null` skips PDF extraction.	`/data/project/pdfs` or `null`. Put `.pdf` files directly in this folder.
`inputs.marker_genes_dir`	Existing cluster marker files if you want to skip marker detection and start from annotation. The Seurat object must already contain matching `cluster_res.<resolution>` metadata columns.	`/data/project/marker_genes` or `null`

Annotation Behavior#

Field	What it controls	Example / options
`annotation.preprocess`	Whether OntoAnno should run Seurat normalization, variable feature selection, scaling, PCA, UMAP, and clustering before annotation. Set `false` only for an already processed object with matching cluster metadata.	`true` or `false`
`annotation.sub_res`	Resolutions used only when you ask OntoAnno to subcluster a parent cell type.	`0.1`, `0.2`
`annotation.min_cell_count`	Minimum number of cells required before running subclustering for a parent cell type.	`3000`
`annotation.n_runs_parent`	Number of repeated LLM calls for parent annotation. Larger values cost more and take longer but can stabilize labels.	`3` for testing, `10` for final runs
`annotation.n_runs_sub`	Number of repeated LLM calls for subcluster annotation. Ignored if no subclustering is requested.	`3` for testing, `10` for final runs

Ontology And Review Policy#

Field	What it controls	Example / options
`policy.ontology`	Whether RAG review should prefer ontology-mapped candidate labels when possible.	`true` or `false`
`policy.granularity`	Label specificity during review. This does not change clustering resolution.	`coarse`, `balanced`, or `fine`
`policy.review_tie`	Whether tied label decisions should go to human review.	`true` or `false`
`policy.review_nomatch`	Whether unknown or poorly matched labels should go to human review.	`true` or `false`

LLM And PDF Models#

Field	What it controls	Example / options
`llm.annotation.model`	Model used for main annotation and review. Changing it can change labels, cost, and runtime.	`gpt-5`
`llm.pdfmarkers.model`	Model used only when `inputs.pdf_dir` contains PDFs.	`gpt-5-nano`

Subclustering#

Field	What it controls	Example / options
`alignment.celltypes_to_subcluster`	Parent cell types to split into finer subclusters. Keep `null` to skip preset subclustering; you can also ask the agent later.	`null` or `["macrophage", "T cell"]`

Evaluation And Report#

Field	What it controls	Example / options
`evaluation.enabled`	Whether to compare OntoAnno labels against `inputs.reference_labels_csv`. A CSV path alone does not enable evaluation.	`true` or `false`
`evaluation.manual_col`	Column name in `reference_labels_csv` that contains the known labels.	`celltype` or `SingleR_labels`
`report.format`	Final report format. `html` is the safest default.	`html` or `pdf`

Advanced Fields Not Shown In demo_optional.yaml#

Most users should not edit these. They are still supported by the config loader when needed.

Field	When it matters	Notes
`inputs.annotation_output_dir`	Import an existing GPTAnno parent annotation output for review/RAG/report workflows.	Advanced import path. It is not a replacement for `inputs.seurat_rds` in a normal full run.
`llm.annotation.api_url`	Use a custom OpenAI-compatible gateway.	Keep default `null` for normal OpenAI.
`llm.annotation.system_prompt`	Override the model’s system instruction.	Can change annotation behavior; leave unset unless you know why.
`alignment.user_restrict_to` / `alignment.manual_resolution_map`	Force advanced subcluster label or resolution choices.	Developer or expert use only.
`evaluation.baselines`	Run external baseline comparison commands.	Advanced and potentially executes shell commands.