This library implements a novel method for mapping MaveDB scoreset data to GA4GH Variation Representation Specification (VRS) objects, enhancing interoperability for genomic medicine applications. See Arbesfeld et. al. (2023) for a preprint edition of the mapping manuscript, or download the resulting mappings directly.
- Universal Transcript Archive (UTA): see README for setup instructions. Users with access to Docker on their local devices can use the available Docker image; otherwise, start a relatively recent (version 14+) PostgreSQL instance and add data from the available database dump.
- SeqRepo: see README for setup instructions. The SeqRepo data directory must be writeable; see specific instructions here for more.
- Gene Normalizer: see documentation for data setup instructions.
- blat: Must be available on the local PATH and executable by the user. Otherwise, its location can be set manually with the
BLAT_BIN_PATHenv var. See the UCSC Genome Browser FAQ for download instructions.
Install from PyPI:
python3 -m pip install dcd-mapping
Use the dcd-map command with a scoreset URN, eg
$ dcd-map urn:mavedb:00000083-c-1Output is saved in the format <URN>_mapping_results_<ISO datetime>.json in the directory specified by the environment variable MAVEDB_STORAGE_DIR, or ~/.local/share/dcd-mapping by default.
Use dcd-map --help to see other available options.
Each mapping run produces a single JSON document conforming to schema.json (the JSON Schema serialization of ScoresetMapping). Top-level keys:
metadata— the verbatim MaveDB API scoreset response, stored unchanged so no upstream fields are lost.mapped_date— ISO 8601 UTC timestamp of when this run completed.reference_sequences— per-target reference sequence info per annotation layer.mapped_scores— flat list of per-variantScoreAnnotationrecords (see below).target_mappings— per-(target, alignment_level)provenance and alignment QC rows. The MaveDB API consumes these astarget_gene_mappingsand uses them to attribute everymapped_scoreback to the alignment that produced it.error_message— populated only when the run failed before producing scores.
The verbatim MaveDB API scoreset response. Stored unchanged so downstream consumers retain access to every upstream field (URN, title, description, target gene definitions, score-column metadata, etc.) without having to query MaveDB again.
A dict[target_gene_name, TargetAnnotation] describing the reference sequences each target was mapped against, organized by annotation layer. Each TargetAnnotation carries:
gene_info—hgnc_symbolplus theselection_methodthat picked it (transcript-derived, alignment-overlap-derived, variant-overlap-derived, or metadata fallback).layers— adict[AnnotationLayer, {computed_reference_sequence, mapped_reference_sequence}]populated only for layers that actually produced mappings.computed_reference_sequenceis the in-pipeline sequence (e.g. translated protein);mapped_reference_sequencelists the canonical accession(s) the variants were ultimately grounded in. Layers with no usable reference are pruned, not emitted asnull.
This block is the human-readable "what was used as reference" view; programmatic auditing should use target_mappings instead.
A flat list of per-variant ScoreAnnotation records. One entry per (score_record, emitted annotation_layer) pair. Key fields:
mavedb_id,score— identifier and numeric score copied from the MaveDB record.relation— fixed at"SO:is_homologous_to"whilepre_mappedis populated.target_gene_identifier,alignment_level— composite key linking back to atarget_mappingsrow (see below).pre_mapped,post_mapped— VRS variant objects in the target's coordinate frame and in the reference frame, respectively. Either may benullfor failed mappings.vrs_version— VRS schema version used for this record.error_message— populated whenpost_mappedisnullor when mapping succeeded with a caveat (e.g. RLE fallback, ambiguous reference allele).at_mismatched_locus,near_gap— per-variant audit flags, described below.
Per-(target, alignment_level) provenance and alignment QC rows. The MaveDB API consumes these as target_gene_mappings and uses them to attribute every mapped_score back to the alignment that produced it. (See schema.json TargetMapping for the wire format.)
Populated only when the run failed before producing any scores; otherwise omitted. Per-variant errors live on mapped_scores[].error_message, not here.
Each row describes the alignment that one set of mapped variants is grounded in:
| Field | Notes |
|---|---|
target_gene_identifier, alignment_level, preferred |
Composite key. (target_gene_identifier, alignment_level) is unique per run. Exactly one row per target has preferred=True. |
tool_name, tool_version, tool_parameters |
Aligner provenance. tool_parameters.aligner is "blat" for sequence-based targets and "cdot_transcript_placement" for accession-based targets. |
reference_accession, reference_sequence_id, vrs_version |
Coordinate-frame and run provenance. |
percent_identity, alignment_score, next_best_alignment_score, alignment_length, mismatch_count, gap_count |
Aggregate QC for the winning HSP. alignment_score is the canonical PSL score (identities − mismatches − qNumInsert − tNumInsert). |
alignment_string, alignment_metadata |
Pairwise visualization plus a small structured payload (CIGAR, near_gap_window, at_mismatched_locus_evaluated). |
total_variants, variants_mapped_cleanly, variants_with_mapping_warnings, variants_with_alignment_warnings, variants_failed |
Per-row variant counts. variants_with_alignment_warnings counts variants whose reference position fell on a mismatched base or near a gap. |
Each ScoreAnnotation is attributable to exactly one target_mappings row via the composite key (target_gene_identifier, alignment_level). The pipeline enforces this as a runtime invariant — orphaned scores raise RuntimeError rather than silently corrupting downstream joins.
Per-variant locus flags:
at_mismatched_locus—Truewhen any base in the variant's reference span mismatches between the target sequence and the reference;Falsewhen evaluated and no mismatch was found;Nonewhen per-base sequence content was unavailable for that layer (seealignment_metadata.at_mismatched_locus_evaluated), or when the variant is aReferenceLengthExpressionallele (large deletions/duplications, alwaysNone/None).near_gap—Truewhen the variant lies withinalignment_metadata.near_gap_windowreference bases of any alignment gap;Nonefor layers without an alignment (e.g.cdna).
Completely-failed variants (pre_mapped is None and no annotation layer was determined) are attributed to the target's preferred layer so the join invariant holds.
schema.json is checked in and consumed by downstream services (notably the MaveDB API). After any change to src/dcd_mapping/schemas.py that alters the public output contract, regenerate it:
python scripts/generate_schema.pyCommit the regenerated schema.json in the same change.
Notebooks for manuscript data analysis and figure generation are provided within notebooks/analysis. See notebooks/analysis/README.md for more information.
Following installation instructions for CoolSeqTool and Gene Normalizer should take care of the external data dependencies.
Note that Gene Normalizer's pg dependency group must be installed to make use of the PostgreSQL-based backend:
python3 -m pip install 'gene-normalizer[pg]'Clone the repo
git clone https://github.com/ave-dcd/dcd_mapping
cd dcd_mapping
Create and activate a virtual environment
python3 -m virtualenv venv
source venv/bin/activate
Install as editable and with developer dependencies
python3 -m pip install -e '.[dev,tests]'
Add pre-commit hooks
pre-commit install
Run tests with pytest
pytest