How ClinVar validates submissions
Validation during submission processing
Information reported after submission processing
Validation during submission processing
ClinVar analyzes the content of submissions and validates selected elements. This analysis includes both automated checks and manual checks by curators. Some checks result in rejecting a submission; others allow the submission to proceed but with questioned information returned to the submitter for review.
Variant
Variant definition
ClinVar validates variants with precise location described by either an HGVS expression or chromosome coordinates. This validation primarily uses NCBI’s Variation Services, which are based on SPDI notation (PMID: 31738401); it is supplemented with validation of intronic variants described on ReSeq transcripts and with limited validation of variants that are large and/or have an imprecise location, like CNVs. Variants that do not pass validation are not processed by ClinVar; they are returned to the submitter for correction or removal from the submission.
For HGVS expressions
- validate HGVS format
- validate content, including the reference sequence and reference allele
- ClinVar does have some older submitted records based on HGVS expressions that were not handled as rigorously. These may include variant descriptions that are truly valid that our code did not handle correctly, and variants that are truly invalid but that we accepted anyway. For those submissions the HGVS expression is reported as “non-validated”, rather than invalid.
For chromosome coordinates
- validate assembly and chromosome
- validate that the chromosomal locations are within the range of the chromosome
- validate that outer start < inner start < start < stop < inner stop < outer stop
- validate that start and stop are consecutive nucleotides for insertions
- validate that the length of the reference allele matches the start/stop locations for deletions
- validate that the reference and alternate alleles are provided; otherwise, variant type is expected, to describe large deletions, duplications, etc.
- validate that for insertions, the reference allele is either “-“ or an anchor nucleotide (VCF-style)
- validate that for deletions, the alternate allele is either “-“ or an anchor nucleotide (VCF-style)
For both HGVS expressions and chromosomal coordinates
- validate that the asserted reference allele matches the allele in the reference sequence
- validate that the alternate allele is an IUPAC base, or one of a number of non-standard abbreviations for retroviral insertions (insAlu, insLINE, insLINE1, insSINE)
- validate that the reference sequence is human
- validate that the reference sequence is valid.
Known issues
Our variant validation has a few known issues. They include some less frequently used HGVS standards that our code does not handle yet, as well as other issues for which we do not have a programmatic solution yet. We will continue to improve the code; please contact us at clinvar@ncbi.nlm.nih.gov if you have a submission that is blocked by one of these issues.
- The HGVS format using both an NG and an NM for intronic variants, e.g. NC_000023.10(NM_004006.2):c.357+1G>A, is valid but not handled by ClinVar’s validation.
- New versions of RefSeq NM transcripts may not be validated by ClinVar immediately.
- Suppressed RefSeqs are accepted, but they should return an error.
- Some HGVS expressions representing no change at intronic positions are not validated correctly.
- Some HGVS expressions representing microsatellites with uncertain ranges or with ambiguous nucleotides are not validated correctly.
- Some HGVS expressions representing inversions across intron-exon or exon-UTR boundaries are not validated correctly.
Examples
The examples that we use as test cases for our validation code are available on the ClinVar FTP site. They are organized into files with either valid or invalid cases, and with descriptions using HGVS expressions or chromosomal coordinates. The files using chromosomal coordinates are available in both GRCh37 and GRCh38 coordinates.
Consistency checking
- If an rs number is provided for a variant, ClinVar validates that the genomic location for that identifier is consistent with the variant description.
- Legacy descriptions are not validated.
- If a submission is updated, ClinVar stops the processing if a change in the variant definition is detected. The update is allowed to continue only if the submitter verifies the previous definition was an error. ClinVar then assigns a new AlleleID and a new VariationID, when it updates the version of the SCV accession.
Multiple variants
- If multiple variants are submitted together as a single record, e.g. as a haplotype or as a compound heterozygote, curators confirm the submitter's intent. This may result in splitting a submission, e.g. two variants submitted as a compound heterozygote as pathogenic for an autosomal recessive disease should be split into two distinct submissions, reporting each individual variant as pathogenic for the disease.
Condition
- If a variant is classified for a disease or phenotype with a database identifier, ClinVar validates:
- that the identifier is valid for the database
- that the identifier is for a disease or phenotype. For example, a MIM number for a disease is valid; a MIM number for a gene is not.
- If a variant is classified for more than one disease, ClinVar validates whether the classification is for several related diseases or for the combination of distinct diseases in the same individual.
- If a variant is classified for a set of HPO identifiers, ClinVar validates that the HPO identifiers represent a novel, unnamed disease, rather than a patient's specific phenotype.
Variant-condition
- ClinVar checks whether the submitter has already submitted a classification for the variant and condition. We are aware that the database contains some duplicate records that were submitted before checks were put in place. We are working with submitters to resolve the duplicates.
Evidence
ClinVar validates:
- PubMed IDs and other citation identifiers
- Sequence Ontology (SO) and Variation Ontology (VariO) terms used to describe functional consequence
- Genetic Testing Registry (GTR) test IDs provided as the test type used to detect the variant
- organization ID provided as testing laboratory
Miscellaneous
- For an update to an SCV accession, ClinVar verifies that the submitter is the owner of that record.
- If a batch of submissions includes only variants of "uncertain significance", curators confirm that the variants were determined to be uncertain as the result of a classification process. If the variants are uncertain because they were not classified, they are not in scope for ClinVar.
Classification
- ClinVar does not validate the classification of a variant.
- This curation is performed by expert panels and professional societies who provide practice guidelines and submit their curation results to ClinVar.
- ClinVar does not determine which classification is correct when submitters disagree.
- ClinVar represents the classification provided by each submitter.
- ClinVar calculates an aggregate classification based on submissions and indicates when there is conflict between submitters.
- Submitters are encouraged to provide evidence for the classification so that users can understand why other submitters may disagree with the classification.
- A curated classification from an expert panel or a practice guideline overrules any conflict from other submitters.
- ClinVar staff do not review criteria for classification used by submitters for appropriateness (assertion criteria).
- The submitter may provide documentation of the categories used to classify variants and the criteria needed to categorize variants into each bin.
- ClinVar staff may review this documentation to ensure that it describes categories and criteria, but they do not decide whether the categories and criteria are appropriate.
- This documentation of assertion criteria is for users to evaluate how a classification was made and may help users understand why submitters disagree in their classification.
Information reported after submission processing
After submission, a report is provided to the submitter based on checks done when submitted data is integrated into the database. This report is provided only as an FYI for the submitter, and it includes information such as:
- ClinVar processed a variant description that could not be validated
- the submitted HGVS expression uses a previous version of the reference sequence
- the classification is inconsistent with the allele frequency (e.g. a pathogenic variant with a high allele frequency) )
- the classification was made for a novel gene-disease relationship
- ClinVar has conflicting submissions for the same variant-disease relationship (ClinVar checks for this issue proactively but this check addresses historical issues with redundant records)
- the submitter’s classification conflicts with the classification from an expert panel or practice guideline
- The submitter’s classification differs from another submitted classification
- The disease for the classification is idiopathic
- The classification is “Pathogenic” but no disease was provided
- The classification changed but the date last evaluated did not change