NCBI Pathogen Detection Frequently Asked Questions (FAQ)
Where can I find more information on the NCBI Pathogen Detection System
Our help page provides detailed documentation on how to use the NCBI Pathogen Detection System, including all of the ways to view the processed data in our browsers. Additional information on the organization of the FTP site and on analysis methods can be found on our FTP site in the ReadMe.txt and Methods.txt files.
What is the Pathogen Detection project?
NCBI Pathogen Detection project is a centralized system that integrates sequence data and analysis for bacterial pathogens.
NCBI Pathogen Detection integrates bacterial and fungal pathogen genomic sequences from numerous ongoing surveillance and research efforts whose sources include food, environmental sources such as water or production facilities, and patient samples. Foodborne, hospital-acquired, and other potentially clinically infectious pathogens are included.
The system provides two major automated real-time analyses:
- It quickly clusters related pathogen genome sequences to identify potential transmission chains, helping public health scientists investigate disease outbreaks
- As part of the National Database of Antibiotic Resistant Organisms (NDARO), NCBI screens genomic sequences using AMRFinderPlus to identify the antimicrobial resistance, stress response, and virulence genes found in bacterial genomic sequences, which enables scientists to track the spread of resistance genes and to understand the relationships among antimicrobial resistance, stress response, and virulence.
A number of public health agencies and researchers in the US and internationally are collecting samples from clinical cases, from the environment, from food products, and from industrial production facilities to facilitate active, real-time surveillance of pathogens, including foodborne disease. Public health agencies and researchers sequence the samples and submit the data to NCBI, which analyzes the sequences and compares them to others in its database, including all genomes in GenBank, to identify closely related sequences. The aim is to identify closely or clonally related isolates to aid in outbreak investigation.
The NCBI Pathogen Detection Project also analyzes the assemblies in its database in real-time for known anti-microbial resistance (AMR) genes and other genes of interest and maintains software and databases to facilitate monitoring and research including the National Database of Antibiotic Resistant Organisms.
What is a target/isolate?
A target is an assembled genome for one pathogen isolate. Assemblies for this may not yet be available in GenBank.
What are these Accession Numbers? What is a PDG?
The Pathogen Detection system is currently using three series of Accession Numbers.
- PDT# - isolate accession for each pathogen genome
- PDG# - taxgroup accessions for each organism group
- PDS# - SNP cluster accessions - are sets of isolates that have been determined by wgMLST or single-linkage clustering to be closely related for the purposes of facilitating outbreak and traceback investigations. Note - not all isolates within the taxgroups will be genetically similar enough to another isolate to fall into a SNP cluster. In other words there will be isolates in the taxgroup (PDG#) that will not have a PDS number assigned because they were not found sufficiently close to other isolates in our system to be clustered together.
These Accession Numbers are distinct from those associated with nucleotide records in GenBank or RefSeq. These Accession Numbers are available in the Pathogen Detection Isolates Browser, MicroBIGG-E, and in FTP files.
Each of the accessions ends in a .version e.g., PDG000000004.355. A larger version number is always more recent than a smaller version number.
Note that unlike many other NCBI resources the Pathogen Detection accessions are not archival (the data they refer to is not guaranteed to persist indefinitely) because of the rapid real-time reanalysis that happens in the Pathogen Detection System. For that reason we recommend you download the metadata associated with your analyses and save it locally. Check our HowTo Cite the Pathogen Detection Resource and the Data Contained Within PDF for info on how to get the metadata for web-based analyses.
Does the Pathogen Detection system use all genomes?
For each organism group the system incorporates ALL assembled genomes submitted to GenBank that are not flagged as anomalous in the Assembly database, and select sequences submitted to SRA by public health agencies and others. If you would like sequences you have submitted to SRA from the supported organisms to be included in the NCBI Pathogen Detection analysis please contact us at pd-help@ncbi.nlm.nih.gov.
What about other types of bacterial pathogens?
NCBI Pathogen Detection now analyzes over 50 taxa, and more are addded as capacity allows. If you are interested in large-scale real-time analyses of pathogens that are not included in the Organism Groups we analyze, please contact us at pd-help@ncbi.nlm.nih.gov.
Where are the analysis results? How can I visualize the results?
Results can be viewed in the Isolates Browser, MicroBIGG-E, and by FTP approximately every day if new data are submitted. Our system outputs SNP cluster trees in *.pdf, *.newick, and *.asn format. The *.asn format can be opened in NCBI's Genome Workbench tool. The phylogenetic trees for the SNP clusters are available on both FTP and in the isolates browser. AMRFinderPlus results are available in the Microbial Browser for the Identification of Genetic and Genomic Elements (MicroBIGG-E).
More information can be found in our help documentation and FTP ReadMe file.
How are isolates clustered and what are SNP clusters?
See the Pipeline overview on our help page for a high level description of the pipeline including clustering, and the Methods.txt on our FTP site for more details.
How are SNP distances computed?
SNP distances are patristic distances on the maximum compatibility tree. See below and the (Cherry, J. 2017. A practical exact maximum compatibility algorithm for reconstruction of recent evolutionary history. BMC Bioinformatics: 2017 Feb 23;18(1):127. doi: 10.1186/s12859-017-1520-4) for more information on SNP distance calculation. For a description of filtering of potential SNP sites that occurs before running compat see the Methods.txt file on FTP.
I know there are some SNPs separating these two isolates but in your cluster you are showing them 0 SNPs apart, why?
We have several filtering steps that may be filtering out the SNPs that you see. In order to handle varying quality and error levels of sequence and assembly SNP filtering is required. See our Methods.txt file for information on SNP filtering. Individual SNPs and filtering is indicated in the VCF file provided for each of the SNP clusters.
Why doesn't your cluster correspond to this published cluster? What are you doing different?
Our clusters are based solely on automated analysis of isolate sequences. Published outbreaks generally consider other, epidemiological factors, as they should.
Can I add my own isolates to the cluster and recompute it?
If you submit your sequence and let us know it will be included in the full analysis pipeline. See our How to submit page for more information on how to submit your data.
However, the Pathogen Detection system is a high-throughput, automated system and we are unable to customize SNP cluster contents, so if the pipeline does not identify your isolate as being sufficiently close to others in a cluster it will not be included in a cluster.
I know these isolates are near to the outbreak isolates, but they don't appear in the cluster
See the above answer about how isolates are clustered. There are some possible data issues that may cause isolates that are in fact closely related to appear too distantly related. While we have extensive filtering steps to minimize the chances of this happen it can sometimes occur when the sequence data is contaminated or poor-quality.
I know these isolates are more than 50 SNPs away from the outbreak isolates, but is there any way I can get these included in the cluster to see exactly what the SNP distance is?
No. The Pathogen Detection system is a high-throughput, automated system and we are unable to customize SNP cluster contents.
What is the SNP threshold for an outbreak?
The number of SNPs that define an oubreak is variable and depends on many factors. Outbreak determinations should be made in concert with epidemiological factors, and the number and identity of SNPs that identify an outbreak are dependent on these factors.
How do I know when new analysis results are released?
The pipeline produces new analysis results every day if new data has arrived for each organism group. The Pathogen Detect home page shows the most recent results with Accession Numbers, count of newly added and total isolates, and a link to each taxonomic group subset in the Isolates Browser.
The FTP directory contains links to the latest kmer analysis and SNP analysis for each organism.
Automatic emails can notify you in real-time when there are updates for any Isolates Browser search. For more information see our Isolates Browser help for Automatic E-mail Notifications of New Data.
How do I submit data?
Information on how to submit can be found here.
How do I submit beta-lactamase, MCR, or Qnr alleles?
Information on how to submit for allele assignment can be found here
How do I submit and find AST data?
To submit AST data.
To find all isolates with BioSample records in the isolates browser.
To find all BioSample records that have AST data.
How do I contact the Pathogen Detection team?
The Pathogen Detection team can be reached by email at pd-help@ncbi.nlm.nih.gov.
What are the AMR_genotypes and AST_phenotypes column?
These two columns refer to antimicrobial resistance.
The AST_phenotypes column corresponds to the antibiogram submission into BioSample database that captures antibiotic susceptbility phenotypic testing (MICs or disk diffusion). Only isolates that have an antibiogram submitted will show AST data. See the help document for the AST phenotypes column.
The AMR_genotypes column corresponds to antimicrobial resistance genes that have been identified by NCBI's AMRFinderPlus. This process is run on all isolates submitted by collaborators contributing data for the Pathogen Detection system, and is also run against all GenBank genomes that are incorporated into the Isolates Browser (although there may be a delay in the analysis). See also the help document on the AMR genotypes column. For more info about NCBI's efforts about antimicrobial resistance gene/protein identification, see Antimicrobial Resistance. For information on the software we developed to identify AMR genes see AMRFinderPlus. For information on the reference set of genes and proteins see the Reference Gene Catalog.
How are the phylogenetic trees reconstructed?
The SNP trees are calculated using a maximum compatibility algorithm that is efficient for closely related bacteria while dealing with input data quality issues. The algorithm is described in this paper (Cherry, J. 2017. A practical exact maximum compatibility algorithm for reconstruction of recent evolutionary history. BMC Bioinformatics: 2017 Feb 23;18(1):127. doi: 10.1186/s12859-017-1520-4). The VCF file output has a field that lists incompatible sites that were filtered by compat.
Can I download the software and run it myself?
Unfortunately the complete current pipeline is tied very tightly to many NCBI internal resources and so it can't be run outside of our environment. We would like to make a distributable version, but we have not yet had development capacity to do so. Several pieces of software devloped by NCBI and used in the pipeline are available including:
- AMR, stress resistance, and virulence gene identification software AMRFinderPlus Paper and Software
- The assemblers SKESA and SAUTE Paper and Software
- The maximum compatibility phylogenetic inference software "compat" Paper and Software
- The PGAP annotation system Paper and Software
Is your SNP pipeline published?
Not yet. We're working on it :-)
In the meantime several of the components NCBI developed for the pipeline are published and open source See above and pipeline references).