Issues seen in the human genome assembly

Potential problems that arise in the genome need to be tracked. We are using a centralized tracking system to store information concerning these issues. Issues can fall into the following categories:

Unknown: unclear what the problem is without further investigation.
Clone Problem: the issue is contained within a single clone.
Gap: the issue is associated with a known gap in the assembly.
Path Problem: the data supporting the issue suggests there is a problem with the tiling path.
Variation: the data supporting the issue suggest there is no error, but that there is extensive allelic variation within the region.
Missing sequence: sequence has been identified that does not map to the reference assembly.

Below are some examples of issues that have been reported for the human genome.

Example 1: Sequence missing from the reference assembly
Example 2: Mismatch between transcript sequence and genome sequence
Example 3: Sequence difference between transcript sequence and genome sequence
Example 4: Mixed haplotypes

Example 1: Sequence missing from the reference assembly

Report: TAS2R45 (GeneID: 259291) is missing from the reference assembly. This gene is present in the HuRef assemblies on chromosome 12.

Approaches to this issue? (details...)

You can look up information about this gene in Entrez Gene. The report indicates that this gene is a member of a large family of taste receptors.

You can search for TAS2R45 in the NCBI Genome Data Viewer and compare the HuRef assembly to GRCh37 to see where you might expect to find this gene if it were present. How do these regions compare? Notably, in HuRef, TAS2R45 is located between 2 genes that are annotated on a single component in GRCh37: AC018630.40. This tells us that TAS2R45, which we would expect to see between these two other genes, is not missing from GRCh37 because it falls in an assembly gap. So why is it missing?

gdv image of TAS2R45 region

Figure 1: Genome Data Viewer results page. The two genes flanking TAS2R45 in HuRef (circled) are present in GRCh37. There are no assembly gaps in GRCh37 located between these two genes.

Option 1: There is a problem with the assembly of the reference component AC018630.40

Option 2: TAS2R45 is a CNV locus, and not present in the haplotype contained on the clone used in the reference chromosome.

Performing an alignment of AC018630.40 to the HuRef component reveals lots of repetitive sequence in this genomic region and an alignment gap of several kb that encompasses the region containing TAS2R45.

dotmatrix view of GRCh37/HuRef alignment

Figure 2: Dot matrix view of alignment between GRCh37 and HuRef components. The alignment gap is circled.

The repetitive sequence in this region could have caused problems in the assembly of the component. However there are no reported problems annotated on this component. A PubMed search returns PMID:18077390, which indicates the TAS2R family is CNV, but this gene is not specifically mentioned. The GRC identified that missing of TAS2R45 in GRCh37 is due to structural variation and it's now represented as alternate loci NW_003571050.1 in GRCh38.

Example 2: Mismatch between transcript sequence and genome sequence

Human chromosome 2 BAC AC010896.15 has a substitution compared to all 11 human ESTs (DB301244.1, CD652139.1, DA182571.1, BF698724.1, AU132204.1, CN255513.1, CB854573.1, BI850279.1, CD643380.1, CB145381.1, DB508845.1, DB218901.1) and all 5 human (AK095620.1, BC021229.2, AB051511.1, AK092710.1, AK023242.1) and 1 Pongo (CR926492.1) cDNAs that cover the base. The substitution is indicated here by the flanking dashes. The "C" is a "T" in all the transcripts. cttactgagtacatgccccctttaatgttaa--c--atgacttggagtaatttctgaggtttactgacaaa. This is in the 3' UTR of gene OTTHUMG00000151931.

Approaches to this issue? (details...)

The first step in evaluating this issue is to take one of the transcripts and align this to the genome assembly. Tools such as SPLIGN can be used for this task. Select the ' click here' link to try SPLIGN now. Select BC021229.2 as a representative transcript and select 'Homo sapiens' from the "Whole Genome:" pull down menu. Once the result is returned, can you find the mismatch?

Once the mismatch is confirmed, it is useful to compare the difference to a database of known variation, such as dbSNP. The SNP data is not available from the SPLIGN page currently. Figure 2 shows a screen shot taken after running SPLIGN in genome workbench and comparing this to the SNP data that has been annotated on the genome. This mismatch corresponds to a known variation. Further investigation reveals that this is a validated SNP. In this case, no action will be taken on this report as the difference appears to be a biological variant.

transcript alignment difference genome workbench view

Figure 2: View of SPLIGN alignment in genome workbench. The actual alignment is shown in purple at the bottom (BC021229.2). The mismatched base is shown in red. A red arrow points to the SNP feature. Independent investigation shows this SNP is validated (and not just a false positive).

Example 3: Sequence difference between transcript sequence and genome sequence

The reference genome has three 1-nt insertions relative to NM_004628.3, in the CDS at nt 1414, 1486, and 2063. The first of these causes a frame shift and premature truncation, encoding 453 aa versus 940 for the full-length protein. The NM is supported by 6 mRNAs and 27 ESTs; the genome is not supported by transcripts. The full-length protein is supported by homology (e.g. rat NP_001101344, zebrafish NP_001038675).

Approaches to this issue? (details...)

The first step in evaluating this issue is to take one of the transcripts and align this to the genome assembly. Tools such as SPLIGN can be used for this task. Select the ' click here' link to try SPLIGN now. Select NM_004628.3 as a representative transcript and select 'Homo sapiens' from the "Whole Genome:" pull down menu. Once the result is returned, can you find the mismatch?

Once the mismatch is confirmed, it is useful to compare the difference to a database of known variation, such as dbSNP. The SNP data is not available from the SPLIGN page currently. Figure 3 shows a screen shot taken after running SPLIGN in genome workbench and comparing this to the SNP data that has been annotated on the genome. This mismatch corresponds to a known variation. However, further investigation suggests that this is not a validated SNP (Figure 3). To resolve this issue experimental validation, such as PCR amplification of either the source DNA used to make the genomic library. If the source DNA is unavailable, PCR across a panel of diverse individuals can help determine if the SNP is present in the population.

transcript alignment difference genome workbench view

Figure 3: View of SPLIGN alignment in genome workbench. The actual alignment is shown in purple at the bottom (BC021229.2). The mismatched base is shown in red. A red arrow points to the SNP feature. Independent investigation shows this SNP is validated (and not just a false positive).

Example 4: Mixed haplotypes

User reports that the ABO gene in the reference assembly reflects a nonexistent haplotype for "Type O". Although the two clones used in the reference assembly come from the same library, they represent two different "Type O" haplotypes at the ABO locus. AL158826.23 (RP11-430N14) This component contains the first 5 exons of ABO in the first 18,000bp. I used the dbRBC SBT input tool to type this component. Using a haploid, gapped analysis, I get 0 mismatches to the two following types: ABO*O.02.14.1 and ABO*O.02.09.1?. AL732364.9 (RP11-244N20) This component contains partial intron 3- exon 7 of ABO in the last 5,000bp. I used the dbRBC SBT input tool to type this component. Using a haploid, gapped analysis, I get 0 mismatches to the following type: ABO*O.01.01.1.

Approaches to this issue? (details...)

Analysis of known ABO alleles suggest the allele represented in the GRCh37 reference assembly was a mixture of two different alleles. Although the two clones are from the same library, GRC analysis of the overlap suggested they alignment has a higher degree of diversity than would be expected from clones that are part of the same haplotype. This can be reviewed on the OverlapViewpage for these 2 clones. Search for AL732364 on the chr. 9 TPFview pageand you will see that this region has now been updated by the GRC: AL158826.23 has been replaced AL772161.10. Clicking on the green ball shows you that the switch points have also been curated so that ABO is now entirely contained in AL772161.10.