NCBI Logo NCBI News
NCBI News


In this Issue

Decade of Data

LocusLink

Recent Publications

QBlast

Cn3D 2.5 Released

RefSeq

Exhibits &
Workshops

Coffee Break

Masthead



A Decade of Data at NCBI

Integrated Approaches to Managing the Information Explosion


Over the past 10 years the management of biological information has truly come of age, becoming increasingly integrated into the scientific process. It is now almost impossible to think of an experimental strategy in biomedicine that does not involve some online foray into scientific databases. At the core of this shift is a huge data explosion, most notably in the amount of gene sequence and mapping information.

From its inception in November 1988, NCBI was charged with providing data access and analysis tools for molecular biology information. As its 10th anniversary year draws to a close, the horizon is a familiar one—a flood of data coming in many new forms. This year-by-year tour marks highlights from NCBI’s first decade.

Growth Chart
Click on chart to view enlarged version.


1988 to 1989: Getting Ready

The newly established Center set up offices within the National Library of Medicine at NIH with a small staff representing a core combination of expertise in molecular biology, mathematics, and computer science.

As a research and development agenda emerged, an uppermost priority was to design tools for analyzing the growing number of sequences in GenBank, at that time managed elsewhere at NIH. Also of prime importance was to develop a flexible and robust data model to form the backbone of all data collection and data access services to come.


1990: BLAST Off!

With the development of BLAST, now a household word for many biologists, NCBI offered a program to find DNA or protein sequence similarities quickly, while also providing a statistical measure of significance that infers biological relevance. With significantly faster performance than existing algorithms, BLAST soon became the tool of choice among molecular biology researchers, and now supports more than 50,000 searches per day.


1991: Unified Entry via Entrez

A key objective of NCBI is to build integrated approaches to searching biological information. The Entrez retrieval system, first produced on CD-ROM, provided a unified interface, or entry point, for varied data types. For the first time, nucleotide sequences were linked with their protein translations. Links to “sequence neighbors” uncovered by BLAST searches were also featured, revealing previously unreported sequence similarities and even prompting some debate on what constituted a novel homology finding. Literature links facilitated follow-up on interesting relationships. Macromolecular structures, genome maps, a phylogenetic taxonomy, and even the whole of MEDLINE were later woven into the Entrez mesh.


1992: GenBank Moves to NCBI

In October 1992, NCBI assumed formal responsibility for GenBank, following a transition period during which NCBI developed the software and data infrastructure to maintain the database, enhance data quality, and provide Internet-based access.

Internet access began with e-mail servers for text searching and BLAST analysis and expanded to include client-server and Web access as the Internet evolved. GenBank has since grown from 100,000 to more than 4 million sequences.

At about this time, a technique for producing randomly initiated partial cDNA sequences known as Expressed Sequence Tags (ESTs) also came into prominent use. Following computational research that demonstrated the utility of ESTs for identifying genes in the sequence databases, NCBI created the dbEST database. A year later, a formal EST division was added to GenBank for this burgeoning class of data. EST sequences also laid the foundation for the later UniGene and GeneMap projects, which in turn contributed to progress on the Human Genome Project. Because of the utility of ESTs as short tags for whole genes, most human gene discoveries today rely heavily on EST approaches.


1993: Internet and 3-D Entrez

Network Entrez, a client-server version of the CD-ROM, brought Entrez to the Internet and paved the way for further data expansion. 3-D macromolecular structure data were also added to Entrez, enabling biologists to easily check whether the structures of proteins in the sequence databases had been determined experimentally. An outgrowth of molecular modeling research at NCBI, this enhancement is just one example of the important synergy between the research programs and the database development initiatives at NCBI. Later enhancements to Entrez would include the Cn3D structure viewer and the ability to link proteins based on structural similarity.


1994: Web Site Launched

NCBI launched its Web site (
www.ncbi.nlm.nih.gov) in early 1994 with BLAST, Entrez, dbEST, and dbSTS. Recognizing the power of the Web to facilitate the level of data integration it envisioned for molecular biology, NCBI focused essentially all new development efforts on this new medium.

The dbSTS database was established in response to a growing body of new sequence data called Sequence Tagged Sites (STS), short sequences of known location in a genome used as essential markers for gene mapping and positional cloning. As the volume and importance of these sequences grew, they were consolidated into a new STS division of GenBank.

Electronic PCR (e-PCR), an STS-hunting tool that made it possible to mine dbSTS for the map location of a sequence, was developed a short time later. E-PCR simulates conventional PCR methods for identifying unique DNA landmarks by searching for STSs. Researchers use e-PCR to assign sequence database records to map positions, test primer feasibility, and integrate and anchor genetic maps and sequence data.


1995: Surge in Sequence and Mapping Data

What started as a trickle of sequences in the early 1980s was by now a torrent. BankIt offered a simple Web-based form for submitting DNA sequences to GenBank. Developed in response to growing interest in the Web, it quickly became the most popular submission tool.

As the Human Genome Project progressed, Entrez added a Genomes database to manage data on a genomic scale. Large-scale sequencing had by this time produced a number of completely sequenced genomes or chromosomes. The mapping initiatives had also generated many genetic and physical maps. The Genomes database provided effective ways to integrate disparate mapping and sequence data. A graphical viewer aided in the visualization of complete genomic data.

The Taxonomy Browser was developed following an NCBI initiative to create a consistent and comprehensive sequence-based taxonomy for the growing number of species represented in GenBank. Developed in collaboration with an international team of experts, this phylogenetic approach produced a classification that takes into account sequence similarities and more closely reflects evolutionary history than does classical taxonomy. Today the Taxonomy Browser provides access to more than 50,000 species in GenBank, with links into Entrez.


1996: Finding Genes Among the Sequences

Much of GenBank’s growth was due to the high volume and redundancy of EST sequences, which had also started to present problems in data presentation and analysis.

The UniGene database organized matching sequences into clusters, each representing one human gene. With more than 75,000 clusters today, representing more than 75% of all human genes, UniGene serves as an important springboard for gene hunting.

GeneMap ’96, the first transcript map of expressed human genes, was produced by an international radiation-hybrid-mapping consortium, which relied on UniGene as a central resource for identifying novel, non-redundant genome mapping candidates. Updated in 1998 and 1999, GeneMap integrates STS mapping data, sequence data, and UniGene clustering data and provides the mapping framework upon which to mount the complete sequence data. GeneMap ’99 charts locations of more than 30,000 human genes.

Sequin, a sequence submission tool, was also released in 1996 to handle the surge in sequence data. Sequin especially facilitates submitting large batches of sequences and sets of related sequences from phylogenetic, mutational, or population studies. Later addition of alignment capabilities furthered Sequin’s ability to sort out sequence relationships.


1997: PubMed and Proteins

PubMed expanded the literature component of Entrez to encompass the entire MEDLINE database and made it available for free over the Web, with links to full-text articles on Web sites of participating publishers. In the year following itsofficial launch by Vice President Gore in June 1997, PubMed use increased from one million to 16 million searches per month. The success of PubMed has led to the increasing involvement of NCBI in projects related to electronic access to the scientific literature.

NCBI research in protein analysis spawned three major resources this year. The Entrez Structures database was enhanced with structure-based protein neighbors, often discovered to be homologs with similar biological functions. Structure neighbors are computed by the VAST (Vector Alignment Search Tool) algorithm, which identifies proteins that exhibit a combination of strong sequence and structure similarity. Visualization of aligned structures is supported by the Cn3D structure viewer.

Gapped BLAST and PSI-BLAST (Position Specific Iterated BLAST) increased both speed and sensitivity, and ushered in a new generation of sequence similarity search tools. PSI-BLAST facilitates profile-based searches, which are potentially much more sensitive to distant relationships than are the traditional pairwise similarity searches for which BLAST was originally tailored. PSI-BLAST can be used to help delineate diverse protein families and predict function for newly sequenced proteins.

The COGs (Clusters of Orthologous Groups) project takes a different approach to analyzing biological information. COGs organizes clusters of protein sequences from completely sequenced genomes of different species. Currently there are eight genomes in the scheme, spanning the major kingdoms of life. Analysis of COGs shows the molecular similarities and differences between species, which not only can provide clues about evolution, but also may help to identify protein families, predict new protein functions, and point to potential drug targets in pathogenic species.


1998: Billions of Bases to BLAST

GenBank surpassed the two-billion base pair mark in 1998. More than half of the data comes from a single organism—homo sapiens—largely due to the Human Genome Project’s high throughput sequencing centers. As only about 8% of the human genome sequence is currently considered “finished,” and recent predictions indicate that sequencing will proceed significantly more rapidly, the flow rate is set to increase. The HTGS (High Throughput Genomic Sequences) division of GenBank was established to organize these data as they are deposited in progressive stages of completion.

For protein analysis, PHI-BLAST (Pattern Hit Initiated BLAST) complemented the profile-based searching that was previously introduced with PSI-BLAST. PHI-BLAST incorporates hypotheses as to biological function of a query sequence and restricts the analysis to a set of protein sequences that are already known to contain a specific pattern or motif.

As more than 20 complete microbial genomes and one multicellular organism, the worm, were placed in the public domain, customized BLAST services and enhancements to the Entrez Genomes database were implemented to organize, visualize, and analyze these data.

Collaborations with other NIH Institutes for disease-based services were also established. Examples include the Cancer Genome Anatomy Project for expression data from normal, precancerous, and cancerous cells; and specialized Web sites for analysis of genetic diversity in malaria and HIV.


1999: Focus on Human Genome

As the Human Genome Project nears completion, the research focus is turning from analysis of specific genes or regions to a whole genome approach. NCBI has developed a suite of genomics resources to support comprehensive analysis of the human genome. New projects such as LocusLink, a hub for integrating key descriptors of genetic loci; RefSeq, a non-redundant set of human reference sequences for known human genes; and dbSNP, a collection of data on human genetic variation, all contribute to this network of information. Online Mendelian Inheritance in Man (OMIM), the Johns Hopkins comprehensive database of human genetic disorders, supplements these resources.

The challenge for the next decade will be to keep pace with the flood of genome data, while also designing the tools and databases for the gene discoveries of the 21st century.

—JM, BR, DW





Continue