U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Pool R, Waddell K; National Research Council (US). Exploring Horizons for Domestic Animal Genomics: Workshop Summary. Washington (DC): National Academies Press (US); 2002.

Cover of Exploring Horizons for Domestic Animal Genomics

Exploring Horizons for Domestic Animal Genomics: Workshop Summary.

Show details

5Data Access

The final issue tackled by the participants was how best to work with the tremendous amount of data that will be generated by domestic animal genome projects. The data create a number of challenges, said Daniel Drell of the U.S. Department of Energy (DOE). “These have to do with the interoperability of data, the sharing of data in some cases, but, principally, organizing it in such a way that others can come along and add value to it in some efficient ways.” So far, he said, “the genome projects have been largely unsuccessful at dealing with many of these.”

APPROPRIATE TOOLS AND THE IMPORTANCE OF DATA ACCESS

One can frame the issue in terms of access to data, said Claire Fraser. “When it comes to data access,” she said, “there are two ways to think about it. One, are the data accessible in GenBank or someplace else? And the answer is yes. But individual sequence reads or assembled data are only so useful. What we really need in terms of data access, in order to empower all of the users that are interested in getting a hold of these data, are far better databases and tools to really exploit the information. And I think this is an area that so far has been more of an afterthought with these projects than it should have been.”

The result, she said, is that some genomics researchers end up having easier access to the data than others. “We are seeing a bit of a genomics-divide being created between those groups that are involved in generating the data and have been forced to build the tools in order to manipulate it, and the more typical user who doesn't necessarily have access to the same tools, (and) who doesn't have bioinformatics expertise at his or her university. I think that's one of the real problems that we need to address.”

The other problem, Fraser added, is that the various genome projects generally make no allowance for taking care of the data they generate once the project is finished. “For the most part, even for sequencing projects with bioinformatics support during the term of the project, that support ends when the sequence is completed. There's been no plan put in place for how to maintain and update all of this information.”

“That problem is going to get even worse as we begin to accumulate more data. There have been all sorts of models proposed, from letting people in the community who are passionately interested in an organism do it on an ad hoc basis, to having this done in a more centralized facility, to having this done in a distributed way but with clear rules for interoperability. I've even heard some people go so far as to suggest that perhaps we need to come up with some sort of tax on genome projects that goes to fund a bioinformatics trust managed by an inter-agency group responsible for maintaining these databases.”

Several participants pointed out that in order to maximize the value of the information generated by domestic animal genome projects, researchers and information technology specialists will have to pay more attention to data handling. In particular, programs need to be designed not only to maintain the data and make it accessible to any researcher who needs it but also to make sure the information can be integrated with new data and new understandings as they appear.

THE CHALLENGE OF SCALING UP IN RESPONSE TO INCREASES IN DATA

The biggest difficulty is the problem of scaling: A database must be designed so that it continues to work, and work well, when the amount of data in it is doubled or increased by a factor of ten or twenty. That will be a challenging job, Fraser noted.

“I'm not convinced,” she said, “that any of the existing databases that have been built so far to handle sequence information are robust enough to scale to the level that we know we are going to need in going forward.” The databases built to handle the sequence information are actually the easy part, she said. “We would like to begin to add in functional information, either directly or through links, to all of the existing gene and protein databases. When you start thinking about doing that, the challenge goes up by several orders of magnitude.”

Owen White, of The Institute for Genomic Research (TIGR), made a similar point. “The National Center for Bioinformatics (NCBI) is doing a heroic job,” he said. “They are doing an amazing job managing sequence data and publication data. That's a specific data type, and they have a fighting chance of scaling up for just the raw sequence information.

“But there's another data type that a lot of us are familiar with, which is annotation. Annotation is kind of a generic term, but I usually mean identification of all the genes and trying to give functional assignments to those genes and trying to represent them well in a structured database. So if you've got 500 microbial genomes and people want to come in and work with the data, I would argue that we don't really have representation systems for that type of thing.”

While the problem of scaling up the databases that hold basic information, such as sequences of base pairs, is challenging but seemingly solvable, no one yet has constructed databases that will be able to handle the amount of annotation that likely will proliferate in the years to come.

STRUCTURING GENOME DATABASES

Workshop participants had various perspectives on how a system of genome databases should be structured. White, for instance, offered a vision of large central repositories that would handle all the data of one particular type— say, information on how genes are expressed—for many different species. He warned that it would not be feasible to have one mega-center handle all different types of data for every type of organism, but he argued that if each center focused on one type of data, it would work quite well.

“There are a number of reasons why I think this is a much more attractive model,” he said. “Training becomes much easier, and there is reduced reinvention of the wheel. Once you instantiate those infrastructures, they are easy to apply to new organisms.”

Furthermore, he added, these data-specific centers should be able to expand easily enough to accommodate ever-growing amounts of data. “I think they are the only things that had a chance of scaling.” Suppose, he said, that some individual research center had developed a good way to represent expression information for the particular organism studied at that center. “Hopefully they generalize their services enough so they can apply them to another organism. Then if they instantiate what the standard operational procedures are, they develop a relatively good training program, and they have a robust representation system going on in the database. That's the hard part. That is the energy of activation, so to speak. Then adding another organism is actually much simpler.”

A member of the audience disagreed with White's suggestion, however. For him, it made more sense to keep smaller, individualized databases and develop standards so that the various databases could exchange information and work with each other almost as if they had a single database. “You don't have to bring things into gigantic warehouses' or try to federate databases. You try to create a level of information that can be exchanged among databases. In part, this goes along the lines of the discussions about whether you sequence in a center only or distribute the work in order to create local communities of scientists and train graduate students. This is particularly true in bioinformatics. If you have only centers for collecting information, you develop no local skills and no local students to use that information.”

“Centers like NCBI do an extraordinary job of archiving low-level information,” he continued. “But in the plant community, for instance, there is an immense difference in the interests of, say, the cereal genomicists versus just the legume folks. The legume folks have a high interest in secondary metabolism, symbiosis, and nitrogen fixation. Those are all functions that fit within community exploration of data and creation of data models and data-mining mechanisms appropriate to those. But they don't map onto cereals, and if you try to force these into a one-size-fits-all model, you come down to a lowest common denominator of things that are done well.” In short, having different centers for different organisms allows each to specialize and take into account the areas of interest for that particular organism. It might make sense to accumulate certain types of information—generally the very basic, low-level information—in one, large central repository, but the higher-level information, with its sensitivity to the type of genome being considered, is better handled at individual centers.

ALLOCATION OF RESOURCES FOR BIOINFORMATICS

No matter how the centers ultimately are organized, several participants expressed the view that more resources must be allocated toward bioinformatics if researchers are to be able to work with all the data that is being accumulated. “If you want a system,” White said, “that can dynamically manage data that's coming in from several projects in parallel and have version dates and a help desk and just a well-engineered system, we are talking about a completely different magnitude of budget that's required to do that.”

Copyright 2002 by the National Academy of Sciences. All rights reserved.
Bookshelf ID: NBK207591

Views

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...