Research Data in the Digital Age

National Academy of Sciences (US), National Academy of Engineering (US) and Institute of Medicine (US) Committee on Ensuring the Utility and Integrity of Research Data in a Digital Age

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

National Academy of Sciences (US), National Academy of Engineering (US) and Institute of Medicine (US) Committee on Ensuring the Utility and Integrity of Research Data in a Digital Age. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington (DC): National Academies Press (US); 2009.

Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age.

Show details

< Prev Next >

1Research Data in the Digital Age

In a 1965 article in Electronics Magazine, Gordon Moore, the cofounder of Intel, observed that the number of components on an integrated circuit per unit of cost was doubling on a regular basis—a period he later set at 2 years.¹ What came to be known as Moore’s law has become a defining property of the digital age.² For more than half a century, the power of computing available at a given cost has risen exponentially, which has increased computer power by many orders of magnitude. Today, the most powerful computers can perform more than a million billion operations per second. Storage devices can handle petabytes of information.³ Data can be transferred at rates of 10 gigabits (or 10 billion bits) per second (see Box 1-1 for a description of units of size for data). Sensors such as the charged-coupled devices used in modern cameras and telescopes can acquire data from billions of pixels simultaneously. Furthermore, in key areas of computing, Moore’s law continues to hold.⁴ Many measures of computing power continue to double every 1 to 2 years. As a result, the quan tity of data being created and stored by businesses, individuals, government, scientific institutions, and individuals is growing rapidly. Figure 1-1 shows one consulting firm’s projection of how information and available storage will grow in the coming years.

BOX 1-1

Units of Size for Data. Bit: The fundamental unit of digital information, equivalent to a 1 or a 0, or to an electronic switch being on or off. Bit is short for binary digit. Byte: The information stored in eight bits. A byte can be used to store one (more...)

FIGURE 1-1

Projected global information creation and available storage. NOTE: One exabtyte equals one billion gigabyptes.

This exponential increase in computing power has had profound consequences for many aspects of modern society, including scientific, engineering, and medical research.⁵ Using digital technologies, researchers can measure, describe, and model phenomena much more comprehensively and in far greater detail than was possible in the past. They can detect and analyze the products of high-energy particle collisions to probe the underlying structure of matter. They can extract information about the functioning of nerve cells and construct models of neural processing. They can combine simultaneous measurements of atmospheric and oceanic conditions to predict the effects of pollutants on climates. They can extract patterns of health from extensive databases of genetic and medical records. Examples of the impact of digital technologies on research fields appear as sidebars throughout this report, and the number of such examples could be multiplied many times.

The advances in digital technologies have caused a massive increase in the quantity of data generated by research projects. The proposed Large Synoptic Survey Telescope is expected to gather 30 terabytes of data per night and more than 60 petabytes over its lifetime (see Box 1-2). Particle physics experiments conducted with the Large Hadron Collider at CERN (Figure 1-2) will generate 15 petabytes of data annually. Even relatively small-scale projects can generate immense quantities of data that can be valuable in multiple research fields. These quantities of data are much too large to examine by hand. Instead, computers must conduct the initial analysis of data before the processed and condensed results are reviewed by researchers.

BOX 1-2

Digital Data in Astronomy. As astronomical observatories have become more powerful, they also have become more data-intensive. Table 1-1 shows the trend in recent decades. The Sloan Digital Sky Survey (SDSS), for example, has delivered an unprecedented (more...)

FIGURE 1-2

LHC at CERN. SOURCE: © CERN. See http://cdsweb.cern.ch/record/42370.

However, the most consequential changes being fostered by digital technologies involve issues that range beyond the quantities of data generated.⁶ Today, researchers can access a rapidly expanding range of digital information from around the world almost instantaneously. They can use this information to analyze their results, as when biologists compare DNA sequences they have generated to sequences stored in worldwide databases. They can incorporate information from others with their own data to make discoveries that would otherwise have been impossible, as when epidemiologists combine census and economic data to analyze the prevalence of disease. They can analyze data produced by others to answer questions that could not have been anticipated by the data’s creators, as when astronomers use digital sky surveys to investigate newly recognized phenomena in distant galaxies. For some areas of science, engineering, and medical research in the digital age, carrying out laboratory experiments to corroborate or disprove hypotheses has given way to a process of hypothesis testing based on computational analysis and modeling.

The creation of inexpensive, complex sensors is contributing to the data explosion by enabling new research approaches in a variety of fields, particularly in the earth sciences. Projects such as the National Science Foundation’s Network for Earthquake Engineering Simulation and National Ecological Observa tory Network, as well as the National Aeronautics and Space Administration’s Earth Observing System, depend heavily on sensor networks.

Digital technologies also are making possible a new kind of science that depends on simulations combined with experimentation and observation.⁷ Cosmologists can combine simulations of galactic dynamics with astronomical observations of distant galaxies to analyze the early evolution of the universe. Records of calls made with cell phones can be compared to mathematical models of social networks. Researchers can model the functions of cells, simulate the effects of modifying those functions, and then re-create these modifications in real cells to alter biological function and refine the original models. Large-scale simulations of natural phenomena can be as valuable as data drawn from observations of the natural world.

The advances in research enabled by high-performance computing and high-performance communications are contributing to a steady growth of collaborations and interdisciplinary projects. Digital communication technologies enable researchers to communicate and exchange data with colleagues around the world, creating electronic collaborations that can catalyze progress. By making it possible to address more complex and integrative questions, these technologies also catalyze interdisciplinary collaboration. As one indicator of this trend, consider the growth in the number of authors on research papers over time. Over the course of 40 years, according to a computerized analysis of millions of published science and engineering papers, the number of authors for papers in the sciences nearly doubled, from 1.9 to 3.5.⁸ In the environmental sciences, the fraction of papers with multiple authors rose from 25 percent to 82 percent; in economics, it rose from 9 percent to 52 percent.

Collaborations have also become more international. In 2003, 20 percent of all research publications had authors from more than one country, compared with 8 percent in 1988.⁹ Citations to literature produced outside the author’s home country rose from 42 percent of all citations in 1992 to 48 percent in 2003.

However, the most far-reaching effects of digital technologies are not evident in traditional measures of research collaboration. Researchers—and especially young researchers—are developing new ways to interact with each other and with the subjects they study.¹⁰ They exchange information in virtual communities, write and read blogs on research developments, and are pioneering new methods to conduct research and share their results. In the long run, these developments are likely to have a more profound effect on research than increases in the pace or scale of traditional practices. These developments can be difficult to foresee. For example, research in many fields is moving toward much more open and collaborative models that are both served and driven by technology, and this trend is likely to result in research environments very different from those that have prevailed in the past. Although our committee has not tried to predict the long-term outcomes of this process, ongoing changes can be expected to continue to transform how research is done and how researchers interact with each other.

The rapid spread of digital technologies also is transforming the relationship between researchers and the broader public that supports and expects to benefit from research. When research results that underlie important public policies are available electronically, they can be examined and questioned by any member of the public. Individuals interested in specific issues—whether the regulation of an environmental toxin or the development of therapies for a human disease—can monitor, comment on, and even shape ongoing research.

Similarly, digital technologies have profound implications for scientific, engineering, and medical education.¹¹ Students can have access to research information from instruments in distant locations.¹² Computer owners around the world can contribute to the solution of particular research problems by allowing their computers to become parts of distributed computational networks.¹³ Data from cutting-edge research are being made available on the Internet for use not only by the research community but by educators or anyone else interested in the subject.¹⁴ Members of the public are participating in research projects as varied as analyses of genetic variation and galactic structure.¹⁵ Although fascinating, the full consequences of changing technologies for scientific, engineering, and medical education or for direct public participation in research lie outside the scope of this report.

CHALLENGES POSED BY RESEARCH DATA IN A DIGITAL AGE

Rapid advances in computing and communication technologies have changed the professional responsibilities, interpersonal interactions, and daily practices of researchers. Many of these changes have strengthened the research enterprise, both by enabling researchers to ask new questions of nature and by providing new means of achieving research objectives. At the same time, some changes have raised important issues involving researchers, research institutions, sponsors, and journals.¹⁶ These issues are the focus of this report on the integrity, accessibility, and stewardship of research data.

As discussed in Chapter 2, although advances in digital technologies allow phenomena and objects to be described more comprehensively and accurately, they also can complicate the process of verifying the accuracy and validity of the data (see Box 1-3 for an example). Digital technologies require the translation of phenomena and objects into digital representations, which can introduce inaccuracies into the data. Digital data often undergo several layers of complex processing as they move from an instrument or sensor to the point of being reviewed by a researcher. If this processing is not properly done or is misunderstood, the results can be misleading. In some cases, researchers may intentionally or unintentionally distort data in a misguided attempt to emphasize particular features and downplay others. In the worst cases, researchers can falsify or fabricate data, thereby violating both the ethical and methodological standards of research integrity. Many of these considerations apply as well to data that are not generated or stored digitally, but digital technologies both expand and intensify the challenge of maintaining the integrity of data.

BOX 1-3

Digital Data in the Neurosciences. The neurosciences illustrate both the potential value of well-organized and accessible data and the variety of issues raised by the increased importance of data handling and data sharing. It is not surprising that the (more...)

Chapter 3 describes the challenges that researchers face in maintaining the traditional openness of research in a digital age. Electronic technologies provide researchers with many new ways of communicating data to others, but providing other researchers with access to large databases can be difficult and expensive. With smaller, heterogeneous databases, where quality control and documentation tend to be less formal, sociological and technological factors can restrict data sharing. Also, an increasing range of restrictions are being placed on research data as this information becomes more valuable for commercial uses, which can limit the distribution and utilization of data within and beyond the research community.

Even as more research data are being created, their value for future uses is increasing. Chapter 4 describes the need to preserve many research data for long-term use, even in situations where those uses cannot be currently envisioned. Digital storage technologies, application environments, and operating systems change every few years, which means that digital bits must continually be transferred from one storage platform and software environment to another if they are not to be lost. Digital data also need to be annotated in sufficient detail that future researchers, sometimes in fields well removed from those of the data’s original creators, can both use the data and understand their limitations. Maintaining data collections for long-term use thus requires continued investment and planning, which can compete with expenditures for ongoing research.

DESCRIPTIONS OF TERMS USED IN THE REPORT

In describing issues as broad as those covered in this report, it is essential to have clear understanding of the basic terms.

Research Data

Despite the importance of research data, there exists no standard or widely accepted definition of exactly what research data are. For the purposes of this report, we have treated data as information used in scientific, engineering, and medical research as inputs to generate research conclusions (see Box 1-4 for definitions from other reports). This usage encompasses a wide variety of information. It includes textual information, numeric information, instrumental readouts, equations, statistics, images (whether fixed or moving), diagrams, and audio recordings. It includes raw data, processed data, published data, and archived data. It includes the data generated by experiments, by models and simulations, and by observations of natural phenomena at specific times and locations. It includes data gathered specifically for research as well as information gathered for other purposes that is then used in research. It includes data stored on a wide variety of media, including magnetic and optical media.¹⁷

BOX 1-4

Definitions of “Research Data” from Other Reports. “Data are facts, numbers, letters, and symbols that describe an object, idea, condition, situation, or other factors.” “A reinterpretable representation of information (more...)

Though our concerns in this report lie largely with the application of digital technologies in research, our examination of the issues is not limited to digital data. Nor does this report address just those areas traditionally considered “science.” It applies to all efforts to derive new knowledge about the physical, biological, or social worlds and thus encompasses research in engineering and in all of the physical, biological, behavioral, and social sciences. The conclusions in the report generally apply to quantitative data. However, many of our conclusions also apply to qualitative data, though we have not focused on the issues unique to qualitative data. Also, this report does not address research in the humanities, which lies outside the committee’s charge and expertise.

The term “data” in this report excludes physical objects (including living organisms) and other materials used in research, such as biological reagents or the devices, instruments, or computers that generate experimental or observational data. In many cases, these physical objects can be described in written, numeric, or visual forms, and these descriptions constitute data. However, because materials are tangible whereas data are generally intangible, different issues surround their use, storage, and dissemination. Some of the observations and conclusions in this report apply to materials as well as to data, and on occasion we make this extension of our conclusions explicit. However, the treatment of materials in research introduces issues that are beyond the subject matter of this report.¹⁸

Finally, our definition excludes information that can be important in research but is not used to generate research conclusions, including interpre tive statements, or matters of personal judgment, such as peer reviews, plans for future research, communications with colleagues, or personnel assessments. Of course, the line between research data and subjective judgments is sometimes difficult to draw, since subjective judgments can influence the structure ascribed to data. Nevertheless, a distinction exists, and we do not mean to imply that all of the information associated with research necessarily constitutes research data.

Metadata

As used in this report, the term “metadata” refers to descriptions of the content, context, and structure of information objects, including research data, at any level of aggregation (for example, a single data item, many items, or an entire database). According to the National Science Foundation report Cyber-infrastructure Vision for the 21st Century, metadata “summarize data content, context, structure, interrelationships, and provenance (information on history and origins). They add relevance and purpose to data, and enable the identification of similar data in different data collections.”¹⁹ Metadata make it easier for data users to find and utilize data, particularly if they are machine-readable.

Metadata are extremely diverse, ranging from written descriptions of instruments and software to the largely tacit knowledge on which the success of an investigation often depends. They are a critical part of the context needed to assess the integrity of data and use data accurately. Metadata are themselves data, since they consist of descriptive, factual information about data. Thus, conclusions about data in this report generally apply to metadata as well, although special considerations sometimes apply to metadata.

Until fairly recently, the term “metadata” was used primarily by the library community and by individual research communities.²⁰ As digital data has become more important in a variety of disciplines and fields, the scope and value of metadata have grown, leading to the development of metadata standards. Metadata standards represent an agreed set of terminologies, definitions, and values to be provided for data in a given field or community.²¹

Raw and Processed Data

Raw data directly from an instrument or data that have not been documented or processed usually are of little value to anyone except the individuals who generate or collect them. In many fields, capturing data that are “whole” or “perfect” may be difficult or impossible. Instruments may only partially and imperfectly record phenomena. Researchers may not even see the raw data on which their conclusions are based. In some cases, raw data may exist in a computer buffer for only a fraction of a second before they undergo processing. In other cases, raw data may be so voluminous that they cannot be examined in anything other than a processed or condensed form. However, raw data may need to be retained to validate research findings and, in some research fields, to support patent applications, investigate instances of research misconduct, or justify public policies.

Data used to draw conclusions, derive findings, and build models may undergo many changes as they are processed, distributed, and archived. They are analyzed, aggregated, and reformulated by researchers. Data often are organized into structures for long-term storage and access that require the expertise of professionals trained in the management and handling of large databases.

As soon as raw data are processed, the algorithms, computer programs, and other techniques used in that processing become crucial to their understanding. Many data cannot be properly interpreted or used without understanding the processing they have undergone, and it is generally impossible to judge the integrity of processed data without access to the metadata documenting how they were processed. In some cases, this processing may be so machine-dependent that the metadata must include either a thorough representation or a copy of the devices used to do the processing. Consequently, to judge the accuracy and validity of data, researchers, policy makers, and other users of data may need a thorough understanding of the tools and procedures used to analyze those data. In many cases, a high level of expertise is needed to use metadata in order to place data in context.

Given the relatively broad definitions of data and metadata that we have adopted in this report, a great many issues are obviously associated with the generation, use, dissemination, and preservation of research data in the digital age. In this report, however, we focus on three specific issues, which we describe using the terms integrity, accessibility, and stewardship.

Integrity

Integrity describes an uncompromising adherence to ethical values, strict honesty, and absolute avoidance of deception. Integrity also describes the state of being whole and complete, of being totally unimpaired. Thus, the word “integrity” has both an ethical meaning and a structural or methodological meaning. In this report we use the word “integrity” in both senses.

According to one definition, “being assured of data’s integrity means having confidence that the data are complete, verified, and remain unaltered.”²² This is possible only if researchers adhere to professional and ethical standards of their fields. In some research fields, these standards are written, but in many areas they exist as tacit knowledge that is passed from senior researchers to beginning researchers over the course of a research apprenticeship. These professional standards, in turn, describe the methods, procedures, and tools that researchers are expected to employ to minimize error and bias in their work. Consequently, integrity in research has both an individual and a communal meaning. Researchers maintain the integrity of research data by adhering to the professional standards of their fields.

Researchers are expected to describe their methods and tools to others in sufficient detail that the data can be checked and the results verified. Completely and accurately describing the conditions under which data are collected, characterizing the equipment used and its response, and recording anything that was done to the data thereafter are critical to ensuring data integrity. Thus, for experimental data, integrity implies that the data can be reproduced in a test or experiment that repeats the conditions of the original test or experiment. For observational data, data of high “quality” (a term that we sometimes will use as a synonym for data integrity) have been validated through comparison with data whose quality is known or by being generated with an instrument that has been adequately calibrated or tested.

Accessibility

In this report, accessibility refers to the availability of research data to researchers other than those who generated the data. Accessibility is a critical element of integrity, because data must be available to others in order for the validity of those data to be verified. However, in some cases an investigator may not be able to make data available to the public. For example, in private companies, data may need to be restricted for commercial reasons. In such cases, data are frequently made available within the company to evaluate their integrity.

In this report, the term “accessibility” generally implies public access as well as availability to other researchers upon request. Accessibility does not necessarily imply free access, because providing access to data entails financial costs that must be met. Also, access does not necessarily imply that researchers must provide inquirers with the training and expertise they would need to understand or use data. However, data should be accompanied by sufficient metadata for colleagues to assess the integrity of those data.

Stewardship

In the broadest possible sense, the term “utility” in the name of our committee refers to all of the various applications of research data. Both integrity and accessibility are critical elements of utility, because research data must have integrity and be broadly accessible to be effectively utilized.

However, our focus in this report is on a specific aspect of utility that we refer to as data stewardship—the long-term preservation of data so as to ensure their continued value, sometimes for unanticipated uses. Stewardship goes beyond simply making data accessible. It implies preserving data and metadata so that they can be used by researchers in the same field and in fields other than that of the data’s creators. It implies the active curation and preservation of data over extended periods, which generally requires moving data from one storage platform to another. The term “stewardship” embodies a conception of research in which data are both an end product of research and a vital component of the research infrastructure.

THE VARIETIES OF RESEARCH DATA

As the examples presented throughout this report illustrate, research data are so varied that they can be described in their entirety only in the most general terms. Different research fields have very different approaches to the treatment of research data. Even at the level of individual research groups, expectations and demands can vary greatly from one investigator to another. This tremendous variety within the research community complicates the task of arriving at conclusions that apply across all fields of research. Research fields are also characterized by diversity in the origins of data and by the size and other characteristics of data collections.

Diversity Across Disciplines

There is great diversity in the ways data are gathered and analyzed both among and within disciplines. The sidebars in this and other chapters describe some of the diversity among disciplines, but individual disciplines also harbor great diversity in the ways data are gathered and analyzed. Data in physics, for example, range from small datasets generated by a “tabletop” experiment to the terabytes of data generated by an accelerator-based experiment. Databases in the social sciences may be freely available to all researchers in some fields and tightly restricted in other fields. Some fields within a discipline may have traditions of storing data for extended periods while others discard data relatively quickly. (In this report, “field” refers to an area of research smaller than a discipline. In many cases, a field can be roughly associated with the community of researchers who follow and publish articles in a relatively small collection of related journals—what analysts of science have referred to as “invisible colleges.”²³)

Furthermore, some of the most interesting and productive areas of research today involve researchers from multiple disciplines working together on complex, integrative problems.²⁴ In some cases, these areas of multidisciplinary research become so well defined that they evolve into research fields of their own, as in astrobiology. In other cases, researchers may come together to work on a multidisciplinary project and then disband once the project is over. In interdisciplinary research, different traditions of data treatment meet and sometimes clash, and new ways to gather, analyze, and store data may need to be developed to address novel challenges.

Diversity in Origins of Data

The practices for analyzing, disseminating, and storing research data vary greatly from field to field.²⁵ For example, in some fields, observational data can be re-created by other researchers, but in other fields observations are impossible or impractical to make a second time. In these cases, observational data may need to be carefully archived for future use, including uses that cannot currently be foreseen.

Data generated through computer simulations are increasingly important in a variety of fields.²⁶ Data generated entirely by computation can in principle be regenerated, assuming that enough is known about the hardware, software, and inputs used in the computation. However, each of these three components of a computation may be so complex or indeterminate that the computational data have some of the characteristics of observational data. Furthermore, many simulations involve random inputs, so that successive simulations will not be exactly the same. In some cases, sharing and preserving the models and software tools used to create a simulation will be more important for verifying and building upon research than sharing and preserving the data generated. In other cases, the data themselves have value and can represent such a large investment of resources that they may need to be preserved for subsequent use in the same way that unique observational data are preserved.

Data from experiments may be reproducible if a robust description of the experiment is available. In practice, however, it may not be possible to re-create the exact conditions of the experiment. An experimental apparatus also may be so costly to build or use that experiments can be conducted only once or over a limited time period. If so, long-term preservation of the data generated by the experiment may be essential for optimizing the experiment’s value.

Diversity in Types of Data Collections

In this report, we use the term “database” to refer to a collection of data that is organized to permit search, retrieval, processing, and reorganization of stored information. Databases include datasets, which are collections of similar or related data. We use the term “data collection” interchangeably with “database.”

In its report Long-Lived Data Collections: Enabling Research and Education in the 21st Century, the National Science Board divided data collections into three broad categories (Box 1-5).²⁷ “Research collections” are the products of one or more focused research projects and typically serve just the research group that generated the data. “Resource collections” serve a single science or engineering community and are generally intermediate in size and budget. “Reference collections” serve large segments of the research and education communities and are often supported by large budgets.

BOX 1-5

Three Types of Data Collections. The National Science Board (NSB) has organized data collections into the three categories described below. In addition, the NSB defined “collection” to refer “not only to stored data but also to (more...)

These categories may seem to correspond to small-scale research, intermediate-sized research projects, and large-scale research, but the National Science Board’s report shows that such an association can be misleading. Using digital technologies, relatively small-scale projects can generate immense quantities of data that become the basis for research in many related fields. Large-scale reference data collections may be the product of many small projects linked through digital networks. Or large projects may produce focused data collections that serve a narrow research purpose and never become publicly available. Thus, distinguishing research data by the size of the group that generated those data is problematic—in part because of new capabilities created by digital technologies.

STRUCTURE OF THE REPORT

The remainder of this report is organized into three thematic chapters and a final summary chapter. Chapter 2 considers the integrity of data throughout their life cycle, from their collection to their disposal or preservation. Maintaining the integrity of research data is a fundamental obligation of researchers; achieving this objective in the digital age can be either easier or more difficult than in earlier times.

Chapter 3 considers the issues of accessing and sharing research data. The research enterprise is built on the precept that researchers will make the data on which publicly disseminated conclusions are based available to their colleagues so that others can verify and build on those data. Accessibility is vital for ensuring the integrity of research data and facilitating their future use.

Chapter 4 discusses the stewardship of research data, that is, their long-term preservation in databases for various future research uses and other applications. Preserving data collections can be expensive and difficult—so much so that it can compete with the conduct of research. Yet the loss of many kinds of research data also can incur substantial costs.

The final chapter reorganizes recommendations that have appeared earlier in the volume according to different actors within the research community rather than thematically. It also discusses how action can be motivated when responsibility for research integrity, accessibility, and stewardship is shared across the components of the research community. Each part of the research enterprise has much to gain or lose, depending on how research data are managed, and each has a role to play in ensuring the integrity, accessibility, and stewardship of research data.

Footnotes

1: Gordon E. Moore. 1965. “Cramming more components onto integrated circuits.” Electronics 38(19):114–117.
2: Michael S. Turner. 2007. “Scientific discovery in the Information Age.” Presentation at the De Lange Conference on Emerging Libraries: How Knowledge Will Be Accessed, Discovered, and Disseminated in the Age of Digital Information, March 6, Houston, TX. Available online at http://delange.rice.edu/VI/EL/Turner-DeLange-2007.pdf?action=details&event=921.
3: A petabyte represents a million billion characters, the equivalent of the text in one billion books.
4: Not all measures of computing power are increasing exponentially. For example, the transfer rate of data within computers from memory devices to the central processing unit is growing slowly and at a linear rate. Physical limitations on the power of single processors have constrained the continued general application of Moore’s law. However, new algorithms for processors and storage units linked in parallel may lead to resumed exponential increases in computing power in the future.
5: Alexander Szalay and Jim Gray. 2006. “Science in an exponential world.” Nature 440:413–414.
6: National Research Council. 2001. Issues for Science and Engineering Researchers in the Digital Age. Washington, DC: The National Academies Press.
7: The 2020 Science Group. 2006. Towards 2020 Science. Redmond, WA: Microsoft Corporation. Available at http://research.microsoft.com/en-us/um/cambridge/projects/towards2020science/downloads/T2020S_ReportA4.pdf.
8: Stefan Wuchty, Benjamin F. Jones, and Brian Uzzi. 2007. “The increasing dominance of teams in production of knowledge.” Science 316:1036–1039.
9: National Science Board. 2006. Science and Engineering Indicators 2006. Arlington, VA: National Science Foundation.
10: Carolyn Y. Johnson. 2008. “Out in the open: Some scientists sharing results.” The Boston Globe, August 21, p. A1.
11: National Research Council. 2002. Preparing for the Revolution: Information Technology and the Future of the Research University. Washington, DC: The National Academies Press.
12: An example is the Education and Outreach Project of the National Virtual Observatory (http://www.virtualobservatory.org).
13: An example is the SETI@home project (http://setiathome.berkeley.edu), which uses computer time provided by volunteers to analyze astronomical data for signs of intelligence.
14: Ryan Scranton, Andrew Connolly, Simon Krughoff, Jeremy Brewer, Alberto Conti, Carol Christian, Craig Sosin, Greg Coombe, and Paul Heckbert. 2007. “Sky in Google Earth: The next frontier in astronomical data discovery and visualization.” Available at http://arxiv.org/PS_cache/arxiv/pdf/0709/0709.0752v2.pdf.
15: For the analysis of genetic variation, see https://www3.nationalgeographic.com/genographic. For the analysis of galactic structure, see http://www.galaxyzoo.org.
16: National Research Council. 2001. Issues for Science and Engineering Researchers in the Digital Age, Washington, DC: National Academy Press.
17: As a point of comparison, the Office of Management and Budget defines research data as “the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This “recorded” material excludes physical objects (e.g., laboratory samples).” See OMB Circular A-110 at http://www.whitehouse.gov/omb/circulars/a110/a110.html.
18: Issues related to sharing research materials in the life sciences have been addressed by a previous National Research Council report. See National Research Council. 2003. Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences. Washington, DC: The National Academies Press.
19: NSF Cyberinfrastructure Council (2005), NSF’s Cyberinfrastructure Vision for 21st Century Discovery, Arlington, VA, National Science Foundation.
20: Tony Gill, Anne J. Gilliland, Maureen Whalen, and Mary S. Woodley. 2008. Introduction to Metadata, Version 3.0. Los Angeles, CA: J. Paul Getty Trust. Available at www.getty.edu/research/conducting_research/standards/intrometadata/index.html.
21: U.S. Geological Survey, Coastal and Marine Biology InfoBank. USGS CMG “Formal Metadata” Definition. See walrus.wr.usgs.gov/infobank/programs/html/definition/fmeta.html. Accessed December 8, 2008.
22: University of Minnesota Research Data Management Online Workshop (www.research.umn.edu/datamgtq1/MDI_020.html).
23: Daryl E. Chubin. 1983. Sociology of Sciences: An Annotated Bibliography on Invisible Colleges, 1972–1981. New York: Garland.
24: National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2005. Facilitating Interdisciplinary Research. Washington, DC: The National Academies Press.
25: National Research Council. 1995. Preserving Scientific Data on the Physical Universe: A New Strategy for Archiving Our Nation’s Scientific Information. Washington, DC: National Academy Press.
26: Ghaleb Abdulla, Terence Critchlow, and William Arrighi. 2004. “Simulation data as data streams.” SIGMOD Record 33(1):89–94.
27: National Science Board. 2005. Long-Lived Data Collections: Enabling Research and Education in the 21st Century. Arlington, VA: National Science Foundation.

Bookshelf ID: NBK215259

Contents