NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
National Research Council (US) Committee for a Study on Promoting Access to Scientific and Technical Data for the Public Interest. Proceedings of the Workshop on Promoting Access to Scientific and Technical Data for the Public Interest: An Assessment of Policy Options. Washington (DC): National Academies Press (US); 1999.
Proceedings of the Workshop on Promoting Access to Scientific and Technical Data for the Public Interest: An Assessment of Policy Options.
Show detailsDR. SERAFIN: We are going to begin now with the scientific data panels, which will describe and discuss the salient characteristics of scientific and technical databases in four disciplines—geography, genomics, chemistry and chemical engineering, and meteorology—from the government, not-for-profit, and commercial perspectives. [NOTE: Prior to the workshop, the National Research Council study committee distributed a set of questions to the data panelists requesting detailed information on their respective data activities. The data panelists' prepared responses to these questions, which were distributed to the workshop participants, are included in these proceedings because they are more comprehensive than the transcribed text of the oral workshop presentations. See Box 3.1 for a list of questions to the data panelists.]
The moderator of the first panel, which focuses on geographic data, is Harlan Onsrud, professor at the University of Maine.
GEOGRAPHIC DATA PANEL
MR. ONSRUD: My name is, again, Harlan Onsrud with the Department of Spatial Information, Science, and Engineering at the University of Maine, which is also affiliated with the National Center for Geographic Information and Analysis.
We will have two speakers today, since James Brunt, from the Long-Term Ecological Research Network Office at the University of New Mexico, is unable to join us. Our first speaker is Barbara Ryan. She is associate director for operations for the U.S. Geological Survey (USGS). Barbara is going to be highlighting her agency's experience in the creation, sharing, and handling of geographic data, as well as some of the other data that the agency certainly collects. USGS, of course, is very much both a creator of geographic data as well as a major user of geographic data. So, both of those perspectives are represented.
Government Data Activity
Barbara Ryan, U.S. Geological Survey
Response to Committee Questions
Provide a description of your organization and database-related operations. The U.S. Geological Survey (USGS) and its information assets provide a gateway to the Earth. Sound stewardship of the nation's land, natural, and biological resources requires up-to-date, and often up-to-the-minute, information on how these vital resources are being used, as well as an understanding of how possible changes in use might impact the national economy, the environment, and the quality of life for all Americans. A core responsibility of the federal government is to enhance and protect the quality of life for its citizens, and the USGS provides the scientific underpinning for sound stewardship decisions that have an impact in each community, but that also extend beyond state boundaries and benefit the nation as a whole. With scientific information from the USGS, policy makers can foresee possible impacts of their decisions on America's economy, on the environment, and on the lives of the citizens they represent. With an interdisciplinary mix of nearly 10,000 scientists including geologists, biologists, hydrologists, cartographers, computer scientists, and support staff at work in every state and in cooperation with over 2,000 local, state, and other federal organizations, the USGS is uniquely positioned to serve the science needs of the communities, the states, and the federal government by describing processes that occur in, on, and around the Earth.
1a. What is the primary purpose of your organization? The USGS serves the nation by providing reliable scientific information to (1) describe and understand the Earth; (2) minimize loss of life and property from natural disasters; (3) manage its water, biological, energy, and mineral resources; and (4) enhance and protect the quality of life. It is the primary science agency of the Department of the Interior. The USGS carries out its research and activities at the global, national, regional, state, and local levels. Because the USGS encompasses numerous natural science disciplines, it is possible for the bureau to bring physical plus biological science to natural resource management problems. The aggregation of this information provides a national perspective on the landscape of the country, from understanding processes deep beneath the Earth's surface to preserving habitat for threatened and endangered species.
A sampling of current USGS programs includes (1) biological activities such as the cooperative biological research units, the Gap analysis program, biomonitoring of environmental status and trends, and the Species at Risk program; (2) geologic activities such as the Energy and Mineral Resource Assessment, National Cooperative Geologic Mapping, landscape and coastal assessment, and geologic hazards assessments; (3) mapping activities such as the mapping cooperative partnerships, business partner product distribution program, and cooperative research and development agreement partnerships with Microsoft TerraServer, Environmental Science Research Institute, Lizard Tech, and Now What, National Atlas of the United States of America, Center for Integration of Natural Disaster Information, National Geographic Research program, and National Satellite Land Remote Sensing Data Archive; and (4) water resource activities, such as the Federal-State Cooperative Water Resources Program, National Water Quality Assessment Program, Water Resources Research Act Grant Program, ground-water resources program, toxic substances hydrology program, and national water resources research program.
1b. What are the main incentives for your database activities (both economic and other)? As a science agency, a fundamental part of the USGS mission is the collection, quality assurance, storage (archiving), and dissemination of basic natural science data that are reliable and have continuity over time and space. Embodied in its mission is also a commitment to make USGS data and information more accessible to more people.
Other important incentives are as follows:
- Meet a growing number of requirements and support a wide array of constituents by using rapidly advancing technology.
- Provide updated and revised graphic topographic maps and ensure that the nation has access to the best available geospatial information in formats and on media best suited to customer needs.
- Use creativity in cooperation and coordination, seek and find matching dollars from other government agencies and the private sector in many different kinds of partnerships and consortia of customers.
- Ensure timely presentation of scientific information and effective use of this information by decision makers.
- Ensure that products are published in digital format, have consistent data standards, and are available through the National Spatial Data Infrastructure (NSDI).
- Provide searchable indexes to access USGS projects.
- Provide reliable, impartial and timely information that is needed to understand the nation's natural resources.
- Establish a network of distributed databases and information sources on natural resources directed toward the needs and responsibilities of Interior resource management bureaus.
2a. What are your data sources and how do you obtain data from them? The USGS Geospatial Data Clearinghouse provides information about USGS geospatial or spatially referenced data holdings. The agency is an active participant in the NSDI. The USGS NSDI node encompasses a distributed set of sites organized on the basis of the USGS's four principal data themes—biological resource information, geological information, national mapping information, and water resources information. (See < http://nsdi.usgs.gov/nsdi/ > for additional information.)
For biologic data, the USGS works cooperatively with many government agencies; nongovernmental institutions including academia, the private sector, and museums; and international organizations to share data and information. At this time, the National Biological Information Infrastructure (NBII) is based upon a fully distributed, World Wide Web-based architecture, in which the provider sites, in addition to providing data and information for the NBII, also serve the data and information. As the infrastructure develops and matures, it may be possible in the future to create a central server site that allows provider sites to concentrate on their primary functions, except for providing their data for public availability. The centralized server node then would take care of virtually all of the additional mechanics required for making the data accessible. This second model is under consideration for future implementation.
Geographic and cartographic data are obtained primarily from state and local government mapping and Geographic Information System agencies, other federal agencies, and partnerships and relationships with the private sector. These data are obtained mainly through cooperative agreements or innovative partnerships. The USGS has relationships with both the National Oceanic and Atmospheric Administration (NOAA) and the National Aeronautics and Space Administration (NASA) for archiving satellite data.
Sources of geologic data include the USGS; state geological surveys and academic institutions through the National Cooperative Geologic Mapping Program; academic institutions that operate, through cooperative agreements with the USGS, regional earthquake monitoring networks; and international partners (academic institutions or foreign government agencies) that operate nodes of the Global Seismographic Network through agreements with the USGS.
The National Water Information System (NWIS) is the primary corporate database for the USGS water information. NWIS receives data from a variety of sources, including field instruments through a variety of different telemetry, field computers, laboratory instruments, and direct input from investigators.
2b. What barriers do you encounter in getting these data and integrating them, and how do you deal with those barriers?
- Funding and integration of common data requirements among several partners. There are many issues here related to content, format, accuracy, etc. It is best to look for common ground and minimum specifications that work for both parties.
- Database content and merging of data together that have different content specifications. Work toward a specification that ensures a minimum level of content and find partners that are willing to provide data to that minimum level.
- Copyright problems when working with private-sector organizations. To deal with this problem, look for data exchange opportunities or possible degradation of the copyrighted data.
- Great variety of data types located in many legacy systems and format; lack of common data models. The USGS is dealing with these barriers by working to develop common data models and migrating priority legacy data sets to make them more widely available.
- When dealing with real-time data, absence of data due to problems with the reliability of system components and erroneous readings resulting from damaged or malfunctioning system components. The USGS deals with these problems through vigorous quality control procedures and the use of hardened or redundant components.
- Building of partnerships representing a broad array of organization types coming together for a unified purpose. This task is important but difficult. Issues and challenges are raised emanating from the diverse needs of such organizations. Each type of organization must be enticed in a manner that is of benefit to them to enable their participation. One method that has been effective in meeting this challenge is the dialog and demonstration method, that is, participating actively in groups where the highest number of partner and potential partner organizations can be reached to deliver information about the status and progress of the partnerships. In addition, one-on-one dialog and technical support can be maintained as needed with new partner organizations to assist them in complying with the requirements for participation. Monetary support is sometimes provided to organizations with key data sources.
3. What are the main cost drivers of your database operations? Cost drivers for USGS information products can be grouped into two categories. (1) Data collection and management costs including interpretation, maintenance, administration, archive, and analysis; software enhancement; hardware upgrades; hardened and/or redundant systems; World Wide Web page development and maintenance; searchable online clearinghouses; controlled vocabularies; data discovery, retrieval, and access tools; assessment and documentation of user requirements; partnerships with key non-USGS sources of data (such as state government agencies, academic scientists, or natural history museums) to assist in their efforts to document and serve important data sets and information products; and support of trained staff to prepare high-quality metadata documentation of data sets and information products: These cost drivers are funded by congressional appropriations and cooperative funds. (2) Reproduction and distribution costs, with the primary cost drivers being customer service, order taking, accounting, and order fulfillment: These cost drivers are funded by congressional appropriations for legislatively required distributions; all other distributions are funded through cost reimbursement fees.
Cost drivers for reproduction-related costs for maps, map products, and digital data are inspection of the press-ready combined negatives, press plate production, press setup and press plate calibration, production supplies, quality control, equipment amortization, equipment maintenance, space and utilities, and shipment to the main USGS distribution facility.
Cost drivers for distribution-related costs for maps, map products, and digital data are receiving and processing the shipments received from the USGS printing operation into inventory, inventory management and quality control, processing orders from operational databases, customer service, order taking, accounting, order fulfillment, packaging, postage, distribution supplies, order closeout, and marketing.
If the maps or map products are in digital format, costs are similar to graphic maps with the exception of media costs, research, and order staging. Equipment amortization and maintenance costs for digital format production equipment are somewhat higher. Text products have additional costs of editing and Government Printing Office contract overrides as well as higher unit costs due to limited demand and small production lot sizes.
4a. Describe the main products you distribute/sell. The USGS products, information, and services are based on or support natural science data and include the following formats: publications (professional papers, circulars, and general interest), both in electronic and hard copy forms; fact sheets; digital data; maps (including geologic, hydrologic, and topographic); analytical studies; technical assistance; tangible technology; new processes and procedures; emergency assistance; predictive modeling and analysis; environmental assessments and reports; water-resource assessments; biological assessments; biological status and trends reports; satellite imagery; and aerial photography.
Information products disseminated by the USGS are grouped into four general categories: (1) maps and map products, (2) text products, (3) scientific data, and (4) remotely sensed imagery. These products are made available in various formats to include paper, plastic, film, and digital. The 1:24,000-scale standard topographic quadrangle maps (topoquads) on paper are probably the best known USGS product and are distributed most widely. In fiscal year (FY) 1998, the USGS disseminated approximately 3.1 million 1:24,000-scale topoquad sheets and approximately 4.3 million topoquad sheets to include all scales available. The USGS also disseminates information generated by other federal agencies, i.e., the National Imagery and Mapping Agency of the Department of Defense, the United States Forest Service of the Department of Agriculture, other Department of the Interior bureaus, the U.S. Customs Service of the Department of Commerce, etc.
The USGS holds databases across many subject areas including biological information, climate, natural hazards, minerals, ecosystems, coastal and marine geology, energy, geography, real-time streamflow discharge, water-use, groundwater, and water-quality data.
4b. What are the main issues in developing those products?
- Trying to produce national data sets from many regional data sets that do not have common standards and may be incomplete.
- Producing printed products as part of cooperative agreements with state and local agencies and other federal agencies through a distributed production process that decentralizes the approval, preparation, and distribution activities.
- Migrating toward more electronic publishing and distribution of products—toward an as yet undetermined end point—the USGS is still dealing with various issues in the print world and the evolving technology of electronic publishing. In addition, the costs of getting to that end point, coupled with level or decreasing funding for production and printing, are dynamic issues.
- Evaluating the potential effect of distribution. Due to the nature of its scientific focus, USGS research sometimes results in data and information about threatened or endangered species. While the agency has no security restrictions or limitations on distribution of these publications, it does find it necessary to evaluate the potential effect of publication on the resource being studied. For example, it is an unfortunate fact that publication of endangered species data sometimes results in further harm to the species at the hands of those who wish to possess rare commodities.
4c. Are you the only source of all or some of your data products? If not, please describe the competition you have for your data products and services. The USGS is not the only source of many of its data products, although it produces some specific research products that can be found only at the USGS. The National Water Information System is a unique national database providing consistent, reliable, long-term water information. However, many private sector concerns, state governments, and academic institutions gather information similar to that collected by the USGS. The USGS strives to develop multiuse information products on a national level. Its competition, both public and private, develop information products, with a specific customer in mind, that meet certain demand-level projections. The USGS strives to work cooperatively with many organizations to collect, coordinate, and share data and information, e.g., Incorporated Research Institutions for Seismology data center, state geological surveys, and state geographical information system groups. Often the greatest value of USGS database activities is derived from the federation of partners we strive to create and we are not in a competitive role with regard to the other producers. The national coverage provided by the USGS ensures consistent management of all U.S. land, water, and natural resources for the betterment of all.
5a. What methods/formats do you use in disseminating your products? The tangible products (inventoried items) of the USGS and custom products produced on demand are disseminated out of the USGS Denver warehouse via the mail and over the counter. The format is mostly paper, although USGS products come in a wide range of flat maps, folded maps, books, etc. Some inventoried items are on CD-ROM.
Digital products are produced on demand (primarily) and distributed via the mail, over the counter, through retail business partners, and over the Internet. Formats vary widely, but USGS is trying to standardize on the Spatial Data Transfer Standard, the native format (the archive format), other nonproprietary formats such as GeoTIFF, and sometimes proprietary formats like ARC-INFO. A variety of media are offered including C-R, CD-ROM, 8-mm tape, 3480 cartridge, and digital linear tape.
Much of the USGS digital data and information are distributed over the Internet. The rapid movement of the Web from a novelty to mainstream distribution mechanism has presented the USGS with challenges unthought of just five years ago. The biggest challenge has been to organize, integrate, and present, in a sensible manner, the broad range of data and information types that characterize USGS products.
The Web medium has made USGS products visible to a vast and varied clientele, ranging from the traditional USGS customer base among scientists and policy makers to hobbyists and the K-12 education community. These new audiences have their own unique needs and abilities to digest and use USGS products, which has placed great pressure on the agency to create multiple views and tailored extracts of its Web products and services. For example, genealogists are now a major nonscientific user group for the online USGS Geographic Names Information System, and whitewater recreationists are heavy users of the USGS online real-time stream flow data.
5b. What are the most significant problems you confront in disseminating your data? A fundamental goal of the USGS is to maximize the dissemination of information products to the broadest possible audience given the constraint of recovering costs associated with reproduction and distribution. Fees for USGS information products are therefore based on reproduction and distribution costs and not on the value of the product provided. These fees pursue full recovery of costs, including indirect costs such as depreciation of equipment. USGS information products are in the public domain, carry no copyrights, and may be used and shared freely.
The public policy rationale for charging no more than the cost of reproduction and distribution for information products is that the taxpayer has already expended resources to create the data. The costs associated with reproduction and distribution to specific customers represent the incremental or additional cost that the USGS incurs to disseminate the information products to these customers.
The most significant problem with digital data is that every order is customized. This causes problems in ordering the correct data type and format for the customer. It also creates bottlenecks within the production processes, sometimes resulting in delays in distribution. Due to file size, distribution over the Internet is limited by bandwidth, both on the USGS end and the customer end. The Web “pipeline” is presently inadequate to efficiently deliver some USGS products, such as remotely sensed satellite imagery.
Another goal is to provide customers with data, information, and products in the format they most need, in a timely manner, and at a level of information that is appropriate to the intended audience. In addition, the proprietary nature of information that is collected as part of some cooperative agreements presents a problem in the broad release of information. The current, inconsistent pattern of electronic publishing—some products are available on the Web; some are not—is not based on an established policy, but rather arbitrary decisions. The support of printed products and their distribution is also a significant problem in addressing cost recovery mandates and in long-term funding of free products. The USGS is striving to find more cost effective means to disseminate a large variety of distinct products that may each have a relatively small or specialized customer base.
6a. Who are your principal customers (categories/types)? Because the USGS mission encompasses a broad range of natural science studies, issues, and interests, the agency serves many different customers. It defines its customers as anyone who uses USGS information, services, and products or as anyone who works with USGS to produce and deliver these. Its customers include the engineer who uses USGS data to revise building codes, the resource manager who uses USGS information to make critical resource and land management decisions at the state and local levels, the water manager who uses the data and information from USGS research and investigations and data collection in fulfilling his or her responsibilities to manage the nation's water resources, and the hiker who uses USGS topographic maps. These customers also include Congress; state and local agencies; federal government agencies such as the Forest Service, NOAA, the Department of Energy, Environmental Protection Agency, U.S. Army Corps of Engineers, NASA, and the Federal Aviation Administration; land and resource management bureaus of the Department of the Interior, (Bureau of Land Management, National Park Service, Minerals Management Service, Bureau of Reclamation, Fish and Wildlife Service, and Bureau of Indian Affairs); the science community; elected officials at the state and local levels; other state, local, and tribal authorities; federal, state, and local emergency management agencies (Federal Emergency Management Agency, state offices of emergency services); producers and users of mineral and energy commodities; nongovernment organizations (e.g., insurance sector, structural engineering industry, not-for-profit natural resource interest groups); the news media; the private sector; citizens; universities and schools; representatives of other countries; and other USGS employees (internal customers).
6b. What terms and conditions do you place on access to and use of your data? USGS data are in the public domain and are not subject to copyright protection. Copyright is considered to be a barrier to use of data as a public good.
Although not a term or condition per se, the fact that streamflow information is being served in real time on the Internet requires the statement that they are provisional data, subject to quality assurance and quality control.
6c. Do you provide differential terms for certain categories of customers? The USGS provides a volume discount pricing structure for registered business partners, federal agencies, and non-profit organizations that is different from the prices offered to the general public.
7a. What are the principal sources of funding for your database activities? The principal sources of funding for USGS database activities are congressional appropriations, interagency cooperative agreements (other federal agencies, and state and local agencies), and joint funding arrangements for geospatial data collection, analysis, and interpretation. Reproducing and distributing copies of USGS archival information is funded by congressional appropriations for legislatively required distributions, and through fees established to recover costs associated with reproduction and distribution to all others.
A mix of legislation and executive direction authorizes and requires the USGS to charge for the dissemination of information products to customers both within and outside the federal government. The USGS is required to recover the full costs associated with the reproduction and dissemination of information products. Three fundamental concepts describe the philosophy that underlies USGS pricing policy: (1) the goal of the USGS pricing policy is to maximize the dissemination of information products to the broadest possible audience given the constraint of recovering the cost of reproduction and distribution; (2) prices should be based on costs, not on the value of the product provided; and (3) prices should pursue the full recovery of costs, including indirect costs such as depreciation of equipment.
7b. What pricing structure do you use and how do you differentiate (e.g., by product, time, format, type of customers, etc.)? The USGS pricing structures are based on algorithms designed to track estimates of the actual costs of reproduction and distribution. Whenever possible, products are grouped by like type and are priced accordingly. Since reproduction and distribution costs are similar regardless of customer, the USGS pricing structures are applied equally. Projected targets for reimbursable revenues from the sale of USGS information products, coupled with congressional appropriations and cooperative funding, are used in developing USGS budgets.
7c. Do your revenues meet your targets/projections? Please elaborate, if possible. The USGS has made cost recovery a priority activity for the past two years. The overall USGS FY 1998 recovery rate is 100 percent. On a product-line basis, recovery rates for several product lines are less than 95 percent. However, the USGS is taking aggressive steps to update processes, contain costs, and update prices where necessary for each of these product lines.
8a. Have you encountered problems from unduly restrictive access or use provisions pertaining to any external source databases? No. However, the lack of adequate copyright guidance for federal agencies when publishing in the electronic era is a problem (see question 8d).
As the National Biological Information Infrastructure federation is expanded to include international partners, it is anticipated that problems will arise pertaining to World Intellectual Property Organization (WIPO) issues. However, as of yet USGS has no experience with this. In addition, since it is a government agency, information in USGS possession is subject to Freedom of Information Act (FOIA) guidelines. Since anyone may make a FOIA request for information in the agency's possession, some organizations have been reluctant to pass over to the USGS their data and information for the reasons described in question 4b.
8b. What problems have you had with legal protection of your own database activities and what are some examples of harm to you or misuse of your data that you have experienced, if any? Because USGS data are not copyrighted, the USGS identity is sometimes not carried or acknowledged on products that reproduce or use USGS data. This practice may be harmful as it could blend data from multiple sources and of different quality.
Primary harm has been experienced when species have been researched, especially when the data or information produced reveals their exact location. For example, after USGS sent out a FOIA-requested release of information from a research study concerning the location of certain wolves, the animals were soon found dead.
8c. How have these problems differed according to data product, medium, or form of delivery, and how have you addressed them (e.g., using management, technology, and contractual means)? No differences.
8d. What specific legal or policy changes would you like to see implemented to help address the problems identified above? The problem statement is that there is no clear mechanism for guiding USGS authors with respect to copyright privileges and responsibilities. The two areas needing policy development are (1) public domain of reports in compliance with OMB Circular A-130 and (2) use of copyrighted material. Exceptions should be provided to the FOIA guidelines that would exclude the mandatory release of data and information pertaining to threatened and endangered species.
9. Do you believe the main problems/barriers/issues you have described above are representative of other similar data activities in your discipline or sector? If so, which ones? If not, what other major issues can you identify that other organizations in your area of activity face? Yes, especially barriers that deal with difficulty in integrating data from various legacy systems. Headway is being made in these areas, as both more standards and better tools are developed for integrating data from different sources.
Two specific problems are (1) lack of restrictions on FOIA guidelines and (2) potential difficulties in cultivating international partnerships due to WIPO-induced restrictions. Both of these problems will be encountered by any federal agency attempting to provide access to data and information about threatened and endangered species or attempting to partner internationally. The former problem pertains only to federal agencies. The latter problem might be encountered by all who engage in international partnerships if the WIPO were to adopt a treaty based on the E.U. Database Directive model.
General Discussion
PARTICIPANT: Can you tell us something about the financial relationship between USGS and Microsoft?
MS. RYAN: Yes; with the guidelines on entering into CRADAs—cooperative research and development agreements—with the private sector, we are starting to see more of these, not just with Microsoft. So, as pressure starts to hit the public sector for finances, I think there will be a much broader range of partnerships with the private sector.
Right now, Microsoft has purchased the digital orthophoto quadrangles (DOQ) data, just like any other customer would purchase those DOQ data. That is about the only financial exchange of research. In return for that, we had to advertise the CRADA in the Federal Register, so that any other group who wanted to do something similar, had the ability to do that right up front.
PARTICIPANT: To follow that, two questions. One, how do you access the information if you don't go through Microsoft? Two, what if Netscape comes along and wants to do the same thing? Will the CRADA with Microsoft permit the USGS to enter into the same deal with someone else?
MS. RYAN: Let me just answer that first question. The DOQ data are probably our best example of information available over the Internet.
For any of these other data sets, that is the challenge that we have internally. Right now we have something like 300 or 400 home pages out there. Each of these individual data sets has its own home page. So, the challenge is currently getting those together, so that when you want to focus on a place on Earth, you can get the full range of these data.
In terms of your question about another group entering into it, I think, in the life of the CRADA, they likely couldn't come in at that juncture. Their opportunity to enter into that was at the beginning when it was advertised in the Federal Register. If they wanted to come, and if it was to our benefit to spin off a different angle, then we would similarly advertise the goals, the missions, the functions for that, and enter into new CRADAs. There are actually a couple of other different partners in this CRADA with Microsoft. They wanted to get worldwide data as well as U.S. data. So, one of the goals was to use other partners for the other parts of the world, such as the Russians and their spy satellite data.
Not-for-Profit Data Activity
James Brunt, Long-Term Ecological Research Network Office, University of New Mexico Response to Committee Questions
1a. What is the primary purpose of your organization? The Long-Term Ecological Research (LTER) Network Office exists to coordinate network activities of 21 intensive research sites in the United States and Antarctica. The LTER Network Office was established in 1983 and is involved in activities such as:
- Facilitating communication among the LTER sites and between the LTER Program and other scientific communities;
- Supporting the planning and conduct of collaborative research efforts, including provision of some technical support services;
- Facilitating intersite scientific activities, including national and international meetings; and
- Providing a focal point and collective representation of the LTER Network in its external relationships. This includes the development of the LTER Network information system, the primary purpose of which is to facilitate access to LTER data for cross-site analysis and synthesis.
1b. What are the main incentives for your database activities (both economic and other)? The incentive is clearly the advancement of ecological science through the provision of greater access to data—for LTER scientists as well as the scientific community at large.
2a. What are your data sources and how do you obtain data from them? Our primary data sources are the 21 LTER sites around the country, as well as collaborating federal agencies such as NASA. Data are accessed directly from LTER site Web servers in standardized exchange formats. NASA data are obtained variously depending on the project but are facilitated through memoranda of understanding.
2b. What barriers do you encounter in getting these data and integrating them, and how do you deal with those barriers? Today, barriers consist mostly of the availability of personnel time at the LTER sites that is focused primarily on on-site science. Since the process is research driven, sites are almost always willing to participate; but the amount of work that can be done by site personnel is limited, so our office helps provide person-power to achieve some of the data acquisition and integration. In the past there were proprietary data issues, but those have all been resolved by the formulation of site and network data access policies.
3. What are the main cost drivers of your database operations? Our database operation exists to provide data to facilitate research. As such, we have research drivers instead of cost drivers.
4a. Describe the main products you distribute/sell.
- LTER management databases such as personnel, e-mail, etc.;
- LTER all-site bibliographic database;
- LTER data catalog;
- LTER integrated climate database;
- LTER site description database; and
- Remotely sensed data from a variety of sources.
In addition, there are other scientifically specific databases in development such as nitrogen deposition, net primary productivity, leaf area index, etc.
4b. What are the main issues in developing those products? Scientific priority is now the main issue we deal with besides the personnel issues involved in building the data systems. A lot of up-front effort has been put into the databases named above to establish working prototypes and develop operating protocols for further development.
4c. Are you the only source of all or some of your dataproducts? If not, please describe the competition you have for you data products and services. Yes, we are the only source for some of the integrated-site data products. Data are available from individual sites but in a variety of formats. Remotely sensed data are often available directly from the producers, but our products are value-added and significantly modified from the source data.
5a. What methods/formats do you use in disseminating your products? All our data are available via the Internet and have been since before the World Wide Web existed. We also distribute some data on CD-ROM and tape where necessary for portability.
5b. What are the most significant problems you confront in disseminating your data? One of the more significant problems that we are exposed to is the proliferation of offspring data sets. Our data, and especially metadata, are somewhat dynamic and it is difficult to get users to check for changes in the data once it's been downloaded. We do not have a system in place to track data users at this point.
6a. Who are your principal customers (categories/types)? Our primary customers are scientists and administrators.
6b. What terms and conditions do you place on access to and use of your data? Each site has its own policies but they are all more or less similar to the network policy, which states that “data may be used for legitimate noncommercial scientific purposes” with “no expressed warranty about the quality or content of the data” and “no expressed value beyond that purpose for which the data were originally collected.” For the most part, our policies focus on the openness of our data and not the restrictions. For example, Box 3.2 describes the data access policy for the LTER Network.
However, certain LTER sites have explicit data set warnings. The following is an example from the Sevilleta LTER Program: “All data collected under the umbrella of the Sevilleta Long-Term Ecological Research Program are available here only to qualified scientific interests that agree to cite the data and source appropriately. This agreement must be made in person by contacting the Sevilleta LTER Information Manager (ude.mnu.atellives@esu-atad). Failure to make this contact will be considered a disregard for scientific ethics and a violation of University of New Mexico Intellectual Property Rights and could result in civil action.”
Some sites have active mechanisms that include software license-like agreements and registration forms on the Web. LTER scientists are in the process of drafting a document that describes what “ethical use” is for our data.
6c. Do you provide differential terms for certain categories of customers? Data use is restricted to legitimate scientific investigation, which can be a scientist or a 4th grader but not a commercial data provider. The terms restrict any commercial use of the data. Commercial interests would have to negotiate contracts with university-sponsored research programs on an individual basis. Any attempt to commercialize the integrated products would result in a multitude of legal issues because all universities treat intellectual property rights differently and so do the granting agencies.
7a. What are the principal sources of funding for your database activities? National Science Foundation grants and cosponsoring institutions provide funding for the LTER program.
7b. What pricing structure do you use and how do you differentiate (e.g., by product, time, format, type of customer, etc.)? There is no formal pricing structure for the LTER Network. However, some individual sites have a pay-as-you-go policy for anyone requesting value-added data reduction or analysis beyond what they normally make available.
7c. Do your revenues meet your targets/projections? Please elaborate, if possible. Since we provide data as a service to the community, I guess we meet our projections.
8a. Have you encountered problems from unduly restrictive access or use provisions pertaining to any external source databases? No.
8b. What problems have you had with legal protection of your own database activities and what are some examples of harm to you or misuse of your data that you have experienced, if any? None of the above statements or policies about data access, commercialization, and intellectual property rights have ever been challenged. Legal challenges to these policies could potentially present a multitude of problems.
8c. How have these problems differed according to data product, medium, or form of delivery, and how have you addressed them (e.g., using management, technology, and contractual means)? Not applicable.
8d. What specific legal or policy changes would you like to see implemented to help address the problems identified above? I'm very interested in the issues of who owns the data and who controls the data. Do these data rightfully belong to the scientist, the university or institute, the funding agency, the federal government, or the American people? Consequently, many scientists are concerned with “good-Samaritan” protection against action resulting from misuse of data. Some would like to see protection from misuse of data beyond simple disclaimers. It would not be conducive to research if scientists started being sued because of quality-assurance issues.
9. Do you believe the main problems/barriers/issues you have described above are representative of other similar data activities in your discipline or sector? If so, which ones? If not, what other major issues can you identify that other organizations in your area of activity face? The ecological community as a whole is not of the mindset described to you above. Most ecological data are held by the investigator until death and are only revealed through published analysis and interpretation, and only shared with close colleagues. This practice has come about since journals quit publishing data sets (circa 1930s) and consequently ecologists started collecting much larger data sets in their work. To further the collective efforts of ecologists everywhere, the LTER network advocates open access to ecological data and is demonstrating this by making data available. We are also working with the Ecological Society of America to establish an electronic data journal and a means by which to publish data sets that can be reviewed and cited. The current feeling about data publication among the ecological community is that there are not enough incentives. Peer-reviewed publications are the currency of academia and data sets are not considered publications by tenure and promotion committees. These attitudes are changing.
General Discussion
DR. SERAFIN: It is unfortunate that James Brunt could not be here. Part of the purpose of this workshop, of course, is to give presentations by users of massive amounts of, in this particular panel, geographic data. Our example from the not-for-profit sector was the Long-Term Ecological Research Network, which is a large group of scientists attempting to share scientific data on a massive scale.
We will go straight to Barry Glick, who is former president and chief executive officer of GeoSystems Global Corporation. GeoSystems is one of the many commercial firms, perhaps one of the more successful firms, that has been taking government geographic data and commercial geographic data, and adding value to create services and products that they then make available to other businesses, as well as to the general consuming public.
Commercial Data Activity
Barry Glick, GeoSystems Global Corporation (retired)
Response to Committee Questions
Provide a description of your organization and database-related operations. GeoSystems Global Corporation is a leading supplier of maps and mapping-related products, services, and technology to companies in the publishing, travel, yellow pages, and real estate markets, as well as directly to consumers. The company's products and services range from supplying highly customized maps for textbooks, travel guides, reference books, and multimedia products to providing the underlying mapping technology and components for hotel reservation systems, driving directions, information kiosks, cellular telephone directory assistance systems, and Internet Web sites.
The company initiated two major expansion efforts to further leverage its data and technology assets. GeoSystems expanded into the Internet information publishing and business services market with the launch of its highly successful MapQuest Web site ( www.mapquest.com ), the first interactive mapping site on the Internet. GeoSystems also has moved aggressively into consumer publishing, and in 1996 entered into a partnership with the National Geographic Society to be the primary commercial producer, publisher, and distributor of maps and related products under the National Geographic brand name.
Market and Product Focus GeoSystems is a products and solutions-centered business that provides high-value location and mapping information to businesses and consumers across all media and distribution channels. A broad spectrum of products and services in all major categories gives GeoSystems an unparalleled advantage over our competition.
GeoSystems offers integrated solutions, services, and a wide range of geographic and map products designed to meet the need for the highest quality mapping and geographic information. It provides digital and multimedia cartography, geographic database development, and comprehensive map and data maintenance through the application of digital and database-driven cartographic techniques. In addition, it offers map-publishing systems, as well as advanced mapping technology and consultation services to clients. Products available for license or purchase include world and U.S. atlases, worldwide electronic map sets in a variety of formats, customized maps and atlases for reference and travel products, as well as U.S. and world map data suitable for high-quality cartographic production.
GeoSystems applies its core technologies to innovative information publishing solutions with a number of leading publishers in travel, yellow pages, mobile, real estate, online and consumer software industries. The company provides solutions for operator and agent-assisted applications, and CD-ROM multimedia title development, as well as customized database integration services, and Internet/intranet applications.
Custom Services On a regular basis millions of people benefit from GeoSystems' custom services through our customers' map-enhanced applications.
GeoSystems also provides significant expertise in geographic data management, a critical part of any map-enhanced solution. We build and maintain our own atlas database of the United States—USDB. We also provide street-level mapping databases for over 300 cities worldwide and a gazetteer of over 3 million places. In addition, we maintain strategic partnerships with most mapping data providers in the world, including CompuSearch Micromarketing Data & Systems, Etak Inc., Geographic Data Technologies, Inc. (GDT), AND Mapping B.V., Business Locations Research, Urban Decision Systems, Navigation Technologies, Inc. (NavTech), and Tele Atlas B.V.
Customer Applications GeoSystems offers a number of customized solutions and applications including an automated trip planner, directions kiosks, commercial real estate systems, client/server systems, consumer CD-ROMs, reservation or OAS systems, intranet applications, and map-enabled business solutions.
Product Management This group provides sourcing, database design, and enhancement of content that is necessary for many GeoSystems clients' integrated technology products and services. To ensure the success of our information solutions, Product Management provides GeoSystems information publishing clients with an excellent database foundation for multiple application development.
This group manages GeoSystems' strategic partnerships with vendors who are acknowledged leaders in the supply of highly accurate roadway information and point of interest data, such as Etak, GDT, and NavTech. This group also formats and optimizes the data for use in routing, display, and geocoding applications. A number of processes and tools can be used to geocode (assign spatial attribution) to points of interest such as businesses, landmarks, and events. Using batch processes and data, Product Management can assign international points of interest with geocoded values, as well. When locational data are not available, we utilize the significant map accumulations resident in the GeoSystems' library, which holds over 300,000 maps.
Other related activities include the “scrubbing” of datasets to eliminate redundancies, correct erroneous addressing information, and facilitate the acquisition of more detailed attribute information (forms of payment, hours of operation, etc.) to substantially increase the usability of data for each customer's engineered solution.
Products and Services GeoSystems products and services include Adobe Illustrator to ARC-INFO conversion service; authoring cartographic titles; Boundary Litigation Group; CartoTools™; corporate intranet applications; customized map and atlas products; electronic yellow pages applications; fine maps, atlases, globes, and geographic products from Interarts; GeoLocate® Technology; Global Electronic Map Set; GeoRelief™; MapQuest Internet products and services; multimedia product development; rapid application development tools; and world-class cartographic services.
MapQuest MapQuest provides scaleable solutions for individuals, community organizations, and businesses to add interactive mapping to their Web sites. The use of MapQuest is free to consumers, providing content such as travel, reference, classified/yellow pages, real estate, special events, and retail information that relates to the daily lives of individuals. Business information content is layered within geographic databases that cover the entire world. For Web sites, MapQuest offers scaleable solutions with MapQuest Connect Services for presenting locational and business information on dynamically generated interactive maps. MapQuest's goal is to pioneer new ways for businesses and consumers to use interactive mapping on the World Wide Web. MapQuest is the leading provider of interactive mapping technology and services for Internet publishers. The Connect product line uses dynamic technology that provides businesses with a full range of mapping and routing services.
The map and point-of-interest data seen in MapQuest comes from numerous international sources, including AND Mapping B.V.; CompuSearch Micromarketing Data & Systems; Geographic Data Technology, Inc.; GeoSystems U.S. Digital Map Database; GeoSystems U.S. Street Files; GeoSystems International City Vector Maps; Navigation Technologies, Inc.; and Spatial Data Sciences.
1a. What is the primary purpose of your organization? The primary purpose of GeoSystems is to provide geographic information-based products and services to consumers and businesses in all media.
1b. What are the main incentives for your database activities (both economic and other)? The main incentives for GeoSystems' database activities are to generate value-added products and services in order to generate consumer usage and business sales.
2a. What are your data sources and how do you obtain data from them? See above for details. GeoSystems' primary data sources are in the U.S. public domain including government-produced maps, digital geographic databases, remotely sensed imagery, and miscellaneous published data/information. Secondary data sources include commercial and non-U.S. government-produced copyrighted maps, digital geographic databases, remotely sensed imagery, and miscellaneous published data/information.
In the past, much of this source information was in analog form and required manual compilation by cartographers. More and more source information is available in digital form and in greater detail and content levels (both in cartographic databases and imagery). In addition, the growing adoption of standards for geographic databases greatly simplifies the importation and integration of disparate databases. Finally, the availability of information on the Internet will allow for even more efficient collection of source information on a worldwide basis.
Information from these sources is digitized (if source is nondigital), edited/updated, reformatted to GeoSystems' internal database formats, and integrated with other sources to create a final “source” database. This database is then extracted to create customized electronic or printed maps, driving directions, software products, etc.
2b. What barriers do you encounter in getting these data and integrating them, and how do you deal with those barriers? The major barrier in efficiently exploiting the available source information has been the variability in media, format, data structure, geographic coordinate systems, accuracy, currency, etc., all of which adds effort to the process of generating the end product. In the field of geographic information, no single source contains the needed information for creating almost any end product. Therefore, the integration of information from multiple sources is a necessity. Government-source data, while exhibiting very significant advantages (typically the most comprehensive coverage due to the public mission, zero or very low cost to acquire), also have some important weaknesses, particularly in the lack of currency and maintenance and in some cases, the lack of content needed for commercial usefulness. Therefore, we face a constant decision regarding whether to put the needed updating and enhancement effort into public domain data to create a “proprietary” database of our own versus licensing data from third-party commercial vendors. These make-versus-buy decisions are made on a product-by-product basis. As the private sector invests more and more resources into generating databases and competition keeps license fees reasonable, these decisions tend to favor the “buy” rather than the “make” outcome.
We address the incompatibility barriers through a process, sometimes painful, of “decomposing” the source information back to a common geographic frame of reference, thus removing any unique format, structure, and/or coordinate system. In the case of analog sources, digitization is required and then followed by the above-described decomposition process. Once the various sources are all in the common digital source database, the needed editing, updating, reconciliation of conflicts, and data enhancements can take place.
Nontechnical barriers such as the negotiation of license agreements for commercial data use also exist. In addition to the obvious issue of cost, there are thorny issues having to do with protecting the copyright-holders' data confidentiality and enforcing license terms on end users. The use of these data sources in Internet services such as MapQuest makes these issues even thornier and increases the sensitivity of the licensors to the potential for unauthorized copying and use of their data. We have addressed these concerns through the use of copyright notices and by keeping the copyrighted source data in a protected environment and instead using substantially watered-down extracts of the data to generate maps or other information available to end users. In other words, end users only have access to the results of a query using a small subset of the data and never to the data itself.
3. What are the main cost drivers of your database operations?
- The labor effort required to clean, reformat, edit, enhance, and integrate data (this cost component is declining over time in relative terms).
- The cost of licensing third-party commercial and non-U.S. government copyrighted data (this cost is increasing over time in relative terms).
- The labor cost of hunting down source information around the world (declining).
- The ongoing cost of data maintenance (this is increasing as a result of the increase in our data holdings requiring maintenance and the requirement for online access to continually maintained data).
4a. Describe the main products you distribute/sell. The main products sold by GeoSystems and the services provided are listed in the organizational description above. The major categories are map images (print and electronic), map datasets, software products, atlases, CD-ROMs, and Internet information services.
4b. What are the main issues in developing those products? The main issues in developing those products are designing the products to meet customers' needs and desires, selecting and obtaining the appropriate source information, enhancing/customizing the data to meet the products' needs, pricing the products in an optimal way, and distributing the products to customers.
4c. Are you the only source of all or some of your data products? If not, please describe the competition you have for your data products and services. GeoSystems is the only source for some of the databases used in our products and services. For example, GeoSystems' cartographic database of international cities is unavailable elsewhere. However, for the vast majority of data products sold or used by GeoSystems, multiple sources are available. The competition includes traditional map and atlas publishers, such as Rand McNally, CD-ROM publishers, such as DeLorme and Microsoft, and geographically oriented software/Internet businesses, such as TravRoute and Vicinity. Since GeoSystems is primarily a developer, distributor, and marketer of finished products and not a database vendor per se, it does not view the primary database vendors in the industry (i.e., NavTech, GDT, Etak) as competitors but as suppliers. Similarly, since GeoSystems is not a vendor of geographic information systems (GIS) software tools, it does not view GIS software vendors, such as Environmental Systems Research Institute, Intergraph, MapInfo, etc., as primary competitors.
5a. What methods/formats do you use in disseminating your products? GeoSystems disseminates its products in multiple channels and media. Printed products are created from digital databases and disseminated by traditional retail and distribution channels as well as sold directly via the Internet in GeoSystems' “ mapstore.com ” commerce site. Some software products are sold via retail channels (consumer CD-ROMs); however, most are sold to corporate customers (such as the airline reservation systems, car rental agencies, real estate database companies, hotel chains, etc.) and used by intermediaries (travel agents, real estate agents, customer service representatives, etc.) to provide information to their customers. In some cases, interactive kiosks are employed by GeoSystems' customers to provide information directly to their customers without intermediaries. This increasing emphasis on direct access to information is rapidly expanding with the growth of Internet usage. MapQuest.com provides mapping, travel, and routing information directly to consumers as well as feeds information into clients' Web sites for direct access by their customers. Internet-based dissemination is clearly going to dominate the nonmobile uses of geographic information by both consumers and businesses, and may also spread to mobile over the next five years.
5b. What are the most significant problems you confront in disseminating your data? It is well known in our industry that the traditional modes of information dissemination are flawed and inefficient. Supplying the “right” set of geographic information for a specific purpose requires access to very large and disparate geo-databases and also requires specialized software and knowledge. It is generally not economical for most organizations to maintain these data and the human, software, and hardware resources required to exploit the data for their needs. In addition, traditional print products are, by their nature, limited in content, flexibility, and currency. It is greatly more efficient to deliver need-based information on an on-demand basis, drawing from a worldwide database, than it is to produce fixed-media products representing a data extract frozen in time and space. Also, there are significant economic problems in traditional forms of geographic information dissemination. In many cases, the perceived monetary value of a typical map in print or electronic form is too low to justify the costs of database creation and maintenance as well as product creation and dissemination. Therefore, dissemination costs must be kept as low as possible.
6a. Who are your principal customers (categories/types)? GeoSystems' major customers in each of its product/service areas are listed above in the organizational description. Outside of its MapQuest Internet business, the most important customers for GeoSystems products and services are publishers, both print and electronic/software (reference, educational, yellow pages, travel guide); travel services companies (hotels, car rental, airline reservation systems, travel agencies, auto clubs); real estate information providers (agencies, data services firms); and general corporate users of geographic information (telecommunications firms, oil companies, retail chains, etc.). In the MapQuest Internet segment of its business, the main customer categories are advertisers, Web sites (major national retailers, travel services companies, Web search engines/portal sites, real estate sites), and consumers buying products directly via mapstore.com. In the print publishing segment of its business, the main customers are major bookstore chains, discount stores, and distributors.
6b. What terms and conditions do you place on access to and use of your data? All of the end-user products (maps, Web pages, software, directions/routes) we provide are protected by copyright. In cases in which third-party data are used in the solution, the product carries both GeoSystems and the third-party copyright. Software is sold to end users on a license basis, subject to nonresale and other standard provisions and restrictions found in software licenses.
6c. Do you provide differential terms for certain categories of customers? Yes, differential terms and conditions are provided to certain categories of customers. For example, the least restrictive terms are provided to those customers who acquire a broad technology license that allows them to utilize data and software to create their own products for sale to end users. They must not, in any case, distribute the core proprietary technology outside of their own organization but are restricted to using the technology to produce end products. The most restrictive terms and conditions apply to single-use end users (consumers or corporate customers) who acquire a single copy of a product and are therefore limited to the use of that copy or “instance” of the database, not to include the reselling of the product to others.
7a. What are the principal sources of funding for your database activities? GeoSystems' sources of funding are internal and derived from income generated through revenues derived from sales of its products and services. Funding to support major new database/product initiatives has come from venture capital equity investments.
7b. What pricing structure do you use and how do you differentiate (e.g., by product, time format, type of customer, etc.)? Pricing structure is based on a combination of market-based and value-based pricing schemes. In general, prices vary according to the degree of rights obtained by the customer, the number of copies to be made, the content level of the databases involved (e.g., geographic coverage, scale/level of detail, and attribution level), and the functionality of any software licensed. However, the media or form of dissemination is also a critical element in pricing. Internet access to mapquest.com is free to consumers although the “serving” of maps or routes into third-party Web sites is priced on an annual license fee basis, based on estimated numbers of accesses. The free consumer site is advertiser/sponsor supported, with advertisers paying a fee based on the number of times its ad is to be displayed. Print product pricing is highly competitive given the existing competition, as is CD-ROM pricing.
7c. Do your revenues meet your targets/projections? Please elaborate, if possible. GeoSystems' revenues in general have met our forecasts and projections. However, in the case of new products that are being introduced into the market for the first time, revenue projection is very difficult. For example, in the case of MapQuest, no one knew whether advertising revenue was really going to work for content-based Web Sites and at what rate such revenue would grow; likewise for providing map- and routing-enabling services to other Web sites. While it was clear that there was demand for such a product, pricing was totally unknown and, as is typical in these cases, pricing started high and then declined rapidly and has now stabilized. Outside of the Internet, GeoSystems has enough of a knowledge base and long-term contracts and relationships to be able to forecast revenues fairly accurately.
8a. Have you encountered problems from unduly restrictive access or use provisions pertaining to any external source databases? Generally speaking, any problems we have encountered from restrictive access or use provisions to an external source database have been addressable through negotiations and usually end up being a pricing issue. Because competition exists in most commercial database categories in the geographic sector, pricing and hence terms and conditions have been realistic and workable. Probably the major problem area we have faced is in dealing with governments outside the United States that have a particularly restrictive approach to geographic databases. In the extreme, these restrictions can sometimes mean that all government map data are considered sensitive and not releasable to outsiders. More commonly, it means that prices are set extremely high (based on the actual costs involved in collecting the data) making commercial exploitation infeasible. In these cases, work-arounds involving the use of source information that is not produced by the government in question (e.g., commercial satellite imagery) can be undertaken, although these are expensive and time consuming.
8b. What problems have you had with legal protection of your own database activities and what are some examples of harm to you or misuse of your data that you have experienced, if any? We have not experienced any major problems with the legal protection of our databases and products. We have clearly experienced many instances, on a small scale, of unauthorized copying and use of our products. For example, a couple of years ago, I stopped in the Admirals Club at O'Hare and noticed a kiosk that advertised itself as a concierge-type guide to the Chicago area. It became obvious very quickly that the maps used in this kiosk were lifted from a CD-ROM that was published by one of our licensees (with the copyright notice removed from the maps). Our licensee confirmed that they had not authorized this use. After several phone calls and letters from our attorneys (and putting pressure on American Airlines), the company that produced the kiosks withdrew them. Other examples of misuse include publishers producing atlases derived from digitizing our print products.
8c. How have these problems differed according to data product, medium, or form of delivery, and how have you addressed them? We have used all means at our disposal (procedural, technical and contractual) to protect our intellectual property. Even though at this point the biggest problems we have had have come from unauthorized copying and use of our print products, clearly the availability of digital data (especially online) has the potential for much more significant and harmful abuses. As mentioned above, as a matter of policy, we do not make our “source” databases available directly. They are used to create specified maps, routes, or travel plans. This limits our exposure (at least short of an actual penetration of our internal data management systems) to derived products and not the actual databases. We also must contractually protect our third-party suppliers' databases and we employ the same procedural, technical, and contractual mechanisms to protect their data.
A major concern for us (and others in the commercial geographic database business) is that the extent to which geographic information (e.g., maps) has legal protection under current copyright laws is in doubt. As in the Feist case, courts have recently determined that maps have weak, if any, protection under copyright laws because they are assemblages of facts that cannot be, in themselves, copyrightable. Even though the original U.S. copyright law specified maps as one of the works that shall be protected, courts have determined that only the “artistic” design, layout, and possibly the selection of information to be portrayed on a map are protected. Traditional means of protection, such as placing deliberate errors in a map (copyright traps) do not seem to guarantee protection from wholesale copiers of maps. When extended to databases, this lack of protection becomes even more acute since geographic databases are clearly and unarguably collections of facts only and therefore are akin to maps with the artistic aspect removed. This means we are reluctant to invest in databases (argues for the “buy” in our make-vs.-buy decisions) and we are very reluctant to make our databases available in any form.
8d. What specific legal or policy changes would you like to see implemented to help address the problems identified above? Some form of legal protection that would proscribe the unauthorized copying and revenue-generating use of a commercial geographic database and derived products such as maps, is necessary for our company to have future viability. Since we pay significant license fees to third-party data providers and also spend literally millions of dollars on creating, enhancing, and maintaining data, we would be at a significant cost disadvantage if, through unauthorized use, competitors could offer similar products and services. Because every map must involve creative decisions pertaining to what to show and not show on the map and how to represent the information via symbols and text, we believe maps should be fully covered under existing copyright law and that this should pertain to both printed maps and maps displayed on a screen. Even maps created automatically from databases involve the use of software that selects features for display based on scale, map use, etc. and follows rules (developed by human cartographers) on how to symbolize features and how to lay out the resulting maps. Since the copyright laws cannot protect the source databases in the same way, a specific remedy is needed to protect those investments from unauthorized access and use.
9. Do you believe the main problems/barriers/issues you have described above are representative of other similar data activities in your discipline or sector? If so, which ones? If not, what other major issues can you identify that other organizations in your area of activity face? The issues discussed above are, in my opinion, representative of issues confronting the geographic data industry as a whole. In fact, for firms that are wholly or primarily data vendors, such as Navigation Technologies, Inc. (NavTech), Geographic Data Technology, Inc. (GDT), Etak Corporation, and others, the issues are greatly magnified due to their dependence on revenue from data licensing and their need to invest very heavily in database development and maintenance. These firms will typically rely on public domain source data together with self-funded primary data collection, involving their own field staff and/or aerial photography. The case of NavTech is illustrative given the extreme nature of the investment being made to develop navigable street map databases in major populated centers around the world. Estimates of the investment already made in building the NavTech database are in the hundreds of millions of dollars. This marks the first time in the geographic-data industry that private-sector investment is at a level that matches or exceeds public-domain investments for map data development and maintenance. The resulting proprietary navigation database clearly exceeds, in its capture of information related to vehicle navigation such as directional controls, lane restrictions, turn restrictions, etc., that which is available via the public domain. The business proposition is clearly based on the establishment of a mass market for automobile navigation systems and the need for this kind of data to support the functioning of these systems. To make the huge investment pay off, the price for this data must be kept high, significantly higher than the price of paper maps or road/street atlases. To maintain this high price, the ability to obtain this data inexpensively through copying, reverse engineering, decompilation, or other methods, must be prevented through all means possible, including technical (via physical and software copy-protection devices), procedural, and legal/contractual. Without strong legal and contractual protection (as well as technical and procedural), NavTech would certainly greatly limit the accessibility to its databases. Since these databases are also of great utility outside of the in-vehicle navigation systems application (e.g., for emergency dispatch systems, local government applications, logistics planning, and general consumer use), the lack of protection will result in less-than-optimal geographic data applications in a variety of areas. In addition, the databases could be withheld from dissemination via potentially vulnerable channels, such as the Internet and wireless communications, which could effectively prevent advanced applications of the database from coming to fruition.
A related issue, of less importance but nevertheless one that appears from time to time, is the uncertainty of the respective roles of the public and the private sector in geographic data creation and maintenance. There are those who believe that geographic data should be a public good and that even databases such as the NavTech database really ought to be taken over by the federal government and put into the public domain. This is precisely due to the business economics discussed above, that is, the necessity to keep the price for the data high, which will limit its potential use in many applications that could benefit from it. Also, NavTech, like any private-sector firm, is limiting its development to major metropolitan regions where it can generate adequate revenue from the data; this will lead to the lack of data availability for nonmetropolitan regions of the United States. In another example, the business of GDT, as well that of similar firms, is largely built around the enhancement and updating of U.S. Census Bureau geographic data. The extent to which the Census Bureau decides to undertake (and adequately budgets for) enhancing its data for the upcoming 2000 census, as well as its dissemination and pricing policies, can have a life or death impact on these firms. In general, the status quo consensus seems to be that the public sector is responsible for establishing a basic underlying foundation layer for geographic data, including the basic geopositioning and identification of “base map” features such as hydrography (rivers, lakes, coastlines), boundaries, transportation, etc. The federal public-sector approach is the “wide and shallow” approach with consistent national coverage but with wide gaps in maintenance, causing the data to require updating for most commercial applications. The private-sector approach is “deep but narrow” but generally builds upon the federal public-domain data foundation, which ensures consistency and some degree of standardization.
The role of state and local governments, which are increasingly rich sources for digital geographic data, is another key issue. In the past, the perception from the private sector is that it is very difficult to work with state and local government databases because of spotty coverage, inconsistent data structure and content, a wide range of database building systems and approaches, and difficult administrative and contractual issues. As geographic data standards are established and widely adopted, some of these barriers should be lowered, increasing the attractiveness of exploiting these data resources for a variety of geographic data applications. On the administrative/contractual front, it seems that some state and local governments have taken an aggressive, nearly private-sector approach to the rights in data and pricing issues while others have more closely emulated the federal public-domain philosophy. Again, some form of consistency here would be very useful to the industry.
General Discussion
DR. FORESMAN: We heard earlier about barriers to development of these kinds of things, and talk about the shift in the primary source development, and how it gets into a critical mass of data. How is that shift affecting the barrier to development? Did it shift?
MR. GLICK: Correct me if I am wrong, but I think you are referring to the change in source availability from a digital-source perspective?
DR. FORESMAN: Right, the original analog and digital conversion to get the base maps, the switch to digital.
MR. GLICK: What is happening is that, as I mentioned, the barriers to private-sector people entering into the geographic-data business, whether as data vendors or as creators of end-user applications, have gone way down. Of course, in the United States we are fortunate enough to have very liberal public-domain protection.
One thing I didn't mention regarding barriers, which is significant on the international side, is the difficulty in getting hold of government-produced databases outside the United States, where there is very, very strict control over such databases and, therefore, high prices related to licensing those data. I think it has caused a situation where those countries lag way behind the United States in terms of the availability of these applications. That barrier has gone down. What that has been replaced with, though, for example, are the Internet people who want data that are updated literally on a daily basis.
Given the fact that street-level data are available, for example, with information on things like turn restrictions and one-way streets that change all the time, the burden of maintaining those data has gone way, way up. So, it is easier to get started, much easier than before, but it is difficult to really provide, I think, the level of quality and currency that people expect.
DR. OVERTON: Chris Overton, University of Pennsylvania. It strikes me that you don't need any protection for these, because the size of the databases and the speed with which they are updated are protection enough. I can't imagine anyone pirating these databases and making much use of them. A snapshot in time would not be very useful.
MR. GLICK: I wish that were true. Certainly there are high-quality, high-cost-oriented applications for which that probably is the case. However, there is lots of business for people who create moderate-quality and low-quality products that actually undermine some of the higher-quality, legitimate products that are out there.
I can give you many, many examples of that, where people have either scanned maps or taken databases that, for example, we have had and created. I was just faced with this a couple of months ago. I think I mentioned in my prepared response that, at the Admirals Club in Chicago I found a concierge service kiosk that had maps and databases of cities around the country. They were clearly our maps with just the copyright notice removed. Yes, they may have had an agreement with the Admirals Club, which was paying them thousands of dollars to place these things in their clubs. Now, that database was going to go out of date, and eventually they would have had a problem there. I guess they would have had to steal our next version of those data. But I don't think that is enough of a barrier to prevent people from making some commercial use of databases.
If the NavTech database, even a snapshot of that, which contained all the streets in Washington, D.C., and the turn restrictions and address ranges and so forth, were freely available, I guarantee you there would be dozens of people trying to create products from that, even knowing that it would go out of date, and they would either have to go back to the well or start investing themselves in maintaining the data.
MR. REICHMAN: It strikes me that there is kind of a cycle going on in your operation, and I would like to pin it down. I don't know how typical it is of others. On the one hand, you are really dependent on contract-at-the-moment at the delivery application. So, you are one of those people that want Article 2(B) of the Uniform Commercial Code revised so that you can count on your contract-standardized agreements being enforced. At the upper end, you are extremely dependent on access to the public domain, and you are candid enough to admit that.
How about the other way around? What if a scientific body needed access to your data on a kind of regular basis? They were doing some kind of a study and they needed to have massive amounts of your data. So, in other words, the public domain comes to you and says, “Well, now we need some help from you.” Do you have a differentiated pricing policy? Do you have some kind of a two-tiered product or price discrimination that would favor the public-domain users?
MR. GLICK: That is a good question. Let me answer that more from a data-vendor perspective than a GeoSystems perspective. GeoSystems doesn't really license its databases. It creates end products. I don't think the industry has matured yet to that point. Frankly, I don't think there has been demand from the research community to do that. But I think the industry would be very, very receptive; I believe so.
MR. REICHMAN: One follow-up question. You described the difficulty of obtaining comparable public-domain data from Europe. Now, is it not a possibility that if this legal protection were misdirected, that you would experience the same type of difficulty obtaining data that are now readily available to you as a raw material input to your operation? If the same sort of laws and restrictions applied here, you would not have that access to the public domain from which you then make these mainstream applications. Is that a misunderstanding on my part?
MR. GLICK: The laws and restrictions that apply in other countries go under various rubrics. There are things like crown copyright, royal copyright, government copyright, where the governments believe that the data that they have created, that they have invested in, that the governments have invested in, are owned and are really the exclusive property of the government. In some countries, for example, Japan, even the primary act of data collection—in other words, going out and surveying the streets—is an illegal activity, and it is an activity that is reserved for the government.
You know, this is unrelated, I think, to any other copyright protection or intellectual property protection issue. It is just that the government acts as a private-sector vendor would and keeps prices very, very high. It forces people—for example, ourselves and other data collectors like NavTech in Europe—to actually fly photography or take satellite imagery from the United States and create databases of Europe, instead of going to government sources there. That, of course, adds to the cost. That means, for example, that when we create databases in Europe, we charge three to four times the price for a single city that we charge in the United States, because of that issue.
MR. ONSRUD: I think we are going to have to wrap up now. Part of the purpose of having these presentations by various data users and creators, of course, is to uncover some actual examples of problems that are being confronted by the governments, by the noncommercial sectors, and by the commercial sectors, that might be addressed by database legislation. So, part of the attempt here is to find the remnants of the projects that essentially failed, that were unable to move forward.
If we are drafting legislation, we want to hone legislation that would actually address specific problems; otherwise, of course, there is a very real danger of unintended consequences. So, we want to be able to address very specific problems, whether it is day-to-day operational difficulties, project-formation difficulties, etc. or something else. We have had a bit of that today, but in another sense, all three of the entities that presented papers today appear to be thriving. They have been able to manage most of their data-issue problems using current technological, contractual, and intellectual property devices.
We have seen from this panel, at least in my reading of the papers and some of the problems that people are referring to, that there are already violations of copyright law. So far from this panel, we don't really have major empirical evidence yet that illustrates these real-world problems that are ripe to be resolved specifically through database legislation. Perhaps we will see more of these as we talk in the small group sessions and in the other data panel sessions over the next two days. Keep in mind that what we are really after, for many of these experiences, is the actual empirical evidence or directions for finding that.
DR. SERAFIN: We are going to move on to genomic data. Philip Loftus is going to moderate this session.
GENOMIC DATA PANEL
DR. LOFTUS: Genetics and genomics are essentially the new game on the block as far as databases are concerned, but of course they are growing at an explosive rate. I think it is obviously very timely to have presentations from each of the key sectors on that. To begin the discussion from the government sector, we have James Ostell, who is the chief of the information engineering branch for the National Center for Biotechnology Information (NCBI), which pioneered some of the early connections of key databases for human and other genomic data that have come along. NCBI is part of the National Library of Medicine and the National Institutes of Health.
Government Data Activity
James Ostell, National Center for Biotechnology Information
Response to Committee Questions
Provide a description of your organization and database-related operations. The National Center for Biotechnology Information (NCBI) was established by Public Law 100-607 on November 4, 1988, as a division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH). NCBI's mission is to (1) create automated systems for storing and analyzing knowledge about molecular biology, biochemistry, and genetics; (2) perform research into advanced methods of computer-based information processing for analyzing the structure and function of biologically important molecules; (3) facilitate the use of databases and software by biotechnology researchers and medical personnel; and (4) coordinate efforts to gather biotechnology information worldwide.
Basic Research From the inception of NCBI, it was considered essential to have a multidisciplinary group of in-house investigators concentrated on basic research in computational molecular biology. These investigators not only make important contributions to basic science, but also serve as a wellspring of new methods for applied-research activities. A research group composed of computer scientists, molecular biologists, mathematicians, biochemists, research physicians, and structural biologists is studying fundamental biomedical problems at the molecular level using mathematical and computational methods. These problems include gene organization and genome analysis, theory of sequence analysis, biomolecular structure modeling and prediction, and statistical approaches to text retrieval.
A sampling of current research projects includes detection and analysis of gene organization, repeating sequence patterns, protein domains, and structural elements; creation of a gene map of the human genome; mathematical modeling of the kinetics of HIV infection; and analysis of effects of sequencing errors for database searching, development of new algorithms for database searching and multiple sequence alignment, construction of nonredundant sequence databases, mathematical models for estimation of statistical significance of sequence similarity, and vector models for text retrieval. Additionally, NCBI investigators maintain ongoing collaborations with several institutes within the NIH and with numerous academic and government research laboratories.
Databases and Software NCBI provides integrated access to the GenBank DNA sequence, related molecular biology databases, and other NCBI services through the World Wide Web at < http://www.ncbi.nlm.nih.gov >. The major database services are summarized below.
NCBI assumed responsibility for the production and distribution of GenBank in October 1992. NCBI staff with advanced training in molecular biology build the database from sequences submitted by individual laboratories, by high-throughput sequencing centers, and by data exchange with the international nucleotide sequence databases, European Molecular Biology Laboratory (EMBL) and the DNA Data Bank of Japan (DDBJ). Arrangements with the U.S. Patent and Trademark Office enable the incorporation of patented sequence data.
The current Release 110.0 of GenBank contains more than 3 million sequence records, yielding more than 2 billion base pairs. GenBank has been growing at an exponential rate since its beginning in 1982 and doubles approximately every 14 months. More than 100,000 sequences from individual laboratories and high-throughput sequencing centers are added each month.
In addition to GenBank, NCBI supports and distributes a variety of databases for the medical and scientific communities. These include the Online Mendelian Inheritance in Man (OMIM), the Molecular Modeling Database (MMDB) of 3-D protein structures, the Unique Human Gene Sequence Collection (UniGene), a Gene Map of the Human Genome, and the Cancer Genome Anatomy Project, which is done in collaboration with the National Cancer Institute.
Entrez is NCBI's search and retrieval system that provides users with integrated access to sequence, mapping, taxonomy, and structural data. Entrez also provides graphical views of sequences and chromosome maps. A unique feature of Entrez is the ability to retrieve related sequences, structures, and references. The journal literature is available through PubMed, a World Wide Web interface developed at NCBI that provides access to the 9 million journal citations in MEDLINE and contains links to full-text articles at participating publishers' Websites. The MEDLINE database is produced and distributed by the Library Operations Division of the National Library of Medicine, and NCBI provides the Web access via PubMed.
BLAST is a program for sequence similarity searching developed at NCBI and is instrumental in identifying genes and genetic features. BLAST can execute sequence searches against the entire DNA database in less than 15 seconds. Additional software tools provided by NCBI include Open Reading Frame Finder (ORF Finder), Electronic PCR, and the sequence submissions tools, Sequin and BankIt. All of NCBI's databases and software tools are available from the World Wide Web or by File Transfer Protocol (FTP). NCBI also has e-mail servers that provide an alternative way to access the databases for text searching or sequence similarity searching.
The high rate of access to NCBI's services makes it one of the most highly used federal Web sites. The site receives more than 4 million hits per day from 90,000 unique user addresses. About one-third of the searches are for molecular biology databases and two-thirds for the PubMed interface to MEDLINE. The BLAST service receives 70,000 search requests daily.
Education and Training NCBI fosters scientific communication in the area of computers as applied to molecular biology and genetics by sponsoring meetings, workshops, and lecture series. A Scientific Visitors Program has been established to foster collaborations with extramural scientists. Postdoctoral fellow positions are available as part of the NIH Intramural Research program.
1a. What is the primary purpose of your organization? The primary purpose of NCBI is to develop databases of molecular biology and related information, develop effective search and analysis methods for the data resources, and conduct research in computational molecular biology.
1b. What are the main incentives for your database activities (both economic and other)? The primary incentive is to support national and international molecular biology research activities and contribute to basic research in computational molecular biology.
2a. What are your data sources and how do you obtain data from them? NCBI obtains data through direct contributions from individual scientists, collaborations with other databases, and links to related outside resources. The data sources for NCBI's major database development programs are summarized below.
GenBank DNA and protein sequence data is submitted to GenBank directly by the scientific community and represents the results of their experimental research. Many database users are also data contributors. Our international database collaborators, DDBJ and EMBL, also receive data from individual scientists, and the three databases exchange data nightly. DNA and protein sequence data submission is done voluntarily by the scientific community, with added encouragement from funding institutions and journal editors. Many molecular biology journals require that molecular sequence data be deposited in a public database as a condition of publication. Recipients of NIH research grants are also required to deposit sequence data in the public databases. Policy statements related to release of research data include: (1) the Public Health Service policy on distribution of unique research resources, September 11, 1992; (2) NIH Grants Policy on availability of research results, October 1998; (3) National Human Genome Research Institute (NHGRI) policy on release of human genomic sequence data, March 7, 1997; and (4) NHGRI policy on availability and patenting of human genomic DNA sequence data, April 9, 1996.
Molecular Modeling Database The Molecular Modeling Database (MMDB) contains experimentally determined biopolymer 3-D structures obtained from the Protein Data Bank (PDB) produced by Brookhaven National Laboratory. With MMDB, NCBI has encoded the protein structures from PDB in a data structure designed to facilitate molecular modeling. MMDB is fully integrated into the Entrez retrieval system for sequence, mapping, and structure data and includes the Cn3D structure viewer developed at NCBI. The PDB database is available via FTP without restriction.
Genome Mapping Information Resources The Genomes domain within the Entrez database system contains genetic, cytogenetic, physical, and sequence maps that have been integrated to show common markers. The seven organizations that maintain the individual source maps make their data publicly available via FTP from their host Web sites. A collaborative working relationship with the mapping groups contributes greatly to the effectiveness of NCBI's service. Links from NCBI to the source organizations are provided directly from within the Genomes domain of Entrez.
GeneMap 98—a transcript map of the human genome—and its 1996 predecessor resulted from a collaborative effort between NCBI and the International Radiation Hybrid Mapping Consortium to launch such a project. NCBI organized and distributed the sequence data to the mapping centers, and the centers carried out the mapping using a consistent set of radiation hybrid reagents and methodologies. NCBI then developed the database and retrieval systems that provide access to the integrated human gene map, with links to the original source mapping organizations for more detailed information.
Literature Providing access to the scientific literature is an important component of NCBI's molecular biology database services. The PubMed retrieval system provides access to the MEDLINE database that is also a product of the National Library of Medicine. By agreement with the Johns Hopkins University, NCBI also provides access to the OMIM database, a comprehensive catalog of human genetic disorders that includes comprehensive state-of-the-art reviews of the scientific literature and extensive references.
Links to Outside Resources and Related Databases NCBI provides links to numerous outside resources that offer related molecular biology data. The NCBI FTP site also serves as a data repository for third-party databases containing specialized data, for example, databases on specific organisms, a restriction enzyme database, a metabolic pathways database, and a database of protein motifs.
2b. What barriers do you encounter in getting these data and integrating them, and how do you deal with those barriers?
Direct Contributions from Scientists NCBI has encountered few barriers in obtaining and integrating data contributed directly by the scientific community. The primary issues that have arisen in this respect are related to timeliness of data deposition, the completeness of the sequence data annotations, requests to release partial data, and concerns about confidentiality of data that is deposited prior to publication.
Since there are still some journals that do not require deposition of sequence data in a public database as a condition of publication, an author may delay release of the full sequence data until additional research can be completed. Contribution to the GenBank is voluntary, and so NCBI cannot force any individual scientists to submit sequence data, but we can contact the scientists and request that they do so. NCBI also has a mechanism for entering sequence data into GenBank directly from the published literature, so that the portion of the sequence that appears in print can be included in the database. This situation is by far the exception to the rule, however.
In the rush to submit their data and obtain an accession number to be included in a manuscript submitted for publication, authors sometimes fail to do a complete job of making biological annotations on their sequence data. Incomplete data annotations can also result from inexperience using the sequence submission software. In both cases, an increased production burden is placed on NCBI GenBank annotation staff.
There are some cases in which a scientist requests that only part of the sequence be released, for example, only that portion that actually appeared in print or only the protein sequence. GenBank policy requires that the full sequence, both DNA and translated protein, be released when the accession number of any portion of the sequence is published.
Some authors want to be certain that their data will remain confidential prior to publication of the associated paper. GenBank has a “hold until published” policy, so that submitters can request that the data be held confidential for that period of time.
From a technical standpoint, NCBI has facilitated the data submission process by developing two easy-to-use software packages for the preparation of sequence database submissions and by making these available free of charge. BankIt is designed for submitting sequence data through the World Wide Web, and Sequin is a stand-alone program for preparing sequence submissions on Macintosh, Windows, and UNIX format. Our international sequence database collaborators have also developed submission tools for their users, and the collaborators employ a common data format to facilitate exchange.
Collaborations with Other Databases The barriers to obtaining access to data from outside databases have also been few and relate to timeliness of updates. To the limited extent that NCBI makes use of third-party databases in its integrated retrieval systems or in its BLAST service, we are dependent on the update schedules of the outside entity.
3. What are the main cost drivers of your operations? The primary costs are for personnel. The increase in sequence and mapping data has necessitated an increase in staff required to process those data. The increase in data has also resulted in development of specialized database services and a concomitant increase in personnel to design, implement, and maintain these resources. The second major cost area is for information technology to support the growing database development and distribution activities.
4a. Describe the main products you distribute/sell. NCBI's primary products are molecular biology databases containing DNA and protein sequence information, genome mapping information, 3-D biomolecular structure information, and associated published literature. NCBI also distributes and provides Web access to the BLAST family of sequence analysis programs, which were developed at NCBI and are used as research tools in sequencing laboratories worldwide. Some examples of NCBI's more than 30 database and analysis services include GenBank, Entrez, and the Molecular Modeling Database, all of which are described above.
In addition, the UniGene database organizes expressed sequence tags (ESTs) and full-length cDNA sequences into more than 55,000 clusters, each of which represents a unique human gene. Clusters are annotated with mapping and expression information. The UniGene database served as the foundation for the collaborative project that resulted in the 1998 and 1996 Gene Map of the Human Genome.
Another example of an NCBI database is GeneMap 98, which is a database representing the physical map of more than 30,000 human genes constructed by the International Radiation Hybrid Mapping Consortium using a consistent set of radiation hybrid reagents and methodologies. This map provides a framework and focus for accelerated sequencing efforts by highlighting key landmarks of the chromosomes. It represents the cooperative efforts of more than 100 scientists worldwide.
Sequence Analysis Tools BLAST sequence similarity search programs allow scientists to compare a nucleotide or protein sequence against the full sequence database or a subset thereof. Electronic PCR makes it possible to determine the gene map location of a newly identified sequence. ORF Finder is a graphical analysis tool that finds all open reading frames in a sequence.
4b. What are the main issues in developing those products? The main issue is identifying the needs in the research community. In response to research needs, issues of data collection, organization, and access are addressed.
4c. Are you the only source of all or some of your data products? NCBI is the sole source of most of our data and software products. Much of the data are made available without restriction from the NCBI FTP site, and outside organizations can and do obtain them and redistribute them in full or part.
Because of the international sequence database collaboration, the information in GenBank is also contained in the EMBL and DDBJ sequence databases. Each database receives, processes, and maintains data submissions independently, so each database does maintain control over a unique set of sequence submissions. However, the sequence data processed at each of the three databases is exchanged on a daily basis, so that all three databases provide access to essentially the same universe of DNA and protein sequence information.
GenBank is made available for downloading in full or in part from our FTP site and is installed as a local application in hundreds of academic, government, and commercial institutions. Therefore, while we are the sole source of GenBank, many other organizations provide access to it. Approximately 200 sites download daily the GenBank updates.
NCBI is the only original source of specialized database services that have resulted from internal research and development efforts on data organization, consolidation, and analysis. Examples of these include UniGene and GeneMap 98. However, as noted above, for many of these services, the underlying data are made available by FTP, so local database development projects may be under way based on those underlying data.
The BLAST sequence similarity search software is also available for downloading from our FTP site and is installed as a local application in academic, government, and commercial laboratories. It is also available as a Web application from academic and commercial Web servers.
5a. What methods/formats do you use in disseminating your products? NCBI disseminates its databases and software through online interaction on the Web, through client/server programs, and via FTP. CD-ROM dissemination was completely discontinued earlier in 1998.
5b. What are the most significant problems you confront in disseminating your data? In terms of computing resources, the most significant problems are keeping pace with the demand for access and the sheer growth of data. In terms of providing reliable access to the data, there are performance issues related to the Internet itself or related to network availability at local user sites. Performance issues are greater for international access.
6a. Who are your principal customers (categories/types)? For NCBI's molecular biology database services, the primary customers are research scientists in academic, government, and commercial organizations.
For the services that have a clinical component, such as OMIM, GeneMap 98, and Genes and Disease, customers also include health professionals and, to a more limited degree, students and the general public.
6b. What terms and conditions do you place on access to and use of your data? GenBank and the other molecular biology databases produced by NCBI are freely available with no copyright or access restrictions. NCBI requests that it be acknowledged as the source of the data, but this is not required. The OMIM database is produced by and proprietary to the Johns Hopkins University, with NCBI providing the computer support for database maintenance and access. OMIM is subject to copyright restrictions regarding redistribution, and the Johns Hopkins University is the copyright owner.
6c. Do you provide preferential terms for certain categories of customers? No differentiation is made among different categories of users.
7a. What pricing structure do you use and how do you differentiate (e.g., by product, time, format, type of customer, etc.)? All services are free of charge.
7b. Do your revenues meet your targets/projections? Please elaborate, if possible. NCBI's operation is completely funded by congressional appropriation, and our operations are conducted within annual budget limits.
8a. Have you encountered problems from unduly restrictive access or use provisions pertaining to any external source databases? There has been only one instance in which there was an attempt by an outside database to restrict access. The SWISS-PROT database, which is based in Geneva, Switzerland, and had been included as an integral part of the Entrez retrieval system for eight years, attempted to change the terms of data distribution and require that NCBI impose access restrictions for commercial users. NCBI was able to negotiate mutually acceptable terms that would not require us to release information on commercial or other users of our database and to continue to incorporate the SWISS-PROT data in our services.
8b. What problems have you had with legal protection of your own database activities and what are some examples of harm to you or misuse of your data that you have experienced, if any? NCBI's intent has been to encourage use of the databases by third-party vendors and distributors, and we have not encountered problems in this regard. To the contrary, we have experienced successful interactions in which producers of specialized sequence analysis software provide hooks directly into the NCBI databases. In addition, outside organizations have developed customized user interfaces that provide access to NCBI services, for example, foreign language interfaces.
8c. How have these problems differed according to data product, medium, or form of delivery, and how have you addressed them (e.g., using management, technology, and contractual means)? Not applicable.
8d. What specific legal or policy changes would you like to see implemented to help address the problems identified above? As an organization within the NLM and the NIH, we need to defer to NIH's Office of Policy in regard to legal or regulatory changes.
9. Do you believe the main problems/barriers/issues you have described above are representative of other similar data activities in your discipline or sector? If so, which ones? If not, what other major issues can you identify that other organizations in your area of activity face? The only comparable government sector organizations involved in this type of data activity are our international partners, the producers of the EMBL and DDBJ databases. They are both providers of large public databases that are funded with public monies. The issues described above would be representative of these other two databases.
General Discussion
MR. PETTINGER: Larry Pettinger from the USGS. In your prepared response the statement was made that work—findings—from agencies like the National Institutes of Health is exempt from copyright. I was interested in the fact that this being government-funded research, how is that possible, or are there other ways that the public interests are protected?
DR. OSTELL: This is not a copyright on the sequence data themselves. Normally, there is a process with the sequence data, where the journals require a section number from the public database in order to show that as supporting evidence for the paper. So, obviously, you can obtain a copyright for the paper, but once the sequence is deposited in the public database, the public database has a policy of no restrictions. Now, the laws can change about that.
DR. BENSON: You are exactly right, in terms of the sequence data. The copyright would be for the paper.
DR. PETTINGER: So, they don't own the data directly?
MS. BROOKS: It is still in a public database, so it is publicly available, for other people to use it.
DR. OSTELL: I think that the grantee has the right to patent an invention, that is, if they discovered, say, a pharmaceutical use for some assay they have. But the sequence itself is considered primary data, and it is required that it be deposited in a public database.
MR. MILES: Is there a commercial publishing activity that uses the data that you have?
DR. OSTELL: Yes, a large number of them republish sequence data, sometimes in the context of, say, a software tool, or a set of analysis software tools. For example, we don't provide interactive solutions that have to be taken in-house in, say, a pharmaceutical company. So, there is a large industry, representatives of which essentially build information systems combining public data with the customer's private data into an information system that is designed for in-house use, as compared to public use.
There are also companies that don't need fresh, new technology. For example, a company builds hardware for measuring gene expression and has a new product that uses human sequence references—that is, a standard set of publicly available sets of sequences. For the company, using publicly available data is a benefit, because providing that data set with their hardware product will become standard. The company can also expect that the data will be included in other software database products from other vendors, so their hardware will be referring to the same data set.
PARTICIPANT: What prevents someone from using commercial data provided in your database, from SWlSS-PROT, for example, for other commercial purposes?
DR. OSTELL: In a sense, that is the gamble SWISS-PROT has to take to allow us to include its data. We don't consider it our job to police data in this database, which is why generally we explain to people that they are publicly available. For some of the data, we provide specifically for bulk downloading. SWISS-PROT doesn't provide for bulk downloading; it is only visible through the retrieval system. I think if a drug company were to retrieve dozens of SWISS-PROT records and use them internally, we really can't distinguish that from an individual using that. We do have protections, basically protecting against abuse of the system, somebody using an interactive system to download hundreds of thousands or tens of thousands of records. That would kick in for SWISS-PROT just the same as it would for anything else. Because of those constraints and the advantage for SWISS-PROT to be in there, they are willing to take the risk. But we don't provide policing functions. They have to do that.
DR. LOFTUS: The genomic data give you the blueprint, ultimately, on all forms of life, human and others. Essentially what you have is the bill of materials, the pieces from which life is built. It doesn't tell you how that information is expressed, it doesn't tell you how that expression is controlled, and it doesn't tell you how the products of that expression protein fit into the various biochemical pathways.
A whole area of science has grown up, bioinformatics, that looks at how you take the genomic information and add value to it and turn it into meaningful scientific insights. So, from the not-for-profit sector, we have Chris Overton, who is director of the Bioinformatics Center at the University of Pennsylvania.
Not-for-Profit Data Activity
G. Christian Overton, Center for Bioinformatics, University of Pennsylvania Response to Committee Questions
1a. What is the primary purpose of your organization? The Center for Bioinformatics was established by the University of Pennsylvania in 1997 to provide a focal point for ongoing research and educational programs in bioinformatics and computational biology. The Center is interdisciplinary; faculty and students from the schools of Medicine, Arts and Sciences, and Engineering and Applied Science participate in the program. Research activities in the Center range from basic research in advanced database technology and algorithms to the application of databases and algorithms to furthering our understanding of biological structures and processes. Educational activities reflect the breadth of knowledge demanded by bioinformatics and cover fundamentals in data management, analysis and visualization, scientific computing, molecular and cellular biology, evolution, and genomics. An important activity of the Center is the creation of databases in support of the Human Genome Project and related efforts in genomics, as well as databases designed for hypothesis driven research in biomedicine.
1b. What are the main incentives for your database activities (both economic and other)? Databases hold a unique status in biological research. Because all life is related through evolution, the study of virtually any question in biology is informed by consideration of the historical record of life as reflected in modern organisms. For example, understanding the processes of development in model organisms such as the fruit fly or the round worm C. elegans, whose complete genome (more or less) has recently been determined, provides powerful insights into homologous systems in other animals, including humans. Similarly, predictions of gene and protein functions and macromolecular structures are all driven by comparison to known macromolecules and their structures. Consequently, development and maintenance of databases of biological data, information, and knowledge are critical to the rapid advance of research in fundamental problems in biomedicine. As a corollary, unfettered access to the data housed in the large and diverse collection of online biology resources is essential if the pace of research is not to be inhibited.
2a. What are your data sources and how do you obtain data from them? We obtain data through four principal mechanisms: (1) proprietary and public-domain experimental data generated at the University of Pennsylvania or by collaborators at other academic research facilities; (2) manual curation and encoding of data from the published scientific literature; (3) transformation and integration of data from the large collection of public online resources; and (4) computational analysis to yield derived data, information, and knowledge. With respect to existing online resources, the biological sciences are extraordinarily rich in public-domain repositories of information. For example, one biological database, INFOBIOGEN, which tracks many of the available online resources, lists 410 resources. Most of these databases relate to molecular and cellular biology and genomics.
2b. What barriers do you encounter in getting these data and integrating them, and how do you deal with those barriers? Until recently, the barriers to accessing and integrating biological data resources were primarily technical in nature. Indeed, many of the issues involved in integrating diverse, heterogeneous, distributed biological data resources—such as data resource evolution, transformation, and integration, and data provenance—have motivated significant research efforts in information technology. Because the rich data resources for biology are largely in the public domain, they have become important testbeds for advances in information technology not readily available elsewhere.
A growing trend, which will surely impact ready access to vital information, is the commercialization and restrictive licensing of formerly freely distributed data resources. In some cases this has been motivated by the need to secure stable long-term funding for data resource development and maintenance. Regardless, this trend could introduce insurmountable barriers to database integration efforts, particularly distributed database integration, as we are forced to negotiate with each provider terms for data access, acceptable data formats, and distribution on the Web. One solution under consideration would be to block access to our Web sites by all commercial domains, a strategy that would greatly simplify our compliance with licensing agreements.
3. What are the main cost drivers of your database operations? Our costs are almost entirely dominated by development efforts for software infrastructure, and creating and maintaining database content. With few exceptions, we do not produce shrink-wrapped software and databases, so our distribution costs are currently modest. Access to our databases is primarily through World Wide Web interfaces. We do distribute software and bulk copy files of our databases on request from our FTP site. We regularly receive requests from industry, government, and academic sites.
4a. Describe the main products you distribute/sell. Ours is primarily a research group, not a production facility. We work on both enabling technology and database content development. Examples of enabling technologies include:
- K2 proiect: In collaboration with colleagues in the Computer and Information Sciences Department at the University of Pennsylvania, we explored advanced query languages for database integration. This led to the creation of K2, a practical system for the integration of distributed, heterogeneous databases and other online resources. K2, while a generic system, has been tuned for databases of interest to biologists.
- TESS project: A transcription element search system that integrates pattern recognition tools, database access, and visualization tools in support of the analysis of gene regulatory elements in DNA sequence.
- bioWidgets project: A Java-based graphical user interface toolkit for the construction of scientific visualization applications in genomics.
Examples of database content projects include:
- GAIA, which is a testbed for exploring issues surrounding automated annotation of genomic sequence in higher organisms. It is both a system for systematically analyzing uncharacterized genomic sequence and a warehouse of derived data available for public access. GAIA integrates information from DNA and protein sequence databases, gene mapping databases, literature information retrieval systems, and genetics databases among others.
- EpoDB, which is a prototype framework for building deep coverage databases for a specific problem of interest to biologists. Like GAIA, it is both a collection of tools for database construction and maintenance and a repository of integrated data. In the case of EpoDB, data have been gathered to support the analysis of gene expression during red blood cell differentiation. It is initialized with data drawn from the nucleic acid sequence database GenBank, the protein database SWlSS-PROT, the transcription factor databases Transfac and TRRD, and the gene expression database GERD. Value is then added to the database by data analysis and manual curation of information.
4b. What are the main issues in developing those products? Since these are research projects, the main issues are in technology development and in unfettered access to data to drive database integration efforts.
4c. Are you the only source of all or some of your data products? If not, please describe the competition you have for your data products and services. The competition largely comes from other research groups, although there are some areas where the difficulty and complications in accessing commercial data has forced us to re-create some of these products.
5a. What methods/formats do you use in disseminating your products? Software is distributed as executables except when source code is specifically requested by collaborators. Database content is distributed as relational bulk copy files. Most data access, however, is via the Web.
5b. What are the most significant problems you confront in disseminating your data? Up until this year, we have had no significant problems in data dissemination. We now are beginning to experience problems as a consequence of licensing restrictions on various databases we would normally redistribute in part or in whole.
6a. Who are your principal customers (categories/types)? Requests for software and database content come from national and international government, academic, and commercial users.
6b. What terms and conditions do you place on access to and use of your data? We currently place no restrictions on the use of our data when extracted from our Web sites. Conditions on data distributed in bulk format are tailored to the user.
6c. Do you provide differential terms for certain categories of customers? In general, we do not provide bulk copies of our databases to commercial users.
7a. What are the principal sources of funding for your database activities? Federal grants and, to a lesser extent, sponsored research from industry.
7b. What pricing structure do you use and how do you differentiate (e.g., by product, time, format, type of customer, etc.)? No charge to this point.
7c. Do your revenues meet your targets/projections? Please elaborate, if possible. Not applicable.
8a. Have you encountered problems from unduly restrictive access or use provisions pertaining to any external source databases? Yes. Licensing agreements on several databases constrain our work on database integration. Among the most annoying over the years have been restrictions on access to literature citation databases, none of which provide adequate query facilitates for the data mining tasks we are interested in.
8b. What problems have you had with legal protection of your own database activities and what are some examples of harm to you or misuse of your data that you have experienced, if any? We have on multiple occasions had users attempt to systematically extract large sections of our databases by performing thousands of sequential queries on our Web site.
8c. How have these problems differed according to data product, medium, or form of delivery, and how have you addressed them (e.g., using management, technology, and contractual means)? We limit the number of queries that can be performed from a particular site between database updates. However, this is primarily an effort on our part to contain abusive use of our system rather than prevent access to the data.
8d. What specific legal or policy changes would you like to see implemented to help address the problems identifed above? I would like to make sure that access to data is as unrestricted as possible.
9. Do you believe the main problems/barriers/issues you have described above are representative of other similar data activities in your discipline or sector? If so, which ones? If not, what other major issues can you identify that other organizations in your area of activity face? These are reasonably representative of the kinds of problems experienced by other content providers.
General Discussion
DR. GILBERT: Richard Gilbert, University of California at Berkeley. Do you attribute new restrictions on the provision of commercial databases to new directives?
DR. OVERTON: I don't know exactly. I attribute it to the fact that SWISS-PROT, a number of years ago, was having funding problems. They decided that the only way they could continue to produce these data, which are extraordinarily accurate data, was to generate funding from other sources.
DR. SCOTCHMER: Is SWISS-PROT itself a commercial entity?
DR. OVERTON: I am not the one to ask that. It is my understanding they are quasi-commercial. They are sponsored by the Swiss Institute of Bioinformatics.
DR. WILLIAMS: GeneBio is a commercial company and they have the exclusive rights to the distribution of several software products produced by the Institute.
DR. OVERTON: I think this is a trend we are seeing. So, I have been approached on a number of occasions to commercialize several of the databases that we produce. What I would like to see is just easier access to all of this, especially when what we do is database integration. To make it more complicated, we do database integration on the fly. That is, we query through these schools for heterogeneous, distributed data. Some of the data is from different parts of the world. I actually have no idea what is going to happen. I don't know if we will be able to do that in the future if the restrictions become universal.
MR. PERLMAN: Harvey Perlman, University of Nebraska. I don't know enough about biology to really form a question, but does the existence of your database, which incorporates some of the SWISS-PROT database, deprive them of any customers they would otherwise have?
DR. OVERTON: No, absolutely not. A lot of what we do enhances SWISS-PROT.
MR. STEFIK: I think the value of the examples that you put up here had to do with combining items from multiple databases to create some new output. Is there any experience with the restrictions on that—i.e., the combined results? Does it carry all the restrictions from all the sources of data, no matter how small the contribution to the database?
DR. OVERTON: I think that is a great question and we have just not addressed it at all at this point. All the licensing restrictions have really come in the last year. We have worked on these data for a number of years. We haven't changed what we have done, but that currently makes us uncomfortable now, especially when they are commercial users. So, we have the providers of some of these databases asking to see if some of these commercial users actually are licensed. We have had cases where a commercial user has come in on the Web site and tried to download without restrictions.
We are moving toward restricting, just as you said, massive downloads from the Web site. Mostly, we are doing that not just because of the licensing agreement, but because it puts a burden on the resources. We don't want some user to wipe us out, bring our relatively modest system to its knees.
DR. LOFTUS: We probably need to move on. I think genomics is a very exciting area of science where, because of the currency of the information, there is enormous value in the databases. We have seen now that a lot of the value also comes from the value added, in areas of science like bioinformatics, as well as research opportunities. Therefore, that creates commercial opportunities both in the database domain and in the derived products, whether they are products to help you navigate, search, and access the databases or whether they are original scientific capabilities that add value to the information that comes from the databases by telling you how these products are expressed or how they manifest their actions biologically.
Our third speaker gives us an insight on that commercial domain in the marketplace. Myra Williams is the president and the chief executive officer of Molecular Applications Group.
Commercial Data Activity
Myra Williams, Molecular Applications Group
Response to Committee Questions
1a. What is the primary purpose of your organization? Pharmaceutical and agricultural research has undergone a paradigm shift that reflects the impact of the genomics initiative, combinatorial chemistry, and high-throughput screening on the discovery process. Differential gene expression and proteomics are providing insight in disease pathways and functional roles of proteins. Genomes of entire organisms are being elucidated and massive stores of information are accumulating. The goal of Molecular Applications Group is to provide important new science and computational capabilities to mine these data to enhance the discovery process. This goal is accomplished through (1) continued research on new algorithms for moving from gene sequence information to protein function; (2) the development of software for storing, mining, and visualizing these massive stores of data; and (3) capturing the results of our proprietary algorithms with a combination of public and proprietary information to provide databases of direct relevance to the discovery process.
As a privately held company, translating these advances in science into building value for the investors is important.
1b. What are the main incentives for your database activities (both economic and other)? Our database activities are extensive. They include mining existing databases to extract relevant information for analysis as well as developing value-added databases, which include information extracted from other sources.
Some of our database activities are required for us to conduct research in a proprietary environment. For example, information that is transmitted over the World Wide Web is considered to be published. Hence, it is crucial for databases and algorithms to be available in-house for the analysis of proprietary sequences (not only for us, but also for our customers) to avoid placing intellectual property at risk.
Our database activities result as well from an economic incentive. We have developed proprietary algorithms that are directly relevant to novel target identification, the prioritization of likely targets, and the linkage of our structural prediction capabilities to chemistry. Moreover, we have developed software that provides a very powerful biological database—one that permits a scientist to ask questions about relatedness, not just facts. For example, one could automatically assign new sequences that are discovered to families of proteins, query information about “similar” proteins, predict protein function, and identify key residues in a protein. Examples of such families include g-protein coupled receptors, proteases, kinases, etc. Having a compilation of such information is of great importance to scientists since it is the integration of information from numerous sources that frequently provides insight that might have been missed. The discovery process benefits from mining all known information about disease involvement, known ligands, biological selectivity, associated toxicity information, etc. Our databases will address these needs and can be applied to many different protein families.
2a. What are your data sources and how do you obtain data from them? We retrieve information from over 150 sites on the World Wide Web, of which almost 100 are backups for the primary sites. Sources such as GenBank, ProCite, SWISS-PROT, SCOP, and Protein Data Bank (PDB) are accessed dynamically with the desired information extracted automatically and parsed into the appropriate fields. Our systems know where to look for particular types of data and rollover to the secondary site should the primary site be unavailable. Frequently, multiple predictors of the same information are accessed, which gives a certain degree of verification of the individual prediction. This feature is particularly important given the variability in quality of the data on the World Wide Web. New sources, whether public or proprietary, can easily be added to the system. For example, the Incyte database, LifeSeq, is used as an information source of sequence and expression information. The retrieved information is analyzed, clustered, and represented as an “annotation” on the relevant area of the sequence, or as a protein structure in the structure window. At any time, a scientist can drill down to access the raw data, reviewing everything from BLAST search results to PubMed records.
2b. What barriers do you encounter in getting these data and integrating them, and how do you deal with those barriers? An increasing number of databases that used to be freely available on the World Wide Web are now being privatized. This factor forces us to obtain licenses for our own use and, in some cases, to request a license that permits us to redistribute the data. The latter has been particularly problematic. Many of the scientists and the academic institutions have minimal experience negotiating such an agreement; as a result, decision making is slow. Since our products depend upon having a rich variety of information available, these situations often require us to look for other information sources rather than dealing with the recognized leader. We have not yet faced any legality issues in creating a derivative database based in part on information extracted from a different database. Should we lose the right to reutilize information in the public domain, our entire product focus would be invalidated. For example, sequence alignments available publicly can currently be utilized with our proprietary technology to generate Hidden Markov Models for protein families and to produce evolutionary trees. These models and evolutionary trees are then stored with the alignments in a new database along with other information gleaned from numerous sources. Three-dimensional protein structures published in PDB provide the data for the prediction of homologous structures. The original structure as well as the calculated structures may also be stored in a database.
Science builds upon science, with one discovery becoming the basis for another. In the past, providing appropriate credit for the source of the information was adequate. Should that situation change, science would be seriously impeded.
3. What are the main cost drivers of your database operations?
Software Development Costs The initial substantial investment is creating software upon which the database will run. If the database captures new science, it will require special software. In the case of our DiscoveryBase™ product, the software consists of a tool kit, which makes it easy to add additional information resources and analytical capabilities. The ongoing costs for extending this database include the costs of getting access to the information sources and a modest investment for extending and updating the database.
The development of the GeneMine Enterprise software and database has been a multiyear, multimillion dollar project. In this case, effort has focused on the database schema, user interface design, and a powerful biological query system that does not require knowledge of SQL. This is a very complex system that is designed to automate the information collection and analysis of feature and structure information for hundreds of genes at a time. The system is completely dependent upon sources from the World Wide Web.
The software for our contents will reutilize some of the software in GeneMine Enterprise; however, software development will still represent a multi-person-year effort. This software will continue to be enhanced over time as new science and new types of information become desirable.
Obtaining Rights to Information The second issue, that of getting rights to the use of desired information, is associated more with the issue of time than of cost. Many information providers have not yet established guidelines and prices for the reuse of electronic information that they provide. Negotiations with such groups go back to the days of the initial electronic delivery of literature searches. Although some progress has been made, we still need to sort out many issues.
Our databases are populated with information derived from numerous different sources on the World Wide Web. If legislation should be passed that makes the creation of derivative databases illegal, all of our database activities as well as our current software products would have to be removed from the market.
Curation One of the most expensive elements of the database activities will be curation. In each case, it is vitally important for the database to be curated by an outstanding scientist who will vet the information proposed for the database. Curation is an ongoing cost that will determine the scientific relevance of the database.
4a. Describe the main products you distribute/sell.
Look™ v.3 The Look™ product accelerates research along the path from sequence to structure by providing a suite of powerful tools to molecular biologists. This product provides scientists with integrated systems for data query and retrieval from a range of available sources via the Internet and intranet. It also provides automated sequence alignment, automated segment match homology modeling, and mutant modeling of protein structures. In addition, the system provides scientists with a convenient mechanism for communication of their ideas by allowing them to link their experimental data, research notes, references, sequence alignments, and 3-D structures through hypertext. The Look™ product is flexible at many levels to accommodate the degree of interactivity desired by expert users. For example, sequence alignments can be adjusted manually, gaps or insertions in a sequence can be added manually, and criteria defining the scope of the search for homologous sequences can be adjusted by the user. By simultaneously highlighting residues in both the sequence and structure windows, the system dynamically links sequence and structure data in one interface, providing valuable insight to molecular biologists. Scientists working toward a better understanding of protein function currently use Look™ to incorporate integrated structure and sequence information into the planning and interpretation of their experiments, accelerating the process of drug discovery.
MacLook™ with Modeling Server The MacLook™ product provides scientists with the same functional capabilities as Look™ v.2.0. The two products differ in that MacLook™ was developed to run on a Macintosh Power PC Computer with a Silicon Graphics Workstation acting as the remote modeling server. MacLook brings molecular modeling capabilities to the scientist's desktop computer system.
GeneMine™ (Look™ v.3.0 with Discovery Engine) GeneMine™ is an expert bioinformatics data-mining system designed to provide scientists with the automated data query, retrieval, and analysis capabilities required for knowledge discovery. The system supports automated query and collection processes via the Internet and intranet to access data in public, licensed, and proprietary sources. GeneMine™ processes the results—filtering, calculating, and clustering data to extract meaningful information and support comprehensive data analysis and visualization. Sequence, structure, and function information is seamlessly integrated within a single application, enabling the user to visualize broad patterns in a concise, interactive, customized display. The visualization format provides the user with the ability to evaluate individual protein features quickly and efficiently. The system supports sequence alignment and 3-D modeling and enables scientists to communicate their ideas through hypernotes containing linked sequences, structures, and text. GeneMine™ also introduces a functional capability that allows scientists to publish quickly and easily to the World Wide Web in HTML.
DiscoveryBase™ Molecular Applications Group developed DiscoveryBase™ for internal use to duplicate the primary information services that GeneMine™ accesses on the Internet. This server provides us with a secure, stable environment to support our projects and our research programs. Other companies have similar needs; thus, we decided to commercialize this product. Pharmaceutical and biotechnology companies are concerned about the protection of their intellectual property and as a result prohibit people from sending proprietary sequences to Internet servers. Although scientists can use Internet services for sequences in the public domain, these services are constantly changing and sometimes are not available at all. This lack of reliability results in great frustration and inefficiency for the user. To address this issue, major companies have cloned a number of databases from the World Wide Web in-house, which are updated nightly. This approach tends to be limited only to the most heavily used databases and requires significant internal support. DiscoveryBase™ brings information services such as GenBank and other frequently used sources in-house for our customers. This server allows our customers to query and run analyses from a secure internal information server that mirrors the external data. DiscoveryBase™ can be updated nightly to provide the most current source of genomic information to the customer. It needs to be configured at each site to interface with the customer's existing analysis tools.
Stingray™ Expression Analysis System Molecular Applications Group has a partnership with Affymetrix in the development of software modules for mining of differential gene expression data. Our first products will be available in the second quarter of this year. They will include (1) algorithms for clustering the results of expression experiments in various ways such as clustering those genes that appear to be coregulated; (2) the linkage of the genes of interest to stored analyses of the structural and feature information for the 6800 human gene chip and the 6500 mouse chip; and (3) databases that subset expression data via hierarchically organized pathway function classes and automates this prediction where feasible.
4b. What are the main issues in developing those products? We are competing against internal development by the major pharmaceutical companies as well as against organizations that are receiving substantial government support. We conduct thorough marketing research to be certain that we are addressing an important question scientifically and one that will have a significant market. However, this field is moving rapidly, so one can never be certain that an opportunity identified today will still be significant once the development effort is completed.
The scarcity of appropriate talent is also an issue. There are few talented bioinformaticists and even fewer software developers who are comfortable with the science. Since the environment is competitive, these individuals command significant salaries—a factor that drives up development costs. Consequently, adequately staffing the project team is difficult, even for a high-priority project.
We try to protect our software and databases through license agreements that recognize the use of the software for creation of derivative databases for internal use, but we place a restriction on the use of the software for creating databases for commercialization. To the best of our knowledge, only one company is using our software for the development of a commercial product; however, several have expressed interest in doing so. We recognize that our products represent only one of numerous modules that any company must have available and that the systems need to be integrated with existing technology.
Certainly, an overwhelming issue would result from a change in copyright law, which limits our ability to extract data from multiple sources and to add value to those data through the use of our proprietary technology.
4c. Are you the only source of all or some of your data products? If not, please describe the competition you have for your data products and services. The bioinformatics market is a highly competitive one with new companies being announced almost weekly. Each company has distinctive technology, but there is some overlap as well. The only product we market that is subject to direct competition is DiscoveryBase™. In this case, some of our customers have developed similar products internally. In addition, at least four other companies market similar products. Our other database products appear to be more distinctive—at least at this time.
5a. What methods/formats do you use in disseminating your products? Our older products such as Look™, DiscoveryBase™, and GeneMine™ are downloaded from the World Wide Web and the license authorization is provided electronically. Our newer products such as Stingray™ will require installation by experts from Molecular Applications Group on each site. These are complex products that use Oracle and Java. Both server and client software will need to be installed.
5b. What are the most significant problems you confront in disseminating your data? The most significant problems we face are minor issues with the license authorization, usually due to lack of knowledge on the part of the user who inadvertently does something incorrectly. A more important problem is making certain that our software operates properly in each environment and successfully gets through the firewall for its information retrieval. In addition, we frequently have to customize the software to access internal information sources in each company.
6a. Who are your principal customers (categories/types)? Our customers fall into three major categories: (1) commercial, (2) not-for-profit or government laboratories, and (3) academic. Over 25 commercial organizations use our software with many of these using it on multiple sites. Our software is installed at over 100 institutions, many of which are international sites.
6b. What terms and conditions do you place on access to and use of your data? Our terms are standard across all categories with a standard license agreement being signed. The only nonstandard term is the statement that the software license does not support the use of the software for creating a database for commercialization. Occasionally, changes in wording are implemented at the request of a customer's lawyer, but we strive to minimize substantive changes.
6c. Do you provide differential terms for certain categories of customers? At this time, the standard terms apply. We are only now beginning discussions with companies that would like to use our software to enrich a commercial product.
7a. What are the principal sources of funding for your database activities? The majority of our funding to date has been provided by license revenues and by our venture capital investors. For some products, we expect to receive advance subscriptions to support development activities.
7b. What pricing structure do you use and how do you differentiate (e.g., by product, time format, type of customer, etc.)? Each product is priced according to the functional scope of the product and whether it is designed for use by only a small number of people or if it is an enterprise system. We typically have single-user licensing with volume discounts provided for multiple copies of the software. In the case of those products designed to function across the enterprise, the pricing is on a “per seat” basis for the client licenses and a separate fee is charged for the server license. Site or corporate licenses are also available for unlimited use.
In general, substantial discounts are provided in the United States for not-for-profit organizations and academic institutions. The price for not-for-profit organizations or government laboratories is typically about 10 to 20 percent of the price charged for commercial use of the software. The price for academics is 5 to 10 percent of the commercial price for our current products. The ratios are considered with each new product and hence may be subject to change. In some cases, customer support requirements after the sale is completed may prevent us from discounting as heavily as suggested here.
7c. Do your revenues meet your targets/projections? Please elaborate, if possible. Since we are a privately held company, our policy is not to discuss revenue issues.
8a. Have you encountered problems from unduly restrictive access or use provisions pertaining to any external source databases? Those databases, licensed or sold by academics, are increasingly presenting problematic license issues. In some cases, we would like to be able to sublicense the database to our customers, but this is frequently not possible. Some of the academic software or software from organizations such as EMBL is now becoming the basis for commercial ventures. This factor further complicates our negotiations. Due to limited experience, all institutions appear to take the most conservative position possible.
8b. What problems have you had with legal protection of your own database activities and what are some examples of harm to you or misuse of your data that you have experienced, if any? At this point, we have not had any legal problems with protecting our databases or with misuse of our data.
8c. How have these problems differed according to data product, medium, or form of delivery, and how have you addressed them? We use our license agreement to define appropriate use.
8d. What specific legal or policy changes would you like to see implemented to help address the problems identified above? The current policies have been adequate from our perspective. As an increasing number of information sources become private, compulsory licensing might be needed for products created with government funds.
9. Do you believe the main problems/barriers/issues you have described above are representative of other similar data activities in your discipline or sector? If so, which ones? If not, what other major issues can you identify that other organizations in your area of activity face? One of the major differences is that some of our competitors provide databases, which require royalties to be paid if the use of the database results in the discovery of a product that is later marketed. This difference greatly intensifies the issue of license compliance for those companies and results in more stringent license requirements.
General Discussion
DR. SCOTCHMER: This is a question partly to you but also to the panel. SWISS-PROT here has emerged as one example of the trend toward strong property protection—I am sure there will be others—in the sense of commercializing or privatizing public data and imposing increasing burdens on science. From a more integrated economic perspective, the justification for that would be a need to have those kinds of price returns in order to support their operation and generate information and so on. I am wondering whether there is any justification for these high prices that they are charging, in terms of what it costs them to provide their service. I am asking, is there any public-policy justification for those kinds of prices?
DR. WILLIAMS: At this point, I don't believe that any of the bioinformatics companies are profitable. As I indicated, putting together these products costs us many millions of dollars. GeneBio, the company distributing SWISS-PROT, can run on a much smaller budget because the Swiss Institute generates the product and all GeneBio has to do is commercialize it. Therefore, they should be profitable earlier than companies that must also cover R&D costs.
Another company, Lion Bioscience, has exclusive rights to commercialize several software and database capabilities developed by EMBL. So, we are beginning to see this happen. Lion also has multimillion-level funding from the German government. Thus, the privatization of information that used to be in the public domain is something that has been occurring increasingly.
It is very expensive not only to produce the sort of systems that we have all been describing today, but also to curate them. Although people may feel that current prices are too high, what they really need to do is to take a step back and look at the value to drug discovery.
I visited a company yesterday to discuss a product that we are considering developing that focuses on the gene protein coupled receptors. Over half of all drugs marketed today associate with one of those receptors (GPCRs). A scientist at the company highlighted the importance of this information to drug discovery and commented that considerable effort had been dedicated internally to the generation of a small set of the information that we were proposing. Moreover, the company lacked access to algorithms comparable to ours that can be used to provide new insight in this important drug class.
If you consider that a scientist costs at least $200,000 a year, the dedication of three people to a project represents an annual investment of over $500,000. Many laboratories, for example, have at least one person who does nothing other than bring data in from the Web to make this information available in their proprietary environment. Commercial systems that cost less that $50,000 annually are available to do this. It doesn't seem to me to be prohibitive for a company to spend $50,000 for performance of such routine services while both saving internal resources and freeing employees to work in areas of greater proprietary value.
Up until the time Incyte hit the market, people didn't value information very highly. In fact, it was actually very difficult to value information by trying to assess a concrete impact on drug discovery. Now we can assign value because we can show where you can accomplish things through the use of information that you couldn't have done otherwise, or that you were accomplishing through a manual process that required man-years of efforts.
We are still in that transition as people learn to value information. The Incyte database was the first product that I can remember for which R&D groups were willing to pay huge sums to get access to proprietary information.
MR. GLICK: You said that sequences, if they are on the Internet, are considered to be published. Can you explain that? I am particularly interested in the implications and why someone would want a sequence to be considered published.
DR. WILLIAMS: If a sequence is proprietary, as I understand it, that sequence cannot be transmitted over the public Internet without encryption. The Internet is not considered a secure environment; as a result, the release of a proprietary sequence to the Internet is essentially the same as publishing it. Scientists work in a restricted environment with great concern over doing anything that would prevent their proprietary material from being patented later on.
DR. MARTINEZ: Joe Martinez, Department of Energy. Part of the liability issue associated with providing information in databases to individuals who might derive products from the databases is if the information contained in them is somehow incorrect.
DR. WILLIAMS: That issue is one that we certainly have not had to face yet. We are providing science captured in our database products, and science is not perfect. We do a lot of work with companies in validating the science that we are including. The nature of science is that as we become wiser and improve our scientific strategies, we will find that there were errors in the way that we have done it in the past. At least in the past, the onus has really been on the recipient to verify accuracy, at all times; just because something comes out of a computer does not mean that the answer is right.
We work very, very hard at validating our algorithms, and we publish that validation. An analogous situation might involve a pharmaceutical company that has performed a comprehensive analysis of a drug candidate. They believe the drug is safe, they market the drug, and unexpected adverse reactions occur. In that case, as long as they can prove that there was no way to anticipate those reactions, they have some degree of protection. Our goal is to make certain that we validate our products, that we curate these products, that we strive for the highest possible quality, and that these actions will provide us some assurance.
In the area of basic research, I can't think of anyone yet who has ever been sued based on an algorithm having an error. It would be different, obviously, for other applications that have direct commercial relevance. Our software enables predictions to be made. The real validation still occurs in the laboratory or in the clinic.
DR. LOFTUS: I hope this panel has given you a flavor for the excitement in this area of genomics. You have certainly seen from these sessions that there is real value in the database products that can be created from them. You have also seen that there is real value that can be added scientifically and in terms of products.
I think you heard a strong message from all three speakers saying that for the science to progress, and even for the commercial part of the market to progress, the whole is more than the sum of the parts. You also heard strong messages saying that the ability to access information across those databases, to take a strong cross-sectional view of that information and combine it in new and imaginative ways, is also a key to success.
DR. SERAFIN: The next panel focuses on chemical and chemical engineering data. The moderator is Roberta Saxon. She is a patent agent with Skjerven, Morrill, MacPherson, Franklin & Friel, LLP.
CHEMICAL AND CHEMICAL ENGINEERING DATA PANEL
DR. SAXON: In my previous life I was doing research in chemistry, possibly making some contributions and certainly using the products of some of our speakers. So, it is a pleasure to start off. Our first speaker is Richard Kayser, who is the chief of the Physical and Chemical Properties Division of the National Institute of Standards and Technology, which is a division of the Department of Commerce.
Government Data Activity
Richard Kayser, National Institute of Standards and Technology
Response to Committee Questions
1a. What is the primary purpose of your organization? An agency of the U.S. Department of Commerce's Technology Administration, the National Institute of Standards and Technology (NIST) exists to promote U.S. economic growth by working with industry to develop and apply technology, measurements, and standards. Within NIST, the Measurement and Standards Laboratories are responsible for providing the nation and U.S. industry with the technology infrastructure (reference measurements, standards, and data) needed to underpin commerce both within the United States and abroad.
As one of the seven Measurement and Standards Laboratories at NIST, the Chemical Science and Technology Laboratory provides the nation's measurement infrastructure in the areas of chemistry, biotechnology, and chemical engineering. Within the Chemical Science and Technology Laboratory, the Physical and Chemical Properties Division is the nation's reference laboratory for the thermophysical and thermochemical properties of gases, liquids, and solids and for the rates and mechanisms of chemical reactions in the gas and liquid phases. The Chemical Science and Technology Laboratory and the Physical and Chemical Properties Division have adopted as one of their three principal goals to assure that U.S. industry has access to accurate and reliable data and predictive models to determine the chemical and physical properties of materials and processes.
In 1968, NIST established its formal program on data evaluation, the Standard Reference Data Program, in response to congressional legislation to ensure that “critically evaluated data is available to scientists, engineers, and the general public.” The program built upon a decades-long NIST tradition of data evaluation in thermochemistry, thermophysics, and atomic spectroscopy. Today, the Standard Reference Data Program, together with the NIST Measurement and Standards Laboratories, coordinates on a national level the production and dissemination of critically evaluated reference data for the physical sciences and engineering. The Physical and Chemical Properties Division is a major contributor to that effort and oversees the majority of data evaluation activities at NIST in chemistry and chemical engineering. As the measurement of data quality, data evaluation is a crucial component in the measurement chain.
1b. What are the main incentives for your database activities (both economic and other)? The main incentives for NIST's database activities stem directly from its mission to promote U.S. economic growth by working with industry to develop and apply technology, measurements, and standards.
The NIST Act of 1988 (15 U.S.C. 271 et seq.) states: “The future well-being of the United States economy depends on a strong manufacturing base and requires continual improvements in manufacturing technology, quality control, and techniques for ensuring product reliability and cost effectiveness.” To assure that future well-being, the Act authorizes and directs NIST to “determine, compile, evaluate, and disseminate physical constants and the properties and performance of materials when they are important to science, engineering, manufacturing, education, commerce, and industry and are not available with sufficient accuracy elsewhere.”
Similarly, the Standard Reference Data Act of 1968 (15 U.S.C. 290-290f) states: “The Congress hereby finds and declares that reliable standardized scientific and technical reference data are of vital importance to the progress of the Nation's science and technology. It is therefore the policy of Congress to make critically evaluated reference data readily available to scientists, engineers, and the general public. It is the purpose of this Act to strengthen and enhance this policy.” The Act authorizes and directs the Department of Commerce (NIST) “to provide or arrange for the collection, compilation, critical evaluation, publication and dissemination of standard reference data.” It empowers the Department to recover the costs of producing and disseminating reference data and to copyright, on behalf of the United States, standard reference data prepared or made available under the Standard Reference Data Act.
Evaluated chemical data are important in diverse areas, including research and development, process and product design, energy efficiency, chemical analysis and identification, custody transfer, and safety, health, and the environment. For example, “process modeling and simulation” has emerged in recent years as a key enabling technology in many industries, and the availability and accuracy of massive amounts of data are crucial to generating results that can be used with confidence. Applications of process modeling and simulation range from the design of chemical plants and air-conditioning and refrigeration equipment to the modeling of combustion and semiconductor manufacturing processes.
2a. What are your data sources and how do you obtain data from them? In many areas of chemistry and chemical engineering, NIST relies primarily on experimental measurements published in the open literature. NIST acquires such data by at least three different mechanisms: (1) direct acquisition of the data from the literature by members of the NIST staff; (2) direct collection of such data by outside experts under grants or contracts from NIST; and (3) donations of private data collections to NIST. NIST also uses experimental measurements reported in master's and PhD theses and in data archives such as VINITI in Russia.
In some cases, NIST acquires extensive sets of experimental data for specific data efforts. NIST often performs such measurements itself but also acquires such data from outside organizations under grants or contracts or via donations. Sometimes these data never appear in the archival literature.
In addition to primary experimental results, NIST uses the results of evaluations published by outside experts. In evaluations, experts analyze multiple data sets from a single and/or multiple measurement techniques, and choose or generate a “preferred” value and associated uncertainty. In many cases, evaluation involves examining interrelated data and measurements to ensure internal consistency. In fields with a well-developed theoretical underpinning, evaluation may involve theoretical calculations.
Increasingly, NIST is using state-of-the-art computational chemistry methods as a source of data. Such methods have improved to the point where in some areas, e.g., gas-phase thermochemistry, the calculations are nearly as accurate as the experiments but much easier to do and less expensive. This trend will accelerate as both algorithms and computing capabilities continue to advance.
Several recent developments are having an impact on the availability of primary experimental data. First, many journals are no longer willing to publish extensive tables of experimental data (or calculations for that matter), especially when the measurements are considered “routine;” for this reason, NIST started some time ago publishing extensive internal reports, and the journals themselves started providing the data as supplementary information. For some journals, this information is now available free of charge via the Internet. Second, over the past several decades, the United States as a source of high-quality experimental data for chemistry and chemical engineering has been declining relative to Europe and Asia; thus, many sources of such data are now overseas. Third, because of pressure on R&D resources worldwide, researchers in some technical areas are coordinating their efforts, often internationally, and depositing and sharing their results in data depositories.
2b. What barriers do you encounter in getting these data and integrating them, and how do you deal with those barriers? The principal barriers to obtaining data from the open literature are locating all the data and putting them in a common electronic format. Several factors are exacerbating this problem, including the proliferating number of scientific and technical journals and the growth in Web publishing. The appearance of electronic journals (and associated collections of supplementary data) and of ever more powerful programs for searching the literature represent countervailing trends. Finding and obtaining data from more obscure sources remains difficult.
The principal barriers to integrating data are putting them into common electronic formats (often starting with hard copy only, e.g., as in the case of many spectra); adding auxiliary information such as Chemical Abstracts Service registry numbers and chemical structures (in electronic form); and dealing with missing, incomplete, or unclear information, e.g., concerning experimental conditions or measurement uncertainties. In addition, there are no universally accepted data exchange standards, which makes it difficult to integrate data obtained from different sources and in different formats. On the positive side, the Internet has emerged as a powerful means of communicating and exchanging information and offers dramatic new possibilities for collaborating on data activities both within and across organizations.
Finally, financial limitations always constitute a significant barrier to obtaining and integrating data.
3. What are the main cost drivers of your database operations? Cost drivers vary significantly across NIST's 40 data activities. In general, they include the cost of acquiring data experimentally; acquiring data from external sources; selecting and acquiring relevant papers from the literature and extracting the data; confirming or adding auxiliary information such as formulas, structures, Chemical Abstracts Service registry numbers, experimental conditions, and uncertainties; putting data and auxiliary information in a common electronic format; evaluating the data; developing models to represent the data within their uncertainties; packaging the data in electronic form with appropriate tools for accessing, displaying, and using the data; distributing the data; and providing technical support. Many of these activities are ongoing and highly labor intensive.
The response to question 4a describes four different NIST databases in chemistry and chemical engineering. The main cost drivers for these databases are as follows:
NIST/EPA/NIH Mass Spectral Library (NIST 98) In NIST 98, the data consist primarily of complete, experimental spectra, which have been acquired specifically for the NIST library and have been critically evaluated. Cost drivers include evaluation, including development of evaluation tools; confirmation of or addition of auxiliary information; acquisition of data experimentally; and packaging the data in an electronic database with appropriate tools for accessing, displaying, and using the data.
NIST Thermodynamic and Transport Properties of Refrigerants and Refrigerant Mixtures Database: Version 6.0 (REFPROP) In REFPROP, the principal results are mathematical models that have been developed to represent extensive sets of high-quality experimental data within their uncertainties and that can be used to calculate essentially any thermophysical property of selected pure fluids or fluid mixtures with high accuracy over wide ranges of conditions of temperature, pressure, and composition. Cost drivers are evaluation; development of models to represent the data within their uncertainties; and selection and acquisition of relevant papers from the literature and extracting the data. This program also involves extensive experimental measurements and theoretical work, neither of which is included here as a cost driver.
NIST Chemical Kinetics Database: Version 2Q98 The Kinetics Database consists mostly of experimental data obtained from the open literature. The major cost drivers are evaluation and selection and acquisition of relevant papers from the literature and extracting the data
NIST Chemistry WebBook In the WebBook, NIST primarily makes already-existing data collections available over the Internet. Cost drivers of this database include packaging already-existing data in an electronic database with appropriate tools for accessing, displaying, and using the data; and converting existing data and auxiliary information to a common electronic format.
4a. Describe the main products you distribute/sell. NIST makes available over 60 databases and online data systems, including more than a dozen in the areas of chemistry and chemical engineering. Of the following illustrative examples, three are sophisticated but quite different personal-computer-based packages that have gained wide acceptance and approval, and the fourth is a popular online source of chemical reference data; all these products have extensive help systems. NIST also regularly publishes standard reference data in the archival literature (e.g., in the Journal of Physical and Chemical Reference Data) and in publicly available reports and monographs.
NIST/EPA/NIH Mass Spectral Library (NIST 98) NIST 98 is the world's largest collection of evaluated mass spectra for use in identifying unknown chemicals via their electron-impact-ionization fragmentation patterns. Virtually all of the 3,000 mass spectrometers sold annually for identifying unknown chemicals incorporate the NIST library and algorithms in their data analysis systems.
NIST 98 contains 129,136 evaluated spectra for 107,886 compounds. It is the product of a 10-year effort by a team of experienced mass spectrometrists in which each spectrum was examined for correctness. This led to thousands of selections, deletions, and modifications to produce an optimal reference library for compound identification by spectrum matching and library searching.
NIST 98 incorporates 75 percent more spectra than its predecessor, including many complete, high-quality spectra measured specifically for the library or taken from major practical collections of spectra of commercially important chemicals, crime-related chemicals, flavors and fragrances, toxic chemicals, drugs, urinary acids, and chemical-weapons-related chemicals.
NIST 98 is available in an ASCII version or with the enhanced, full-featured NIST MS Search Program for Windows with integrated tools for GC/MS deconvolution, MS interpretation, and chemical substructure analysis.
NIST Thermodynamic and Transport Properties of Refrigerants and Refrigerant Mixtures Database: Version 6.0 (REFPROP) REFPROP is the de facto standard in the refrigeration industry and in research labs for the property data needed to evaluate new non-ozone-depleting refrigerants and to optimize the energy efficiency of heat pumps and other refrigeration equipment.
Version 6.0 is a complete revision based on the most accurate pure fluid and mixture models currently available. Users may generate tables and plots of the thermodynamic and transport properties of any of 33 pure fluids and of mixtures with up to 5 components given a wide variety of possible input conditions. Many commercially available refrigerant blends are predefined in the database.
A separate Windows-based graphical user interface provides a convenient means of accessing the models in REFPROP, which are implemented in a suite of FORTRAN subroutines. An online help system provides information on how to use the program. Data information screens and documentation for the property models are available at any time. Numerous options exist for importing and exporting data.
NIST Chemical Kinetics Database: Version 2Q98 The NIST Chemical Kinetics Database provides a unique tool for producers and users of gas-phase kinetic data. With a few commands, users of the database can examine all of the data on many different reactions, compare the rates measured to their own data, generate files for inclusion in a modeling program, or produce citations for use in a word processor.
Data coverage in version 2Q98 is current through the first quarter of 1998. The data include 37,400 rate constants; 15,000 reactions with 11,400 distinct reactant sets; 9,000 compounds; and 11,200 literature references.
Searching is possible by reactants, by author (all authors in a given paper are included), for reactions in a particular paper, and for all reactions producing a given product. The user may select and fit sets of rate data to Arrhenius equations using least-squares fitting and may edit the resulting graphics on the screen and save the fits to a file suitable for use in a modeling program. Users can also enter their own data and comments, which are then displayed and graphed with literature data. Graphical output is via Windows drivers.
NIST Chemistry WebBook The NIST Chemistry WebBook is NIST's first large-scale effort to make its major collections of thermochemical, thermophysical, and spectral reference data for industrially important chemicals available over the Internet. In two years the WebBook has become by far the most comprehensive source of chemical reference data available on the Web, with data for almost 32,000 chemical species.
The current version of the NIST Chemistry WebBook contains thermochemical data for over 5,000 organic and small inorganic compounds; reaction thermochemistry data for over 8,000 reactions; infrared spectra for over 5,000 compounds; mass spectra for over 10,000 compounds; ultraviolet/visible spectra for over 400 compounds; electronic and vibrational spectra for over 3,000 compounds; constants of diatomic molecules (spectroscopic data) for over 600 compounds; ion energetics data for over 14,000 compounds; and thermophysical property data for 16 fluids.
Those accessing the WebBook can search for data on specific compounds based on chemical name, chemical formula, Chemical Abstracts Service registry number, molecular weight, or selected ion-energetic and spectral properties.
4b. What are the main issues in developing those products? See answers to questions 2 and 3. Developing standard reference databases is generally a long-term proposition requiring stable, long-term funding. Thus, obtaining sufficient long-term support from NIST is a key issue and one exacerbated by a lack of funding growth. Within NIST, data activities must compete for funds with all other technical programs, and the criteria used to allocate resources among competing programs are (a) the magnitude and time frame of the industrial need to be addressed, (b) the degree of correspondence between a particular need and NIST's mission, (c) the opportunity for NIST participation to make a major difference, (d) the nature and size of the anticipated impact resulting from NIST's participation, (e) NIST's capability to respond in a timely fashion with a high-quality solution, and (f) the nature of opportunities afforded by recent advances in science and technology. To thrive within NIST, a data program must score high against these criteria.
4c. Are you the only source of all or some of your data products? If not, please describe the competition you have for your data products and services. NIST's data products are unique for three reasons. First, NIST specializes in comprehensive, high-accuracy data and critical data evaluation. Second, NIST has a mandate to serve as an impartial source of measurements, standards, and data, including standard reference data. Third, NIST cooperates with or remains cognizant of other data programs worldwide to ensure that data activities are complementary rather than overlapping. The four databases described above are unique.
5a. What methods/formats do you use in disseminating your products? NIST distributes its data products using a variety of methods/formats determined primarily by customer needs. The Standard Reference Data Program is the central point of contact for all electronic databases available from NIST.
NIST distributes some databases in electronic form on CD-ROM or floppy disks (e.g., Mass Spec, REFPROP, Chemical Kinetics) and some databases online (e.g., the NIST Chemistry WebBook). NIST also enters into numerous agreements with secondary distributors of NIST data products (e.g., manufacturers of mass spectrometers in the case of Mass Spec; the Air Conditioning and Refrigeration Institute in the case of REFPROP; Aspen Technology in the case of the properties of water and steam). NIST also publishes papers, monographs, and reports in the open literature (e.g., proton affinity database, the NIST/Joint Army-Navy-Air Force Thermochemical Tables) and contributes to data efforts outside of NIST (e.g., NASA and International Union of Pure and Applied Chemistry data panels on atmospheric chemistry, AIChE/DIPPR projects on data for chemical process design).
NIST also customizes its methods/formats in response to specific customer needs and concerns. In the case of REFPROP, NIST provides the FORTRAN subroutines for the underlying thermophysical property models because many users want to incorporate these subroutines in proprietary equipment design codes. In the case of Mass Spec, the NIST database can read the data files of commercial instruments in their native formats, which facilitates the use of the NIST search algorithms while ensuring that the data are represented properly. In the case of the WebBook, NIST intends to make an intranet version available to alleviate the concerns of some organizations that accessing the publicly available version could compromise their proprietary information.
NIST will continue to distribute its data products in a variety of forms driven by customer needs. However, we can expect that the Internet will continue to grow rapidly as a method/format for distributing chemical and chemical engineering data and for communicating and exchanging data with users and with other data activities around the world. In addition, the demand for comprehensive databases with broad coverage will continue to grow, leading to the consolidation and integration of smaller, specialized databases. Finally, standardized formats will undoubtedly emerge to facilitate data exchange.
5b. What are the most significant problems you confront in disseminating your data? Problems in disseminating data vary from database to database. General problems include making potential customers aware that databases exist, getting potential customers to pay a reasonable fee (a problem exacerbated by NIST being a U.S. government agency; also see question 7a), keeping up with changing dissemination technology, and overcoming bureaucratic obstacles to entering into licensing and distribution agreements with outside parties. NIST has addressed these problems for the four examples described above by working closely with customers throughout the database development process.
6a. Who are your principal customers (categories/types)? Customers for NIST databases vary significantly from database to database, but generally include scientists and engineers from industry, academia, and other U.S. government agencies. For the four examples presented under question 4a, the principal customers are as follows:
NIST/EPA/NIH Mass Spectral Library (NIST 98) NIST 98 has approximately 4,500 customers per year, including manufacturers of mass spectrometers (e.g., Hewlett Packard, Varian, Finnigan), who along with other organizations act as secondary distributors of the NIST library and associated algorithms; and users of mass spectrometers, primarily in applications involving the identification of unknown chemical compounds, e.g., research and development, health care, forensics, environmental measurements, and chemical and drug manufacturing.
The NIST Thermodynamic and Transport Properties of Refrigerants and Refrigerant Mixtures Database: Version 6.0 (REFPROP) This database has approximately 150 customers per year, including scientists and engineers in the air conditioning and refrigeration industry (e.g., Copeland, York, Carrier, Trane), primarily in the design and optimization of air-conditioning and refrigeration equipment; scientists and engineers in the chemical industry (e.g., DuPont, Allied Signal), primarily to identify and characterize new products for the air-conditioning and refrigeration industry; and researchers in academia and other U.S. government agencies working in the aforementioned areas.
NIST Chemical Kinetics Database: Version 2Q98 This database has approximately 150 customers per year, including researchers in industry (35 percent), academia (50 percent), and other U.S. government agencies (15 percent) working in areas such as combustion, atmospheric chemistry, and chemical and materials processing.
NIST Chemistry WebBook This database has 6,000 to 8,000 distinct users per week, approximately half of whom are return customers, including scientists and engineers working in research and development in industry, academia, and other U.S. government agencies; and teachers and students in high school and college using the WebBook in classes. Of all users, 15 percent are from U.S. industry and 20 percent are from academia.
6b. What terms and conditions do you place on access to and use of your data? NIST electronic databases are available for sale to any interested party through the NIST Standard Reference Data Program and from secondary distributors who have entered into licensing agreements with NIST.
All NIST databases include the following copyright statement: “©1998 copyright by the U.S. Secretary of Commerce on behalf of the United States of America. All rights reserved. No part of this database may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the distributor.”
NIST electronic databases also include the following disclaimer: “The National Institute of Standards and Technology (NIST) uses its best efforts to deliver a high quality copy of the Database and to verify that the data contained therein have been selected on the basis of sound scientific judgment. However, NIST makes no warranties to that effect, and NIST shall not be liable for any damage that may result from errors or omissions in the Database.”
6c. Do you provide differential terms for certain categories of customers? NIST treats all customers equally. All individual users receive the same terms as do all secondary distributors. Options made available to one are made available to all.
7a. What are the principal sources of funding for your database activities? Congressional appropriations are the principal source of funding for NIST data activities. Some data activities also receive external support from U.S. industry and other government agencies, and from sales of databases. Although the Standard Reference Data Act allows NIST in principle to recover the costs of all data activities from the sales of databases, few data activities are capable of generating enough income from sales to offset a significant fraction of their costs.
The biggest exception to the above is the NIST/EPA/NIH Mass Spectral Library, for which all of the funding derives from sales of the library and associated algorithms; that is, the mass spec data program is self-supporting. Although they both depend on appropriated funds, REFPROP and Chemical Kinetics are also exceptions. REFPROP has received substantial outside support from U.S. industry and from other government agencies, and significant income from sales. Similarly, Chemical Kinetics has leveraged externally funded efforts in kinetics and also received significant income from sales. The Chemistry WebBook has relied entirely on appropriated funds this far; however, many of the data collections that the WebBook includes had other sources of support.
At present, all of NIST's online databases, including the NIST Chemistry WebBook, are available free of charge to all users. NIST is considering various options for charging for online data.
7b. What pricing structure do you use and how do you differentiate (e.g., byproduct, time, format, type of customer, etc.)? In general, the pricing structure at NIST reflects such factors as the amount of data, the level of evaluation, and the complexity of tools for accessing, displaying, and using the data.
NIST is currently reviewing its policies on cost recovery for databases because those policies provide the underlying basis for setting prices. For example, NIST would have one underlying basis if it tried to recover the costs of collection, compilation, evaluation, publication, and dissemination of standard reference data to the extent practicable and appropriate for each data product, and quite another if it tried to recover no costs at all. In the former case, NIST could make some data products available for free (or for a nominal fee) if recovering costs did not appear possible or cost effective. At the other extreme, if it were possible to recover all costs for a particular database, then the cost of that database would be determined by the cost of the program required to meet national/industry needs and the projected number of sales.
At present, all of NIST's online databases, including the NIST Chemistry WebBook, are available free of charge to all users. NIST is currently considering various options for charging for such data, including offering a limited version (in terms of data) for free and a complete version for a small annual fee; offering limited access to the complete version for free and unlimited access for a fee (analogous to going to the library every once in a while to look up a number in a handbook if you use that handbook infrequently versus buying a handbook if you use one frequently); and offering the complete version for use on a customer's PC or intranet, again for a fee.
7c. Do your revenues meet your targets/projections? Please elaborate, if possible. In general, yes; however, as mentioned under question 7a, few data activities are capable of generating enough income from sales to offset a significant fraction of their costs. Although everyone agrees that high-quality data are extremely valuable, users are reluctant to pay real value, perhaps because the separation between the use of the data and the end results is too great, and consequently, the impact on the bottom line is hard to quantify.
8a. Have you encountered problems from unduly restrictive access or use provisions pertaining to any external source databases? NIST has experienced significant problems regarding the costs of certain data-related services provided by outside organizations, such as assigning auxiliary information to chemical compounds, and regarding the subsequent distribution of such information in NIST databases.
8b. What problems have you had with legal protection of your own database activities and what are some examples of harm to you or misuse of your data that you have experienced, if any? NIST has experienced several problems. For example, outside organizations have taken FORTRAN subroutines such as those in REFPROP and tried to package and market them, and users of the NIST Chemistry WebBook have downloaded massive amounts of data from the NIST Web site. However, it is not clear that these and similar actions have caused NIST significant harm, at least not yet.
8c. How have these problems differed according to data product, medium, or form of delivery, and how have you addressed them (e.g., using management, technology, and contractual means)? In the case of databases like REFPROP, NIST has addressed the problem by writing warning letters to the offending parties but has taken no further action. In the case of the WebBook, NIST has not yet taken any action.
8d. What specific legal or policy changes would you like to see implemented to help address the problems identified above? None.
9. Do you believe the main problems/barriers/issues you have described above are representative of other similar data activities in your discipline or sector? If so, which ones? If not, what other major issues can you identify that other organizations in your area of activity face? Yes.
General Discussion
DR. ALEXANDER: You have copyright privileges to copyright material. Do you also license products?
DR. KAYSER: The only sense in which we license them is that we enter into agreements with secondary distributors who are then free to distribute the database on behalf of NIST. In some cases, we are willing to enter into customized agreements. “Secondary distributors,” I think, covers it.
DR. SCOTCHMER: In your authorized use that you referred to, have you taken steps, or is your database amenable to encryption methods or restricted access methods, to stop massive downloading of data?
DR. KAYSER: In the case of REFPROP and similar databases, we have pretty much ignored it because it wasn't obvious to us that it was doing a substantial amount of harm. We didn't pursue it beyond writing the warning letters. In the case of the WebBook, we haven't yet developed any countermeasures to stop people from downloading massive amounts of data.
DR. SAXON: Our second speaker is James Lohr. He is the director of information industry relations for the Chemical Abstracts Service, which is part of the American Chemical Society and a service that I think is one of the original and large bibliographic databases.
Not-for-Profit Data Activity
James Lohr, Chemical Abstracts Service
Response to Committee Questions
1a. What is the primary purpose of your organization? Chemical Abstracts Service's (CAS) mission is to be the world's leader in meeting the needs of the world's scientists and researchers for chemical and related scientific information. CAS fulfills its mission in part by (a) producing the world's most important secondary database (Chemical Abstracts summarizing the world's publicly available journal and patent information since 1907); (b) creating the CAS Registry System of new chemical substances; and (c) developing and deploying state-of-the-art delivery modalities to permit searching and information retrieval from these databases.
1b. What are the main incentives for your database activities (both economic and other)? CAS is a division of the American Chemical Society (ACS). The ACS was charted by the U.S. Congress in 1876 and is currently the world's largest scientific society with well over 150,000 members. The mission of the ACS is to encourage in the broadest and most liberal manner the advancement of the chemical enterprise and its practitioners.
Central to this mission from the beginning has been the important function of organizing and disseminating chemical information. The ACS accomplishes this via a sustained strategy of continuing to be the world's leading provider and deliverer of chemical information. In practice, the strategy is implemented through the journal publishing activities of the Publications Division and the CAS databases.
Both the Publications Division and CAS are “self-sustaining” divisions of the ACS. This means they are expected to generate revenues sufficient to cover (a) all of the their operating expenses, (b) cash flow necessary for business reinvestment, (c) certain overheads allocated from the ACS, and (d) a budget annual surplus to contribute to the funding of other ACS activities related to its mission.
2a. What are your data sources and how do you obtain data from them? CAS's main data sources are the publishers of the world's chemical research journals and the major global patent-issuing offices. Traditionally, the data were obtained by acquiring print journals from publishers and patent gazettes from patent-issuing offices. More recently, a substantial portion of this input is acquired in electronic form from the same sources.
2b. What barriers do you encounter in getting these data and integrating them, and how do you deal with those barriers? There are few barriers to obtaining data for the production of CAS databases. Chemical practitioners are frequently also the original sources of chemical information contained in CAS databases. These practitioners universally recognize the benefits of inclusion of references to their published work in CAS databases and act to ensure that no segment in the chemical information delivery chain does anything to impede their inclusion. This is rational, owing to (a) the vast amount of chemical literature available in the world, (b) the consequent effort involved in organizing this information in an efficiently searchable form, and (c) the near impossibility of effectively conducting chemical research absent ready recourse to the chemical literature.
There are, of course, costs involved in obtaining some of the data. These are generally subject to negotiation, however, and are less “barriers” than they are analogs to the costs any business might incur procuring raw materials for its operations.
3. What are the main cost drivers of your operations? The main cost driver in the production of CAS databases is labor cost associated with hiring and retaining a large cadre of highly trained chemical professionals to analyze documents and extract salient features for inclusion in the databases. There are smaller costs associated with gaining access to raw materials as noted above.
For CAS as a whole, there are also significant costs incurred procuring electronic delivery product hardware and developing software necessary to provide electronic access to the CAS databases.
4a. Describe the main products you distribute/sell. Historically, CAS's main product has been the weekly publication Chemical Abstracts (currently more than 1,800 pages per week), which contains references to the world's chemical literature. Chemical Abstracts is produced in both print and electronic forms. In addition, CAS electronically creates and maintains the CAS Registry System database, CASReact, which is a chemical reaction database, and MARPAT, a Markush chemical structure file whose input is drawn mainly from the world's patent literature. While CAS's print products remain viable, CAS is increasingly selling access to the databases via a variety of electronic delivery modalities, which are frequently viewed as “products” themselves.
Prominent among these electronic delivery products are:
- The global online system, Science and Technology Network (STN) Information, which is comanaged by CAS, Fachinformationzentrum-Karlsruhe (Germany), and the Japan Science and Technology organization;
- The award-winning desktop system SciFinder, which puts direct access to the CAS databases at the scientists' fingertips;
- STNEasy, which is an Internet service that permits access to the CAS databases from anywhere in the world with Internet service; and
- A variety of CD-ROM products.
4b. What are the main issues in developing those products? Given the high labor component in database building, a continuing challenge is to find ways to eliminate non-value-adding work by the professional staff.
On the distribution side, since the main mode is changing rapidly from paper to electronic, a significant challenge continues to be the availability of adequate software development resources to meet market demands for new variations of electronic delivery.
4c. Are you the only source of all or some of our data products? No; CAS products have their unique features, mostly related to comprehensiveness, but there is generic competition for all of them when different sources are accessed.
5a. What methods/formats do you use in disseminating your products? As noted in the response to question 4a above, CAS products are distributed in print and electronic forms. Electronic access to CAS data can be via the proprietary global online system STN, the proprietary desktop software SciFinder, over the Internet with STNEasy, or a variety of CD-ROM offerings.
5b. What are the most significant problems you confront in disseminating your data? As noted in the response to question 4a above, CAS increasingly conducts its business by providing electronic access to its databases. A large fraction of all CAS commerce is done in real time, 24 hours a day, on a global basis. CAS's most significant dissemination problems involve maintaining high service levels for this complex electronic system. Extremely high reliability—and redundancy—in computer hardware, software, and telecommunications links is essential.
6a. Who are your principal customers (categories/types)? Customers for CAS data are individuals involved in the chemical enterprise who require a complete knowledge of existing experimental data and results to do their jobs. These include mainly scientists and engineers occupied in research and development, examiners of chemical patents, attorneys preparing chemical patents, academicians, chemistry students, and any others who need a comprehensive and current knowledge of some aspect of chemistry.
6b. What terms and conditions do you place on access to and use of your data? CAS database products are all copyrighted and their use is regulated by the copyright restrictions pertaining to such materials.
Electronic access to CAS data via telecommunications systems is routinely covered by agreements between CAS and the organizations with which individuals accessing the data are affiliated. These agreements are often tailored to the specific needs of an organization and can govern such things as the identity and/or number of individuals with access privileges, the number of simultaneous users, the number of records that may be downloaded and retained, the length of time such records may be retained, and so forth.
CAS CD-ROM electronic data products contain restrictions on the electronic redistribution of data. With special arrangements, CAS does permit data from CD-ROM products to be downloaded to an organization's internal network for the exclusive use of its affiliates.
6c. Do you provide preferential terms for certain categories of customers? Academic institutions are able to purchase CAS products and services at costs substantially below those of other customers.
7a. What pricing structure do you use and how do you differentiate (e.g., by product, time, format, type of customer, etc.)? CAS print products have standard prices, which are adjusted a ually. Prices for electronic access to CAS data depend on a variety of factors, including modality (e.g., STN, SciFinder, etc.), the number of simultaneous users (SciFinder Scholar), the number of software subscriptions or tasks contracted for (SciFinder basic client), the inclusion of special software features (e.g., SciFinder Substructure Search Module), any specific arrangements tailored for special organizational needs, and so forth. As noted above, academic institutions enjoy substantial discounts on all CAS products.
7b. Do your revenues meet your targets/projections? Please elaborate, if possible. CAS revenues have met or exceeded targets in recent years.
8a. Have you encountered problems from unduly restrictive access or use provisions pertaining to any external source databases? Until recently there have been some restrictions on the ways information emanating from certain patent-issuing bodies could be used in building chemical databases. These restrictions have been eased in the last several years.
8b. What problems have you had with legal protection of your own database activities and what are some examples of harm to you or misuse of your data that you have experienced, if any? The two main CAS databases, Chemical Abstracts File and Registry, enjoy copyright protection. Also, as noted above, electronic access to the data in these files is frequently governed by agreements that further protect CAS's interests.
CAS has a major problem with the piracy of printed Chemical Abstracts in the People's Republic of China. Large numbers of copies are illegally printed and distributed to institutions throughout China by an organ of the Chinese government. CAS estimates this loss in the neighborhood of $20 million if the equivalent number of copies could be sold at prices prevailing in the rest of the world.
8c. How have these problems differed according to data product, medium, or form of delivery, and how have you addressed them (e.g., using management, technology, and contractual means)? As noted above, CAS has been unable to effectively address the problem of piracy of printed products by foreign governments. Abuse by the same sources via electronic access has not been a problem as CAS had declined to grant access. Abuse via electronic access has been uncovered from time to time in the rest of the world and has been managed by a combination of managerial, technical, and contractual measures.
8d. What specific legal or policy changes would you like to see implemented to help address the problems identified above? Clearly, it would be in CAS's interests to have the People's Republic of China brought fully into conformance with generally prevailing intellectual property conventions and practice. This is an active goal of U.S. policy toward the People's Republic of China, but it has met with only limited success to date.
9. Do you believe the main problems/barriers/issues you have described above are representative of other similar data activities in your discipline or sector? If so, which ones? If not, what other major issues can you identify that other organizations in your area of activity face? CAS is not aware of how general any of the issues contained in the responses to question 8 may be within the chemical database-building community.
General Discussion
PARTICIPANT: I have a question for both Dr. Lohr and the NIST speaker. That is, both of you have mentioned that the work prepared by your organizations is covered by copyright. As a copyright lawyer I am very doubtful, particularly in the case of registries that copyright protects you much at all. The same thing with NIST; you may assert it, just as I may assert that I am a citizen of Mongolia, but it may not do you much good in a court of law. Have you looked into that problem?
DR. LOHR: It has never been challenged in court, no. You have to make certain representations to the Patent and Trademark Office and to the Copyright Office periodically, to have them certify if you have that status, and we do that. To have to defend that before a court, no, we have never done that.
MR. PERLMAN: It seems like you are doing well. Yet, you suggested protocols be developed to give greater protection to the databases. In addition to copyrights and in addition to the licensing agreements in the past, what other problems need to be solved?
DR. LOHR: I think, Professor Perlman, it is hard to anticipate exactly what those problems are. If someone had asked us all to come to Washington for a couple of days and talk about this subject five years ago, we would have all scratched our heads and said, “What are you talking about? We will stay home and enjoy ourselves.” The environment is changing so dramatically and so rapidly you cannot really foresee what all the problems might be. If you look at all these agreements that we have and the things that we do right now, they are adding inefficient friction to the system. If public policy, even in today's environment, were such that people like us and others felt adequately protected to go about their business, we wouldn't have to exert all the effort and the energy and the time. It really is just wasteful. Nothing is contributed, in an economic sense, by creating all these agreements and so forth.
First, I think even in the present environment people feel a need for something that gives them more assurance that others will be prohibited from just expropriating their property to their disadvantage somehow. Second, though, I cannot foresee the future. I really don't know what awful things will happen. What I do know is that things are changing very, very rapidly and in a somewhat unpredictable way. To take the position that we will wait until we really get clobbered, we will wait until the ceiling falls in on us, and then decide to do something about it is just not a prudent way to go about your business.
DR. SAXON: We are working on the electronic delivery of information.
MR. PERLMAN: It seems the potential threat to your mode of operation is if one of the sources of your information—proprietary journals—begins to assert restrictions on what you can do with the information that you have.
DR. LOHR: Yes, and it depends on how “restrictive” restrictive is. Right now I would say that we have arrangements and agreements, contracts of various sorts, and transactional relationships with a very large percent—I don't know what percent—of all the suppliers of our raw material, whereby they supply the information and we use it in certain ways that they approve of. If they changed their minds and decided to just choke that off, it would be very simple. It would just put us out of business, in all probability.
DR. SAXON: For the commercial sector in this area, we are going to have a presentation by Leslie Singer, who is the president of ISI—the Institute for Scientific Information.
Commercial Data Activity
Leslie Singer, Institute for Scientific Information
Response to Committee Questions
Provide a description of your organization and related database activities. The Institute for Scientific Information (ISI), a company with approximately 800 employees, is a leading secondary publisher of bibliographic databases to support scholarly research. ISI is a subsidiary of the Thomson Corporation, headquartered in Stamford, Coecticut, and is listed on the Toronto, Montreal, and London stock exchanges.
The foundation of ISI's products is the ISI database, which includes the highest-quality science, social science, arts, and humanities publications, covering about 16,000 journals, books, and conference proceedings. ISI's database contains data from 1945 to the present.
Although ISI's core competency is database creation, ISI also devotes substantial energy to creating and marketing the software (proprietary to ISI) to manage the database. ISI offers its products in a variety of media, including print, diskette, CD-ROM, magnetic tape, and Internet or intranet. The first electronic product was offered in 1988 (revenues then were 15 percent electronic and 85 percent print); 1998 estimates are 79 percent electronic and 21 percent print. ISI's database is also available through third-party vendors, such as Ovid Technologies, SilverPlatter, OCLC, Dialog, STN, and Dimdi. Each offers access to the ISI database through their software or online system.
A key feature of ISI's database is the inclusion of searchable cited references (bibliographies or footnotes) published with each article. These cited references are links to prior relevant research established by the publishing authors themselves—an acknowledgment of previous research that provided the basis for the author's current research. Cited references can be used to retrieve related articles even when the terminology of the research has changed over time. For example, cited reference searching lets the user take a known paper and find other, more recent papers that cite it. It also enables the researcher to identify cocitations (articles that include common cited references), to analyze the impact of published research, and to identify experts in a field. Through cited references, the researcher can track developments forward and backward in time, crossing disciplinary boundaries, and uncovering relevant links that might otherwise remain hidden. ISI processed approximately 22 million cited references in 1998.
ISI's products can be categorized as:
- Current awareness—Table of Contents products;
- Alerting services—customer profiles with e-mail delivery;
- Citation indexes—index to comprehensive research literature with cited reference searching;
- Chemical services—index to bibliographic, reaction, structure and cited reference for new compounds and new synthesis in chemistry; and
- Linkages—a relatively new service, provides customers with the ability to hyperlink from ISI's database to the primary publisher's full-text databases.
A complete list of ISI's products is given in question 4a below, or for additional information about ISI and its products, please see ISI's Web page at <http://www.isinet.com>.
1a. What is the primary purpose of your organization? ISI's primary purpose and its mission is to provide essential, high-quality products and services that enable all participants in the scholarly and applied research process to optimize their access to and management of published materials.
1b. What are the main incentives for your database activities (both economic and other)? As a for-profit company, ISI's main incentive must be self-perpetuation through fiscally responsible behavior. However, ISI's heritage is rooted in the pursuit of scholarly research. The idea for Current Contents began when Dr. Eugene Garfield, a graduate student of chemistry, would prepare a packet of the table-of-contents pages of the leading chemistry journals for his fellow researchers. ISI's focus will always be to serve the scholarly community by providing the information necessary to advance research.
2a. What are your data sources and how do you obtain data from them?
Data Sources ISI's data sources are scholarly journals, books, and proceedings data. Coverage is multidisciplinary, including arts and humanities, social sciences, and the sciences. Approximately 16,000 peer-reviewed journals, books, and proceedings are processed for ISI's database each year. Many more new journals, books, and proceedings are reviewed to determine if they meet ISI's coverage standards. Many publishers provide their journals on a complimentary basis because of the exposure associated with coverage in ISI products; however, we do spend considerable funds for subscriptions as well, primarily from societies and associations.
Traditionally, all materials processed by ISI were print format originating from publishers, societies, and university presses. In the past three years, the larger publishers have begun supplying journal data in electronic format. Also, with development of the World Wide Web, many new sources of articles, primary databases, and other scholarly information are available and being evaluated by ISI on a regular basis.
How We Acquire Them The process for acquiring ISI's raw materials involves negotiating with publishers, acquiring the materials (new and ongoing), and evaluating the literature for coverage. The process requires varied skill levels and is therefore divided into three functional areas.
Publisher Relations The Publisher Relations division is responsible for initiating and maintaining strong, positive relations with the more than 2,500 publishers who provide the journals, books, and proceedings included in ISI products. Publisher Relations is responsible for obtaining the publications used for coverage; negotiating arrangements with primary publishers for the use and storage of their electronic materials; negotiating rights to supply document delivery; and negotiating rights to link between the ISI database and the publisher's full-text of primary materials.
Acquisitions The acquisition of journals, books, and conference proceedings for editorial evaluation purposes is a labor-intensive activity. A number of print, online, and Web resources are utilized to identify and request newly published material which is potentially appropriate for inclusion in ISI's products and services. In addition, we have developed relationships with a large portion of the scientific publishing community that include automatic provision of all new publications. The Acquisitions Division's role includes requesting over 6,000 books and conference proceedings anually; evaluating over 7,000 monographs (requested and auto-provided) anually; evaluating over 2,000 new journals (corresponding to 16,000 issues) anually; managing all journal subscriptions; and tracking claims for missing issues.
Editorial Selection ISI's primary editorial goal is the selection of the most important, internationally influential publications for coverage in each of the over 200 subject categories in ISI's multidisciplinary database. We have selected only those publications most highly valued by the international community of researchers and scholars. Thus, the ISI database is comprehensive, but not all-inclusive.
The work of journal selection is performed by a team of ISI editors who have educational backgrounds relevant to their areas of responsibility. Several editors are also librarians, and all editors have broad knowledge of the literature of their field. They have the full resources of the ISI citation database as a primary tool in evaluating journals.
Each year editors review approximately 3,000 journals from which fewer than 200 are selected for coverage. Another 7,000 books and proceedings are evaluated, resulting in coverage of around 4,600 volumes in ISI products.
How We Obtain the Data from the Sources The process of populating the ISI database with new source materials involves three key steps of cataloguing the new journal issues or books, capturing the bibliographic and other ISI data from the source materials, and verifying the integrity of the data and database after the source data has been captured.
Publication Processing—Cataloging Many functions parallel those in libraries. The Publication Processing division acts as our technical services group. They catalogue 8,000 to 9,000 books and proceedings volumes per year; maintain up-to-date serials records for over 8,500 journals; input ongoing journal receipts in to the online serials system; and accession, label and ship the receipts to the data capture facility.
Data Capture Over the past several years ISI has made a successful transition from a keying-based data-capture process to that of a scanning/OCR-based system. We are in the early stages of the next major transition, which is the shift from processing print source material to processing from electronic input files. In 1998 we processed nearly 1.3 million source articles (up 23 percent from 1993), more than 4 million authors' names (up 35 percent from 1993), over 2 million addresses (up 31 percent from 1993), nearly 22 million cited references (up 41 percent from 1993), and over 802,000 abstracts (up 51 percent from 1993).
Database Edit There are several types of quality-control edits applied to the data, both during and after capture. These edits fall into three general categories: (1) machine edits, (2) manual edits, and (3) dictionary processing. The edits are designed to correct any errors introduced either by the OCR or entry processing, or by the author—especially errors in reference lists.
2b. What barriers do you encounter in getting these data and integrating them, and how do you deal with these barriers? The major barriers for getting the data are similar to most library acquisitions process, including:
- Evaluating a large set of potential new journals to obtain a small number for coverage;
- Negotiating subscription fees and rights for electronic storage and delivery, and starting and maintaining subscriptions in electronic and/or print formats;
- Receiving journals in a timely manner; and
- Monitoring and claiming missing issues.
In addition, a new trend for receiving electronic journals whereby issues are available online and must be “retrieved” by the subscriber. This process requires more manual effort and tracking, because it is similar to an ongoing claiming process.
The major barriers for integrating data are the diversity of formats and styles, as follows:
- There are different formats for journals, proceedings, and books;
- Each publisher and even each journal has different styles and formats for presenting data elements, requiring ISI to impose standards during data capture, so that data can be indexed properly for searching;
- Electronic materials with their unique formatting complexities (such as PDF, SGML, HTML, and XML) further burden the integration process; and
- Sweeping changes are required in all systems and products to process these electronic materials and to accommodate new data elements and new citation methodology.
3. What are the main cost drivers of your database operations? The main cost drivers are volume of materials and the labor required to support the translations, data capture, database support, quality assurance, data extraction and dissemination, and search and retrieval software support.
Over the last 5 to 10 years, ISI has experienced a shift in labor requirements whereby more technical personnel (programmers, hardware technicians, and communications staff) are necessary to keep pace with the quickly changing technical environment.
Translations ISI provides an English translation of article titles when the journal is published in its native language. ISI translations staff is selected so that there is broad coverage of languages as well as disciplines. Because ISI products encompass science, social science, arts and humanities, the staff must have working knowledge of the vocabulary of the discipline and its non-English equivalent.
Data Capture
General Data Capture After scanning and OCRing of journal material, human edit of the data is required to confirm the accuracy of the data and also to apply extensive ISI policy rules that act to unify the data for indexing purposes. Policies affect every field (author, title, address, abstract, page span, citation, etc.) captured by ISI. In addition, because of the lack of citation format standards within bibliographies, manual keying of citations (22 million in 1998) is still required.
ISI has also been making the transition to accept source materials from publishers in electronic format; however, since there are no standards in electronic publishing, each new journal usually requires a new programming effort. In addition, because very few publishers are currently providing data in electronic form, the two systems (electronic and paper input) must be maintained.
Chemical Data Capture Data capture of chemical data, particularly the chemical structures and reactions, are highly labor intensive. ISI chemists enter the graphical representation of all compounds presented in a reaction. In addition, editors read the complete article to obtain and capture specific data about the reaction (e.g., reagents, key steps, R-groups, temperature, yield rates, advantages, and other comments germane to the reaction or new compound).
Arts and Humanities Data Capture Cited references in the humanities are notorious for being incomplete. ISI humanities editors, therefore, must have extensive knowledge in music, literature, theatre, and art in order to provide a complete cited reference (author, full title of the artistic work, and the year of creation) when only partial information is given.
Database Support and Quality Assurance Major support functions for the database involve the use of automated “cleansing algorithms” with human intervention, whereby ISI repairs references in bibliographies that may be cited incorrectly, thus extending ISI's citation indexing capabilities. Similarly, ISI performs quality assurance checks of all other data elements using automated algorithms with human intervention and correction.
ISI data capture policies are reviewed regularly to adapt existing rules and to create new rules for new elements, such as how to capture a reference to an electronic publication or Web site. Any policy change is structured to maintain maximum consistency with the 50 years of existing ISI data.
Data Extraction and Dissemination Data extraction and dissemination require separate programs, operational procedures, and quality processes for each operating system (e.g., DOS, Windows, Mac, Sun, DEC, etc.). In addition, creation of master files is required for each media type within each operating system (e.g., diskette, CD-ROM, FTP transfer, magnetic cartridge, online vendor formats, etc.).
Search and Retrieval Software ISI's proprietary search and retrieval software is an integral part of the database. Each software package, for each media type and operating system, is upgraded regularly to incorporate the new capabilities of the operating systems and to provide customers with new search capabilities and techniques to get the most benefit from the database. Development of new products and services to leverage the new technology capabilities (such as the Web) is an ongoing initiative at ISI.
4a. Describe the main products you sell. ISI provides a variety of scholarly information tools to support the worldwide research community. Our offerings range from broad, interdisciplinary products to products that focus on a particular discipline or specialty.
Despite the company name that emphasizes science, we offer a wide variety of products in the arts, humanities, and the social sciences. The ISI products include:
- Bibliographic management tools including ProCite® and Reference Manager®;
- Chemical information products including Current Chemical Reactions®, Index Chemicus®, ISI Chemistry ServerSM, and Reaction CenterSM and its Reaction Citation Index™;
- Citation databases including multidisciplinary citation indexes (Arts & Humanities Citation Index®, Science Citation Index®, Social Sciences Citation Index®, Web of ScienceSM) and specialty citation indexes (Biochemistry & Biophysics Citation Index™, Biotechnology Citation Index™, Chemistry Citation Index™, CompuMath Citation Index®, Materials Science Citation Index®, and Neuroscience Citation Index™);
- Current awareness products such as Current Book Contents ®; Current Contents ® including Current Contents Connect™, and Current Contents editions (Agriculture, Biology & Environmental Sciences; Arts & Humanities; Clinical Medicine; Engineering, Computing & Technology; Life Sciences; Physical, Chemical & Earth Sciences; Social & Behavioral Sciences); Current Contents Collections (Business Collection, Electronics & Telecommunications Collection) and Current Contents® Proceedings; Focus On (Psychopharmacology, Sports Science & Medicine, and Veterinary Science & Medicine); ISI Alerting ServicesSM (Corporate Alert®, Discovery AgentSM, Journal Tracker™, and Personal Alert®); and Reference Update®;
- Document delivery including ISI Document Solution SM;
- Indexes to proceedings, book contents, and reviews including Index to Scientific Book Contents®, Index to Scientific Reviews®, Index to Scientific & Technical Proceedings®, and Index to Social Sciences & Humanities Proceedings®;
- Journal evaluation including Journal Citation Reports ®;
- MetaMaps™; and
- Research performance and evaluation tools including High-Impact Papers, Institutional Citation Report, Institutional Indicators, Journal Analysis Database, Journal Performance Indicators, Local Journal Utilization Report, National Citation Report, National Science Indicators, Personal Citation Reports, Research Fronts, SCI-MAP, Science Watch®/Hot Papers on Diskette, Topical Citation Report, and University Indicators.
4b. What are the main issues in developing those products? Our main barrier in developing products is the shortage of skilled data processing professionals available to perform the highly sophisticated development required to maintain our proprietary search engine (no commercial product was able to meet our requirements), to maintain a massive database (approximately 166 GB), to provide state-of-the-art user interfaces, and to process large volumes of data.
Other issues include the rapid change in technology and the impact on development of products. This rapid change requires us to monitor technological advancements, evaluate which technologies will be widely accepted, and estimate when the market will be ready to accept products for a given technology. Software in the current electronic environment has a very short shelf life. New versions of operating systems are being released at regular intervals, making it necessary to upgrade ISI software to accommodate the new capabilities.
Another issue is the disparity of technical capabilities in ISI's worldwide market, making it virtually impossible to discontinue a product (such as microfiche or MS-DOS products); thus the unit cost to produce each “old technology” format becomes higher over time as the customer base erodes. ISI's customer base ranges from those with no computer equipment, therefore print is the only acceptable format, to customers with sophisticated remote access network capabilities, where the World Wide Web is the medium of choice.
4c. Are you the only source of some or all of your dataproducts? If not, please describe the competition you have for your data products and services. ISI or its authorized agents are the sole sources of its propriety search and retrieval software combined with a unique scholarly multidisciplinary database. Portions of ISI's database are also available through third-party distributors. The set of materials covered by ISI is unique; however, it does overlap with other secondary publishers database holdings.
Other secondary publishers (e.g., Chemical Abstracts Service, BIOSIS, the National Library of Medicine, and IEEE) are ISI's traditional competitors in the scholarly market. One of the noteworthy aspects of the competitive landscape is that the competition generally remains níche-oriented, in contrast to ISI, which is differentiated by its broad-based, multidisciplinary coverage.
In the electronic environment, nontraditional competitors have emerged in the form of primary publishers (e.g., Elsevier's ScienceDirect, Adonis) and aggregators/distributors (e.g., Ovid, SilverPlatter, EBSCO, Swets, British Library). As electronic data become more accessible, new competitors, with no print legacy, find it easier to enter the market.
5a. What methods/formats do you use in disseminating your products? ISI sells most of its products directly to libraries or end users, developing and supplying its own search and retrieval software. ISI also works with third-party vendors (e.g., Ovid, SilverPlatter) who provide ISI data to customers using their own software.
ISI's products are available in a number of formats:
- Electronic product formats include diskette (1.3 million distributed per year) and CD-ROM (over 300,000 per year); we also supply tape and FTP files to vendors and customers who load data locally.
- We host two Internet-based products, the Web of Science and Current Contents Connect, on the World Wide Web. The Web of Science is also available for intranet loading by customers.
- Distribution of e-mail data files began two years ago in a series of four new alerting service products.
- While electronic products are what most subscribers prefer, we continue to sell more than 600,000 print volumes (paperback and hardbound) comprising more than 3 billion pages of data anually.
5b. What are the most significant problems you confront in disseminating your data?
- Supporting all the various formats of data is labor intensive and expensive, particularly with an eroding subscriber base of the earlier formats, such as print and diskette.
- Staying current with the technology trends of the future, especially those for producing data and those that will likely be adopted by customers for receiving data. Many technologies are investigated but only a few become operational at ISI and within our customer's sites (e.g., Lotus Notes never achieved its potential in much of our marketplace).
- Meeting market expectations for turnaround time, both internally and with service vendors for processing, replication, and mailing.
- Internet dependence that may be slow or disrupted.
- Distribution vendor dependence (Ovid, Dialog, etc.) where vendors may be purchased or change policies affecting ISI or its customers.
6a. Who are your principal customers (categories/types)? ISI's principal customers are academic libraries, library consortia of graduate-level universities, research-oriented corporations (such as pharmaceutical and biotechnology firms), government research facilities, and the end-user researchers themselves. ISI's market is international with 50 percent of revenue attributed to North America and 50 percent contributed by Europe, Middle East, Africa, Asia Pacific, Australia, and Latin America. ISI's customer base embraces all disciplines within the sciences, social sciences, and humanities.
6b. What terms and conditions do you place on access to and use of your data? Authorized use of ISI's database is established through license agreement. The main provisions of the agreement allow for printing and downloading of search results for personal or internal business use by an authorized user. The results may not be used for purposes of publication or commercial use or distribution outside the licensing institution.
6c. Do you provide differential terms for certain categories of customers? No; authorized use and copyright terms are standard language for all categories of customers. Any special use of the data must be permitted by ISI on an individual-case basis.
7a. What are the principal sources of funding for your database activities? Our database activities are funded exclusively from our product sales. Although our parent company (the Thomson Corporation) occasionally makes investment funding available, ISI has used this source of funds only once.
7b. What pricing structure do you use, and how do you differentiate (e.g., by product, time, format, type of customer, etc.)? ISI's pricing differentiates by product (see extensive list under question 4a above), medium (print, microform, diskette, CD-ROM, magnetic tape, online), length of subscription, networked vs. standalone, number of simultaneous users (Internet systems), and number of products purchased from ISI.
7c. Do your revenues meet your targets/projections? Please elaborate, if possible. The Thomson Corporation expects its operating companies to plan carefully, and to meet the commitment specified in their annual management plans. ISI has developed a comprehensive planning process that includes input from a strategic advisory board, a meeting with key publishers, an externally facilitated planning retreat, and a series of internal reviews. Our new product plans are validated through an extensive market research process, frequent reviews with current customers, and presentations at trade shows and scholarly meetings.
As a result of our careful planning and research processes, ISI consistently meets or exceeds its revenue targets and projections.
8a. Have you encountered problems from unduly restrictive access or use provisions pertaining to any external source databases? We occasionally encounter objections from publishers to our use of their abstracts, particularly in the electronic environment, but have been successful in resolving these issues. Publishers generally regard ISI as a solid, neutral source of scholarly information rather than as a competitor.
8b. What problems have you had with legal protection of your own database activities and what are some examples of harm to you or misuse of your data that you have experienced, if any? We use license agreements to protect our data and provide guidelines to customers on product use. We have over the years seen various types of data misuse. For the most part these situations have been resolved by simply calling to the attention of the customer the terms in the license, but occasionally it has taken more insistence in a strongly worded “cease and desist” letter to get resolution. Most often these instances are in the nature of redistribution of data, either for free or for a fee. Misuse of the data is more easily accomplished in electronic form.
Historically, data misuse was in third world or iron curtain countries where one print subscription to Current Contents or the Science Citation Index was purchased and then photocopied and distributed to scientists throughout the entire country. The lost revenue that resulted could not be calculated, and the situation was very difficult to control.
The Journal Citation Reports is frequently the target of data piracy. It has been available for several years on CD-ROM and will soon be on the Web. Since it has been in electronic form, we have found many instances of users taking large parts, or all, of this database and posting it on the Web for anyone to use. In one case the posted data came with a note to “use this quickly before ISI finds it.” Some infringements even note that the material is copyrighted by ISI, but none claim that they have sought or received permission to post it.
ISI develops and sells bibliometric analyses of its data, both on a set-product and custom basis. Others also use our data for bibliometric analyses, and this use is not in itself restricted; however, these data may not be redistributed nor may they be resold without express permission. Three recent infringement cases have now been settled amicably, but all started with flagrant misuse of ISI data that was repackaged and sold to third parties. One involved an academic institution that, for a fee, provided to a third party data and analyses to evaluate research activity in Europe. Another involved a situation where a government agency licenses our data and then subcontracts the analysis out to a for-profit company. While this is an acceptable use, what we found was that the subcontractor company was also using the data to sell additional analyses to other parties. The third example is a not-for-profit agency that uses our database on CD-ROM to create a bibliographic product that they sell to other agencies like themselves.
Each of these examples serves to illustrate situations where customers purchased or leased data from us under license and then went well beyond the terms of that agreement and charged fees to others without recompense to ISI for such use.
More controversial in the industry is the use by private search firms of data uncovered by online searches that is then repackaged and sold at a profit to third parties. Most database providers do not allow redistribution and resale of data without permission and possible payment of fees. This type of use is nearly impossible to track, however.
8c. How have these problems differed according to data product, medium, or form of delivery, and how have you addressed them (e.g., using management, technology, and contractual means)? As mentioned above, once data are distributed in electronic form, redistribution—either for profit or not—becomes that much easier to accomplish. We have had to become very specific in our license agreements for all electronic products and vigilant in our monitoring of data use.
8d. What specific legal or policy changes would you like to see implemented to help address the problems identified above? Database protection legislation should acknowledge that database producers add value to the database through their editorial selection process, data capture, translation, and policy standardization, as well as the unique search and retrieval programs that allow access to that data. Database suppliers should have legal recourse to protect their intellectual property rights against violation of authorized use as outlined in the legislation, copyright, and license agreements.
Legislation for copyright protection of patents, databases, and software should be negotiated and enforced worldwide so that U.S. providers are protected internationally against piracy and unauthorized use.
In addition, ISI would like to see a noncompete policy whereby government-funded agencies would not provide free services that compete with nongovernment entities (for-profit or not-for-profit), and that international users be charged fair market value for services they receive, rather than be subsidized by U.S. taxpayers.
9. Do you believe the main problems/barriers/issues you have described above are representative of other similar data activities in your discipline or sector? If so, which ones? If not, what other major issues can you identify that other organizations in your area of activity face? In general, all secondary publishers face the same basic issues as described above.
A summary of the major issues facing the secondary publishers in general and ISI in particular is:
- Negotiating electronic use agreements with primary publishers for database inclusion, document delivery, and more recently links to full-text publisher files;
- Maintaining selected highest quality coverage from among a vast international publishing arena;
- Increasing numbers of articles and issues within a journal, such that volume of records processed is continually increasing even when no new journals are added;
- Standardizing and unifying data from various nonhomogenous sources, particularly within cited references;
- Dealing with the trend toward technology-centric jobs and skills;
- Maintaining a variety of formats, media, and operating systems for a declining customer base of each format, resulting in increased unit costs; and
- Creating a price structure that is flexible yet equitable for all customers.
Perhaps ISI's situation is different from other database suppliers in that ISI covers a broad range of disciplines, each with their own set of issues. However, it is important to note that the methods used by each database producer to resolve the discrepancies in standards and formats is exactly what gives that producer its competitive advantage.
General Discussion
PARTICIPANT: I notice that you have translations at the bottom of the process. To what extent do you use machine translation, aside from human translators?
MS. SINGER: We at present don't use any machine translators. Our sister company, Derwent, does a lot of machine translating from Japanese patents. Most of our translation actually occurs in our arts and humanities entities. In arts and humanities, there is a lot of standardization that goes along with the translation. We just basically have people who are fluent in multiple languages and also come out of the arts and humanities venue.
PARTICIPANT: So much data today come from the public sector that you incorporate in your database. You said you didn't want the governments to be competing. Do you speak to the social value of that position? It seems to me that the social value is enhanced by having the public sector generate the data and possibly distribute them as well.
MS. SINGER: It is difficult to compete with an entity that, in most cases, doesn't have to make a profit and, at this time, has very, very deep pockets. We fully recognize that the government is a generator of information, and they certainly have every right to package that information and disseminate it.
What we do is take journal material from not-for-profit or from for-profit entities, integrate it, standardize it, and package and disseminate it.
One of our competitors—and maybe it is not a complete overlap—is certainly the National Center for Biotechnology Information (NCBI). And although NCBI serves a great social need and a great social good, it gives us pause every once in a while when there are rumors that NCBI may be taking additional information that comes outside the medical parameters that were set.
DR. SAXON: It is clear that chemistry, despite being a mature field in contrast to genomics discussed in the previous panel, has a great deal of database activity, and I think what we have learned helps to broaden our understanding of what this issue is about.
METEOROLOGICAL DATA PANEL
Government Data Activity
DR. SERAFIN: We have three speakers in the meteorological area this morning. The first is Ken Hadeen, who is a former director of the National Climatic Data Center in Asheville, North Carolina. The National Climatic Data Center is part of the National Oceanic and Atmospheric Administration (NOAA) and the Department of Commerce.
Kenneth Hadeen, National Climatic Data Center (retired)
Response to Committee Questions
1a. What is the primary purpose of the organization? The National Climatic Data Center (NCDC) serves as the National Weather Records Data Center under guidance from the National Archives and Records Administration. NCDC's primary purpose is to manage the nation's resource of global climatological in situ and remotely sensed data and information to promote global environmental stewardship; to describe, monitor and assess the climate; and to support efforts to predict changes in the Earth's environment. This effort requires the acquisition, quality control, processing, summarization, dissemination, and preservation of a vast array of meteorological data generated by national and international meteorological services.
1b. What are the main incentives for your database activities? The main incentives are to provide long-term preservation, management, and ready accessibility to environmental data, and to assemble quality controlled databases of climatological information for use in engineering; construction; litigation support; natural disaster damage amelioration; insurance claims; urban planning; socioeconomic studies; transportation; aircraft operations; local, state, and federal planning; global climate change projects; monitoring and prediction of climatic events.
2a. What are your data sources and how do you obtain data from them? Principal sources are the observational networks of the National Weather Service, the international World Meteorological Organization (WMO) Global Telecommunications Network, exchanges through the World Data Center system, NASA, bilateral agreements with other countries, and special collections gathered in the conjunction with global climate change projects.
Data are received in a variety of form factors and media. Some data are downloaded electronically via T-1 lines from the National Centers for Environmental Prediction. Other data are received on standard magnetic tape, floppy disk, CD-ROM, 8-mm tape, ZIP disk, paper tape, and manuscript records.
2b. What barriers do you encounter in getting these data and integrating them, and how do you deal with those barriers? The major barriers are primarily in the area of rapid technology changes in observing methods, instruments, formats, and media rather than administrative or political barriers. Few barriers are encountered in getting current data. Occasional communication link outages may cause delay but seldom result in significant loss of data. Acquisition of historical data sets, especially those used in long-term climate studies may be quite another matter. The barriers range from reluctance to part with data sets that have been gathered and processed in the national interest, to meteorological services of other nations not having enough money to pay postage to send the data to us. There may also be reluctance on the part of principal investigators to “share” their data sets to other than selected close peers, even after the initial research has resulted in publication of their findings.
A barrier may also be encountered in the actual format of the data, lack of documentation, or deficiency of information about what quality-control steps have already been undertaken. Media-form factor can also play a part in making integration more difficult.
NCDC attacks these problems routinely though international councils, one-on-one communication, participation in global programs, quid pro quo between research fellows, etc. In the arena of global climate change, it is generally recognized that it will take the efforts of many nations, working independently and in concert, to develop the databases from which valid scientific conclusions may be drawn.
Integration of diverse data sets is accomplished through applying both computer techniques and statistical analyses to ensure that homogeneity of the basic data exists before submitting them to the scientists for use in their studies.
3. What are the main cost drivers of your database operations? These may be divided into three separate categories: (1) ingest and quality control of the data; (2) validation of formats, merging into the database, tape management, inventory, ensuring accessibility; and (3) security back-up of tape magnetic media files, storage, and migration.
Category 1, ingest and quality control, is the most labor intensive, requiring meteorological technicians, computer technicians, programmers, meteorologists, data entry clerks, and systems analysts. As stated earlier, data are still received in a wide variety of formats. Manuscript forms, charts, and paper tapes must be processed and entered prior to quality control. Tapes received routinely from National Weather Service or other entities must be checked for format and completeness before going to the technicians for conversion and quality control. Today, more and more data are ingested through automatic “fetch” programs which periodically poll either communications hubs or individual stations to download the required observational data. Although we use the term “automatic,” it should be noted that these fetch operations require constant vigilance and monitoring to ensure that communications and ingest systems are operating properly. Communication line charges are a significant portion of this type of ingest, requiring several hundred thousand dollars per year.
The database management functions are the next step in the process of developing the long-term archive. Each tape received from either internal operational units or from external providers is checked for readability, format, and completeness. Tapes are inventoried and these inventories compared against header information or manuscript submission forms. The data are inventoried and the various station histories, automated index files, or online inventory information documents are updated. Tape management systems assign bar code numbers, provide location of the physical tape within the library, monitor the usage of the individual tapes, and furnish tape catalog information used to provide quick access to the databases.
Library (“L”) tapes are used to create security backup files. The copy is compared bit-for-bit before being stored as a “B” tape at a secure off-site location. “L” tapes are used for all routine processes. “B” tapes are accessed only at the authority of the database manager and then only to create a new “L” tape as required. Tapes are stored in controlled environments. Each year a random sampling is made to evaluate tape condition and readability. Migration is scheduled for every seven years, but in practice this is seldom done because migration of the entire massive digital archive normally takes place at shorter intervals, to take advantage of new media technology and higher density form factors.
4a. Describe the main products you distribute/sell. Products include published monthly and annual Climatological Summaries from principal National Weather Service stations and from the extensive cooperative network in the United States. Other serial publications are Hourly Precipitation Data, Monthly Climate Data for the World, and Storm Data. These publications are sold through subscription or provided in response to individual requests. These primary products are the output from extensive database activities, which include the ingest, processing, quality control, tape merges, and final tape archive, as described earlier.
In addition to these “bread and butter” climatological summaries, products include periodic publications such as Normals of Temperature and Precipitation, Heating and Cooling Degree Days, and other specialized climatic summaries. Another item of particular interest currently is the construction of historical long-term climatic databases for both the United States and the world.
The fastest-growing dissemination methodology of data and information is over the Internet. In October 1998, users downloaded over 100,000 MB of data and information. This compares to only 150 MB delivered in 1992. Users during October 1998 also accessed the NCDC Web site to plot, graph, and download over 100,000 images of data and information. The more popular downloads were U.S. and Global Summary of the Day time-series plots, satellite and Doppler radar images, and Global Historical Climatology Network temperatures and precipitation plots.
4b. What are the main issues in developing those products? First and foremost is the question of accuracy, timeliness, and completeness. The age of the Internet and World Wide Web has brought with it a sense that all information should be immediately available to a wide range of the user community. Caution has to be exercised to ensure that premature conclusions do not result in erroneous information being distributed.
The advent of automated observing systems has perhaps caused the most significant challenge to the production of climate summaries in a manner similar to those done in the past. Different units of measurement, high temporal resolution, difficulty in measuring some of the basic climate information such as liquid and frozen precipitation, cloud types and amounts, etc. have all impacted the “traditional” climate summarization. The impact is perhaps most obvious in the development of the long-term databases mentioned earlier. Discontinuity in resolution of temperature is an example of this situation.
Products from the NEXRAD radar system present another challenge because of volume and type of media used to store the nearly 100 terabytes of data generated each year. Extracting products from the Level III WORM disks is slow and expensive, with the main output media being paper copies of the specified radar product.
4c. Are you the only source of all or some of your data products? If not, please describe the competition you have for your data products and services. NCDC is the only source for most of the products described earlier. There are groups in the private sector that purchase products or data sets from us and repackage them for specific markets and individual customers. In the area of satellite products, NASA maintains large archives similar to those at NCDC and provides them to a variety of customers. The National Weather Service provides real-time data to their customers and some climatological data in the form of paper copies and information to the public, but these latter products are usually generated in response to ad hoc requests concerning particular events at the time.
5a. What methods/formats do you use in disseminating your products? Products are disseminated in a multitude of methods embracing almost unlimited formats. These include copies of paper records; prints made from microfiche/microfilm; copies of microfiche/microfilm; printed publications; standard magnetic tape; 8-mm magnetic tape; floppy disk; facsimile; CD-ROM; FTP/IP; online at the NCDC Web site; information given over the telephone in response to requests; and special products developed at customer specifications.
5b. What are the most significant problems you confront in disseminating your data? Probably the demand for near real-time information, often even before the data are received at the Center. In many instances, observational data are received a few days after the end of the data month. Significant events that may have occurred near the beginning of the month are thus not available for processing and publication for many weeks after the fact. For extraordinary events such as tornado outbreaks or major hurricanes, for example, we attempt to gather information from satellites and other real-time systems, then develop a package for inclusion on our home page on the Internet.
The only other problem is the one of cost. Our customers range from the man-in-the-street to major engineering, manufacturing, and insurance firms. Customers who pass the cost of our goods and services on to their clients do not normally complain about our charges. Academia and researchers, often requesting vary large data sets, complain that the normal charges are exorbitant and that they should be treated differently from other customers. In some instances this is possible, but not in all cases. The agency or group requesting $100,000 worth of processing output while having only $10,000 is not an uncommon encounter.
6a. Who are your principal customers (categories/types)? NCDC monitors customer profiles routinely in an effort to ensure our products and services reflect the needs of the various user communities. As stated earlier in this presentation, our primary mission is to collect, preserve and publish data sufficient to describe the climate of the United States. To that end, all citizens may be described as customers.
The customer profile is basically a judgment call on the part of the customer service representative who takes the order. For example, if a law firm requests data to be used in litigation against a business or insurance firm, the customer would be listed under “Legal.”
The most recent 12-month period shows these categories, most of which have not changed significantly over the past several years.
User Category | Percentage of Request |
---|---|
Legal | 28 |
Individual | 16 |
Insurance | 15 |
Business | 13 |
Consultant | 10 |
Engineering | 7 |
NOAA | 5 |
Government | 4 |
Research | 2 |
Please note that these are profiles only and may have little or no relationship to the amount or cost of the data ordered.
6b. What terms and conditions do you place on access to and use of your data? In general the data and information in the NCDC databases are considered to be in the public domain and no restrictions or conditions are placed on their use or further distribution.
Occasionally there are temporary restrictions placed on selected products or data during times of national emergency or military operations. In the case of our Web site this means that we would not place sensitive products for areas of concern on our home page. These data/products would not be available to anyone else during that time.
Recent international restrictions have been placed on the further distribution of certain data obtained from foreign countries. WMO Resolution 40 allows countries to define certain stations and data types that are not to be resold or distributed. The ramifications of data management required by this resolution were soon evident. In essence we would have to maintain two separate archives—one that can be freely used, and one that we could use for climate studies but that we could not distribute to users outside the government.
In response to these concerns, NCDC developed an approved warning statement that users who access our Web site must read before downloading these restricted data. They are cautioned that use of the data for commercial gain is illegal and that they must contact the meteorological service of the originating country to arrange permission to use the data. The same cautionary statement will be included in any shipments of tape archives containing data from the countries in question.
6c. Do you provide differential terms for certain categories of customers? The short answer is yes. We do have a multitiered system that allows approved researchers and government agencies to receive data at less cost than that charged to commercial customers. It is even possible for some researchers engaged in studies of global climate change to receive data free, although there is a cap on the amount of data that can be provided in this manner.
7a. What are the principal sources of funding for your database activities? The Department of Commerce is the primary source of base funding for the Center. Additional activities are carried out through special NOAA data management programs, which require written proposals for specific projects that have a finite terminal date, and the more recent congressional data rescue initiatives, which provide funds for retrospective data management/rescue to be used in specified congressional districts. These activities are aimed primarily at preservation of paper and microform records, although there is some keying of manuscript to digital format also being done. Routine database activities described in other sections of this report are accomplished through allotment of base funds.
This base funding is inadequate to operate the Center, and the sale of data, publications, and information is taking on an evermore important role in being able to continue operations. To that extent, these sales certainly support various aspects of the database management.
7b. What pricing structure do you use, and how do you differentiate (e.g., by product, time, format, type of customer, etc.)? Pricing is normally based on detailed analysis of the actual cost of providing the product or service, including the cost of the media, computer charges, administrative costs associated with processing orders and checks or credit cards, printing, distribution, and postage. Personnel costs are factored in where appropriate. Other than as described in the previous section, there is no differentiation by type of customer. Time is not a consideration except in the case of rush orders, for which there is a surcharge. There is also a surcharge for Department of Commerce certification, which is required in most court cases. In accordance with the Office of Management and Budget's (OMB) Circular A-130, information gathered at government expense should be distributed at the lowest possible cost, and charges for collecting the data or observing, for basic database preparation, and for data management are not to be considered in determining the cost to the end user.
7c. Do your revenues meet your targets/projections? Please elaborate if possible. Seldom; if we are talking about overall operation of the Center, base funding accounts for a little less than 50 percent of the required operating expenses. Erosion of base funding takes several paths. For example, cost of living increases approved by the President and the Congress rarely come with off-setting increases in base. The agencies are expected to “cover” these costs through improved productivity, etc. Changes in technology require more sophisticated computer resources and more technically skilled personnel, all of which come with an increased price tag. Increased payments to the General Services Administration for rent and utilities are seldom covered fully. Communications costs to access new observing systems of the National Weather Service are only partially covered by increased funding.
If the discussion is about revenues from data sales, the situation is not much different. We were directed to recoup an additional $2 million from data sales through increased fees to the end users. This was a top-down decision and, although NCDC had some input regarding the fee schedules, it was a judgment call as to how much additional revenue would be generated versus a decrease in sales because of the increased costs. Statistics show that both sales and income are down rather than what was anticipated and hoped for. In an effort to trim printing and distribution costs, NCDC has placed more data online further reducing sales. The increased service costs and newly instituted charges for accessing online data have not been in place long enough to predict the long-term impact of these policies.
8a. Have you encountered problems from unduly restrictive access or use provisions pertaining to any external source databases? Other than the situation described in question 6b above, we seldom encounter any real problems because of undue restrictions. We do occasionally have some bilateral agreements that restrict the use of data received from a foreign country to NOAA or other U.S. government agencies. The difference between this situation and WMO Resolution 40 is that in the bilateral agreements the entire data set is restricted from further commercial distribution as opposed to certain stations, elements, etc., applied under the recent resolution. There is little additional database cost incurred in the management of these bilateral agreements.
8b. What problems have you had with legal protection of your own database activities and what are some examples of harm to you or misuse of your data that you have experienced, if any? As stated earlier, with noted exceptions, our databases are in the public domain; therefore, we do not have examples of harm or misuse of our data. That is not to say that individuals and/or corporate entities may not draw some erroneous conclusions by using our data without full understanding of the caveats normally associated with climatological observations.
8c. How have these problems differed according to data product, medium, or form of delivery, and how have you addressed them (e.g., using management, technology, and contractual means)? Please refer to response to question 8b.
8d. What specific legal or policy changes would you like to see implemented to help address the problems identified above? The free exchange of meteorological and climatological data has been traditional in the international community at least since the 1929 Copenhagen Convention where standard formats were agreed upon. It is well recognized that weather and climate know no political boundaries. In this age of concern over global warming and other changes, whether natural or man-made, it seems incongruous that this traditional free exchange of information is being decreased rather than vigorously enhanced. Abolition of WMO Resolution 40 would be a step in the right direction.
Another concern is the new proprietary protection laws being proposed for databases. This may well turn out to be a significant impediment to the continued open exchange of data. It appears that this law could result in very restrictive practices on the part of the national and international scientific data archive centers.
9. Do you believe the main problems/barriers/issues you have described above are representative of other similar data activities in your discipline or sector? If so, which ones? If not, what other major issues can you identify that other organizations in your area of activity face? I believe that these issues are not restricted to NCDC but are inherent in the operations of the other data centers within NOAA. Further, I believe that other agencies within the federal government that have a vested interest in climate archives and databases, such as the Department of Agriculture, NASA, USGS, and the Forest Service, face the very same problems.
Not-for-Profit Data Activity
DR. SERAFIN: Our next speaker is Dave Fulker, who is the director of the Unidata program at the University Corporation for Atmospheric Research.
David Fulker, University Corporation for Atmospheric Research
Response to Committee Questions
1a. What is the primary purpose of your organization? Unidata offers software and services that help universities acquire and use atmospheric and related data—especially current data—on their own computer systems. These tools and services as well as the data (with one exception, where fees are paid directly to a provider) are offered at no cost. The Unidata program is operated by a not-for-profit organization, the University Corporation for Atmospheric Research (UCAR), whose purpose is to advance knowledge of Earth's atmosphere and related systems.
1b. What are the main incentives for your database activities (both economic and other)? Access to global meteorological data is essential for studying and predicting atmospheric behavior, even on regional scales. To fulfill its mission, UCAR always has engaged in database activities, including efforts at the National Center for Atmospheric Research to create and archive major holdings of atmospheric observations and simulations. Extending this overarching tradition, Unidata—starting in 1985—responded to specific university pleas for economical access to current (i.e., quasi-real-time) data from a variety of sources. Such data are crucial for meteorology instruction in the United States, where the practice of challenging students with real-life prediction problems is well established. (Prior to Unidata's inception, the only available current “data” were facsimile maps.) More recently, we began responding to university needs for accessing retrospective information, including case-study data sets focused on specific atmospheric phenomena.
Our primary economic incentive is as follows: with limited core funding from the National Science Foundation and severe funding constraints at universities (especially the smaller colleges), Unidata has sought to maximize the return on database expenditures via community effort. This has been achieved using technology, especially distributed computing. Unidata operates without a data center by providing tools that help universities acquire, manage, and share data on the Internet, using either “push” or “pull” methods. Our economic model, in essence, substitutes modest human effort and (surplus) computer power at each campus for a centralized database with access fees or other funding means.
2a. What are your data sources and how do you obtain data from them? Our principal sources are the National Weather Service (NWS) and the National Environmental Satellite, Data, and Information Service; some of the NWS data originate from foreign weather services. Private-sector sources include a network of lightning sensors and (soon) the automated weather sensors carried in commercial aircraft. The means for accessing these sources are varied; they include contractual agreements with (commercial and noncommercial) third parties, as well as a variety of voluntary and collaborative arrangements. We now are planning a project in which university-based sensor systems (built on GPS receivers) will be the sources for new data streams that depict wave propagation delay (and, indirectly, other parameters) in the atmosphere and ionosphere.
It is worth noting that Unidata often does not really “acquire” the data, in the usual sense. Instead we act as a broker, creating relationships between data providers and users by providing software that enables such relations and by negotiating suitable terms and conditions for data acquisition and use.
The technical mechanisms for acquiring data from our various sources fall into three categories:
- All of the quasi-real-time data we acquire are placed on the Internet—usually by the providers—using Unidata “push” software. This mechanism, called Internet Data Distribution (or IDD), is a distributed application in which any node can be a source or a sink for data. At some nodes, data are down-linked from communications satellites and then immediately injected into the IDD.
- Our case-study data sets are acquired from their creators via electronic file transfers. The creators assemble them from various sources.
- We are planning a new form of retrospective data access built on the pattern of special servers being run by data providers. The server mechanism—dubbed the Distributed Oceanographic Data System (DODS) by its authors at the University of Rhode Island—is compatible with key Unidata software and is well designed for remote access to multidimensional data sets and subsets thereof.
2b. What barriers do you encounter in getting these data and integrating them, and how do you deal with those barriers? For most of our data sources, the main barriers are complexity of access (especially in quasi-real time) and complexity of use. The latter includes gaining the metadata needed for proper interpretation and integration; of particular concern is the absence of common methods and metadata to handle spatial/temporal referencing consistently across our various data sources. We employ several mechanisms—all under continual development—to deal with these obstacles:
- Our IDD system simplifies quasi-real-time access by providing a tool that inexpensively links any data source to the Internet. The IDD has mechanisms for reliable, point-to-multipoint delivery, even in the face of relatively severe network congestion. For recipients, the IDD supports event-driven processing and user-defined patterns for data selection and data storage.
- For case-study data sets, the COoperative, Distributed Interactive Atmospheric Catalogue system facilitates data discovery based on a variety of criteria, including user-defined geographic and temporal limits.
- We provide “decoder” routines (i.e., format translation codes) that match our data streams and create files (from the IDD) for use with several data-analysis systems.
- Our highly portable Network Common Data Form (netCDF) software facilitates creating and accessing multidimensional arrays stored in a self-describing, machine-independent file format. (Some of the aforementioned decoders produce netCDF files.) The abstract data model of the netCDF permits data sets to be accompanied by geographic referencing, units of measure, and other metadata needed to integrate and synthesize data from multiple sources.
- We believe DODS (described in the answer to questions 2a) also will reduce the complexity of data access and use, especially because it is compatible with the netCDF software.
In summary, we are dealing with the most common barriers through evolving technological mechanisms that utilize the Internet, operate in diverse computers, and simplify—through abstraction—the complex nature of atmospheric and related data.
At present, cost is a significant barrier to only one of our data sources: the NWS network of Doppler radars. To minimize its internal networking costs, the NWS established contractual agreements that grant—to a few commercial firms—access rights to real-time radar data. Hence these data are essentially proprietary, and the associated costs and redistribution constraints have greatly limited their use in the Unidata context. We have been unable to completely overcome this barrier, even though Unidata has a contract (secured competitively) with one of the firms that has access. The costs (though reduced) remain too high for widespread university use. The NWS is reexamining these access agreements, so a solution may soon be at hand.
3. What are the main cost drivers of your operations? The principal Unidata costs are:
- Software engineering—including design, development, upgrades, porting, release packaging, testing, and collaborating with external developers.
- User support—including training, documentation, consultation, troubleshooting, and various community-building activities such as news publications, workshops, special-interest e-mail lists, participant databases, and Web-based reference materials.
- Data acquisition—including contracting for data provision, collaborating with data providers, organizing and populating databases of case-study data sets, creating or referencing metadata, representing university needs for data, and coordinating the use of Unidata software for community-wide (real-time, push-style) data sharing.
Clearly, human effort dominates the Unidata Program Center's costs. The drivers for Unidata universities are computers, Internet connections, human resources, and (for about 15 percent of our users) access fees for radar data.
4a. Describe the main products you distribute/sell. We sell no products. The main products we distribute are:
Software Packages These software packages meet university needs for managing and analyzing atmospheric and related data on a variety of computers, all of which run some variant of Unix. We have initiated a shift toward Java software and eventual platform independence. Some Unidata packages were designed and developed by us, and others were developed elsewhere and donated, on the condition that we provide user support.
Quasi-Real-Time Data Streams These data streams are suitable for university-level research and instruction in a variety of Earth-science subjects on regional and global scales. Most of the data are atmospheric or oceanic, and they range from in situ and remotely sensed observations to the outputs of forecast and data-assimilation models. Increasingly, Unidata universities employ these data streams to create derived products that are made available on the Web or through Unidata's real-time dissemination system.
Case-Study Data Sets These data sets are created primarily by the NWS, but increasingly by universities as well, to facilitate studying specific atmospheric phenomena and attendant forecasting problems. A typical data set spans two to three days and includes most or all of the relevant observations and computer analyses/forecasts from that period.
4b. What are the main issues in developing those products? The main issues pertaining to software development are the complexities of multiplatform use, keeping pace with data stream changes, exploiting technology advances, and making the software easy to use while offering comprehensive functionality.
Unidata disseminates but does not “develop” its quasi-real-time data products. For those universities creating derived products, timeliness and spatial resolution seem to be the main issues because many of the efforts are geared toward studying the problems of creating accurate, detailed forecasts of severe weather on regional and local scales (i.e., mesoscale forecasting and “nowcasting”).
In developing case-study data sets, the main issues are segmentation, metadata, and formats. We strive to segment the data sets in ways that permit Internet access to useful subsets, without excessively large transfers. We strive to provide ample metadata for classroom and similar uses as well as to help academicians find the data they need. We have gravitated toward storing these data sets using the same formats in which they were created, even though these are far from ideal in many respects; the implication is increased complexity in the decoding (or format translation) software.
4c. Are you the only source of all or some of your data products? We are not the sole source for any data products, but for universities who seek data in quasi-real-time, Unidata is by far the dominant source.
5a. What methods/formats do you use in disseminating your products? The data products disseminated by Unidata are unaltered from the forms in which they are acquired from providers. Our real-time distribution system embeds each product in a frame with a metadata tag (for routing and other event-driven decisions) and a unique signature (for duplicate detection and queue indexing). This framing method/format is unique to Unidata, but it is well documented, and Unidata software for generating and receiving data in this form is freely available.
Though many Unidata users store our data products in their original forms, we provide decoders that facilitate other options, such as storing data in the forms expected by our data-analysis software.
5b. What are the most significant problems you confront in disseminating your data? Aside from the cost and redistribution constraints associated with radar data, as previously discussed, the most significant problems we confront in disseminating data are:
- Interactions with providers and users that are necessary to ensure proper implementation of Unidata technologies;
- Coordination of community efforts to yield effective, coherent results;
- Accommodation of a rapidly changing Internet, with sporadic local outages, etc.;
- Adaptation of software and metadata to changes in the data streams;
- Human effort required to broker relations between providers and users; and
- Incorporation of new data streams that interest our users.
6a. Who are your principal customers (categories/types)? Unidata serves academic departments of colleges and universities in North America, the Caribbean, and Central America, though most users are in the United States and Canada. Participants use Unidata capabilities primarily for meteorological instruction and research, but the software and data products have been employed in a wide range of natural-science studies at two-year, four-year, and graduate-level institutions.
6b. What terms and conditions do you place on access and use of your data? Some of the data streams available through Unidata—specifically, those from the Global Atmospherics lightning detection network, the NWS radar network, and (soon) the commercial airlines—can be accessed only after direct agreements are struck between the university and the provider. In no case are licenses or contractual agreements with Unidata required to access data, though we point recipients to a warning statement (see<http://www.unidata.ucar.edu/data/data_usage.html> for additional information), which refers to conditions placed on the data by the NWS and foreign weather services, and it cautions against using the data for purposes other than education and research.
6c. Do you provide preferential terms for certain categories of customers? Yes; colleges and universities in North America, the Caribbean, and Central America have essentially unlimited, free access to Unidata software and services, including comprehensive support. Much Unidata software is freely available to anyone via Internet, but support is not guaranteed.
7a. What pricing structure do you use and how do you differentiate (e.g., by product, time format, type of customer, etc.)? We do not price our products, and most are available to universities at no cost. In the one exception—radar data—the pricing structure was set by the vendor who won our competitive procurement. In an effort to minimize university costs, the evaluation criteria for our procurement included the pricing structure that would be imposed upon university recipients of the data.
There are a few nonuniversity recipients of our data products. These are groups (mostly government agencies) with whom we collaborate, and such organizations can receive only a subset of the data available to our university users, in accordance with our data-access agreements.
Except where prohibited by the (external) owner, Unidata software is available to anyone, and the cost is always zero.
7b. Do your revenues meet your targets/projections? Please elaborate if possible. Unidata seeks no revenue from its products, and we meet that target exactly. The contractor who provides our radar data probably has a revenue target that is not being met. I estimate that the provider's Unidata-related revenues—the sum of our contract (about $70,000 per year) and fees from universities (about $50,000 per year)—fall short of the target by at least 50 percent.
8a. Have you encountered problems from unduly restrictive access or use provisions pertaining to any external source databases? Though I hesitate to describe the provisions as “unduly restrictive,” it is clear that costs and redistribution constraints are limiting the educational uses of certain data we acquire. In contrast, where data can be used without restrictions, our university community has shown remarkable ingenuity in creating Web-based materials of educational value in a surprising number of fields.
8b. What problems have you had with legal protection of your own database activities and what are some examples of harm to you or misuse of your data that you have experienced, if any? We have not sought legal protections for our database activities, and we do not think Unidata products have been misused with respect to our rights or those of our data providers. Our view notwithstanding, complaints have been raised—to the NWS and the U.S. Congress—about university use of Unidata services to create Web pages that “unfairly compete” with private-sector products here and abroad.
8c. How have these problems differed according to data product, medium, or form of delivery, and how have you addressed them (e.g., using management, technology, and contractual means)? Where the data being conveyed are proprietary, we have helped protect providers' rights by using point-to-point delivery methods (i.e., direct from provider to university) rather than the data-sharing delivery methods we employ for most data streams. This imposes a greater computing and networking load on the provider, but allows more direct control over who receives data. For example, some providers require signed usage agreements.
Except for the above technical approach—where providers implement their own (contractual) protections—Unidata generally employs informal (managerial) mechanisms to prevent data misuse. For example, certain data from the NWS are designated by the country of origin as “not for export, except for research and education purposes.” We have, through e-mail and newsletter announcements, discouraged universities from posting these data or derived products on the Web, even though such restraint may not be legally required. This matter is under discussion.
8d. What specific legal or policy changes would you like to see implemented to help address the problems identified above? The ideal—from a purely educational and research perspective—would be for data depicting Earth's natural systems to be available at no cost and without distribution constraints. Of course similar benefits would derive from a policy allowing unlimited use specifically for research and education, if such usage could be properly distinguished. However, educational use increasingly depends on access via the Web, and user/usage characteristics cannot be determined in this medium without a level of effort that is beyond most educational organizations.
I am unable to articulate an overarching approach that fully resolves this issue, knowing that Web-based distribution can cause monetary or other harm. However, there are clear educational and economic benefits to government policies that maximize the availability of data depicting our environment. Perhaps the law of eminent domain should apply to databases and their encryption keys.
In addition, it might be sensible for governments to offer legal protections only to those database authors who guarantee access at marginal cost for uses “such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research,” as described in the current copyright law.
9. Do you believe the main problems/barriers/issues you have described above are representative of other similar data activities in your discipline or sector? If so, which ones? If not, what other major issues can you identify that other organizations in your area of activity face? Though Unidata focuses primarily on current data, I think the problems, barriers, and issues we face are similar for retrospective databases in all of the natural sciences. In particular, the absence of common methods and metadata to handle spatial and temporal referencing—especially across databases from different disciplines—is a problem faced in all of the Earth sciences. Similarly, the tension between educational and commercial data interests exists in all disciplines. Actually, the tension may be worse in other disciplines because the global nature of atmospheric phenomena has created a culture of free and open data exchange, at least on some levels.
Issues in geoscience, broadly defined, that have not arisen in Unidata include cases where the data are politically loaded (because they reflect government activities, government inaction, threats to tourism, etc.) or where the most crucial data are unaffordable (as with Landsat, for example) or highly proprietary (as with oil-well data).
Finally, I am concerned that current efforts to strengthen database protections may damage a long history of judicial and legislative efforts to balance authors' rights to exclusive control over their creative works against users' rights to utilize the ideas contained in such works. The need for balance—as reflected, for example, in current “fair-use” legislation—derives from the “Progress” objective set forth in the Constitution: “The Congress shall have Power . . . to promote the Progress of Science and useful Arts, by Securing for Limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.”
To an increasing extent, the “progress of science” is manifest as a succession of databases, each predicated on previous ones. (I note that a computer model can be encoded in a database; hence even the evolution of models may be viewed as a series of databases.) As yet this is not an issue in Unidata. However, I foresee the need for regulations and policies that foster rather than inhibit the creation of derivative databases, especially where the derivatives show creative differences from the originals.
General Discussion
MR. REICHMAN: Jerry Reichman, Vanderbilt Law School. It seems to me you already have a consortium of universities that is exchanging data for noncommercial purposes. I wonder if this model is capable of being enlarged into something much bigger and broader. In other words, would it be workable, in your opinion, if universities did this generally with data that they generate? Would it be workable to have at least a two-tiered price structure, or term structure—one for other universities participating in the consortium and one for outside commercial people who want to take these data and do other things with them? In simple form, would a consortia system solve the problem of universities, which want to generate and need access to data, to distribute data for scientific purposes, but also to commercialize data?
MR. FULKER: I think you pose a good question. I don't know that it could be put in quite such a broad context as that. We have been motivated to avoid creating sensitivities to competition with private-sector vendors. We have been very careful to think up ways for distribution serving universities. You are proposing a different model. Quite frankly, I can't think of any reason why it wouldn't be possible.
PARTICIPANT: Can you give an example of database protection that would inhibit your ability to provide service?
MR. FULKER: The service that we provide most directly is not, I think, especially vulnerable to most of the database protection efforts. The biggest problem that we have has to do with redistribution constraints, preventing our universities from exercising the full range of educational opportunities, which have included, to a very successful extent I believe, the provision of information in the K-12 context. Instead of using our distribution system, they are turning around and putting information on the Web, making it accessible for use in the schools.
The general indications from our universities is that access control is impractical in such extended contexts. I don't think there are any examples where we or universities are directly using the data for other than education or research, but there may be secondary usage via the Web which is not so constrained. Thus I find myself alarmed by provisions that rely heavily on distinctions between educational and private uses of data.
The problem concerning the World Meteorological Organization Resolution 40 is that I believe nations have a public-good responsibility to share data with other nations on an unrestricted basis. I think that is the biggest threat, and the database protections encourage it.
DR. SERAFIN: I would just like to comment on that. Were you talking about some example beyond the radar example that Dave described?
PARTICIPANT: I was just asking the general question.
DR. SERAFIN: That radar example is an interesting one. The radar data are actually provided or collected or acquired through the National Weather Service radar. The National Weather Service determined that it did not have the resources to broadly distribute those data to the community, even its own weather forecasting offices in the network. So, it went to a private-sector mechanism for doing that, actually contracted with several vendors so that there would be competition, and allowed them, through charging for those services, to distribute those data. Whenever you see, on the Weather Channel or your local weathercast, the radar picture of the country or the radar picture of your region, they are getting those data through a private-sector company, but those data originated with the National Weather Service.
What we have seen is that a rather large number of universities feel that they can't afford that. Of course, they can turn on The Weather Channel in their departments and see some of it there. They may not have some of the same tailored products that they would prefer.
The next speaker is Bob Brammer. Bob, a long-time colleague of mine, is the vice president and chief technology officer for TASC.
Commercial Data Activity
Robert Brammer, TASC
Response to Committee Questions
1a. What is the primary purpose of your organization? TASC is a diversified information systems integration corporation. Our customers are both government and commercial organizations, primarily in the United States but with a growing international segment.
For the purposes of this NRC workshop, we will focus on TASC's information businesses and weather and agriculture. These operating entities are organized into TASC subsidiaries—the WSI Corporation (weather) and Emerge, Inc. (agriculture). While these do not form the majority of TASC's revenues, they are significant parts of our business. WSI recently had its twentieth anniversary, while Emerge is a recently formed start-up.
1b. What are the main incentives for your database activities (both economic and other)? As a commercial for-profit business and a subsidiary of a publicly traded firm (Litton Industries), TASC obviously expects its business units to be growing and profitable, according to approved business plans. In addition, TASC believes that these information businesses are strong strategic fits with the information technology focus of TASC's overall business and have excellent growth potential over the next several years.
2a. What are your data sources and how do you obtain data from them? The WSI Corporation is primarily a real-time business. We receive our information via several digital communication networks from a variety of sources, both government and commercial. Our primary supplier is the U.S. National Weather Service (Family of Services). We also downlink information directly from both U.S. and international weather satellites. Additionally, we receive information from a variety of other government agencies and private organizations under many types of terms and conditions. The information from these sources is integrated and processed in many ways to create a variety of information products.
Conceptually, this model has not materially changed in the past five years, although we have a significantly more diverse database today than we had five years ago. We expect that this model will still be relevant in the next five years, although we will likely have a much broader range of commercial data sources than we have today.
In our Emerge agricultural information unit, the primary data sources are aircraft multispectral remote-sensing systems. We lease aircraft and host our uniquely designed scanners on these aircraft and fly surveys under contract from various agribusiness organizations. The data are sent back to our central computing facility for processing to create value-added information products. These products are transmitted to our clients. In the course of doing these surveys, we also use data from the Global Positioning System for precise navigation and data from our clients concerning their agricultural operations.
Since our Emerge unit is new, we don't have five years of history or a strong basis for future prediction. However, we anticipate rapid growth in data sources as the business builds.
2b. What barriers do you encounter in getting these data and integrating them, and how do you deal with those barriers? The primary barriers are the technology issues and associated costs of implementing data communication networks, satellite downlink stations, aircraft remote-sensing systems, etc. Obviously, we deal with those challenges with a mix of staff expertise and technology.
Occasionally in the weather aspects of our business there are political barriers to receiving data from international organizations. We work cooperatively with the U.S. National Weather Service in those areas.
3. What are the main cost drivers of your database operations? The main cost drivers are the costs of the skilled labor required to preprocess and quality-assure the incoming data, to operate the information systems, and to respond to customer questions and requests. The associated hardware, software, and networking technology are also significant budget items.
4a. Describe the main products you distribute/sell. For the weather information part of our business, we have a variety of workstation products and weather information products that are addressed to our various markets. The primary markets are the news media (network and cable television), aviation, energy and power, and agribusiness. Our agricultural information services are targeted at large growers. (These are described in further detail at our Web sites, see < www.wsicorp.com and www.emerge.wsicorp.com >.)
WSI Weather Information Products and Systems
Weather Radar Products
- NOWrad® mosaic radar imagery providing local, regional, and national coverage with 5-and 15-minute updates. Unaltered single-site NEXRAD imagery 4-tilt base reflectivity. Composite reflectivity. 3-Layer composite reflectivity. Echo tops.
- Velocity azimuth display winds. Vertically integrated liquid. 4-Tilt radial velocity. 2-Tilt mean storm relative velocity maps. Increased radar sensitivity for better coverage and definition of precipitation.
- One- and three-hour storm accumulation. Total storm precipitation. Hourly digital rainfall array. Free text message.
- Product updates: 10 minutes in clear air mode, 6 minutes in precipitation mode, 5 minutes when local severe weather is detected.
- Enhanced NEXRAD mosaic imagery. Complete reflectivity. 3-Layer composite reflectivity. Echo tops. Vertically integrated liquid. Constant Altitude Planned Position Indicator Winds. Enhanced velocity azimuth display winds: Contoured echo tops.
- Radar summary.
- Regional and national coverage. Combines NOWrad radar mosaics with NEXRAD storm information—including storm-cell movement, echo top heights, hail, mesocyclone, tornadic vortex signatures, and severe weather watch boxes. Simultaneous viewing of multiple radar sites in a single image. Automatic suppression of most false echoes. 15-Minute updates via dial-up or via satellite delivery on WSI's HCSN.
- Winter storm mosaic regional, national coverage. 15-Minute updates via dial-up or satellite delivery on WSI's HCSN. Color-coded NOWrad mosaic radar indicate precipitation type: rain, snow, mixed. Automatic suppression of most false echoes. Simultaneous viewing of multiple radar sites in a single image.
- PRECIP rainfall estimates regional and national coverage. NOWrad mosaic radar interpreted into quantitative precipitation amounts. Cumulative totals appear in color-contoured bands. Real-time hourly estimates available by dial-up or via satellite delivery on WSI's HCSN. Climatic summaries: daily, weekly, monthly, seasonally, and yearly.
Meteorological Satellite Image Products WSI provides worldwide satellite imagery with 100% global coverage, including the U.S. GOES and NOAA Polar Orbiters, Japan's GMS and Europe's Meteosat. Imagery included infrared, visible, water vapor, thresholded, and full spectrum.
Alphanumeric Data Raw data, decoded or plain language observations, severe weather, forecasts, technical discussions, numerical model output data, weather summaries, calculations, and conversions. Access to National Weather Service, domestic, public and international data plus FAA 604 circuit. Data available within seconds of receipt from NWS and available by dial-up or via satellite delivery on WSI's HCSN.
DIFAX Operational weather charts with timely, frequent updates. AVcharts™ for aviation professionals, weather charts for professionals and enthusiasts. Uses high-resolution forecast model data service gridded model data from the following models: Aviation spectral, Nested Grid, European Center for Meteorological Weather forecasting, Medium Range Forecast, Rapid update cycle and ETA. Timely delivery via satellite delivery on WSI's HCSN. Raw data available hours before NWS DIFAX charts.
DATAsuite DATAsuite incorporates all of WSI's data and value-added products into one offering with the added advantage of including all future data products still in development during the life of a customer's contract. DATAsuite includes unlimited domestic and international satellite imagery, and the NOWrad® family of radar products—winter storm mosaics, radar summary, STORMcast®, and PRECIP™ rainfall mosaics. Also, unlimited NEXRAD single-site products from all WSR-88D sites, HRS Forecast Model data, DIFAX and SUPERfax™ charts, and NWS text products and our complete family of on-air WEATHERcharts™ and more:
- STORMcast ®: Weather information for the media market. STORMcast ® automatically locates, tracks, and forecasts intense storms as they bear down on a station's area. Images showing storm cell position, movement, and intensity are updated and sent over a dedicated network within two minutes of the WSI radar scan. Severe storm tracking and path projection are depicted in clear, crisp icons with smooth, visually appealing technical radar echoes.
- WEATHERcast: Forecasting information for the media. With this software package for WEATHERproducer, broadcasters now have access to ready-made, on-air graphical products together with meteorological tools that actually illustrate what their viewers want most—future weather conditions, automatically.
Embedded intelligence puts WSI data and reliable, science-based tools in the hands of the meteorologist. The latest projections, detailed graphics, and proven computer modeling from WEATHERcast create graphical forecasts that help viewers peer into the future. They can watch as their weather week emerges: sun and cloud casts, temperature, rain or snow, fog, thunderstorm, and severe weather forecasting.
WEATHERproducer— Media WEATHERproducer—the totally integrated, data-to-graphics workstation from WSI—builds ratings by delivering more of what broadcasters want—the forecast—automatically. As a single, integrated workstation, WEATHERproducer appeals to the science-driven meteorologist and the audience-driven station management.
- WEATHERworkstation for Aviation is a monitoring and alerting system designed for operations where weather plays a critical role in safety and profit and loss. Briefings can be tailored to a user's specific needs.
- WEATHERworkstation for Industry is a one-of-a-kind weather monitoring, alerting, forecasting system designed for strategic and tactical industry applications. Markets include utilities, transportation, geology, construction, agriculture, travel, insurance, education, and entertainment.
Internet Services Advertiser-sponsored consumer-oriented Web site, Intellicast (see <www.intellicast.com>), as well as a subscription service for energy companies, EnergyCast (see < www.energycast.wsicorp.com >).
Services Services include round-the-clock customer and technological service. Customers can talk to WSI meteorologists to consult on weather or reporting anomalies or to reach a systems expert for tech support. Service also includes a full range of specialties, such as consulting, design, animation, programming, and forecasting services.
Emerge Agricultural Information Products Emerge is a comprehensive precision agricultural information service that provides real-time site-specific data to subscribers. Emerge products assist in detecting crop variability, determining possible causes, and deciding what remedial actions could be taken, if necessary. Emerge gives growers a complete informational view of a farm or agricultural operations, with access 24 hours a day, 7 days a week.
The Emerge service includes information products such as:
- Detailed infrared imagery and enhanced vegetation maps, enabling detection and measurement of areas of variability.
- Critical weather data, forecasts, and agricultural weather alerts, at both a regional and field-specific level. These include such parameters as growing degree days, evapotranspiration, local inversions, and other information essential for crop management.
- Complete management of and access to important field data, such as yield maps, soil tests, and field inputs.
- Pest and disease alerts based on the exact weather conditions on designated fields.
- Crop yield modeling software, predicting potential yields based on specific seed, soil, and other inputs.
- EmergeView™ mapping workstation software for information display and analysis.
- Information access through a customized and secure Internet site.
- Ongoing field-level support and assistance.
4b. What are the main issues in developing those products? The main issues are ensuring that our products are focused on the specific applications that our customers require, that our implementations are better than the competition's, and that we deal effectively with the various technology issues associated with these developments in a cost-effective way.
4c. Are you the only source for all or some of your data products? If not, please describe the competition you have for your data products and services. WSI is the largest of the providers of real-time weather information. However, there are competitors in the various segments of the weather information business. In the United States these competitors tend to be small, privately held firms who focus their expertise and competitive products in various specific market segments. Internationally, to the extent that there are competing services, these are generally provided by the different countries' national weather services.
Our aircraft remote sensing information service for agriculture is a relatively new business, and it does not yet have direct competitors providing similar services.
5a. What methods/formats do you use in disseminating your products? Our products are transmitted through various private and public data communications networks. For the weather information part of our business, we make heavy use of satellite broadcasting services from various satellite providers. For the agricultural unit, much of our information is distributed on a subscription basis through the Internet. We also use the Internet for weather information business. Additionally, there are various of private networks, which some of our customers use to obtain our information products.
5b. What are the most significant problems you confront in disseminating your data? There are many operational problems in dealing with a variety of telecommunications providers. Variations in quality of service and reliability are significant and expensive issues. The Internet is also a somewhat uncertain medium.
6a. Who are your principal customers (categories/types)? Television meteorologists, major airlines, air freight companies, electric power utilities, and major agribusiness firms are our principal customers. Some federal, state, and local government agencies are also important customers.
6b. What terms and conditions do you place on access to and use of your data? Generally, a monthly subscription fee provides access to a defined broadcast stream. Dial-up connections are also available on connect-time fee basis. Licenses for specialized user software and redistribution rights are also established. There are also statements about the advisory nature of the forecasting services and certain limitations of liability. There are also advertising fees since some of our Internet services are advertiser-sponsored.
6c. Do you provide differential terms for certain categories of customers? Yes; distinctions on resolution (spatial and spectral) and timeliness are commonly used differentiators. Variations in user software functionality and in redistribution rights are also used.
7a. What are the principal sources for funding for your database activity? These are commercial businesses. The funds for the database activities come from the revenues from selling the products on commercial terms.
7b. What pricing structure do you use and how do you differentiate (e.g., by product, time, format, type of customer, etc.)? As noted in the response to question 6b, most of our revenue is derived from subscriptions. The customers sign a contract for a period of time (generally a year) and pay monthly for the information that we provide.
Product differentiation is done by all of the methods in the above question. Products can be differentiated by resolution (spatial or spectral), by timeliness (minutes are very significant in some applications), or type of customer (we differentiate by functionality and by data volume).
Additional revenues are derived from the sale of workstation systems and/or local area networks that receive our information products. In some cases we provide integration services to connect our systems with customer operations.
7c. Do your revenues meet your targets/projections? Please elaborate, if possible. In general, we meet our business plan objectives. If there were to be significant deviations from plans, we would make necessary changes. We do not report revenues at the subsidiary level.
8a. Have you encountered problems from unduly restrictive access or use provisions pertaining to any external source databases? In general, within the United States we can get the information needed on a commercial basis, if we feel that there is a sufficient market demand.
Until recently the commercial terms from many national weather services were far too expensive for us to obtain data from them on a profitable basis. However, we are now seeing some very large price reductions due to commercialization efforts in some countries that are changing this situation significantly. These changes, if sustained, may do much to stimulate weather information services internationally.
8b. What problems have you had with legal protection of your own database activities and what are some examples of harm to you or misuse of your data that you have experienced, if any? We have had some instances of unauthorized copying or redistribution of data. Although this has not yet been a major problem in our businesses, there are enough instances that we have to devote some staff time to reviewing reports of misuse. Certainly, there is the possible risk that such problems could grow. For example, we have seen some of our image products (e.g., weather radar images) used in promotional material without attribution despite the clear presence of copyright statements on these image products. We have called the offending organizations to attempt to resolve these issues with varying degrees of success. Almost surely, there are incidents like this that we never hear about.
To some extent, there is loss of revenue and profit from this type of misuse. We do not feel that this has yet been material in our business, but we certainly will continue to monitor within our resources.
8c. How have these problems differed according to data product, medium, or form of delivery, and how have you addressed them (e.g., using management, technology, and contractual means)? Much of our revenue and profit derives from image and graphic products. In recent years, we have marked all these products with copyright statements. We believe that this has helped inhibit some misuse. The real-time nature of much of our information business is also a partial inhibitor to redistributors. The delays involved in redistribution would limit the value of this type of unauthorized use.
We use the various methods of intellectual property protection including trademarks, trade secrets, copyrights, etc. Our contracts specify the rights of the customer for redistribution. In some cases redistribution is the intent of the agreement, and there are specific measures detailing how such redistribution is to be done and what limits are placed on such redistribution.
We have done some experimenting with some “watermark” technical approaches to inhibit unauthorized copying or redistribution. Subtle signatures can be placed into image, graphics, or other types of information products to demonstrate authorship. These encrypted signatures can be placed into the data without being apparent to uninformed users. We are currently investigating the operational implications of such techniques before placing them under full-scale development and implementation.
We also use logging and reporting techniques to see who is using our Internet sites. In some cases, we have found apparent program-automated accesses that indicate likely retrieval and storage of some of our data. We are able to track the users and to investigate their usage. Generally, we can limit this type of access with today's technology. This may be more difficult in the future, depending on technical developments in computer security.
8d. What specific legal or policy changes would you like to see implemented to help address the problems addressed above? In the United States there are already applicable policies and laws governing our types of products and services. In particular, it seems clear that our image and graphic products are protected under copyright. In some cases, better enforcement might help. Further legislation does not appear necessary, although consistency in court rulings on what types of information can be copyrighted would be of benefit to the information industry.
Internationally, however, there are certain countries in which stronger local laws and enforcement would definitely be an improvement. The lack of a uniform legal framework is an inhibitor to certain types of information businesses in these countries. As a company with growing international markets, we would like to see uniformity in international laws for intellectual property.
9. Do you believe that the main problems/barriers/issues you have described above are representative of other similar data activities in your discipline or sector? If so, which ones? If not, what other major issues can you identify that other organizations in your area of activity face? The problems that we face are representative of those faced by similar data activities elsewhere. The strict time-limit requirements of much of our business is a limitation to some of the unauthorized copying and redistribution issues that other types of information businesses may face. Furthermore, the image and graphic products are somewhat easier to protect under copyright than archival text databases. These are not the reasons that we are focused primarily on real-time information services, but that aspect does provide some measure of protection.
General Discussion
PARTICIPANT: You mentioned that some of your sales go back to government agencies. What, if any, restrictions are placed on the redistribution or open access to those data sets that go back to government organizations?
DR. BRAMMER: Generally, the government agencies contract for them for their own use; and the redistribution—we come to an agreement in the contract for those services, and how they're used.
PARTICIPANT: Could one access it under the Freedom of Information Act?
DR. BRAMMER: That really hasn't come up. One of the advantages of that part of our business is that those are real-time products for the most part. The unauthorized redistribution has not, at least to date, been a real problem for us. Occasionally we see some of our image products on the covers of publications, maybe an image product from a hurricane or some other special event. We copyright all of these image products, and we believe that these copyrights are viable. Occasionally they are violated. So, it hasn't been a big loss in revenue, but we do see it once in a while. As far as I am aware, we haven't had a Freedom of Information Act occurrence with our customers.
DR. SERAFIN: I was reminded by Barbara Ryan earlier that we have been looking at four different disciplinary types of databases. Within each of these we have heard about the fact that there are distributed diverse data sets within these disciplines, through which the combination or the integration can result in rather significant scientific advances.
She also pointed out—and I think this is important—that there are also benefits to be gained, and perhaps even greater benefits, by going across those disciplines, and the four that we talked about this morning are only four. There are many others that would be valid and worthwhile to cut across. We are using these today, I think, as our examples of databases and how they might be used. By no means do we have an exhaustive list before us.
Footnotes
[Note: Though Unidata is not directly involved, its freely available software is employed by many public and private organizations to facilitate distribution and use of data from numerous sources beyond those described above.]
- Characteristics of Scientific and Technical Databases - Proceedings of the Works...Characteristics of Scientific and Technical Databases - Proceedings of the Workshop on Promoting Access to Scientific and Technical Data for the Public Interest: An Assessment of Policy Options
- Caragana rosea (1)SRA
- SRX3981650 (1)SRA
Your browsing activity is empty.
Activity recording is turned off.
See more...