Appendix CWorkshop on International Benchmarking of Us Research

Publication Details

Summary

COSEPUP held a 1-day workshop on June 16, 1999, to discuss the methodology and utility to policy-makers of its three benchmarking experiments with invited guests from federal agencies, Congress, universities, and other institutions.

During the morning session, leaders of the three benchmarking experiments (Peter Lax, mathematics; Arden Bement, materials science and engineering; and Irv Weissman, immunology) summarized the work of their panels. During the afternoon, discussants provided comments on the utility and methodology of the benchmarking experiments.

Highlights

  • Panel leaders started out as skeptics, but came to believe that the process of benchmarking is feasible, quick, and accurate.
  • Benchmarking can produce a rapid, broadly accurate "snapshot" of a field. With greater rigor and the generation of relevant local data, it can probably be applied with specificity as well, for example, to evaluate particular agency programs.
  • Through the use of a "virtual congress", it is possible to get 80% of the value in 20% of the time.
  • Benchmarking produces data, not policy decisions.
  • Benchmarking should use both qualitative tools (the "virtual congress") and quantitative tools (citation and publication analysis, prizes, and presentations).
  • Because of the statistical risks in using small samples for rapid assessments, all available tools should be used.
  • The benchmarking model might be adapted at the agency level to evaluate research programs and instruct advisory committees. Agencies would need to determine their own benchmarks, such as comparison with other US scientists or programs.
  • Two independent benchmarking experiments in mathematics—despite dissimilar panels, mandates, and leadership—produced similar results, lending credibility to the technique.
  • Experts on several panels suggested that a diminished flow of foreign-born scientists and engineers to the United States could weaken the research enterprise in coming years. This underscores the importance of drawing more American students into science.
  • In seeking to explain the overall dominance of US research, participants pointed to diversity, flexibility, research-based graduate education, a balanced research portfolio, national imperatives, and a favorable innovation climate.
  • It was suggested that an accumulation of benchmarking exercises might lead to better understanding of the factors that yield research excellence and might better educate the public about the value of research.

Report

The following excerpts and quotations are offered to summarize the issues discussed.

Was The Exercise Successful?

The benchmarking chairs were initially skeptical that the study could be conducted. For example, Arden Bement, chair of the materials science and engineering study, said:

''I started out as a skeptic, but became more of a believer that 1) it's possible to do, 2) it can be done within a short time, 3) the committee got a lot out of it. It's not perfect. I think we got 80-90% of what we looked for, within the scope we chose to probe. You can get 80% of the value in 20% of the time. After we'd finished, my thinking on it was much clearer than before."

Comments received by Dr. Bement after the release of the report were positive. The concerns were focused on sub-subfields not addressed, such as research areas important to industry (for example, corrosion).

In addition, some surprises were illuminated as a result of the analysis. For example, Peter Lax, chair of the mathematics panel, indicated that although he knew that many leading mathematicians were from abroad, he was surprised to find out the degree to which US leadership depended on non-US talent in the United States.

Methodology

All panels used the same general methods: a combination of qualitative judgment by experts and quantitative tools (measures of publications, citations, prizes, and speakers at international meetings). Within each experiment, all assessment methods—qualitative and quantitative—gave similar results.

The use of a "virtual congress", or "reputation survey", was found to be effective for determining the relative standing of a nation's research. In immunology, for example, the panel created a virtual congress by identifying and polling international leaders in major subfields. These leaders were asked to imagine that they were about to organize an international meeting in their particular sub-subfield and to furnish a list of five to 20 potential speakers for that meeting. The identities of the speakers were used to create a "snapshot" of international leadership.

An advantage of the virtual congress is that results are available quickly. "The results of [a first quick] go-round were virtually the same as the final result, which was that about half of the best [immunology] scientists are in the US."

The virtual congress identified sub-subfields of great current interest but did not perform in-depth analysis. In materials research, "the utility [of benchmarking] depends on what kind of assessment you want. We were talking to experts and asking them for hot topics, not looking at whole portfolios."

Concerns About Methodology

Several participants voiced concerns about the virtual congress. One was that it was small and therefore in danger of being biased. In addition, there was no consensus on how many foreign members a panel should have (immunology had three, two of whom have since moved to the United States) and what effect foreign membership (or lack of it) had on decisions. Several people suggested that half the members of each benchmarking panel should be non-American scientists or engineers.

Small panel sizes might be unavoidable if the goal is to poll the most talented leaders: "In asking our questions we quickly started hitting the same people." A general concern was that the method of selecting people can somewhat bias the outcome by promoting an accepted "party line" and by including fewer people who are inclined to follow less-popular directions. However, panel leaders said that the advantages of reaching those most knowledgeable and active in a field outweighed those drawbacks.

Several other concerns about methodology were noted:

  • Panel members might use different modes of polling, variable sample sizes, and subjective criteria in assessing leadership.
  • The use of citations might sometimes be skewed by clustering in some areas, self-referencing, and cross-referencing among certain authors.
  • For benchmarking reports to be of greatest use, results should be described as fully as possible. For example, if 50% of a field's leaders are in the United States, are 49% in a single other country, or do 10 countries each have 5% of the leaders?
  • There is a suggestion that the United States is "number one" in most fields because it spends the most money on research. Another standard of performance might be "research productivity." That is, if the United States funds 60% of the research in a field, it should attain at least 60% of the leadership in that field. However, another nation might fund 10% of the research in the field and be able to attain 25% of the leadership in that field. The second country might be considered more effective at attaining leadership for its level of funding.
  • When the United States is dominant even with respect to whole regions (such as Europe), it is difficult to make country-by-country comparisons.

One person suggested that the validity of benchmarking should be tested by repeating an experiment with a different panel or by using two expert panels simultaneously. Such testing, however, might be hindered by the small number of experts and might not produce new results. Two independent tests were done in mathematics, but they produced essentially the same conclusions, even though the two panels were quite different in composition. One panel consisted mostly of mathematicians (it included eight American mathematicians), the other mostly of non-mathematicians (it included two American mathematicians); one was asked to draw conclusions, the other was asked not to draw conclusions.

Journal-publication and citation analysis suffered some shortcomings. In immunology, journal analysis was generally limited to the largest international journals, which tend to be American or English (such as Nature, Science, Cell, and Blood). In mathematics, the use of only mathematics journals eliminated many fields of which mathematics is a part (geophysics, probability, and mathematical biology); for this reason, the judgment of experts familiar with sub-subfields was especially valuable.

One participant urged greater uniformity of methods so that results would be comparable and easier to use by agencies or Congressional committees. A panel chair responded that the cost of seeking uniform comparability might not add value compared with small, frequent, "snapshot" assessments. He suggested trying an experiment with more-rigorous methods to see whether they produced different results.

Social-science terms are sometimes used in the reports without sufficient rigor. For example, the concepts of "leader" and "fast follower" should be defined uniformly among reports and connected to earlier COSEPUP reports. Is it "almost as good" to be a fast follower as to be a leader? When is maintaining leadership worth the investment required?

On Being a "Fast Follower"

Panel members suggested that in some fields, being a "fast follower" can be a good strategy. It is unrealistic to think that any country can be the leader in every field or subfield of science or engineering. However, a country can support scientists and programs that are "among the leaders" in every field. The advantage of being among the leaders is that a nation can react quickly to new discoveries or fields that are suddenly hot. The main text uses the example of high-temperature superconductivity, a sub-subfield in which the United States was not the leader but was able to move rapidly because US researchers were among the leaders in related subfields or sub-subfields.

The Question of Timing

Workshop members discussed several aspects of timing. The choice of year and even of decade can alter results; had the mathematics benchmarking been done 10 years ago, the Soviet Union would have figured much larger.

There was also a concern about frequency. It is not clear, one participant said, whether benchmarking would lose its impact if done every 5 years, because the same leadership status and issues might be repeated. Another commented that regular evaluations would provide the opportunity to determine whether situations had improved or remained the same.

Several agency representatives raised the issue of "old data", that is, the National Science Foundation (NSF) and science and engineering indicators come out every 2 years and require an additional 2 years for validation, and this creates at least a 4-year lag. Panel members responded that an assessment by a virtual congress can be done relatively quickly, using data that are current.

Quantitative and Qualitative Benchmarking

Several participants asked whether benchmarking could use more rigorous, quantitative measurements. In response, COSEPUP members said that quantitative analysis (of numbers of publications, citations, patents, and so on) is helpful in assessing some research programs or projects, especially when the goal of the research is an incremental improvement or achievement of a known goal. But expert judgment is required to analyze the relative importance of various journals, citations, and patents—the tools of quantitative analysis.

Moreover, quantitative tools offer little information about important aspects of research programs. The current judgment of practicing researchers, managers, policy experts, and users of research is needed to answer such questions as where the most promising ideas are emerging, what locations are being chosen by the best new scientific talent, and what the comparative quality of research facilities in different areas is.

General Comments on the Science and Engineering System

Diversity and flexibility are essential qualities of the American system that allow a rapid upshift from fast follower to leader. For example, NSF recognizes the value of these qualities by not penalizing researchers who divert funds to areas of new or emerging importance. One panel member noted that in the last dozen years, none of the important discoveries made in his laboratory [IW] had been anticipated in a grant proposal. He argued against hierarchies or planned strategies: one cannot know where important new areas might emerge.

One panelist said that the strength in immunology depends heavily on pluralism, decentralized funding, and entrepreneurship: "We are so entrepreneurial that everyone is their own principal investigator. There are many points of light."

In discussing potential threats to the research system, a number of participants mentioned the tendency of industry and government to emphasize applied, results-driven research at the expense of high-risk, high-return basic research. A healthy research enterprise requires a balanced portfolio of research. As one panelist said, "one can't cultivate only the fruit of the tree."

Factors That Lead to Strength in Science and Engineering Research

Human resources, breadth, and motivation were mentioned:

  • Human resources: There was concern that not enough American students are staying in mathematics and there is little understanding of why they don't. One participant said, "it depends on your point of view. US students might make the logical choice to go into computer science instead."
  • Breadth: A "healthy" field requires not just a few stars, but broad strength—both creators (leaders) and innovators (followers).
  • Motivation: The truly successful are those most motivated. A person needs intelligence, but what defines the real achievers is "everything else". Motivation might not be apparent at a very early age—a strong argument against preselecting early achievers and in favor of early mentoring of all students.

Benchmarking, with properly defined terms, might show the factors that have brought fields to their present positions. By capturing the key factors, the process could elucidate serious issues worthy of case studies and could lead to a better understanding of the science and engineering system.

The process might also be a useful tool to demonstrate how fields evolve, what brings success, and how funding is decided. Many people and agencies have little understanding of the scientific enterprise. An accumulation of GPRA reports could be a valuable educational and public-policy instrument.

Funding alone does not determine leadership. National imperatives and the ability to capitalize on research have also been major factors.

Foreign Scientists and Engineers

On the benchmarking panels, strong-minded and independent foreign participants are crucial to successful evaluations.

One of the surprises to the panels was the number of foreign-born scientists and engineers in the United States. "We knew that many American mathematicians had come from abroad, but there were more than we thought." The United States attracts the best PhDs and postdoctoral students from abroad, adding to the excellence of US education. Some participants called the United States "very dependent on them, maybe overly so."

The US is attractive to foreign scientists and engineers in part because of the flexibility of the system. "In the [former] Soviet Union or Europe, a professor gets a chair and stays there. Here they move around," increasing the diversity of the US research enterprise.

At the same time, it is hard to attract US students to enter mathematics and science, partly because of the strong economy and the low perceived economic value of advanced degrees in some fields. As foreign universities catch up, the need to attract more of the best US students will probably grow.

Agency Responses

Several representatives of federal agencies that support research gave valuable perspectives on the concept of benchmarking and on COSEPUP's experimental efforts.

Some agencies might be able to adapt for their own use the benchmarking model, which they can then apply with greater specificity and rigor.

Some features of the experiments might be useful to agencies for instructing their own advisory committees.

A representative of the National Institutes of Health (NIH) found the experiment in immunology "very useful; it validates the picture of US research with a large number of experts from the area. It will allow us to authenticate these results to a greater degree than we now can. It adds strength to our request for a balance of funds in NIH. We are familiar with this approach, without asking for quantitative data to back it up. It helps us in requesting additional resources when this is necessary. The data on subfields gives us guidance to go and explore those areas further. We do not find it useful to make comparisons with other countries, but we do want to know about areas that are not being developed and probably should be."

Summary of Suggestions

In summary, workshop participants suggested the following actions to improve the benchmarking process:

  • Conduct an additional set of experimental benchmarking studies of agency research programs within the context of GPRA.
  • Increase the number of non-US researchers on each panel to 50%
  • Evaluate fields of interest to industry, as well as to the federal government.
  • Invite representatives of federal agencies and national laboratories to discuss topics for benchmarking and to participate in benchmarking itself.
  • Augment analysis of research with economic and market data.
  • Determine international leadership status for both industrial research and academic research.
  • Focus not only on the standing of individual researchers but also on the research establishment as a whole.
  • Evaluate the research community not only for leadership status, but also for its ability to be a "fast follower".
  • Assess the nation's ability to capitalize on the results of research.
  • Evaluate the research standing not only for the present, but also in relation to previous assessment(s), to gauge progress.
  • Analyze how funding is allocated among research, education, and facilities.
  • Analyze the relative roles of researchers in universities, federally funded facilities, and industry.
  • Provide the degree of detail and rigor required by policy-makers to make funding decisions based on the report.
  • Include international leadership status in the charge to existing advisory committees that review federal programs.
  • Provide clearer definition of "leadership" in a field.
  • Make methods as comparable as possible.
  • Conduct a study twice, with different committees, and compare results.
  • Undertake more careful, in-depth studies of issues identified through benchmarking.