Genome-scale datasets have been used extensively in model organisms to screen for specific candidates or to predict functions for uncharacterized genes. However, despite the availability of extensive knowledge in model organisms, the planning of genome-scale experiments in poorly studied species is still based on the intuition of experts or heuristic trials. We propose that computational and systematic approaches can be applied to drive the experiment planning process in poorly studied species based on available data and knowledge in closely related model organisms. In this paper, we suggest a computational strategy for recommending genome-scale experiments based on their capability to interrogate diverse biological processes to enable protein function assignment. To this end, we use the data-rich functional genomics compendium of the model organism to quantify the accuracy of each dataset in predicting each specific biological process and the overlap in such coverage between different datasets. Our approach uses an optimized combination of these quantifications to recommend an ordered list of experiments for accurately annotating most proteins in the poorly studied related organisms to most biological processes, as well as a set of experiments that target each specific biological process. The effectiveness of this experiment- planning system is demonstrated for two related yeast species: the model organism Saccharomyces cerevisiae and the comparatively poorly studied Saccharomyces bayanus. Our system recommended a set of S. bayanus experiments based on an S. cerevisiae microarray data compendium. In silico evaluations estimate that less than 10% of the experiments could achieve similar functional coverage to the whole microarray compendium. This estimation was confirmed by performing the recommended experiments in S. bayanus, therefore significantly reducing the labor devoted to characterize the poorly studied genome. This experiment-planning framework could readily be adapted to the design of other types of large-scale experiments as well as other groups of organisms.
Overall design
This dataset contains 53 experiments as follows: (experiment: number of datasets (number of arrays))
growth at different temperatures, strain backgrounds, cross progeny, Tn7 insertions, nutrient limited chemostat growth, mating type and ploidy, 555.11 and 670.20 knockout tetrads
Almost all samples were hybridized versus a common reference prepared from a mixture of RNA from Mat a, Matx, and Mata/x cells sampled in both exponential and stationary phase. Additionally, RNA from stress conditions was included: hydrogen peroxide treatment sampled at 10, 30 and 45 minutes, and heat shock from 25 to 37 degrees sampled at 10 and 30 minutes.
Samples from the following datasets were not hybridized versus the common reference (reference used in parenthesis and in array annotations): cell cycle (asynchronous culture), constant temperatures (log phase culture), mating type and ploidy (log phase culture), diauxic shift (log phase culture), and strain backgrounds (log phase culture).