NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
National Academies of Sciences, Engineering, and Medicine; Health and Medicine Division; Board on Health Sciences Policy; Forum on Regenerative Medicine; Beachy SH, Nicholson A, Teferra L, et al., editors. Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop. Washington (DC): National Academies Press (US); 2021 Mar 26.
Applying Systems Thinking to Regenerative Medicine: Proceedings of a Workshop.
Show detailsImportant Points Highlighted by Individual Speakers
- New single-cell RNA sequencing technologies have yielded powerful developments in the capabilities to characterize cell types, link newly uncovered axes of cellular identity to observed phenotypes, and model cell-state transitions using new algorithms. (Fertig)
- Advances in the use of fully spatially resolved single-cell data (i.e., spatial transcriptomics) can now be used to infer cellular interactions within and across the tumor and immune systems. (Fertig)
- Challenges related to interpreting big data, machine learning, and mathematical models and translating into predictions and mechanisms can be addressed by modeling dynamic data to identify a latent space. This approach involves reducing the data's dimensionality in a supervised way that accounts for biological knowledge, modeling the dynamics in low-dimension space, and then re-expanding the models into higher-dimensional space to reconstruct biological properties. (Francois)
- High-throughput methods using high-quality data can precisely quantify and analyze cytokine response through a generative model of cytokine dynamics. Effective two-dimensional dynamics controlled by immune velocity have been found to parsimoniously explain cytokine behavior. However, immune velocity is mostly controlled by antigen strength, and response modulation is related to the characteristics and number of antigen-producing cells. Work is under way to quantitatively assess immune response in immunotherapy. (Francois)
- The application of systems biology and machine learning can help elucidate the relations and dependencies among the multitude of influencing factors along the biopharmaceutical drug development pipeline. This can contribute to producing drugs with the best quality attributes to target diseases as precisely as possible as well as to optimizing the volume and efficiency of drug production while still maintaining efficacy and safety. (Richelle)
- Knowledge gaps and technical limitations that currently restrict the ability to routinely integrate systems biology approaches into the biopharmaceutical product pipeline include (1) an inadequate ability to conduct bioprocess monitoring in real time, (2) the complexity of metabolic networks, and (3) challenges involved in modeling with hybrid approaches. (Richelle)
The fourth session of the workshop focused on challenges and opportunities associated with systems-level analysis and modeling. Malcolm Moos of the Food and Drug Administration (FDA) moderated the session, which featured presentations on the development of algorithms for single-cell genomics, modeling dynamic data to identify a latent space, and adapting metabolic modeling tools in biopharmaceutical drug development. This session's objectives were to discuss the current state of the art of systems thinking approaches and talk about how these approaches are being used to inform the identification of important variables to measure and to illuminate current gaps in knowledge and areas for further study.
DEVELOPING ALGORITHMS FOR SINGLE-CELL GENOMICS
Elana Fertig, an associate professor of oncology, biomedical engineering, and applied mathematics and statistics and the associate director of the Convergence Institute at Johns Hopkins University, explored the development of algorithms for single-cell genomics by discussing single-cell technologies, matrix factorization, and other computational techniques. She acknowledged the importance of technological innovation but emphasized the importance of developing computational methods for understanding single-cell genomics as well. She also described some of the needs and challenges involved in developing those computational methods.
Overview of New Single-Cell RNA Technologies
Fertig began with an overview of single-cell technologies by drawing an analogy to different preparations of fruit, comparing the different technologies to smoothies, individual pieces of fruit, and a fruit tart. She compared the previous generation of transcriptional profiling technologies (i.e., bulk RNA sequencing, or RNA-seq) to a fruit smoothie, in that each component of the system is included, but the cellular and molecular components are all blended together to investigate the resulting mixture. In contrast, single-cell RNA-seq can be compared to pieces of fruit, in that each piece of fruit is evaluated individually to observe its molecular state. She likened future single-cell technologies—specifically, spatial transcriptomics—to an elaborate fruit tart. These newer technologies have the potential to reveal not only the presence of each cell, but also the spatial alignment of those cells in a tissue, allowing a system to be characterized in its native context in order to understand its underlying processes. These single-cell RNA technologies are being used to better understand particular cell types involved in these systems, Fertig said. This was previously done using older technologies, such as flow cytometry, but single-cell RNA technologies provide higher-dimensional resolution than their predecessors. She discussed some of these new developments in single-cell RNA technologies related to cell types, axes of cellular identity, and cellular state transitions (Wagner et al., 2016).
Characterizing Cell Types and Axes of Cellular Identity
The single-cell development of the retina across developmental time-points and cell types has been characterized (Clark et al., 2019; Stein-O'Brien et al., 2019), and Fertig showed a dataset where dots are plotted three-dimensionally to represent individual cells in a retina. The dataset included more than 100,000 cells coded to indicate age and cell type, thus illustrating how the dataset transitions over time. This presentation allows the dataset to be viewed in two ways: either in terms of age or in terms of cell type. Annotation based on these labels limits characterization to a priori knowledge of the system, Fertig said, but the dataset has the power to characterize the entire transcriptional profile of the system beyond just cell types. For example, within each cell type, the single-cell approach can provide information about its phenotype, temporal progression, developmental trajectory, progress along the cell cycle, and spatial position (Wagner et al., 2016).
Encoding this additional information raises the question of how to explore these data in deeper way in order to determine what those phenotypes are, rather than merely clustering or annotating the cells, Fertig said. This question has been addressed in the field of mathematics using matrix factorization (Stein-O'Brien et al., 2018). Using Coordinated Gene Activity in Pattern Sets (CoGAPS), a Bayesian matrix factorization method, a large dataset containing thousands of genes can be analyzed in terms of amplitudes (i.e., gene weights for each biological process) and patterns (i.e., biological processes in each sample) (Fertig et al., 2010). The matrix factorization algorithm finds patterns that define the biological activity in a set of samples (e.g., time points) and the amplitude at which each gene contributes to that biological process. Matrix factorization can be used to learn about the dimensions within a dataset. By linking the results of matrix factorization back to genes or cellular site, it is possible to link back to observed phenotypes, which are fundamental to the system.
Matrix factorization can be applied to uncover the features of cellular identity using the single-cell reference dataset of retinal development, Fertig said (Clark et al., 2019; Stein-O'Brien et al., 2019). She presented a three-dimensional plot showing how some features that undergo the CoGAPS matrix factorization approach reflect individual retina cell types, such as the neurogenic cell type. In addition to learning about cell types—which was already possible with clustering methods—this approach can be used to learn about the fates that reflect changes in cell states. In these three-dimensional Uniform Manifold Approximation and Projection (UMAP) plots with weights of the patterns learned across cells through the matrix factorization, a transition region indicates a space where some cells occur in a pattern that is enriched for cell cycle genes. This pattern is consistent with the desynchronization of the cell cycle, as cells transition from stem cells to developing their final states. By looking across the different axes of this dataset, it is possible to tease out all the different cell types as well as the different transitory states related to age within one algorithm.
Modeling Cell-State Transitions
The inference of cellular state transitions offers a global perspective for observing the system, but the other axes provide more insights into occurrences within the system, Fertig said. Thus, understanding the inference of cell-state transitions and interactions requires new algorithms. Understanding one cell type does not account for additional molecular heterogeneity within that cell type, especially when attempting to understand that variation. Some variation may be associated with cell-state trajectories, such as in stem cells. In addition, there is the challenge of understanding the functioning of these interaction networks and the interaction between cells and molecules that drive phenotypes.
Matrix factorization is not the only approach for modeling cell-state transitions, Fertig said. Indeed, to look at dynamic cell-state transitions, other types of computational methods are more apt. In recent years, two dominant methods have emerged. The first of these is built on the notion of RNA velocity (Melsted et al., 2019), which is based on the idea that the relative maturity of a cell can be calculated by observing the ratio of spliced and unspliced gene products. This information can then be used to determine the cell's trajectory over time, which can be used to calculate the RNA velocity. The other approach relates to trajectory inference, or pseudotime, which involves observing cellular maturity in certain clusters—as well as cells' distances away from each cluster—to determine how the cells change over time and how they are ordered (Trapnell et al., 2014). Each of these metrics can be used to determine a cell's state transitions. These approaches, which use single-cell data, will help move the field through a systems approach that is data-driven but also integrate information about time from dynamic models, Fertig said.
Molecular Heterogeneity with Distinct Cellular Subtypes
Even within a single cell type, temporal changes can be observed, suggesting that there may be additional molecular heterogeneity that can occur within distinct cellular subtypes, Fertig said. Not all cells will use the same pathways at the same level. Moreover, it is not the case that a pathway merely turns on or off in a binary fashion; pathways could be more variable in one state than another. The Expression Variation Analysis method can be used to quantify the degree of difference in variation among transcriptional signatures between cells within one state relative to cells from another (Davis-Marcisak et al., 2019). For example, the variation in expression of cell cycle genes has been found to increase over developmental time in the landmark single-cell retinal development single-cell data, arising from the greater diversity of cell cycle states within the same mature system, Fertig said, rather than the cell cycle simply turning off when mature cell states are obtained (Clark et al., 2019).
Applying Regulatory Algorithms to Uncover Drivers of Cell-State Transitions
Interaction networks can be informed by the genes involved in cell-state transitions or by prior knowledge of biological regulatory networks, Fertig said. These networks can be used to investigate the drivers of cell-state transitions and how a system can become dysregulated by helping to reveal the molecular networks underlying these processes. Two dominant approaches are emerging in this field of inquiry. One approach is based on the idea that knowledge about the trajectories (i.e., how the system is changing over time) can be used to understand the system's causal network in terms of which genes are coming up at specific points over time (Deshpande et al., 2019). The second approach focuses on linking prior molecular knowledge (e.g., gene regulation for ligands and receptors triggering a cascade to transcription factors) with data to understand how those molecular networks are being turned on between cell types (Cherry et al., 2020). These are promising developments because older bulk RNA-seq approaches for understanding these regulatory networks were confounded by intercellular interactions. These new approaches offer a way to discriminate between intercellular and intracellular interactions, Fertig added.
Inferring Cellular Interactions Using Spatial Transcriptomics
Within this broader network-focused perspective, Fertig described a transition in the field from using single-cell data to using fully spatially resolved single-cell data. She showed breast tumor data from a dataset created in collaboration with 10X Genomics using its Visium platform. By annotating the data into separate regions, each region can be spatially resolved into a full transcriptional profile, and matrix factorization can be used to determine the axes of cellular identity in this dataset. This approach makes it possible to differentiate regions in a de novo way in order to identify the dominant signaling processes that distinguish various tumor regions. The same process can be applied within the immune system. Tumor patterns can be linked back to distinct molecular pathways to begin revealing the spatial interactions among tumor and immune systems. In conjunction with network inference algorithms, this approach can provide information about space, time, and regulatory networks. This contributes to the understanding of systems' potential for interaction and of the physical interactions within systems. In closing, Fertig said that single-cell algorithm development is a broad field with much promise. However, new approaches for matrix factorization and latent space analysis are still needed to answer the open questions in the field, such as studying cellular heterogeneity, conducting trajectory and velocity analyses, and identifying network analyses.
MODELING DYNAMIC DATA TO IDENTIFY A LATENT SPACE
Paul Francois, an associate professor in the Department of Physics at McGill University, discussed how dynamic data can be modeled to identify a reduced (i.e., latent) variable space.
Fundamental Challenges
Francois framed the challenges involved in this type of modeling by describing an ideal scenario. Ideally, a high-throughput experimental method could be used to generate data such as mapped representations or complex interaction networks. Then machine learning and mathematical models could be used to generate predictions and reveal the mechanisms or principles of the system's organization. In reality, however, the challenge lies in the translation of big data, machine learning, and mathematics into predictions and mechanisms. Francois described what he calls the “connectionist nightmare,” which emerges, for example, when studying protein interaction networks for T cells (Altan-Bonnet and Germain, 2005; Lipniacki et al., 2008). In this case the mathematical modeling of the systems being studied generates numerous equations, making it difficult to understand what is occurring within the system. A similar effect occurs when using deep neural networks to study systems because it is difficult to understand how deep learning and machine learning networks work (Nielsen, 2015). Finally, this effect is related to adversarial examples, in which machine learning is fragile or highly vulnerable to the introduction of noise (Goodfellow et al., 2013). In such examples the introduction of a slight amount of perturbation to a machine learning process may substantially affect the accuracy of the classification algorithm, Francois said.
Another fundamental challenge lies in the interpretation of data and of the impact of how data are plotted or how they are interpreted, Francois said. To demonstrate this point, he invoked the heliocentric and geocentric models of the solar system. Although the heliocentric model is simpler to describe, the geocentric model of the solar system can be plotted using the same data. Similarly, when using an unsupervised method such as UMAP, the UMAP two-dimensional and three-dimensional representations of datasets are different, which raises the question about which representation to choose.1 The same considerations apply to representations of datasets over time. When researchers are working with complex dynamical systems, they must choose how best to represent movement over time in two dimensions. Francois emphasized the importance of choosing the best way to represent data and the impact of these choices on how researchers think about the problems they study.
General Modeling Approach
Francois described the general approach used to model dynamic data by identifying a latent space. The first step is to take the data and reduce its dimensionality in a supervised way—which accounts for biological knowledge—using tailored algorithms, such as a reduction algorithm that reduces large networks (Proulx-Giraldeau et al., 2017). Evolutionary algorithms and auto-encoders can also be applied to understand biological data and identify the most relevant features within datasets (Beaupeux and François, 2016; Henry et al., 2018). Once the dimensionality of the complex dynamical system has been reduced, its dynamics can be modeled in low-dimension space. Low-dimension models are easier to study, and one can derive more general results there (e.g., theorems can be more easily proven). Models can be expanded back into higher-dimensional space to validate the models and reconstruct biological properties.
Generative Model of Cytokine Dynamics
This general approach has been applied to develop a generative model of cytokine dynamics, Francois said (Altan-Bonnet and Mukherjee, 2019). T cells interact in the immune system and produce cytokines which determine how the cell reacts to antigens. From that point there are numerous possibilities with various outcomes, Francois said. Next, the antigen concentration or the number of T cells can be varied, to generate IL-2 (a cytokine), which can rise or fall depending on the situation. Many other dimensions or parameters involved in this system can be manipulated to investigate cytokine dynamics. Francois highlighted the importance of ligand strength as a biochemical parameter within cytokine dynamics. Ligand strength affects immunotherapy because the immune response depends on ligand strength.
Francois and his colleagues developed a stepwise process with multiple readouts that uses a robot to study immune response in vitro at multiple time-points. The process begins with harvesting primary immune cells, after which mixtures of T cells and antigen-presenting cells (APCs) mixtures are prepared. Next the robot collects time series data, which are processed along with data collected via flow cytometer. The pipeline currently follows 7 cytokines and 12 markers and takes data at 12 time points, with 50,000 cells per condition. The result is a reliable and precisely generated map of the ways in which T cells react in the test tube in the presence of different antigens. This process generates highly multidimensional data, which are modeled using new techniques developed by Francois and colleagues, including evolutionary algorithms.
Reducing Dimensionality in a Supervised Way
Francois also described the process of reducing dimensionality in a supervised way that accounts for biological knowledge. The first step, he said, is finding the right variables. An evolutionary algorithm was used to find the variables and suggested an integral of the natural logs of cytokines. Indeed, analyzing three-dimensional plots of data from three cytokines reveals that plots of integral logs visually appear as the best way to disentangle and represent the data. This choice about how to best represent data is analogous to the choice about whether to represent planetary motion data using a heliocentric or geocentric model, he said. Using the representation of the data involving plots of integral logs shows a consistent trajectory of the various antigens being studied in the integral-of-log space. Next, classical machine learning methods can be used to project the trajectories of the resulting plot and classify antigens based on ligand strength.
Modeling Dynamics in Low-Dimension (Latent) Space
This process of reducing dimensionality in a supervised way can be used to model cytokine dynamics in low-dimension space, Francois said. The trajectory can be modeled in two dimensions by using physics to model the dynamics. In the reduced (or “latent”) space, ballistic physics equations appear as a convenient way to parameterize the cytokine curves in order to describe the trajectories using these parameters. It was found that all the parameters correlated with each other, which means the complicated trajectories of cytokines can be modeled with a single parameter, which is essentially the initial slope. This parameter is called immune velocity because it is connected to the initial angle of the trajectory and the way it will decrease.
Re-Expanding the Model to Explain Biology
Once the high-dimensional system has been reduced and modeled in low-dimensional space, it is possible to begin testing hypotheses, Francois said. For instance, the data that were not used to train the model can be fed into the parameters of the model to test whether the model works. In the case of immune velocity, probing the correlation between time and initial immune velocity reveals an appropriate structure in the data. New antigens are entered into the system and layered intermittently between other antigens in order to predict the trends of those new antigens. Furthermore, the complete mathematical model of the cytokine response generated upon expanding back to the biological data can be applied back to understand a specific aspect of the original data, such as IL-2, in order to connect immune velocity to parameters of the curve (e.g., cutoff time). This type of process led to the discovery of the innate versus adaptive cytokine parameter, he added. An entire ensemble of connections between cytokines was discovered using this dimensional reduction process by employing the simple parameter of immune velocity.
Major Findings and Ways Forward in Applying the Model
In conclusion, Francois reiterated that high-throughput methods using high-quality data can be used to precisely quantify and analyze cytokine response. The challenge is determining the best means to render the data in two dimensions using evolutionary algorithms in a supervised manner to refine the data and determine the immune velocity. Effective two-dimensional dynamics controlled by immune velocity have been found to parsimoniously explain all cytokine behavior, but immune velocity is mostly controlled by antigen strength, and the modulation of the response is related to the nature of APCs and the number of cells. Francois noted that there is work under way to apply this approach to assessing immune response quantitatively in immunotherapy.
ADOPTING METABOLIC MODELING TOOLS IN BIOPHARMACEUTICAL DRUG DEVELOPMENT
Anne Richelle, a senior specialist on metabolic modeling at GlaxoSmithKline, explored the adoption of metabolic modeling tools in the biopharmaceutical industry. She discussed how systems thinking can be applied to strengthen the global drug production pipeline and discussed challenges related to implementing systems thinking in the biopharmaceutical industry.
Timeline of Drug Development
Richelle began by describing the typical timeline of research activities in drug development. The process begins with studying a disease, with a focus on drug discovery and target identification. Emergent technologies are used to identify a host organism to be engineered. The aim of this step is to identify or engineer a host cell that will be able to produce the best drug that has quality attributes that enable the disease to be targeted as precisely as possible. During the process development phase, host organisms are cultivated in controlled conditions to create an optimal environment for producing the drug. Finally, this process is scaled up during the manufacturing phase in order to produce as much of the drug as possible—as quickly as possible—while still maintaining efficacy and safety.
This is a simplified timeline of the drug development pipeline, Richelle noted; in reality, there are many additional barriers and subprocesses that greatly influence analyses and development processes at every step of the way. For instance, a patient's genetics, sex, and age can influence the expression of disease and the molecular interactions identified in the prospective drug. Moreover, host organisms have their own genetic attributes that can influence protein or molecule attributes. However, the proper interaction between the target and the drug is defined based on these conformational attributes and can strongly influence the efficacy of the drug. Environmental factors can also affect the development process. If the environment is not optimal, it can influence the genetic expression of the host organism, which in turn influences the protein and molecule quality attributes that determine the drug's therapeutic ability to treat the target disease.
Large amounts of data are collected during the development process to monitor and manipulate these types of influencing factors, Richelle said. In her opinion, she said, the biopharma sector as it currently stands is far from being “big data.” Instead, she described the data generated in the biopharma field as “low and expensive” and often not easily accessible in machine-readable format2 (Richelle and von Stosch, 2020). Furthermore, the lack of standardization in the existing databases (e.g., varying nomenclatures, the heterogeneity of the experimental methods and analysis methods resulting in many hidden variables) makes them difficult to exploit without extensive processing. With the digital transformation occurring in this sector, biopharmaceutical researchers are now faced with finding ways to integrate these diverse data sources and extract meaningful information. These challenges might take a decade to resolve, she added, but hopefully it will be less with the emergence of new technologies in this field.
Machine Learning as a Driver of Discovery
Machine learning is a powerful tool that helps researchers in all fields extract information from data, Richelle said. Some sectors are positioned for greater gains from the use of big data, while other fields may not see the same benefits. The benefits of machine learning depend on the relationship between the amount of data available and the impact of exploiting those data (Manyika et al., 2011). For instance, Richelle said, machine learning has been applied successfully in the fields of personalized marketing, in which the potential impact of exploiting the large volume of available data is relatively high. In the case of using machine learning to optimize bioprocess development, the potential impact of exploiting data is relatively high, but the amount of data available is relatively low. Still, even though the volume of available data is relatively low, machine learning will likely have a substantial impact on the biopharmaceutical pipeline, Richelle predicted. But large datasets “cannot speak for themselves,” especially when those datasets contain randomness that may result in spurious correlations due to non-causal coincidences, hidden factors, or the nature of big randomness (Calude and Longo, 2017). Furthermore, many of the algorithms used in machine learning are difficult to carefully scrutinize during peer review, for example. According to research by Patrick Riley, “many of the algorithms [used in machine learning] are so complicated that it is impossible to inspect all parameters or to reason about how exactly the inputs have been manipulated” (Riley, 2019, p. 27). Richelle added that typical machine learning processes are like black boxes: although they may be instrumental in developing predictive models, they do not provide explanations for their results. To address this challenge, she suggested that the body of empirical knowledge that has been developed in the field of biology should be incorporated into efforts to apply machine learning to drug development.
Applying Systems Biology to Drug Development
The factors that influence the drug development process are related and interdependent, Richelle said. These relations and dependencies need to be considered—including the feedback effects among these factors—in applying the systems thinking approach to observe the network of interconnected influence factors and to alter that network. Altering such networks is generally done to achieve three outcomes: (1) producing as much of the drug as possible, (2) producing the drug as quickly as possible, and (3) producing a drug that all data suggest is the best product to cure the targeted disease. Systems biology typically uses a network-based approach to organize large datasets and glean insights about complex biological systems, Richelle said. Coherently organizing large datasets into biological networks can provide non-intuitive insights on biological systems that in vivo experiments alone cannot provide. Network-based approaches also offer a platform for integrating and interpreting omics data to explore links between genotype and phenotype. For instance, this approach has been used to map metabolic networks and integrate data sources.
The systems biology approach can reveal more than just a network of reactions; it can generate an interconnected map of cellular functions. This approach has been used to develop algorithms that recapitulate the metabolism of specific cell and tissue types, offering useful insights into metabolic activity under these conditions, Richelle said (Opdam et al., 2017). These systems tools have proven to be invaluable at the level of preclinical research, she said. For example, in designing new drugs, the tools can be used to inform target selection and to make it possible to engineer cells to rewire their metabolism toward the production of a product of interest. While the systems thinking approach has brought much value to the study and manipulation of biological networks, it also has potential to be applied to many other influence factors across the drug development pipeline, Richelle said. The tools of systems thinking can be used for process design, monitoring, and control. They can also be applied to lower experimental effort, increase process robustness, and facilitate the implementation of regulatory requirements, such as quality-by-design and process analytical tools (Richelle et al., 2020).
Knowledge Gaps and Technical Limitations
Metabolic modeling tools can be used in drug development to expedite the process and integrate existing knowledge regarding the target disease, the protein, or the molecule being produced and the host organism, Richelle said. However, she highlighted three areas in which gaps in knowledge and technical limitations currently restrict the routine integration of systems biology tools into the biopharmaceutical product pipeline: (1) real-time bioprocess monitoring, (2) the complexity of metabolic networks, and (3) modeling with hybrid approaches (Richelle et al., 2020).
Real-Time Bioprocess Monitoring
Real-time monitoring technologies are a critical aspect of automation, Richelle said. Current data collection methods are unable to acquire real-time in situ measurements of metabolites and cell concentrations, which limits the ability to take a “live snapshot” of cell metabolism. Although there is increasing interest in spectroscopic methods (e.g., near-infrared spectroscopy, Fourier-transform infrared, mid-infrared, Raman, fluorescence) that can capture a “molecular fingerprint” of samples, the current capacity to systematically extract accurate quantitative data is limited to a handful of metabolites. She noted, however, that advances in the chemo-metric modeling field will contribute to developing methods to effectively extract maximum information from these types of spectra. Furthermore, advances in online single-cell probing are ushering in a new generation of high-throughput omics technologies. Richelle speculated that near-real-time measurements of cell transcriptomes and proteomes may be available relatively soon.
Complexity of Metabolic Networks
The complexity of metabolic networks limits the ability to make predictions, Richelle said. Furthermore, the complexity of large networks hinders their utility in practical applications. For example, solving large metabolic networks will require large quantities of data. Otherwise the systems will be underdetermined, leading to a multitude of alternative yet equivalent solutions to a given problem. Therefore, these types of complex networks are not often used to develop feedback control and optimize processes. Various approaches have been proposed for tailoring metabolic networks based on a priori knowledge or available experimental data. However, it is difficult to define the data to be used and determine how to overlay these data on the network. This problem is highlighted by the open question in biology about the links between genes, proteins, and metabolites and how to define the phenotype that should be “protected” during the reduction. Due to the lack of a quantitative description of the gene–protein–reaction rule, strong assumptions are needed to link gene expression and metabolic reaction activity. Moreover, network-tailoring approaches typically do not completely solve the problem of system underdetermination; thus, the choice of an adequate strategy to solve the system will always be required to achieve an instantaneous picture of the flux distributions in the cell. Ultimately, shortages of mathematicians that work on these complex problems is also a limiting factor, Richelle said.
Modeling with Hybrid Approaches
Further challenges relate to modeling with hybrid approaches, Richelle said. Interest is increasing in the use of machine learning and artificial intelligence for bioprocessing and engineering, such as technologies that use digital imaging for automated counting of cell colonies grown on petri dishes (Ferrari et al., 2017). Using machine learning and artificial intelligence tools for bioprocessing and engineering could be a powerful approach and may potentially inform control and optimization strategies as it could be used to establish relationship between metabolism and operating conditions that cannot yet be mechanistically explained. Many aspects of complexity cannot be explained with mechanistic descriptions, Richelle noted. This approach can contribute to resolving the complexity within the “black box” of machine learning and artificial intelligence, she suggested. However, this approach would require more applications that combine artificial intelligence and machine learning with the tools of biology, a dedicated software approach, and interdisciplinary experts with broad skill sets working on the problem. Several major open questions related to pursuing these approaches also remain unaddressed, such as how to effectively generate sufficiently informative experimental data and how to identify joint parameters across the mechanistic and data-driven parts of the model. In closing, Richelle emphasized that advancing the field toward so-called “biopharma 4.0” would benefit from a more global approach that applies machine learning to reveal interactions among factors that influence the drug production pipeline.
DISCUSSION
Dimensionality of Biological Systems
The panel discussion began with a comment from Moos on the theme of reducing dimensionality, and he referred the audience to Huang's earlier discussion about the characterization of cell state in terms of a state vector. Early on, it was suggested that the vector would need to contain 2,000 to 3,000 elements in order to fully describe a cell transcriptome (Huang et al., 2005). Moos asked whether that number of elements is necessary to convey most of the information, because (1) many of those elements are not relevant to the distinction between states and (2) some of the elements could be collapsed into simpler signal transduction pathway elements. Work on factorization approaches has begun to reveal that the dimensionality is a function of the biology that researchers are trying to capture, Fertig said, so there are no right or wrong dimensions. Instead, the focus is on the dimension that the research is trying to uncover in the system. In the days of bulk RNA-sequencing, researchers found that by representing two dimensions in a system it was possible to separate tumor from normal tissue. By increasing the number of dimensions it was possible to separate various tumor subtypes (Fertig et al., 2013). Fertig added that the hierarchy of dimensions in biological systems likely warrants a hierarchy of methods—in terms of multi-scale dimensionality—to uncover a system (Way et al., 2020). This suggests that the systems are fundamentally low dimensional but multi-scale, depending on which dimension the work is trying to capture. Furthermore, various computational methodologies will yield different dimensions or features, requiring different latent space methods based on the target dimension researchers are trying to uncover.
Accounting for Zeros in Datasets
A technical issue related to sampling can arise when a zero appears in a data matrix, and speakers were asked if that means “no expression” or “dropout.” Relatively few efforts have explicitly addressed this question, Moos said, although Fabian Theis and colleagues have started to do so using an autoencoder. The pipeline developed by Francois and his colleagues allows for multiple repeated experiments to account for these types of problems with data points, he said. This and other aspects of their process allow researchers to maintain some amount of quality control when data points are missing. Francois's team can use their dataset to correct for missing data points in a reasonable and coherent way, either by accounting for the missing data points or by using them to show that the experiment has not worked. Unlike single-cell RNA sequencing, this method can systematically evaluate many dimensions of the same data point.
For sampling reasons, researchers may not capture all of the potentially useful elements when using semi-supervised methods to search for the presence or absence of an activated signal transduction pathway, Moos said. He asked whether the one-or-zero issue might be a logistic regression problem that, in a neural-network approach, could help determine whether a neuron has fired. Embedding architecture into systems might help elucidate where things are occurring or not, Fertig said. Even factorization approaches can reveal different features depending on the prior knowledge encoded into the factorized genes. This multi-scale nature of systems is critical for understanding the processes involved, she said. A long-standing issue in genomics research is that the depth of knowledge generated depends on the depth of research focused on, for example, a specific cell type. In genomics, the data analysis should be just as hypothesis-driven as the experiment if the aim is to uncover some phenomenon, she emphasized.
Insights About Biologic Processes from the Projection Space
The projection space that works for visualizations of data correlations also provides insight into the biologic process or network, Francois said. For example, the manifold where data are projected is related to the relative magnitude of two broad types of cytokines, the ones more associated to innate responses versus the ones more associated to adaptive responses. This requires further explanation, she said, because there is likely to be interesting biological information where this low-dimension manifold sits, which might vary between different antigens and T cell receptors.
Using Immune Velocity to Phenotype Immune Health
The possibility of using immune velocity to phenotype a patient's immune health is an area that is still being explored, Francois said. It has not yet reached the stage where it is possible to look at what happens to patients based on their age, gender, and pre-existing conditions, he said. Currently, he and his colleagues are looking at different T cells and T cell receptors to see how immune velocity is defined. He and his team have found that immune velocity is a property of the antigen itself and that the approach can also be applied to other types of immune cells. He was optimistic that the tool he developed with his colleagues would eventually allow for quantifying those properties in various contexts. Moos asked whether immune velocity may be used as a functional test (e.g., for immune system–directed cellular cancer therapy). This is precisely what his group is aiming to achieve with their work, Francois replied.
Optimization in Process Design
Given that many developmental processes are modular by nature, Moos asked if the concept of sequential attractor states might help to simplify process design. A more global perspective would be helpful, Richelle said, but framing all processes as single, unitary operations is also problematic. She cautioned that focusing exclusively on optimizing every step in a drug development process (e.g., in pursuit of optimal conditions for an organism) can risk one overlooking the direct impact of these optimizations on the drug itself. Optimization efforts should consider the influence of that optimization on the molecule being produced and how that affects the efficacy of the drug. Solutions that initially seem suboptimal may actually be optimal for treating the disease, she added.
Hybrid Models of Machine Learning and Biological Systems
Many researchers are moving toward the combination of machine learning and the study of biological systems, sometimes called the hybrid model, Richelle said. However, there is more than one way to consider and integrate these two approaches. For example, in studying the influence of temperature on metabolism, some researchers might introduce numerous kinetic parameters to see how they affect each metabolic reaction, while others might try to combine these parameters into a single statistical effect. There is no one right approach, as different approaches fit different purposes: for instance, she said, developing a very complex model to describe the influence of the temperature on the metabolism at the metabolic reaction level might be conceptually interesting, but less useful for controlling and developing a process. The precise way that data-driven approaches are merged with systems biology will need to depend on the purpose of the merging.
Patient-Focused Biological Systems Approach
In thinking about a biological systems approach, it may be useful to consider if it can begin with a patient, Moos said. Although a patient represents a sample of one, each individual hosts many complex, interacting biological activities. Given the aim of developing a broadly successful suite of therapies, it might be worthwhile to position iterative regenerative drug development processes so that they begin by distinguishing patients who respond from patients who do not. Modeling should begin at the level of disease, and the disease should be a focus throughout the process, Richelle said. A model that is person-specific could be used to explore parallel patient-specific factors (e.g., genetics, age, gender) that influence the development of personalized medicines, thus providing opportunities to expedite the development process. One goal would be to use a patient's DNA sample to evaluate how the disease is manifesting in that patient. This information could be used to appropriately target the specific quality attributes of a molecule that would allow the disease to be targeted most efficiently in that patient. Theoretically, understanding the interactions among those factors could be used to rapidly produce a molecule that is targeted uniquely for the specific patient's expression of the disease, Richelle said.
Clustering Algorithms Versus Matrix Factorization
Many methods for identifying different cell types are based on clustering algorithms that treat all transcripts identically, Moos said, even though the determination of one cell fate over another may not depend on, for example, whether it is expressing aldolase and whether Bmp or Wnt signaling have been activated. Classical algorithms for comparing protein sequences recognize that changing an aspartate for a glutamate is not nearly as consequential as changing an aspartate for a phenylalanine. However, few clustering methods incorporate that recognition. Moos asked how this limitation of existing clustering methods might be addressed. Fertig said that these concerns have motivated her work with latent space representations and matrix factorization, rather than clustering methods, because the former approaches allow for gene reuse. A major limitation of clustering algorithms is that genes are forced to be a member of one class or another, despite the knowledge that genes are reused for multiple processes, as demonstrated by biological systems and systems-level approaches. This is a major advantage of matrix factorization over clustering, she added. Fertig and colleagues are exploring a transfer-learning approach to benchmark transitions from one cell type to another in datasets in which there is a known “ground truth” of a particular cell type. This avoids the reliance on a single gene; rather, it involves looking at gene signatures and how they are preserved across datasets and across cell types, she added.
Impact of Scale and Function Within a Single Transcript
Fertig also discussed that impact of scale and function within a single transcript, which is often ignored even in matrix factorization techniques. Her CoGAPS approach explicitly encodes the uncertainty in each matrix element as a variable. This allows for scaling how much that transcripts express, which can then be down-weighed to avoid bias toward the most highly expressed transcripts. This semi-supervised learning approach may be powerful if various features can be encoded into the system. Furthermore, she added, it may help researchers understand the relevant factors without introducing bias toward the most highly expressed or most correlated genes. The dominance of elements that are abundant rather than important (e.g., comparing transcripts for structural proteins versus receptors and transcription factors) is a limitation in this work that has not received adequate attention, Moos said. Certain practices, such as zero mean and unit variance, continue to endure despite the knowledge that important genes are being regulated in a bistable manner. He asked how this trend may be connected to the idea that multi-omics data—if they too are regulated in a bistable manner—may be amenable to simple scaling as 1s and 0s. Fertig said that this has been considered in terms of weighing down the system by the uncertainty matrix as well as by modeling regulation in a specific manner. The varying error terms in different modalities can then be included in an integrated framework. It can be useful to follow a smaller number of genes by repeating the same experiment over time, Francois added, because following a trajectory over time can reveal interesting geometric features in the data. If different experiments conducted at different times are compared, noise emerges that might not be controllable. Francois also commented on the critical period of transition between the monostable and bistable phase, during which a set of cells can be followed as a function of time. Following the entire trajectory can also reveal interesting features. The critical period of time between phase transitions is crucial in studying and modeling developmental systems, he emphasized.
Footnotes
- 1
More information about UMAP is available at https://umap-learn
.readthedocs .io/en/latest (accessed November 23, 2020). - 2
Richelle noted that the work being done to mine the world's research papers (Pulla, 2019) might change the way to extract information from historical studies.
- Challenges and Opportunities Associated with Systems-Level Analysis and Modeling...Challenges and Opportunities Associated with Systems-Level Analysis and Modeling - Applying Systems Thinking to Regenerative Medicine
- Distichophyllum subnigricaule var. subnigricaule isolate B778 tRNA-Leu (trnL-UUA...Distichophyllum subnigricaule var. subnigricaule isolate B778 tRNA-Leu (trnL-UUA) gene, partial sequence; trnL-trnF intergenic spacer, complete sequence; and tRNA-Phe (trnF) gene, partial sequence; chloroplastgi|313666455|gb|HQ613723.1|Nucleotide
- AI-2E family transporter [Noviherbaspirillum sp.]AI-2E family transporter [Noviherbaspirillum sp.]gi|2687623345|ref|WP_334191118.1|Protein
Your browsing activity is empty.
Activity recording is turned off.
See more...