U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Varki A, Cummings RD, Esko JD, et al., editors. Essentials of Glycobiology [Internet]. 4th edition. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press; 2022. doi: 10.1101/glycobiology.4e.52

Cover of Essentials of Glycobiology

Essentials of Glycobiology [Internet]. 4th edition.

Show details

Chapter 52Glycoinformatics

, , , , , and .

Glycans are branched, biosynthetic metabolic products that are commonly encoded by multiple genes. Unique genes may be involved in the biosynthesis of specific glycan classes (glycoprotein, glycolipid, glycosaminoglycans etc.), and at the same time, many glycogenes participate in the biosynthesis of more than one glycan class. These intricacies relating to gene expression, enzyme specificity, endoplasmic reticulum (ER)–Golgi compartment–specific localization of enzymes, the branched nature of glycan structures, and species-specific variation in monosaccharide composition makes the analysis of glycosylation processes complicated. To aid this effort, a variety of analytical methods have been developed to identify and quantify the structure of glycans and their conjugates in biological samples. Glycoinformatics tools and software aim to use computers to integrate these experimental data, using our knowledge of glycan biosynthetic pathways as a backbone. Glycoinformatics databases ideally curate experimental data allowing glycan structures to be rigorously defined, archived, organized, searched, and annotated. When linked to other relational databases, glycoscience data may then be integrated with related genomic, transcriptomic, proteomic, lipidomic, and metabolomic information. This chapter describes the current status of glycoinformatics databases and software development, with focus on efforts to bridge the gap between glycan structure and function.

THE NEED FOR INFORMATICS IN GLYCOBIOLOGY

Informatics plays a critical role in virtually every aspect of modern biology. Our ability to compile the genomes of diverse organisms made it possible to predict the protein sequences. These sequences are widely used as the basis to predict protein function and to enable proteomic analyses. Proteomic studies in turn advance biological research by providing experimental evidence that supports the expression of specific proteins, in particular tissues or cell types (Figure 52.1). The informatics resources that support these endeavors are facilitated by the fact that genes and the translated polypeptides (i.e., proteins) are typically linear molecules whose sequences are readily specified as a series of characters. These representations are easy to digitize and store and many powerful informatics tools for comparing and classifying polypeptides have been developed. In contrast, the development of informatics tools for glycobiology is more difficult for several reasons. Notably, glycans are not directly encoded by genes. Rather, they are the metabolic products of multiple enzyme activities, both glycosyltransferases [GTs] and glycosidases (Chapter 6). These reactions are tightly controlled by the availability of the donor and acceptor substrates in the cellular compartment containing the GT as well as by modulation of enzyme activity and expression levels. In addition, the organization and function of the ER–Golgi biosynthetic pathway is sensitive to many factors such as the metabolic or developmental stage of the cell or its level of nutrients (Chapter 4), and this leads to structures that can be complex (Chapter 3). Furthermore, glycans are often highly branched and their structures cannot be described as a simple linear sequence (Chapter 3).

FIGURE 52.1.. The critical role of glycomics in systems biology.

FIGURE 52.1.

The critical role of glycomics in systems biology. Glycan structures have no template from which to be predicted, are regulated by cellular metabolism and glyco-enzyme expression, and modify both proteins and lipids. Glycomics thus requires the tools (more...)

Because of this biosynthetic and structural complexity, it is not currently possible to accurately predict the structures of the glycans that an organism can produce under different environments or how these glycans are conjugated with other molecules, armed only with knowledge of the genome or proteome. Rather, the identity of each glycan in a biological sample must be identified using analytical methods (Chapter 50 and Chapter 51) that are sufficiently sophisticated to detect and discern the glycan's diverse structural features. Thus, research aimed at understanding the biological roles and consequences of glycan structures depends on the availability of integrated glycoinformatics databases. A consortium of international scientists under the GlySpace Alliance have made considerable progress in this area in recent years to streamline the annotation of glycans and related expression patterns such that they can be linked across diverse databases. Thus, a clearer path from glycan structure to biosynthetic pathways is beginning to emerge from these efforts.

Interpreting glycan structural information in the context of diverse types of biological and chemical information is a challenge. For example, most glycans in animals are covalently linked to proteins or lipids. The glycan moieties of a glycoprotein are linked to specific amino acids (usually asparagine, serine, or threonine) (Chapters 9 and 10). Which sites are glycosylated and which structures are present at a particular site often vary, depending on many factors, including the type, developmental stage, and disease state of the cell or tissue. Collection, storage, and retrieval of a description of each protein's glycosylation is time-, tissue-, organism-, interaction-, and disease-dependent and thus presents a major challenge to bioinformaticians working in the glycosciences (glycoinformaticians) as it requires integration of conceptually diverse information. Moreover, many different types of digital tools are necessary, ranging from basic visualization software to software that assists in the interpretation and structural annotation of glycoanalytical data (e.g., mass spectra), to algorithms that identify correlations between glycosylation and other biological phenomena (e.g., gene expression, cell differentiation, disease). A major challenge facing the glycoinformatician is the representation of the information that is processed and produced by these software tools in ways that are conceptually accessible to scientists who do not have an extensive background in glycobiology.

Glycoinformatics enables the development of streamlined data reporting and sharing standards. For glycoproteins, the structures of both the glycan and the protein must be represented along with the relationship between these two entities (e.g., the identity of the glycosylation site and the fraction of the protein molecules that bear the glycan in each physiological state). To make this information relevant, the scientist often requires explicit information about the biological context (e.g., tissue and disease state) corresponding to the specified glycosylation or describing how the glycosylation changes when the tissue or cell is perturbed. The glycoscience community is building on the efforts of the Human Proteome Organization-Proteomics Standards Initiative (HUPO-PSI) to develop similar resources that describe the information that should be included when reporting experimental data, with digital data exchange formats to facilitate communication of structural and biological information and controlled vocabularies that allow the data that is exchanged to be unambiguously interpreted. For example, the Minimum Information Required for A Glycomics Experiment (MIRAGE) initiative is modeled after the well-established Minimum Information for A Proteomics Experiment (MIAPE) initiative of HUPO-PSI. These standards are required for the emergence of glycobiology as a mature discipline that is accessible to the scientific community as a whole.

GLYCAN STRUCTURE DRAWING

A major component of databases is the standardized depiction of the glycan structures. The Symbol Nomenclature for Glycans (SNFG) (see Online Appendix 1B) universal symbol nomenclature for the graphical representation of glycan structures has been developed to facilitate such standardization and is used throughout this book. Online Appendix 52A lists current databases and journal publishers that have thus far accepted or strongly recommend the use of this nomenclature. A variety of drawing software has been developed to simplify the usage of the SNFG (Online Appendix 1B). Among them, GlycanBuilder is a tool that can be used as a stand-alone, embedded in webpages, and be integrated into other programs that allow for the drawing of glycan structures. Recently, an updated version of the tool was included in the GlyGen and GlyTouCan databases providing an intuitive method for searching structural content. Built-in features provide support for the conversion of various glycan display and formats allowing users to efficiently switch between the SNFG, Oxford, hybrid, and International Union of Pure and Applied Chemistry (IUPAC) symbol formats while supporting different text formats: LinearCode, KCF (KEGG Chemical Function), GlycoCT, GLYDE-II, Oxford, LINUCS, and WURCS.

RECOGNITION OF THE NEED FOR GLYCOINFORMATICS DATABASES

The Complex Carbohydrate Structure Database (CCSD) (commonly referred to as CarbBank) was established in the mid-1980s. It was developed and maintained by the Complex Carbohydrate Research Center of the University of Georgia (United States). The main design objective of CCSD was to allow researchers to find publications in which specific carbohydrate structures were reported. The need to develop CarbBank as an international effort was clearly recognized and resulted in worldwide curation teams responsible for specific classes of glycans that resulted in more than 30,000 entries into the database. During the 1990s, a Dutch group (led by Hans Vliegenthart) assigned nuclear magnetic resonance (NMR) spectra to CCSD entries (SugaBase). This was the first attempt to create a carbohydrate NMR database that complemented CCSD entries with proton and carbon chemical shift values.

Following the end of CCSD development in 1997, other large projects followed. Among these, the EUROCarbDB initiative (ceased in 2011) seeded integrated tools for streamlining European glycomics research through the development of databases, bioinformatics standards, analysis methods, and Web-based software components. The Australian company, Proteome Systems, provided commercial access to mammalian N- and O-glycan structures, and glycoprotein data curated from the literature, in GlycoSuiteDB. The final release of GlycoSuiteDB comprised more than 3000 glycoprotein-derived glycan structure entries and relevant metadata descriptors including taxonomy, disease, and methods of determination. This effort to provide curated information at the glycoprotein level now continues under the GlyGen and GlyConnect initiatives. In the United States, the Consortium for Functional Glycomics (CFG) was established in 2001. This project aimed to deepen understanding of the function of carbohydrate–protein interactions on the cell surface and in cell–cell communication. The CFG generated diverse data sets of (1) gene expression of glycosyltransferases and glycan-binding proteins (GBPs) from gene microarray experiments, (2) phenotypic analysis of transgenic mice, (3) mass spectrometric profiling of glycan structures isolated from selected cells and tissues, and (4) glycan affinity of proteins using glycan arrays.

CURRENT GLYCOINFORMATICS EFFORTS

Several large-scale initiatives to organize and integrate various glycan-related information and resources have been launched in recent years. Among these, GlyGen, GlyCosmos, and Glycomics@Expasy present integrated portals to query diverse databases related to glycomics, genomics, and proteomics. In this regard, GlyGen retrieves carbohydrate- and glycoconjugate-related data from several international data sources and integrates and harmonizes them. A user-friendly Web portal allows this information, including key associations of diverse data types, to be queried, browsed, displayed, and downloaded. GlyCosmos includes data previously collected by the Japan Consortium for Glycobiology and Glycotechnology Database (JCGGDB). This encompasses experimentally verified MS data, lectin affinity data, glycoprotein data, and glyco-gene information. For example, activity data about glycogenes such as glycosyltransferases and sugar nucleotide transporters from the GlycoGene Database (GGDB) have been integrated with KEGG Orthologs and are further linked with glycan structure (GlyTouCan) and disease information (OMIM). Pathways in which glycoproteins are involved are also integrated and can be cross-searched using the main GlyCosmos search form. Glycomics@Expasy aims to host and interconnect available resources to reflect the diversity of glyco-related interactions at the cell surface. The collection is organized around GlyConnect (glycoproteins) and UniLectin (glycan-binding proteins). It also provides tools to support data interpretation such as Compozitor that maps glycomes at any level (site, protein, cell, tissue) into interactive graphs.

Whereas the above resources are portals that query other data resources, additional unique, broadly useful data sets and webtools are available at Glycomics@Expasy, GLYCOSCIENCES.de, the Asian Community of Glycoscience and Glycotechnology database (ACGG), and the Consortium for Functional Glycomics (CFG) websites. Additional resources have curated non-mammalian glycan information (e.g., CSDB), lectin binding data (e.g., UniLectin, SugarBind), and enzyme data (e.g., CAZy, Kyoto Encyclopedia of Genes and Genomes [KEGG] GLYCAN). Data on the sites of addition of single O-GlcNAc to Ser/Thr on nuclear/cytoplasmic proteins is now being incorporated into the general glycoprotein databases.

Table 52.1 provides a more detailed description of the currently (in 2021) active individual database resources. Note that there is no certainty that all cited databases and tools in this chapter will be active over time. Maintenance and quality of databases and software are dependent on secure, funded, and curated input which has been historically not continuous.

TABLE 52.1.

TABLE 52.1.

Glycoscience databases, repositories and web portals

DATA STANDARDIZATION AND ONTOLOGIES

A critical component of glycoinformatics is the availability of standardized approaches to share data and tools. To address this problem a number of international efforts are underway to establish standards for the presentation of glycomics data to facilitate data comparison, exchange, and verification. The Glycomics Ontology (GlycO) was the first ontology developed to provide standard terminologies for representing experimentally verified glycan structures as collections of chemically and contextually defined constituents, facilitating the association of these structural elements with biosynthetic and functional processes. Another effort, GlycoRDF, is a proposed standard ontology for glycan and related metadata using Resource Description Framework (RDF) to provide consistent terminologies for representing glycan sequences, related biological sources, publications, and experimental data on the Semantic Web. GlycoRDF is now being used by several glycomics database providers, to enable large-scale integration of diverse data collections in the glycosciences. For example, GlyTouCan uses the GlycoRDF ontology to represent the registered data such that other data resources also using this ontology can be integrated and queried by glycan structure. Different information resources (e.g., databases and publications) can reference these identifiers, thus facilitating identification and interpretation of diverse but complementary data sets that embody information about specific glycan structures.

The Semantic Web is a new technology that provides a framework for making data available directly on the Internet, provided with semantics, such that inferences can be made automatically based on the data. For example, a researcher often refers to various publications to derive a new hypothesis to test. Using the Semantic Web, the data in the publications would be formatted in such a way (using predefined vocabulary, or ontologies) that the meaning behind the data is preserved in a computable form, on the Web. Because a common vocabulary, or ontology, would be used across different publications in different websites (i.e., journals), the terminology used to encapsulate the semantics is preserved. Therefore, the Semantic Web becomes a virtual online database in which all linked data can be queried directly, without any need to transfer large amounts of data. Moreover, with such data available on the Semantic Web, machine learning technologies allow computers to make inferences based on the data available, just as a researcher would think of new hypotheses.

The essential accurate interpretation of glycoanalytical data for glycan structure determination requires well-documented metadata, including the parameters used to acquire and process the raw data along with supporting biological source information for the sample being analyzed. The MIRAGE initiative was formed to develop guidelines for researchers to report the qualitative and quantitative results obtained by diverse types of glycomics analyses (e.g., chromatography, mass spectrometry, and glycan/lectin arrays).

In an effort to allow glycan data to be shared seamlessly among glycan resources, GlyTouCan has been established as a stable repository and registry of chemically valid glycan structures. For each glycan, it provides unique accession numbers that are now used by diverse glycobiology data systems so that information on the glycan can be linked across databases. These unique GlyTouCan identifiers thus provide the foundation for linking glycan-related knowledge to the Semantic Web. The initial dataset of GlyTouCan structures comes from GlycomeDB, an undertaking that consolidated structures from a number of established glycan structure databases including CarbBank and provided links to the original sources. Registered users can submit any glycan structure whether they are fully defined, contain ambiguous linkages, or are simply monosaccharide compositions, independent of experimental evidence. All structures registered in GlyTouCan are checked only for representational and chemical consistency and not for biological relevance. Thus, databases that leverage GlyTouCan structural representations still require improved methods for establishing and validating the biological context of these structures. This registry facilitates the interpretation of results in the context of structural and biological information that is available from other sources. Nevertheless, the meaningful comparison of incompletely defined structures in terms of sequence, linkage, and anomericity remains an ongoing challenge for glycoinformaticians. It should be noted that, because of the reliance on analytical structure determination, a limitation for a majority of databases is the number of fully characterized entries (i.e., defined linkages and anomeric configurations between monosaccharides). For example, less than 15,000 out of the 40,000 structures in the amalgamated GlycomeDB are fully defined.

These unique identifiers, however, provide the semantic foundation required for individuals or databases to effectively communicate by recording the identifier of the structures they have characterized.

SOFTWARE TOOLS FOR EXPERIMENTAL DATA INTERPRETATION

Many advances in glyco-related databases and informatics tools development have focused on the interpretation and storage of analytical data, including liquid chromatography (LC), capillary electrophoresis (CE), interaction arrays, mass spectrometry (MS), and three-dimensional (3D)/modeling/nuclear magnetic resonance (NMR) (Table 52.2).

TABLE 52.2.

TABLE 52.2.

Glycoinformatics data analysis software tools

Mass Spectrometry

Most efforts so far have focused on tools that assist the interpretation of MS data. A number of commercial and publicly accessible software are now available. Introduced in 1999, the widely used GlycoMod was the first glycoinformatics Web-based tool to be released in this context, and its function is to suggest possible glycan compositions from experimental mass values of either free or derivatized glycans or glycopeptides. The Glycosciences.de portal started a toolset for analyzing and interpreting experimental data, and a decade later, the EUROCarbDB initiative launched GlycoWorkbench as a freely downloadable software tool to assist the interpretation of MS/MS data by matching a theoretical list of fragment masses against the experimental peak list derived from the mass spectrum. This tool has been integrated into several glycomics resources as it provides an easy-to-use interface, a comprehensive collection of fragmentation types, and a list of annotation options. UniCarb-DB adopts the approach of storing annotated and curated experimental glycan MS/MS data against which spectral matching can be used to identify unknown structures. RINGS provides Web-based software (Glycan Miner Tool, ProfilePSTMM) to analyze distinguishable glycan fragments from glycan profiling (MS) data, and the GlycomeAtlas tool is a visualization tool for glycan profiling data in mouse and human, where the distribution of glycans across various tissues is visualized. The GRITS toolbox is also a freely available integrated environment for processing glycoanalytic data. It uses a plug-in approach to facilitate reuse and integration of data-processing modules. GRITS includes a module called GELATO for collecting, annotating, and comparing mass spectral data, manipulating the corresponding metadata, and generating reports. This also uses libraries from GlycoWorkBench.

With recent advances in high-accuracy, high-throughput mass spectrometers (Chapter 51), glycoinformatics tools have emerged (and continue to emerge and be improved) to analyze these data both for glycomics and glycoproteomics applications (Table 52.2). Despite the rapid speed of these glycoinformatics tools, the analysis may only reveal partial glycan structure information as it is challenging to determine glycosidic bond linkage type and position using MS/MS data alone, particularly in the absence of standards. To alleviate these problems in glycomics analysis, glycan spectral library data repositories have been established by various resources including NIST and UniCarb-DR, and their integration into computational programs is awaited. From the biological perspective, however, even partial information is helpful as it enables comparative glycomics analysis of different healthy and disease tissues.

Similar approaches have also been undertaken in the field of glycoproteomics, in which glycoproteins and glycopeptides are analyzed in their native form, with glycans attached to them, using MS techniques (Figure 52.1; Chapter 51). Glycoproteomics not only allows the identification of the glycoprotein and the sites of the attached glycans, but can also provide some specific microheterogeneity information on the glycan compositions. Different fragmentation methods (electron transfer dissociation [ETD], collision-induced dissociation [CID], and higher-energy collision dissociation [HCD]) each provide distinct information about some different structural features of glycoproteins. More recently both stepped-energy HCD, and ETD supplementation with HCD (EThcD) have emerged as powerful strategies for revealing both the composition of the attached glycans and the sites of glycosylation. The application of multiple fragmentation modes to a single candidate glycopeptide sets the stage for the development of glycoinformatics tools that may reveal some information on the attached glycan structure (in addition to composition) at specific sites. The mass spectrum matching algorithms used in these programs vary widely (e.g., database search, de novo sequencing/open-glycan searching, or spectral library matching) to identify specific spectral fragment masses and thereby assign glycan/glycopeptide structural features. Following the ability now to obtain large data sets on glycopeptides generated from complex mixtures of glycoproteins (Chapter 51), a bottleneck that has severely limited the field of glycoproteomics is the downstream glycopeptide structural identification. The identification process was, until recently, largely driven by manual expert annotation of the resulting MS/MS spectra. However, many glycoinformatics initiatives are under continuing development to automate this glycopeptide identification process using various strategies to identify intact glycopeptides by using characteristic fragment ions. Several software tools are freely available to address the challenge of analyzing glycoproteomics data, such as pGlyco, GlycoPAT, MSFragger-Glyco, GlycReSoft, GP Finder, IQ-GPA, O-Pair, and GPQuest. Licensed Byonic and open-source Protein Prospector and MASCOT, initially designed for proteomics studies, allow semi-automated identification of N- and O-glycopeptides from high-resolution MS/MS data. The recent HUPO-HGI (Human Glycoproteomics Initiative) interlaboratory glycoproteomics analysis provides a side-by-side comparison of some of these programs, with a focus on highlighting best practices in the different approaches. Quantitation analysis can either be performed using the generic SkyLine platform or dedicated tools in the Happy-Tools collection.

Liquid Chromatography

In comparison to MS, few software tools are available for supporting (ultra)high-performance liquid chromatography (U/HPLC) data analysis and storage. GlycoStore is a curated chromatographic and capillary electrophoretic composition database of labeled glycans (2-AB, RFMS, and 2-AA) of N-, O-, glycosphingolipid (GSL) glycans and free oligosaccharides. The database is built on publicly available experimental LC data sets from GlycoBase, which has now been commercialized. To assist analysis, GlycanAnalyzer is available for pattern-matching N-glycan LC peak shifts following exoglycosidase digestions, which can be used to assign structures to each peak. GlycoStore also provides access to CE migration data for a limited set of glycan structures.

Nuclear Magnetic Resonance Spectroscopy

NMR data was obtained on carbohydrate structures in the 1980s and 1990s and is still the best analytical technique available to obtain complete structural information on purified oligosaccharides, but is less used now because of the difficulty in obtaining sufficient material from biological sources. The CASPER (Computer-Assisted Spectrum Evaluation of Regular Polysaccharides) program predicts 1H- and 13C-NMR chemical shifts of glycans. As such, it is used for determining the glycan structures based on experimental NMR data.

Glycan-Binding Data and Interpretation

Another area of software analysis of experimental data has been in mining glycan array data sets to identify glycan sequence motifs recognized by various GBPs, such as plant and animal lectins, viral and bacterial pathogen proteins, and antibodies. Several data analysis tools for glycan array experiments have recently emerged, including MotifFinder, GLYMMR, CCARL, Glycan Miner, Glycan Microarray Database (GlyMDB), and GLAD. Such data analysis determines the relative binding strength/specificity of a GBP to a glycan motif or determinant on the array. RINGS also has analytical tools for predicting glycan-binding patterns from glycan array data.

A few databases collect information on glycan-binding proteins. The Lectin Frontier DataBase (LfDB) provides affinity data for a few hundred lectins. UniLectin, which includes the UniLectin3D collection of thousands of curated lectin 3D structures, suggests a classification based on protein folds and stores these predictions. These classes enable the definition of profiles that can be used to screen sequence databases and predict glycan-binding domains. SugarBind is a curated database of literature-derived knowledge of pathogen–glycan binding, and MatrixDB collects glycosaminoglycan-binding proteins.

3D Glycan Structure Modeling

Because of their inherent flexibility, oligosaccharides typically exist in solution or on proteins, as an ensemble of conformations, making it a challenge to describe their 3D structure (see Chapters 30 and 50 for a description of 3D structures). Computational chemistry is an essential tool in analyzing glycan experimental data, to make predictions that may be tested experimentally, and to unravel and explain chemical processes at the atomic level.

Web-based tools are available to generate a theoretical model of a carbohydrate 3D structure. A useful resource is GLYCAM-Web that provides tools for modeling oligosaccharides and glycoproteins in addition to providing downloadable structure files that can be used for molecular modeling. SWEET-II is also a carbohydrate 3D builder that is available on the GLYCOSCIENCES.de website.

The two major databases for storing experimentally determined 3D carbohydrate structures are the PDB and the Cambridge Structural Database. Crystal structures of oligosaccharides are also available at Glyco3D. A recent extension of the latter is GAG-DB, centered on the 3D description of glycosaminoglycan-binding proteins. Most of the carbohydrates in the PDB are either connected covalently to a glycoprotein or form a complex with a lectin, enzyme, or antibody. Recently, the PDB has undergone a carbohydrate remediation, ensuring that carbohydrates are accurately annotated. Therefore PDB entries now contain glycan annotations, which are available in LINUCS, GLYCAM (IUPAC-like), and WURCS formats.

Glycan Attachment Sites on Proteins

Despite the known sequon (NXT/S, X is not Pro) for N-linked glycosylation, many potential sites are not glycosylated in vivo, and there are no clear motif(s) for predicting O-linked glycosylation. Understanding the “rules'' of attachment site specificity for the glycosylation of proteins is thus an ongoing challenge for glycoinformaticians. Over the past 20 years, neural networks, hidden Markov models (HMMs), and support vector machines (SVMs) have been implemented to predict N- or O-glycosylation and C-mannosylation. Although the original tools were hosted on the Danish CBS Prediction Servers, additional resources have emerged in the last few years.

Glycoprotein informatic resources of GlyGen, GlyCosmos, and Glycomics@Expasy now provide information on the glycan structures as attached to proteins in a complementary manner. The coverage and content depend on both automated and manual efforts to mine or curate current literature that contains characterized glycan structures and their sites of attachment to proteins and on supporting data from experimental conditions and biological sources. This collaborative, international bioinformatic integration of complex molecular data from all types of glycoanalytical techniques and interactions is in constant development and is essential for the continued progress of glycobiological research.

FUTURE PERSPECTIVES FOR GLYCOINFORMATICS DEVELOPMENT

Glycobiology as a Part of Systems Biology

Systems biology involves the development, simulation, and analysis of biological systems (including whole-body and environmental systems) at the molecular and cellular levels. As research on glycan biosynthetic pathway simulation progresses, its integration with genomics, transcriptomics, proteomics, lipidomics, and metabolomics data represents the next step. This will result in a holistic understanding of biological processes (Figure 52.1), such that glycomics data can be viewed in the context of complementary data. Such integrated knowledge will result in better elucidation of the glycosylation process, reveal new interactions with GBPs, and enhance our understanding of related functional consequences.

The current coordinated trend toward RDF-based data integration is already shaping future developments of glycoscience databases. It will help bridge the gap between glycomics and other -omics that have already adopted RDF ontologies but that are still very much DNA sequence–centered. Indeed, 50 years after the advent of sequencing technology, DNA sequences, along with their links to other data types (gene expression, protein structure, etc.), remain the most prominent entities in molecular biology databases and repositories. Scientists gather sequence-centered information in the course of elucidating a cellular process or a pathological behavior, simply because a gene/protein sequence is usually the common element shared across -omics domains. The problem here is that glycans are only linked to the gene via their biosynthetic enzymes and substrates. The advancement of glycoscience as a discipline depends on expanding the integration of data describing glycoproteins, glycolipids, glycosaminoglycans, lipopolysaccharides, and the genome-coded enzymatic machinery that generates or breaks down these glycans, together with the ever-increasing information about the interactions of these glycoconjugates with other components of the cell.

Linking Glycan Structures with Function

It is the ultimate goal of glycoscience research to be able to link glycan structures with their function. Although it is still difficult to completely identify fully defined glycan structures on their conjugates in a high-throughput manner, various hypotheses regarding the relationships between the specific structure of a glycan and its biological functions have been developed. One hypothesis maintains that the structural features characteristic of a group of glycans, rather than of a single structure, are required for biological function. This hypothesis is possible but is unlikely to be valid in all cases, as many discrete glycans are known to have quite specific functions, and changing a single monosaccharide or glycosidic conformation can greatly affect their capacity to realize their functions. Thus, additional work is required to accumulate as much of the experimental glycomics data as possible into standardized formats, such that comprehensive, integrated analyses can be performed using bioinformatics technologies.

Collaboration in Glycoinformatics

The future of current endeavors in glyco-related informatics lies in the consolidation of international consortia. The small size of the glycoscience community has prompted several cooperative initiatives across all continents for representing and collecting glycomics data (as described above). To favor interactions between these complementary initiatives, the international Glycome Informatics Consortium (GLIC) was founded in 2015 to provide and maintain a centralized software resource for developers, thus enabling cooperative database and tool development. Also in 2015, a Glycomics section was established on the Swiss Institute For Bioinformatics (SIB) Expasy proteomics resource portal, and glycoprotein entries in UniProtKB have been also linked to glycan structural information, when known, in GlyGen and GlyConnect. In addition, the U.S. National Institutes of Health (NIH) as part of the Common Fund Glycoscience Program is now focused on creating new methodologies and resources to study glycans that include the development of data integration and analysis tools.

In 2018, the GlySpace Alliance (glyspace.org) was formed for further standardization and collaboration between glycan database portals. This alliance consists of GlyGen, funded by the U.S. NIH Common Fund, GlyCosmos funded by the Japan Science and Technology Agency–National Bioscience Database Center, and Glycomics@Expasy of the Swiss Institute of Bioinformatics. Ensuring that all data is available under a completely open license while providing the provenance of all data, sharing all data between resources, and quality-checking all data are the key goals of this alliance.

These international efforts are affirmation that the importance of bioinformatics resources for glycoscience is finally being recognized, such that the role of glycans may be more easily understood and accessed by the broader research community. However it is clear that there is a long way to go before the entire community can have routine access to what aficionados of nucleic acid and protein biology currently take for granted: reliable, well-curated, user-friendly, cross-referenced databases that are permanently and safely housed in major long-term government-funded central servers. Achievement of this goal will be critical to bringing the study of glycans into the mainstream of evolutionary, molecular, and cellular biology, and its applications to medicine, materials science, and other fields that benefit humankind.

ACKNOWLEDGMENTS

The authors acknowledge contributions to previous versions of this chapter by Ram Sasisekharan and appreciate helpful comments and suggestions from Manfred Wuhrer.

FURTHER READING

  • Doubet S, Bock K, Smith D, Darvill A, Albersheim P. 1989. The complex carbohydrate structure database. Trends Biochem Sci 14: 475–477. doi:10.1016/0968-0004(89)90175-8 [PubMed: 2623761] [CrossRef]
  • Cooper CA, Gasteiger E, Packer NH. 2001. GlycoMod—a software tool for determining glycosylation compositions from mass spectrometric data. Proteomics 1: 340–349. doi:10.1002/1615-9861(200102)1:2<340::aid-prot340>3.0.co;2-b [PubMed: 11680880] [CrossRef]
  • Hashimoto K, Goto S, Kawano S, Aoki-Kinoshita KF, Ueda N, Hamajima M, Kawasaki T, Kanehisa M. 2006. KEGG as a glycome informatics resource. Glycobiology 16: 63–70. doi:10.1093/glycob/cwj010 [PubMed: 16014746] [CrossRef]
  • Mariethoz J, Alocci D, Gastaldello A, Horlacher O, Gasteiger E, Rojas-Macias M, Karlsson NG, Packer NH, Lisacek F. 2018. Glycomics@ExPASy: bridging the gap. Mol Cell Proteomics 17: 2164–2176. doi:10.1074/mcp.ra118.000799 [PMC free article: PMC6210229] [PubMed: 30097532] [CrossRef]
  • Aoki-Kinoshita KF, Lisacek F, Mazumder R, York WS, Packer NH. 2020. The GlySpace Alliance: toward a collaborative global glycoinformatics community. Glycobiology 30: 70–71. doi:10.1093/glycob/cwz078 [PMC free article: PMC6992953] [PubMed: 31573039] [CrossRef]
  • Neelamegham S, Aoki-Kinoshita K, Bolton E, Frank M, Lisacek F, Lütteke T, O'Boyle N, Packer NH, Stanley P, Toukach P, et al. 2020. SNFG discussion group updates to the symbol nomenclature for glycans guidelines Glycobiology 30: 72–73. doi:10.1093/glycob/cwz045 [PMC free article: PMC7335484] [PubMed: 31184695] [CrossRef]
  • Rojas-Macias MA, Mariethoz J, Andersson P, Jin C, Venkatakrishnan V, Aoki NP, Shinmachi D, Ashwood C, Madunic K, Zhang T, et al. 2020. Towards a standardized bioinformatics infrastructure for N- and O-glycomics. Nat Commun 10: 3275. doi:10.1038/s41467-019-11131-x [PMC free article: PMC6796180] [PubMed: 31332201] [CrossRef]
  • Yamada I, Shiota M, Shinmachi D, Ono T, Tsuchiya S, Hosoda M, Fujita A, Aoki NP, Watanabe Y, Fujita N, et al. 2020. The GlyCosmos portal: a unified and comprehensive Web resource for the glycosciences. Nat Methods 17: 649–650. doi:10.1038/s41592-020-0879-8 [PubMed: 32572234] [CrossRef]
  • York WS, Mazumder R, Ranzinger R, Edwards N, Kahsay R, Aoki-Kinoshita KF, Campbell MP, Cummings RD, Feizi T, Martin M, et al. 2020. GlyGen: computational and informatics resources for glycoscience. Glycobiology 30: 72–73. doi:10.1093/glycob/cwz080 [PMC free article: PMC7335483] [PubMed: 31616925] [CrossRef]
  • Fujita A, Aoki NP, Shinmachi D, Matsubara M, Tsuchiya S, Shiota M, Ono T, Yamada I, Aoki-Kinoshita KF. 2021. The international glycan repository GlyTouCan version 3.0. Nucleic Acids Res 49: D1529–D1533. doi:10.1093/nar/gkaa947 [PMC free article: PMC7779025] [PubMed: 33125071] [CrossRef]
  • Kawahara R, Alagesan K, Bern M, Cao W, Chalkley RJ, Cheng K, Choo MS, Edward N, Goldman R, Hoffmann M, et al. 2021. Community evaluation of glycoproteomics informatics solutions reveals high-performance search strategies of glycopeptide data. bioRxiv doi:10.1101/2021.03.14.435332 [PMC free article: PMC8566223] [PubMed: 34725484] [CrossRef]
Copyright © 2022 The Consortium of Glycobiology Editors, La Jolla, California; published by Cold Spring Harbor Laboratory Press; doi:10.1101/glycobiology.4e.52. All rights reserved.

The content of this book is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 Unported license. To view the terms and conditions of this license, visit https://creativecommons.org/licenses/by-nc-nd/4.0/

Bookshelf ID: NBK579995PMID: 35536985DOI: 10.1101/glycobiology.4e.52

Views

  • PubReader
  • Print View
  • Cite this Page

Important Links

Related Items in Bookshelf

Related information

  • PMC
    PubMed Central citations
  • PubMed
    Links to PubMed

Similar articles in PubMed

See reviews...See all...

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...