Archive for July, 2009
Visual and statistical comparison of metagenomes
Posted by Waleed Ghalwash in Oxford journals on July 19th, 2009
Background: Metagenomics is the study of the genomic content of an environmental sample of microbes. Advances in the through-put and cost-efficiency of sequencing technology is fueling a rapid increase in the number and size of metagenomic datasets being generated. Bioinformatics is faced with the problem of how to handle and analyze these datasets in an efficient and useful way. One goal of these metagenomic studies is to get a basic understanding of the microbial world both surrounding us and within us. One major challenge is how to compare multiple datasets. Furthermore, there is a need for bioinformatics tools that can process many large datasets and are easy to use.
Results: This article describes two new and helpful techniques for comparing multiple metagenomic datasets. The first is a visualization technique for multiple datasets and the second is a new statistical method for highlighting the differences in a pairwise comparison. We have developed implementations of both methods that are suitable for very large datasets and provide these in Version 3 of our standalone metagenome analysis tool MEGAN.
Conclusion: These new methods are suitable for the visual comparison of many large metagenomes and the statistical comparison of two metagenomes at a time. Nevertheless, more work needs to be done to support the comparative analysis of multiple metagenome datasets.
Availability: Version 3 of MEGAN, which implements all ideas presented in this article, can be obtained from our web site at: www-ab.informatik.uni-tuebingen.de/software/megan.
Contact: mitra@informatik.uni-tuebingen.de
Supplementary information: Supplementary data are available at Bioinformatics online.
UTGB toolkit for personalized genome browsers
Posted by Waleed Ghalwash in Oxford journals on July 19th, 2009
The advent of high-throughput DNA sequencers has increased the pace of collecting enormous amounts of genomic information, yielding billions of nucleotides on a weekly basis. This advance represents an improvement of two orders of magnitude over traditional Sanger sequencers in terms of the number of nucleotides per unit time, allowing even small groups of researchers to obtain huge volumes of genomic data over fairly short period. Consequently, a pressing need exists for the development of personalized genome browsers for analyzing these immense amounts of locally stored data. The UTGB (University of Tokyo Genome Browser) Toolkit is designed to meet three major requirements for personalization of genome browsers: easy installation of the system with minimum efforts, browsing locally stored data and rapid interactive design of web interfaces tailored to individual needs. The UTGB Toolkit is licensed under an open source license.
Availability: The software is freely available at http://utgenome.org/.
Contact: moris@cb.k.u-tokyo.ac.jp
CORAL: aligning conserved core regions across domain families
Posted by Waleed Ghalwash in Oxford journals on July 19th, 2009
Motivation: Homologous protein families share highly conserved sequence and structure regions that are frequent targets for comparative analysis of related proteins and families. Many protein families, such as the curated domain families in the Conserved Domain Database (CDD), exhibit similar structural cores. To improve accuracy in aligning such protein families, we propose a profile–profile method CORAL that aligns individual core regions as gap-free units.
Results: CORAL computes optimal local alignment of two profiles with heuristics to preserve continuity within core regions. We benchmarked its performance on curated domains in CDD, which have pre-defined core regions, against COMPASS, HHalign and PSI-BLAST, using structure superpositions and comprehensive curator-optimized alignments as standards of truth. CORAL improves alignment accuracy on core regions over general profile methods, returning a balanced score of 0.57 for over 80% of all domain families in CDD, compared with the highest balanced score of 0.45 from other methods. Further, CORAL provides E-values to aid in detecting homologous protein families and, by respecting block boundaries, produces alignments with improved ‘readability’ that facilitate manual refinement.
Availability: CORAL will be included in future versions of the NCBI Cn3D/CDTree software, which can be downloaded at http://www.ncbi.nlm.nih.gov/Structure/cdtree/cdtree.shtml.
Contact: fongj@ncbi.nlm.nih.gov.
Supplementary information: Supplementary data are available at Bioinformatics online.
Rapid detection, classification and accurate alignment of up to a million or more related protein sequences
Posted by Waleed Ghalwash in Oxford journals on July 19th, 2009
Motivation: The patterns of sequence similarity and divergence present within functionally diverse, evolutionarily related proteins contain implicit information about corresponding biochemical similarities and differences. A first step toward accessing such information is to statistically analyze these patterns, which, in turn, requires that one first identify and accurately align a very large set of protein sequences. Ideally, the set should include many distantly related, functionally divergent subgroups. Because it is extremely difficult, if not impossible for fully automated methods to align such sequences correctly, researchers often resort to manual curation based on detailed structural and biochemical information. However, multiply-aligning vast numbers of sequences in this way is clearly impractical.
Results: This problem is addressed using Multiply-Aligned Profiles for Global Alignment of Protein Sequences (MAPGAPS). The MAPGAPS program uses a set of multiply-aligned profiles both as a query to detect and classify related sequences and as a template to multiply-align the sequences. It relies on Karlin–Altschul statistics for sensitivity and on PSI-BLAST (and other) heuristics for speed. Using as input a carefully curated multiple-profile alignment for P-loop GTPases, MAPGAPS correctly aligned weakly conserved sequence motifs within 33 distantly related GTPases of known structure. By comparison, the sequence- and structurally based alignment methods hmmalign and PROMALS3D misaligned at least 11 and 23 of these regions, respectively. When applied to a dataset of 65 million protein sequences, MAPGAPS identified, classified and aligned (with comparable accuracy) nearly half a million putative P-loop GTPase sequences.
Availability: A C++ implementation of MAPGAPS is available at http://mapgaps.igs.umaryland.edu.
Contact: aneuwald@som.umaryland.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Using dynamics-based comparisons to predict nucleic acid binding sites in proteins: an application to OB-fold domains
Posted by Waleed Ghalwash in Oxford journals on July 19th, 2009
Motivation: We have previously demonstrated that proteins may be aligned not only by sequence or structural homology, but also using their dynamical properties. Dynamics-based alignments are sensitive and powerful tools to compare even structurally dissimilar protein families. Here, we propose to use this method to predict protein regions involved in the binding of nucleic acids. We have used the OB-fold, a motif known to promote protein–nucleic acid interactions, to validate our approach.
Results: We have tested the method using this well-characterized nucleic acid binding family. Protein regions consensually involved in statistically significant dynamics-based alignments were found to correlate with nucleic acid binding regions. The validated scheme was next used as a tool to predict which regions of the AXH-domain representatives (a sub-family of the OB-fold for which no DNA/RNA complex is yet available) are putatively involved in binding nucleic acids. The method, therefore, is a promising general approach for predicting functional regions in protein families on the basis of comparative large-scale dynamics.
Availability: The software is available upon request from the authors, free of charge for academic users.
Contact: michelet@sissa.it
Supplementary information: Supplementary data are available at Bioinformatics online.
Predictor correlation impacts machine learning algorithms: implications for genomic studies
Posted by Waleed Ghalwash in Oxford journals on July 19th, 2009
Motivation: The advent of high-throughput genomics has produced studies with large numbers of predictors (e.g. genome-wide association, microarray studies). Machine learning algorithms (MLAs) are a computationally efficient way to identify phenotype-associated variables in high-dimensional data. There are important results from mathematical theory and numerous practical results documenting their value. One attractive feature of MLAs is that many operate in a fully multivariate environment, allowing for small-importance variables to be included when they act cooperatively. However, certain properties of MLAs under conditions common in genomic-related data have not been well-studied—in particular, correlations among predictors pose a problem.
Results: Using extensive simulation, we showed considering correlation within predictors is crucial in making valid inferences using variable importance measures (VIMs) from three MLAs: random forest (RF), conditional inference forest (CIF) and Monte Carlo logic regression (MCLR). Using a case–control illustration, we showed that the RF VIMs—even permutation-based—were less able to detect association than other algorithms at effect sizes encountered in complex disease studies. This reduction occurred when ‘causal’ predictors were correlated with other predictors, and was sharpest when RF tree building used the Gini index. Indeed, RF Gini VIMs are biased under correlation, dependent on predictor correlation strength/number and over-trained to random fluctuations in data when tree terminal node size was small. Permutation-based VIM distributions were less variable for correlated predictors and are unbiased, thus may be preferred when predictors are correlated. MLAs are a powerful tool for high-dimensional data analysis, but well-considered use of algorithms is necessary to draw valid conclusions.
Contact: kristin.nicodemus@well.ox.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
Complex discovery from weighted PPI networks
Posted by Waleed Ghalwash in Oxford journals on July 19th, 2009
Motivation: Protein complexes are important for understanding principles of cellular organization and function. High-throughput experimental techniques have produced a large amount of protein interactions, which makes it possible to predict protein complexes from protein–protein interaction (PPI) networks. However, protein interaction data produced by high-throughput experiments are often associated with high false positive and false negative rates, which makes it difficult to predict complexes accurately.
Results: We use an iterative scoring method to assign weight to protein pairs, and the weight of a protein pair indicates the reliability of the interaction between the two proteins. We develop an algorithm called CMC (clustering-based on maximal cliques) to discover complexes from the weighted PPI network. CMC first generates all the maximal cliques from the PPI networks, and then removes or merges highly overlapped clusters based on their interconnectivity. We studied the performance of CMC and the impact of our iterative scoring method on CMC. Our results show that: (i) the iterative scoring method can improve the performance of CMC considerably; (ii) the iterative scoring method can effectively reduce the impact of random noise on the performance of CMC; (iii) the iterative scoring method can also improve the performance of other protein complex prediction methods and reduce the impact of random noise on their performance; and (iv) CMC is an effective approach to protein complex prediction from protein interaction network.
Contact: liugm@comp.nus.edu.sg
Supplementary information: Supplementary data are available at Bioinformatics online.
Hub genes with positive feedbacks function as master switches in developmental gene regulatory networks
Posted by Waleed Ghalwash in Oxford journals on July 19th, 2009
Motivation: Spatio-temporal regulation of gene expression is an indispensable characteristic in the development processes of all animals. ‘Master switches’, a central set of regulatory genes whose states (on/off or activated/deactivated) determine specific developmental fate or cell-fate specification, play a pivotal role for whole developmental processes. In this study on genome-wide integrative network analysis the underlying design principles of developmental gene regulatory networks are examined.
Results: We have found an intriguing design principle of developmental networks: hub nodes, genes with high connectivity, equipped with positive feedback loops are prone to function as master switches. This raises the important question of why the positive feedback loops are frequently found in these contexts. The master switches with positive feedback make the developmental signals more decisive and robust such that the overall developmental processes become more stable. This finding provides a new evolutionary insight: developmental networks might have been gradually evolved such that the master switches generate digital-like bistable signals by adopting neighboring positive feedback loops. We therefore propose that the combined presence of positive feedback loops and hub genes in regulatory networks can be used to predict plausible master switches.
Contact: ckh@kaist.ac.kr
Supplementary information: Supplementary data are available at Bioinformatics online.
Integrative analysis of transcriptomic and proteomic data of Desulfovibrio vulgaris: a non-linear model to predict abundance of undetected proteins
Posted by Waleed Ghalwash in Oxford journals on July 19th, 2009
Motivation: Gene expression profiling technologies can generally produce mRNA abundance data for all genes in a genome. A dearth of proteomic data persists because identification range and sensitivity of proteomic measurements lag behind those of transcriptomic measurements. Using partial proteomic data, it is likely that integrative transcriptomic and proteomic analysis may introduce significant bias. Developing methodologies to accurately estimate missing proteomic data will allow better integration of transcriptomic and proteomic datasets and provide deeper insight into metabolic mechanisms underlying complex biological systems.
Results: In this study, we present a non-linear data-driven model to predict abundance for undetected proteins using two independent datasets of cognate transcriptomic and proteomic data collected from Desulfovibrio vulgaris. We use stochastic gradient boosted trees (GBT) to uncover possible non-linear relationships between transcriptomic and proteomic data, and to predict protein abundance for the proteins not experimentally detected based on relevant predictors such as mRNA abundance, cellular role, molecular weight, sequence length, protein length, guanine-cytosine (GC) content and triple codon counts. Initially, we constructed a GBT model using all possible variables to assess their relative importance and characterize the behavior of the predictive model. A strong plateau effect in the regions of high mRNA values and sparse data occurred in this model. Hence, we removed genes in those areas based on thresholds estimated from the partial dependency plots where this behavior was captured. At this stage, only the strongest predictors of protein abundance were retained to reduce the complexity of the GBT model. After removing genes in the plateau region, mRNA abundance, main cellular functional categories and few triple codon counts emerged as the top-ranked predictors of protein abundance. We then created a new tuned GBT model using the five most significant predictors. The construction of our non-linear model consists of a set of serial regression trees models with implicit strength in variable selection. The model provides variable relative importance measures using as a criterion mean square error. The results showed that coefficients of determination for our nonlinear models ranged from 0.393 to 0.582 in both datasets, providing better results than linear regression used in the past. We evaluated the validity of this non-linear model using biological information of operons, regulons and pathways, and the results demonstrated that the coefficients of variation of estimated protein abundance values within operons, regulons or pathways are indeed smaller than those for random groups of proteins.
Contact: weiwen.zhang@asu.edu; george.runger@asu.edu
Supplementary Information: Supplementary data are available at Bioinformatics online.
Using chemical organization theory for model checking
Posted by Waleed Ghalwash in Oxford journals on July 19th, 2009
Motivation: The increasing number and complexity of biomodels makes automatic procedures for checking the models’ properties and quality necessary. Approaches like elementary mode analysis, flux balance analysis, deficiency analysis and chemical organization theory (OT) require only the stoichiometric structure of the reaction network for derivation of valuable information. In formalisms like Systems Biology Markup Language (SBML), however, information about the stoichiometric coefficients required for an analysis of chemical organizations can be hidden in kinetic laws.
Results: First, we introduce an algorithm that uncovers stoichiometric information that might be hidden in the kinetic laws of a reaction network. This allows us to apply OT to SBML models using modifiers. Second, using the new algorithm, we performed a large-scale analysis of the 185 models contained in the manually curated BioModels Database. We found that for 41 models (22%) the set of organizations changes when modifiers are considered correctly. We discuss one of these models in detail (BIOMD149, a combined model of the ERK- and Wnt-signaling pathways), whose set of organizations drastically changes when modifiers are considered. Third, we found inconsistencies in 5 models (3%) and identified their characteristics. Compared with flux-based methods, OT is able to identify those species and reactions more accurately [in 26 cases (14%)] that can be present in a long-term simulation of the model. We conclude that our approach is a valuable tool that helps to improve the consistency of biomodels and their repositories.
Availability: All data and a JAVA applet to check SBML-models is available from http://www.minet.uni-jena.de/csb/prj/ot/tools
Contact: dittrich@minet.uni-jena.de
Supplementary information: Supplementary data are available at Bioinformatics online.
