Archive for October, 2010
The mRNA landscape at yeast translation initiation sites
Posted by Waleed Ghalwash in Oxford journals on October 22nd, 2010
Summary: Although translation initiation has been well studied, many questions remain in elucidating its mechanisms. An ongoing challenge is to understand how ribosomes choose a translation initiation site (TIS). To gain new insights, we analyzed large sets of TISs with the aim of identifying common characteristics that are potentially of functional importance. Nucleotide sequence context has previously been demonstrated to play an important role in the ribosome’s selection of a TIS, and mRNA secondary structure is also emerging as a contributing factor.
Here, we analyze mRNA secondary structure using the folding predictions of the RNAfold algorithm. We present a method for analyzing these results using a rank-ordering approach to assess the overall degree of predicted secondary structure in a given region of mRNA. In addition, we used a modified version of the algorithm that makes use of only a subset of the standard version’s output to incorporate base-pairing polarity constraints suggested by the ribosome scanning process. These methods were employed to study the TISs of 1735 genes in Saccharomyces cerevisiae.
Trends in base composition and base-pairing probabilities suggest that efficient translation initiation and high protein expression are aided by reduced secondary structure upstream and downstream of the TIS. However, the downstream reduction is not observed for sets of TISs with nucleotide sequence contexts unfavorable for translation initiation, consistent with previous suggestions that secondary structure downstream of the ribosome can facilitate TIS recognition.
Contact: mweir@wesleyan.edu
Supplementary Information: Supplementary data are available at Bioinformatics online.
Selenoprofiles: profile-based scanning of eukaryotic genome sequences for selenoprotein genes
Posted by Waleed Ghalwash in Oxford journals on October 22nd, 2010
Motivation: Selenoproteins are a group of proteins that contain selenocysteine (Sec), a rare amino acid inserted co-translationally into the protein chain. The Sec codon is UGA, which is normally a stop codon. In selenoproteins, UGA is recoded to Sec in presence of specific features on selenoprotein gene transcripts. Due to the dual role of the UGA codon, selenoprotein prediction and annotation are difficult tasks, and even known selenoproteins are often misannotated in genome databases.
Results: We present an homology-based in silico method to scan genomes for members of the known eukaryotic selenoprotein families: selenoprofiles. The core of the method is a set of manually curated highly reliable multiple sequence alignments of selenoprotein families, which are used as queries to scan genomic sequences. Results of the scan are processed through a number of steps, to produce highly accurate predictions of selenoprotein genes with little or no human intervention. Selenoprofiles is a valuable tool for bioinformatic characterization of eukaryotic selenoproteomes, and can complement genome annotation pipelines.
Availability and Implementation: Selenoprofiles is a python-built pipeline that internally runs psitblastn, exonerate, genewise, SECISearch and a number of custom-made scripts and programs. The program is available at http://big.crg.cat/services/selenoprofiles. The predictions presented in this article are available through DAS at http://genome.crg.cat:9000/das/Selenoprofiles_ensembl.
Contact: marco.mariotti@crg.es
Supplementary information: Supplementary data are available at Bioinformatics online.
Sequencing delivers diminishing returns for homology detection: implications for mapping the protein universe
Posted by Waleed Ghalwash in Oxford journals on October 22nd, 2010
Motivation: Databases of sequenced genomes are widely used to characterize the structure, function and evolutionary relationships of proteins. The ability to discern such relationships is widely expected to grow as sequencing projects provide novel information, bridging gaps in our map of the protein universe.
Results: We have plotted our progress in protein sequencing over the last two decades and found that the rate of novel sequence discovery is in a sustained period of decline. Consequently, PSI-BLAST, the most widely used method to detect remote evolutionary relationships, which relies upon the accumulation of novel sequence data, is now showing a plateau in performance. We interpret this trend as signalling our approach to a representative map of the protein universe and discuss its implications.
Contact: daniel.chubb01@imperial.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
Novel sequence-based method for identifying transcription factor binding sites in prokaryotic genomes
Posted by Waleed Ghalwash in Oxford journals on October 22nd, 2010
Motivation: Computational techniques for microbial genomic sequence analysis are becoming increasingly important. With next-generation sequencing technology and the human microbiome project underway, current sequencing capacity is significantly greater than the speed at which organisms of interest can be studied experimentally. Most related computational work has been focused on sequence assembly, gene annotation and metabolic network reconstruction. We have developed a method that will primarily use available sequence data in order to determine prokaryotic transcription factor (TF) binding specificities.
Results: Specificity determining residues (critical residues) were identified from crystal structures of DNA–protein complexes and TFs with the same critical residues were grouped into specificity classes. The putative binding regions for each class were defined as the set of promoters for each TF itself (autoregulatory) and the immediately upstream and downstream operons. MEME was used to find putative motifs within each separate class. Tests on the LacI and TetR TF families, using RegulonDB annotated sites, showed the sensitivity of prediction 86% and 80%, respectively.
Availability: http://ural.wustl.edu/~gsahota/HTHmotif/
Contact: stormo@wustl.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
A Novel method for similarity analysis and protein sub-cellular localization prediction
Posted by Waleed Ghalwash in Oxford journals on October 22nd, 2010
Motivation: Biological sequence was regarded as an important study by many biologists, because the sequence contains a large number of biological information, what is helpful for scientists’ studies on biological cells, DNA and proteins. Currently, many researchers used the method based on protein sequences in function classification, sub-cellular location, structure and functional site prediction, including some machine-learning methods. The purpose of this article, is to find a new way of sequence analysis, but more simple and effective.
Results: According to the nature of 64 genetic codes, we propose a simple and intuitive 2D graphical expression of protein sequences. And based on this expression we give a new Euclidean-distance method to compute the distance of different sequences for the analysis of sequence similarity. This approach contains more sequence information. A typical phylogenetic tree constructed based on this method proved the effectiveness of our approach. Finally, we use this sequence-similarity-analysis method to predict protein sub-cellular localization, in the two datasets commonly used. The results show that the method is reasonable.
Contact: dragonbw@163.com
SLOPE: a quick and accurate method for locating non-SNP structural variation from targeted next-generation sequence data
Posted by Waleed Ghalwash in Oxford journals on October 22nd, 2010
Motivation: Targeted ‘deep’ sequencing of specific genes or regions is of great interest in clinical cancer diagnostics where some sequence variants, particularly translocations and indels, have known prognostic or diagnostic significance. In this setting, it is unnecessary to sequence an entire genome, and target capture methods can be applied to limit sequencing to important regions, thereby reducing costs and the time required to complete testing. Existing ‘next-gen’ sequencing analysis packages are optimized for efficiency in whole-genome studies and are unable to benefit from the particular structure of targeted sequence data.
Results: We developed SLOPE to detect structural variants from targeted short-DNA reads. We use both real and simulated data to demonstrate SLOPE’s ability to rapidly detect insertion/deletion events of various sizes as well as translocations and viral integration sites with high sensitivity and low false discovery rate.
Availability: Binary code available at http://www-genepi.med.utah.edu/suppl/SLOPE/index.html
Contact: haley@genepi.med.utah.edu
R3D Align: global pairwise alignment of RNA 3D structures using local superpositions
Posted by Waleed Ghalwash in Oxford journals on October 22nd, 2010
Motivation: Comparing 3D structures of homologous RNA molecules yields information about sequence and structural variability. To compare large RNA 3D structures, accurate automatic comparison tools are needed. In this article, we introduce a new algorithm and web server to align large homologous RNA structures nucleotide by nucleotide using local superpositions that accommodate the flexibility of RNA molecules. Local alignments are merged to form a global alignment by employing a maximum clique algorithm on a specially defined graph that we call the ‘local alignment’ graph.
Results: The algorithm is implemented in a program suite and web server called ‘R3D Align’. The R3D Align alignment of homologous 3D structures of 5S, 16S and 23S rRNA was compared to a high-quality hand alignment. A full comparison of the 16S alignment with the other state-of-the-art methods is also provided. The R3D Align program suite includes new diagnostic tools for the structural evaluation of RNA alignments. The R3D Align alignments were compared to those produced by other programs and were found to be the most accurate, in comparison with a high quality hand-crafted alignment and in conjunction with a series of other diagnostics presented. The number of aligned base pairs as well as measures of geometric similarity are used to evaluate the accuracy of the alignments.
Availability: R3D Align is freely available through a web server http://rna.bgsu.edu/R3DAlign. The MATLAB source code of the program suite is also freely available for download at that location.
Supplementary information: Supplementary data are available at Bioinformatics online.
Contact: r-rahrig@onu.edu
Metric learning for enzyme active-site search
Posted by Waleed Ghalwash in Oxford journals on October 22nd, 2010
Motivation: Finding functionally analogous enzymes based on the local structures of active sites is an important problem. Conventional methods use templates of local structures to search for analogous sites, but their performance depends on the selection of atoms for inclusion in the templates.
Results: The automatic selection of atoms so that site matches can be discriminated from mismatches. The algorithm provides not only good predictions, but also some insights into which atoms are important for the prediction. Our experimental results suggest that the metric learning automatically provides more effective templates than those whose atoms are selected manually.
Availability: Online software is available at http://www.net-machine.net/~kato/lpmetric1/
Contact: kato-tsuyoshi@k.u-tokyo.ac.jp
Supplementary information: Supplementary data are available at Bioinformatics online.
Model-based clustering of microarray expression data via latent Gaussian mixture models
Posted by Waleed Ghalwash in Oxford journals on October 22nd, 2010
Motivation: In recent years, work has been carried out on clustering gene expression microarray data. Some approaches are developed from an algorithmic viewpoint whereas others are developed via the application of mixture models. In this article, a family of eight mixture models which utilizes the factor analysis covariance structure is extended to 12 models and applied to gene expression microarray data. This modelling approach builds on previous work by introducing a modified factor analysis covariance structure, leading to a family of 12 mixture models, including parsimonious models. This family of models allows for the modelling of the correlation between gene expression levels even when the number of samples is small. Parameter estimation is carried out using a variant of the expectation–maximization algorithm and model selection is achieved using the Bayesian information criterion. This expanded family of Gaussian mixture models, known as the expanded parsimonious Gaussian mixture model (EPGMM) family, is then applied to two well-known gene expression data sets.
Results: The performance of the EPGMM family of models is quantified using the adjusted Rand index. This family of models gives very good performance, relative to existing popular clustering techniques, when applied to real gene expression microarray data.
Availability: The reduced, preprocessed data that were analysed are available at www.paulmcnicholas.info
Contact: pmcnicho@uoguelph.ca
Global modeling of transcriptional responses in interaction networks
Posted by Waleed Ghalwash in Oxford journals on October 22nd, 2010
Motivation: Cell-biological processes are regulated through a complex network of interactions between genes and their products. The processes, their activating conditions and the associated transcriptional responses are often unknown. Organism-wide modeling of network activation can reveal unique and shared mechanisms between tissues, and potentially as yet unknown processes. The same method can also be applied to cell-biological conditions in one or more tissues.
Results: We introduce a novel approach for organism-wide discovery and analysis of transcriptional responses in interaction networks. The method searches for local, connected regions in a network that exhibit coordinated transcriptional response in a subset of tissues. Known interactions between genes are used to limit the search space and to guide the analysis. Validation on a human pathway network reveals physiologically coherent responses, functional relatedness between tissues and coordinated, context-specific regulation of the genes.
Availability: Implementation is freely available in R and Matlab at http://www.cis.hut.fi/projects/mi/software/NetResponse
Contact: leo.lahti@iki.fi; samuel.kaski@tkk.fi
Supplementary information: Supplementary data are available at Bioinformatics online.
