Archive for August, 2011
Tutorial videos of bioinformatics resources: online distribution trial in Japan named TogoTV
Posted by Waleed Ghalwash in Oxford journals on August 18th, 2011
In recent years, biological web resources such as databases and tools have become more complex because of the enormous amounts of data generated in the field of life sciences. Traditional methods of distributing tutorials include publishing textbooks and posting web documents, but these static contents cannot adequately describe recent dynamic web services. Due to improvements in computer technology, it is now possible to create dynamic content such as video with minimal effort and low cost on most modern computers. The ease of creating and distributing video tutorials instead of static content improves accessibility for researchers, annotators and curators. This article focuses on online video repositories for educational and tutorial videos provided by resource developers and users. It also describes a project in Japan named TogoTV (http://togotv.dbcls.jp/en/) and discusses the production and distribution of high-quality tutorial videos, which would be useful to viewer, with examples. This article intends to stimulate and encourage researchers who develop and use databases and tools to distribute how-to videos as a tool to enhance product usability.
A toolbox for developing bioinformatics software
Posted by Waleed Ghalwash in Oxford journals on August 18th, 2011
Creating useful software is a major activity of many scientists, including bioinformaticians. Nevertheless, software development in an academic setting is often unsystematic, which can lead to problems associated with maintenance and long-term availibility. Unfortunately, well-documented software development methodology is difficult to adopt, and technical measures that directly improve bioinformatic programming have not been described comprehensively. We have examined 22 software projects and have identified a set of practices for software development in an academic environment. We found them useful to plan a project, support the involvement of experts (e.g. experimentalists), and to promote higher quality and maintainability of the resulting programs. This article describes 12 techniques that facilitate a quick start into software engineering. We describe 3 of the 22 projects in detail and give many examples to illustrate the usage of particular techniques. We expect this toolbox to be useful for many bioinformatics programming projects and to the training of scientific programmers.
Letter to the Editor: Current progress in patient-specific modeling by Neal and Kerckhoffs (2010)
Posted by Waleed Ghalwash in Oxford journals on August 18th, 2011
A recent review article on ‘Current progress in patient-specific modeling’ in Briefings in Bioinformatics contains the statement summarizing the results of our previous study ‘On the unimportance of constitutive models in computing brain deformation for image-guided surgery’ published in Biomechanics and Modeling in Mechanobiology as confirmation of adequacy of linear elastic model for such computation. The purpose of this Letter to the Editor is to clarify this statement by informing the Readers of Briefings in Bioinformatics that our study indicates the following: (i) a simple linear elastic constitutive model for the brain tissue is sufficient when used with an appropriate finite deformation solution (i.e. geometrically non-linear analysis); and (ii) Linear analysis approach that assumes infinitesimally small brain deformations leads to unrealistic results.
LEPSCAN–a web server for searching latent periodicity in DNA sequences
Posted by Waleed Ghalwash in Oxford journals on August 18th, 2011
A web server for searching latent periodicity based on the method of modified profile analysis has been developed. This method allows searching latent periodicity in presence of insertions and deletions. During searching process, the periodicity classes are used which were found by us earlier for various groups of organisms. Period length belongs to the range 2–20 nt, not including the triplet periodicity. The results obtained are subjected to various filtration steps to ensure their statistical significance. Availability: The use of web server is free for non-commercial users. No registration is required. URL of the server is http://victoria.biengi.ac.ru/lepscan. Current software version is 1.06.
How to cluster gene expression dynamics in response to environmental signals
Posted by Waleed Ghalwash in Oxford journals on August 18th, 2011
Organisms usually cope with change in the environment by altering the dynamic trajectory of gene expression to adjust the complement of active proteins. The identification of particular sets of genes whose expression is adaptive in response to environmental changes helps to understand the mechanistic base of gene–environment interactions essential for organismic development. We describe a computational framework for clustering the dynamics of gene expression in distinct environments through Gaussian mixture fitting to the expression data measured at a set of discrete time points. We outline a number of quantitative testable hypotheses about the patterns of dynamic gene expression in changing environments and gene–environment interactions causing developmental differentiation. The future directions of gene clustering in terms of incorporations of the latest biological discoveries and statistical innovations are discussed. We provide a set of computational tools that are applicable to modeling and analysis of dynamic gene expression data measured in multiple environments.
Bioinformatics tools and database resources for systems genetics analysis in mice–a short review and an evaluation of future needs
Posted by Waleed Ghalwash in Oxford journals on August 18th, 2011
During a meeting of the SYSGENET working group ‘Bioinformatics’, currently available software tools and databases for systems genetics in mice were reviewed and the needs for future developments discussed. The group evaluated interoperability and performed initial feasibility studies. To aid future compatibility of software and exchange of already developed software modules, a strong recommendation was made by the group to integrate HAPPY and R/qtl analysis toolboxes, GeneNetwork and XGAP database platforms, and TIQS and xQTL processing platforms. R should be used as the principal computer language for QTL data analysis in all platforms and a ‘cloud’ should be used for software dissemination to the community. Furthermore, the working group recommended that all data models and software source code should be made visible in public repositories to allow a coordinated effort on the use of common data structures and file formats.
Conceptual framework and pilot study to benchmark phylogenomic databases based on reference gene trees
Posted by Waleed Ghalwash in Oxford journals on August 18th, 2011
Phylogenomic databases provide orthology predictions for species with fully sequenced genomes. Although the goal seems well-defined, the content of these databases differs greatly. Seven ortholog databases (Ensembl Compara, eggNOG, HOGENOM, InParanoid, OMA, OrthoDB, Panther) were compared on the basis of reference trees. For three well-conserved protein families, we observed a generally high specificity of orthology assignments for these databases. We show that differences in the completeness of predicted gene relationships and in the phylogenetic information are, for the great majority, not due to the methods used, but to differences in the underlying database concepts. According to our metrics, none of the databases provides a fully correct and comprehensive protein classification. Our results provide a framework for meaningful and systematic comparisons of phylogenomic databases. In the future, a sustainable set of ‘Gold standard’ phylogenetic trees could provide a robust method for phylogenomic databases to assess their current quality status, measure changes following new database releases and diagnose improvements subsequent to an upgrade of the analysis procedure.
Calculating transcription factor binding maps for chromatin
Posted by Waleed Ghalwash in Oxford journals on August 18th, 2011
Current high-throughput experiments already generate enough data for retrieving the DNA sequence-dependent binding affinities of transcription factors (TF) and other chromosomal proteins throughout the complete genome. However, the reverse task of calculating binding maps in a chromatin context for a given set of concentrations and TF affinities appears to be even more challenging and computationally demanding. The problem can be addressed by considering the DNA sequence as a one-dimensional lattice with units of one or more base pairs. To calculate protein occupancies in chromatin, one needs to consider the competition of TF and histone octamers for binding sites as well as the partial unwrapping of nucleosomal DNA. Here, we consider five different classes of algorithms to compute binding maps that include the binary variable, combinatorial, sequence generating function, transfer matrix and dynamic programming approaches. The calculation time of the binary variable algorithm scales exponentially with DNA length, which limits its use to the analysis of very small genomic regions. For regulatory regions with many overlapping binding sites, potentially applicable algorithms reduce either to the transfer matrix or dynamic programming approach. In addition to the recently proposed transfer matrix formalism for TF access to the nucleosomal organized DNA, we develop here a dynamic programming algorithm that accounts for this feature. In the absence of nucleosomes, dynamic programming outperforms the transfer matrix approach, but the latter is faster when nucleosome unwrapping has to be considered. Strategies are discussed that could further facilitate calculations to allow computing genome-wide TF binding maps.
Comparative genomics approach to detecting split-coding regions in a low-coverage genome: lessons from the chimaera Callorhinchus milii (Holocephali, Chondrichthyes)
Posted by Waleed Ghalwash in Oxford journals on August 18th, 2011
Recent development of deep sequencing technologies has facilitated de novo genome sequencing projects, now conducted even by individual laboratories. However, this will yield more and more genome sequences that are not well assembled, and will hinder thorough annotation when no closely related reference genome is available. One of the challenging issues is the identification of protein-coding sequences split into multiple unassembled genomic segments, which can confound orthology assignment and various laboratory experiments requiring the identification of individual genes. In this study, using the genome of a cartilaginous fish, Callorhinchus milii, as test case, we performed gene prediction using a model specifically trained for this genome. We implemented an algorithm, designated ESPRIT, to identify possible linkages between multiple protein-coding portions derived from a single genomic locus split into multiple unassembled genomic segments. We developed a validation framework based on an artificially fragmented human genome, improvements between early and recent mouse genome assemblies, comparison with experimentally validated sequences from GenBank, and phylogenetic analyses. Our strategy provided insights into practical solutions for efficient annotation of only partially sequenced (low-coverage) genomes. To our knowledge, our study is the first formulation of a method to link unassembled genomic segments based on proteomes of relatively distantly related species as references.
Ortholog identification in the presence of domain architecture rearrangement
Posted by Waleed Ghalwash in Oxford journals on August 18th, 2011
Ortholog identification is used in gene functional annotation, species phylogeny estimation, phylogenetic profile construction and many other analyses. Bioinformatics methods for ortholog identification are commonly based on pairwise protein sequence comparisons between whole genomes. Phylogenetic methods of ortholog identification have also been developed; these methods can be applied to protein data sets sharing a common domain architecture or which share a single functional domain but differ outside this region of homology. While promiscuous domains represent a challenge to all orthology prediction methods, overall structural similarity is highly correlated with proximity in a phylogenetic tree, conferring a degree of robustness to phylogenetic methods. In this article, we review the issues involved in orthology prediction when data sets include sequences with structurally heterogeneous domain architectures, with particular attention to automated methods designed for high-throughput application, and present a case study to illustrate the challenges in this area.
