Since the development of genome sequencing, a high potential was seen in the benefits of a better understanding of the molecular mechanisms underlying life. Through the technological advances not only more genomic data are produced, but also new scientific questions can be addressed, regarding the relationship of individuals within larger populations. Understanding both the variation between individuals (diversity) and the change in populations over time (dynamics) are essential to fully comprehend biological systems.
In the joint graduate program of Bielefeld University and Simon Fraser University, we will train young academics to be able to develop methods of high importance for the practical analysis of genome diversity and dynamics. This encompasses research in five main areas:
Benedikt G. Brink (started December 2013)
Supervisors: Tim W. Nattkemper (Bielefeld University), Ryan R. Brinkman (BC Cancer Agency), Ralf Hofestädt (Bielefeld University)
Benedikt Löwes (started October 2013)
Viral interactions are essential for the infection of host organisms and the subsequent replication of the viral genome in the host cell. These interactions are often based on specific RNA motifs shared between different evolutionary distant viruses. We observe that convergent evolution is a potential mechanism that explains similar motifs in phylogenetically distant viruses that infect common hosts by interacting with their cellular components. This is supported by the fact that for those specific RNA motifs similar selection criteria prevail. In this regard, the Hammerhead ribozyme is a well-studied example.
We focus on identifying new functional RNAs from viral genomes based on structural agreement of the RNA secondary structure, in order to find new examples of viral interactions with their host cells which are essential for the infection of the host and the replication of the virus. Previous approaches for identifying viral ncRNAs often strongly relied on sequence homology as well as pre-annotated RNA families in databases. Our approach, on the other hand, for the detection of the aforementioned convergent evolution uses conservation and convergence on the secondary structure level as primary information and the primary sequence only as additional source of information. This is based on the assumption that agreement on the level of secondary structure is the primary criterion for different viral RNA molecules to interact in a similar fashion with cellular components.
After identifying clusters of similar RNA elements by either only RNA structure matching, or a fast seed-based approach that takes structure and part of the sequence into account, or hierarchical RNA matching that uses an abstraction of the secondary structure, we incorporate phylogenetic information to ensure that the candidates that form a cluster with similar secondary structure stem from phylogenetically distant viruses. The most promising candidates of convergent evolution can be used as evidence to show that RNAs from phylogenetically distant viruses “look and behave” the same way, but evolved completely independent from one another by a subsequent manual analysis in the laboratory.
Supervisors: Robert Giegerich (Bielefeld University) and Peter Unrau (Simon Fraser University)
Georges Hattab (started October 2014)
Bioimage Informatics is an ever growing field in Bioinformatics for high throughput, high fidelity quantification and sub-cellular localisation. Yet, each of these directions raises a diverse set of questions. Such questions range from but are not limited to the automation of processes and standardisation in the laboratory to the bottleneck of understanding the biological data.
In this project, novel algorithms and depictions were sought for the particular case of bacterial cell growth in microfluidic chambers. The understanding of such time-lapse microscopy data is rendered possible by coupling computer vision, bioinformatics and information visualisation.
Moreover, the relevant information pertains to both the spatial and temporal domains in image sequences. This encompasses addressing both biological (e.g. targeted gene expression using gene knock-out) and informatics questions (e.g. object tracking) enabling us to handle such data quantitatively and qualitatively which in turn can be subject to a posterior statistical analysis.
Supervisors: Tim W. Nattkemper (Bielefeld University, Bielefeld, Germany), Tamara Munzner (University of British Columbia, Vancouver, Canada).
Guillaume Holley (started October 2013)
The advent of High Throughput Sequencing (HTS) technologies raises a major concern about storage and indexing of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pan-genomes.
In the first and second year of this project, an alignment-free and reference-free data structure for indexing a pan-genome has been developed. A pan-genome is represented as a colored de Bruijn graph in which vertices represent colored k-mers, words of length k associated with the identity of the strains in which they occur. For storing and traversing such a graph, we proposed the Bloom Filter Trie (BFT), a trie that compresses and indexes strings of fixed-length k and their colors. The BFT is based on the burst trie and integrates Bloom filters to navigate in the trie and accelerate the graph traversal. The specificity of our work compared to existing pan-genome tools resides in four main characteristics:
Experimental results prove better performance compared to another state-of-the-art data structure. An API written in C is provided to access all functionalities of the the data structure as well as colored de Bruijn graph operations such that methods can be designed on top of the BFT. The BFT source code and documentation are available here.
The data structure was presented at the 15th International Workshop on Algorithms in Bioinformatics in Atlanta, USA. An extended version of the paper was published in the journal Algorithms for Molecular Biology in 2016.
Pan-genomes are also ideal for compression. However, HTS-specific compression tools, developed to counter the massive growth of sequencing data and reduce storage costs, are not designed to process them. These methods can only compress genomes individually such that the compression ratio of a pan-genome does not improve with newly compressed genomes. During the third year of this project, a new alignment-free and reference-free compression method has been developed. It addresses the problem of pan-genome compression by encoding the sequences of a pan-genome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update archives with new genome sequences without full decompression of the archive. This method can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large P. aeruginosa dataset, this method outperforms all other tested tools with a 30 % compression ratio improvement in single-end mode compared to the best performing state-of-the-art HTS-specific compression method in our experiments.
Supervisors: Jens Stoye (Bielefeld University), Roland Wittler (Bielefeld University), S. Cenk Sahinalp (Simon Fraser University), Faraz Hach (Simon Fraser University)
Image source : “Ralstonia syzygii, the Blood Disease Bacterium and Some Asian R. solanacearum Strains Form a Single Genomic Species Despite Divergent Lifestyles” by Remenant B and al.
Jan Kölling (started August 2013, affiliated to DiDy since November 2013)
YU Jia (started October 2014)
NGS(next generation sequence) technology let researchers now have the potential to analyze multiple genomes of several related bacterial strains. EDGAR (Efficient Database framework for comparative Genome Analysis using BLAST score Ratios) is a software platform for bacterial genome analysis maintained by Justus-Liebig-University Giessen. It encloses all the NCBI publicly available bacterial genomes and uses statistical approach together with sequencing alignment to provide users othologous analysis. It also provides multiple comparative genomic visualization features.
As NGS technology developing, more genomes are released and EDGAR itself met its processing bottleneck. In order to keep its popular features yet to have a high performance back-end, we proposed three improvable aspects. On the algorithmic level, we test and benchmark the available alignment algorithms and applications to eventually replace the conventional BLAST sequencing alignment. On the computational level, we use cutting edge parallel computation architecture to provide EDGAR a scalable, fault-tolerant and efficient infrastructure. On the database level, a NoSQL database model with its compatible data schema will be used to enhance the data I/O.
We first circled 2 most recent published sequence alignment applications, GHOSTX and DIAMOND to compare to BLAST. Both claim that they are thousands of times faster than BLAST. After carefully examined their alignment results, we found that GHOSTX cannot provide required sensitivity and its result was missing a significant amount of alignment results and thus was not compatible with the statistical orthologous analysis of EDGAR. DIAMOND on the other hand, is sensitive enough to provide results really identical to BLAST. It also fits to EDGAR's analysis. The benchmark of the three alignment applications indicates that DIAMOND has the best performance when dealing with the same dataset in the same environment. Therefore, we chose DIAMOND as the alternative for BLAST in EDGAR.
We then worked on the parallelization of DIAMOND as it is the second aspect of the EDGAR enhancement. We developed a new application, HAMOND. It integrates Apache Hadoop parallelism with DIAMOND to provide scalability, fault-tolerance and high-speed features. EDGAR-HAMOND integration showed to be significantly faster and high identity to previous results. HAMOND is also compatible with any in-house Hadoop cluster and Amazon cloud platform. The development of HAMOND is finished and the publication is ready to be submit.
The next step would be the database model change. Since we are inclined to utilize a NoSQL database, the current MySQL data schema of EDGAR will be re-designed.
Supervisors: Alexander Goesmann (Justus Liebig University Giessen), Alexander Sczyrba (Bielefeld University), Fiona Brinkman (Simon Fraser University)
Karsten Wüllems (started October 2016)
In the recent years spatial metabolomics became very popular through the technological advances in imaging mass spectrometry (IMS). One of the most used techniques is the matrix-assisted laser desorption/ionisation-time of flight imaging mass spectrometry (MALDI-TOF IMS). MALDI-TOF IMS has various applications like study of spatial distribution of biomolecules and discovery of new biomarkers. By combining this technique with other imaging technologies, information can be gathered that is not accessible in stand-alone utilization.
In this project we aim for three goals. First is the utilization of new analysis methods for MALDI IMS data based on data mining and image processing. Second we want to investigate different multimodal approaches by combining IMS with other imaging techniques. Third is to develop visualizations for acquired data to make those accessible for researchers of different fields, especially for those who are not directly related to bioinformatics or imaging mass spectromtry. By achieving these goals we plan to reveal as much information hidden inside the IMS data as possible and to transform the raw data into useful and valuable information for researches from different fields, especially from the area of biology, chemistry and medicine.
Supervisors: Tim W. Nattkemper (University Bielefeld), Karsten Niehaus (University Bielefeld)
Konstantinos Tzanakis(started October 2016)
Recent improvements in technologies and experimental methods have led to rapidly increased amounts of data in all areas of life science. Also in the field of metabolomics, in which mass spectrometry is the key technology to investigate the metabolites that are abundant in an organism or a tissue, state-of-the-art methods allow to gain ten thousands of mass spectra within a few minutes which in turn characterize hundreds of potential compounds. Nowadays, more and more, the processing of these large amounts of data has emerged as a bottleneck and requires new ways of data handling, processing and interpretation.
This project aims at the development and evaluation of novel methods for the storage and analysis of mass spectrometry-based metabolomics data based on so called Big Data frameworks which allow for the distributed processing of large data sets across clusters of computers in a scalable manner. Examples for this are the Apache Hadoop software libraries and Apache Spark as a fast and general engine for large-scale data processing. Goal of the project is to provide users a bioinformatics platform for the efficient and user-friendly handling of own experimental data. Key objectives of the platform are, at first, the automated processing of large numbers of chromatographic datasets in terms of untargeted metabolite profiling, quantification and de novo identification, and at second, their integration with ither omics data to finally enable a target oriented and time efficient interpretation of the data.
Supervisors: Stefan P. Albaum (Bielefeld University), Tim W. Nattkemper (Bielefeld University), Karsten Niehaus (Bielefeld University)
Linda Sundermann (affiliated, project started November 2014)
Cancer samples are often genetically heterogeneous, harboring subclonal populations (subpopulations) with different mutations such as copy number variations (CNVs) or simple somatic mutations (SSMs; i.e., single nucleotide variants and small insertions and deletions). Information about such mutations in the subpopulations can help to identify driver mutations or to choose targeted therapies. Sequencing of bulk tumor samples is current standard practice because singlecell assays are yet not well established due to high cost and limited resolution.
Recently, several methods that attempt to infer the genotype of subpopulations using CNVs, SSMs, or both have been published. Here, we present Onctopus,a new approach to jointly model and reconstruct the subclonal composition of a bulk tumor sample utilizing SSMs and CNVs.
Given variant counts of SSMs and heterozygous germline SNPs, as well as information about the position and read depth of segments affected by CNVs, Onctopus assigns a frequency, CNVs and SSMs to Nsubclonal lineages (sublineages). Each of these lineages is defined through the CNVs and SSMs that arose in this lineage. SNPs, which are needed to identify copy number changes in sublineages, are assigned to the normal lineage. If SSMs or SNPs are influenced by CNVs, our model can phase them relative to the chromosome copy where the CNV occurs independently of known haplotype blocks.
We build a joint likelihood model and model the tumor as consisting of a mixture of lineages on which we infer a partial order. We choose sublineages to avoid ambiguous solutions that can occur when copy numbers are determined for subpopulations. We developed a linear relaxation of our model as a mixed integer linear program that can be solved with stateoftheart solvers.
Supervisors: Jens Stoye (Bielefeld University)
Project co-supervised by: Gunnar Rätsch (ETH Zurich), Quaid Morris (University of Toronto)
HUANG Liren (started October 2014)
The increasing amount of next-generation sequencing data poses a fundamental challenge for genomic analytics. Addressing this issue requires solutions for both hardware and software. Cloud infrastructure and platform services (IaaS and PaaS) have been well established in informatics discipline. To be compatible on the cluster, methods involving message passing and data aggregation between computers must be re-implement with higher level programming interface. In this project, we will present a Spark and Hadoop based bioinformatics framework, Sparkhit, that should be easy to use on local server or in the cloud. We want to implemented a variety of analytical tools and methods for different types of genomic NGS data. These methods will be programmed in MapReduce model, where parallelization will be optimized and supervised (fault tolerance).
Supervisors: Alexander Sczyrba (Bielefeld University), Alexander Goesmann (Giessen University), Colin C. Collins (Vancouver Prostate Centre)
Big improvements in classification algorithms in recent years not only brought new application possibilities but also lead to increased feature selection performance. In the context of biomedical data analysis, feature selection or feature relevance weighting constitutes a particularly important part for model interpretation, since it offers possible candidates for biomarkers. While such methods are capable of reliable extracting clear biomarkers, they are less reliable when it comes to subtle biomarkers in particular for highly correlated input data.
In this project we develop novel interactive methods which enable the computation of relevance bounds which highlight especially subtle or weakly relevant features. One goal is the ability for the researcher to incorporate hyphotheses into the feature weighting, which allows iterative refinement of the output.
Supervisors: Barbara Hammer (Bielefeld University)
Lu Zhu (started October 2014)
Meaningful data and methods to reliably and systematically study protein subcellular localization (SCL) and exploration of network of protein-protein interactions (PPI) have long been the goals of functional biology and the post genomic era for understanding the mechanistic hypotheses of a cell. Each type of cellular compartment provides a unique physiological environment within which specialized functions are carried out.To interact, proteins (or any other molecules) must necessarily share a common SCL, or an interface between physically adjacent SCLs, transiently or conditionally. Several PPI based SCL prediction methods have been developed using different approaches and data sources. When predicting the SCL of a given protein, not only the protein itself, but also all its interacting partners should be considered. Predicting multiple localizations is always a challenge. Some of the existing method made some process. All of the existing methods are designed to predict all the localizations the same time and use the same parameters for maximizing the global prediction performance. This leads to the poor prediction accuracy for certain SCLs. In our method apply MRF analysis to PPI network for protein SCL assignment problem. Each protein SCL is individually predicted using a MRF model with suitable parameter configuration corresponding to this SCL. We construct the MRF models, to predict if an unknown protein locates in a particular SCL, we consider (1) the prior probability of any protein locates in this SCL, (2) the number of interacting neighbors located in the same SCL, (3) the number of interacting neighbors located in the adjacent SCLs, and (4) the sequence-based features of the protein. A value of the probability of a protein locates in a certain SCL is assigned for every protein and every SCL.
Supervisors: Ralf Hofestädt (Bielefeld University), Martin Ester (Simon Fraser University)
Madis Rumming (started November 2013)
Comparing different Metagenomic studies is a very challenging task. No comparable results are yet to be expected or automatically computed.
Various pipelines are applied to meta genomic sample sets and therefore the results also vary in many different ways. The use of interpretable project comparison rules and the interpretation of those comparisons itself are not sufficiently investigated.
The main goal of this project is to find a way for automatic metadata acquisition, processing the heterogenous data, further data analysis with up-to-date Data Mining techniques and to generate reasonable semantics on the data.
Supervisors: Alexander Sczyrba (Bielefeld University), Barbara Hammer (Bielefeld University)
Insights into large-scale omics data require computational methods which reliably and efficiently recognize latent structures and patterns. A particular example is given by metagenomics which is confronted with the problem of detecting clusters representing involved species in a given sample (binning). Albeit powerful technologies exist for the identification of known taxa, de-novo binning is still in it's infancy. Similarly, in single-cell genome assemblies, a major problem is given by sample contamination with foreign genetic material. From a computational point of view, in both metagenomics and single-cell genome assemblies, genomes can be represented as a clusters. Hence, grouping and cluster validity measures can be employed to detect and separate genomes. In my research project, I adapt, compare and evaluate novel machine learning techniques in the context of such data.
Typically, genomic data can be represented in a high-dimensional vector space, making it inherently difficult for classical learning techniques as they suffer from the curse of dimensionality. Thus, it is crucial to use subspace projections or strong regularization. Additionally, the number of clusters is unknown and clustering results heavily depend on the chosen set of parameters which can be large. In order to yield valuable results, it is also beneficial to incorporate biological side information. This is accompanied by the fact that such data sets are often large, making standard algorithms inapplicable because of quadratic or worse runtime or memory complexity.
Supervisors: Barbara Hammer (Bielefeld University)
Nicole Althermeler (started October 2014)
Ancestral processes are driven by biological forces such as mutation, recombination, natural selection and random transmission of genetic material. These forces are largely of random nature, thus statistical models support inferring and predicting genetic patterns created by them. A collection of statistical models is offered by coalescent theory, which describe the demographic history of species. The n-coalescent by Kingman (1982) marks a milestone in the field (Reference). It is a tree-structure that describes the history of a sample of individuals from one species. The corresponding mathematical process describes the neutral case, in which no selection is present. A graph structure that does take selection into account is the ancestral selection graph [@ASG]. Here, one individual and all its potential ancestors are described in one graph. This graph can be constructed by random poisson processes and models not only different reproduction and mutation rates, but also the effects of selection.
Several important question arise in this field of research, for instance: Given an initial frequency x of a beneficial type in a population in the distance past, what is the probability that the ancestor of today's population had the beneficial type?
It is possible to analytically derive these probabilities for simpler, restricted cases (Reference). However, as studied scenarios become more complicated, analytical results become nearly impossible. Here, a solution are simulations of ancestral processes via Markov chains. These should result to close approximations of the underlying probabilities.
The project is devoted to the simulation and thorough understanding of random genealogies and ancestral lineages under various models of mutation and selection. Particular emphasis will be on the single ancestral line that leads back into the distant past and also connects with other species, thus building the bridge to phylogenetics, the evolutionary relationship of species. This may, for example, help to understand the large discrepancy between mutation rates estimated from species trees and those estimated from population samples.
Supervisors: Ellen Baake (Bielefeld University), Cedric Chauve (Simon Fraser University), Leonid Chindelevitch (Simon Fraser University)
Nina Luhmann (October 2013 - December 2016)
Reconstructing ancestral genomes is a long-standing computational biology problem with important applications to large-scale sequencing projects. Informally, the problem can be defined as follows: Given a phylogenetic tree representing the evolutionary history leading to a set of extant genomes, we want to reconstruct the structure of the ancestral genomes corresponding to the internal nodes of the phylogeny. Global approaches simultaneously reconstruct ancestral gene orders at all internal nodes of the considered phylogeny, generally based on a parsimony criterion. However while complex rearrangement models can give insights into underlying evolutionary mechanisms, from a computational point of view, this problem is NP-hard for most rearrangement distances.
Besides the phylogeny and the extant genome sequences, a third source of data for reconstruction became available recently. Due to the progress in sequencing technologies, ancient DNA (aDNA) found in conserved remains can be sequenced. One example is the genome of the ancestor of Yersinia pestis strains that is understood to be the cause of the Black Death pandemic. However, enviromental conditions influence sources for such paleogenomes and result in degradation and fragmentation of DNA molecules over time. This entails the reconstruction of the genome to be specifically challenging and leads only to a fragmented solution requiring additional scaffolding.
The goal of this project is to integrate new ancient sequencing information in the reconstruction of ancestral genomes. The comparison with extant genomes in the phylogeny can scaffold the fragmented assembly of the aDNA data while offering a lot of questions regarding the modeling of the genomes and the rearrangement model applied. This project aims to develop algorithms addressing these problems in reconstruction and scaffolding with the focus on the fragmented ancient assembly. We developed an extension of the exact algorithm to reconstruct genomes under the Single-Cut-or-Join rearrangement distance. It includes ancient DNA sequencing information in the reconstruction of ancestral genomes and also scaffolds the fragmented aDNA assembly while minimizing the SCJ distance in the tree. We then generalized this result in an approach combining the evolution under the SCJ model with prior information of the genome structure at internal nodes of the tree, e.g. derived from the available aDNA data. We further addressed the problem of closing gaps between assembled contigs in ancient Yersinia pestis genomes, taking advantage of related reference sequences. These are first steps towards an integrated phylogenetic assembly of paleogenomes.
Supervisors: Cedric Chauve (Simon Fraser University), Jens Stoye (Bielefeld University), Roland Wittler (Bielefeld University)
Robert Müller (started October 2016)
The advent of high-throughput sequencing (HTS) technologies has drastically changed how research is conducted in the life sciences. Clustering is a common task in many bioinformatics pipelines for the resulting massive data sets comprising millions of sequences. Intense research on related problems such as similarity joins and approximate dictionary matching led to a variety of tools, most of whom focus on speeding up the computation in order to handle the continuously growing data sets. However, they usually fall under the restriction that all data (structures) have to be held in the faster, but relatively small and expensive main memory (compared to disk space). While today even small laboratories can produce data on HTS-scale, they very often do not have the technical or financial capabilities to constantly upgrade their computational infrastructure in order to process these data sets. As cloud computing is not always an option, the goal of this project is to facilitate the processing of HTS-scale data sets on current lab infrastructures by augmenting state-of-the-art clustering tools with succinct data structures.
More specifically, we aim at examining the effect of introducing succinct data structures into clustering tools on their runtime and space consumption. Since succinct data structures often hide their dominant costs in lower-order terms and constants, we are particularly interested in the trade-off between space and runtime. We will analyse the effect of relaxing the space restriction of succinct data structures and study how much additional space we have to spend on top of the information-theoretical minimum in order to obtain practically superior tools. To that end, existing data structures have to be adapted or fine-tuned, and, potentially, new advanced data structures have to be developed. While the focus will be on tools working with the commonly accepted edit distance, we will also consider alternative, biologically motivated notions of similarity. Furthermore, we will investigate the accessibility of the used succinct data structures to practical optimisations such as parallelisation (broadword programming, GPU computations, threading etc.).
Supervisor: Markus Nebel (Bielefeld University)
Tina Zekic (started November 2014)
With the increasing number of available genome sequences, the concept of a pan-genome arised, representing a set of genomes belonging to different strains but the same species. A pan-genome is composed of three parts, a core genome, representing sequences shared among all strains of a species, a dispensable genome, containing sequences shared among a subset of strains and a singleton genome, representing strain-specific sequences. Recently, a data structure for a memory efficient storage of a pan-genome has been developed. The so-called Bloom Filter Trie (BFT) stores a pan-genome as a colored de Bruijn graph, by storing all kmers of the input sequences and the set of genomes they originate from.
The goal of this project is to extend this data structure by methods for the functional analysis of a pan-genome. The first step is the identification of the core genome. Matching sequences can be interrupted by SNPs or other structural variations, resulting in branching nodes and additional paths in the graph. Therefore, we first introduce a formal definition of a core genome in a de Bruijn graph.
For a set of input genomes, we consider the core as being composed of sequences shared by at least a required number of genomes, called the quorum and denoted by q. In order to model variations and include them in the core genome, we introduce the concept of a core bubble and its generalization, a super core bubble. A distance d is used to limit the maximal distance allowed between two core paths, where a core path is a path containing only core nodes. Besides bubbles, we introduce the definition of connected and disconnected core paths in order to cover the remaining cases.
Supervisors: Jens Stoye (Bielefeld University), Roland Wittler (Bielefeld University), Faraz Hach (Simon Fraser University)