Genome assembly is a critical process in bioinformatics, where short DNA sequences (reads) are pieced together to reconstruct an organism's genome. As sequencing technologies generate vast amounts of data, the need for effective computational methods to assemble these reads has grown. Two prominent graph theory-based approaches—the overlap graph and the Hamiltonian path approaches—offer different strategies for this task. This study focuses on the construction and decoding aspects of these graph-based methods, providing a comparative analysis of their effectiveness in genome assembly. This research will explore the intricacies of constructing and decoding these graphs, examining how each approach handles challenges such as repetitive sequences, sequencing errors, and varying read lengths. The construction phase will be analysed in terms of computational efficiency, focusing on the algorithms used to build the graphs and the preprocessing required to manage large datasets. The decoding phase will be evaluated based on the accuracy of the assembled genome, considering factors like contiguity (N50), error rates, and the ability to resolve complex genomic regions. A key aspect of this research is the comparison of the decoding strategies used in both approaches. For the overlap graph approach, the focus will be on greedy algorithms that iteratively connect reads with the best overlaps. The Hamiltonian path approach, on the other hand, will be examined through the lens of heuristic and approximation algorithms designed to tackle its inherent computational complexity. This research will be supported by practical experiments using real-world sequencing data, allowing for a detailed evaluation of how each method performs under different conditions. The research will also consider the scalability of these approaches, particularly in the context of emerging sequencing technologies that produce longer and more accurate reads. Ultimately, this thesis aims to provide a clear understanding of the trade-offs involved in the construction and decoding of overlap graphs and Hamiltonian paths for genome assembly. By focusing on the graph theory aspects of these methods, the research will offer insights into their strengths and limitations, guiding the selection of the most appropriate approach for different genomic challenges. The findings will contribute to the ongoing development of more efficient and accurate genome assembly techniques, with potential applications in a wide range of biological and medical research.
Genome assembly is a critical process in bioinformatics, aimed at reconstructing the complete genome from short DNA sequences (reads) produced by sequencing technologies. As sequencing technologies have advanced, the volume of data has significantly increased, necessitating effective computational methods for genome assembly. Two prominent graph- based approaches in this field are the overlap graph approach and the Hamiltonian path approach. This literature review focuses on the construction and decoding aspects of these methods, providing a comparative analysis of their effectiveness. The overlap graph approach has been widely utilized in genome assembly, particularly with the advent of next-generation sequencing technologies. In this method, each read is represented as a vertex in a graph, with edges connecting vertices that exhibit significant overlaps. The primary goal is to decode the graph by finding a path that maximizes these overlaps to reconstruct the genome sequence. This approach gained prominence with the development of the de Bruijn graph by Pevzner, Tang, and Waterman (2001) [1], which simplified the assembly process by breaking down reads into k-mers and connecting them based on overlaps. This method proved effective for bacterial and viral genomes but has faced challenges with more complex genomes due to repetitive sequences and ambiguous overlaps.
Enhancements to the overlap graph approach have focused on improving computational efficiency and handling repetitive sequences. Myers (2005) [2] introduced the string graph, which simplifies the overlap graph by removing redundant edges, thereby reducing complexity and improving scalability. Despite these advancements, the overlap graph approach still struggles with repetitive sequences and computational efficiency, particularly when dealing with large and complex genomes. Recent advances in sequencing technologies, such as long- read sequencing, have mitigated some of these challenges but have not completely resolved the limitations of the overlap graph approach.
In contrast, the Hamiltonian path approach represents a more theoretically powerful method but is less commonly used due to its computational complexity. This approach constructs a graph where each read is a vertex, and the goal is to find a Hamiltonian path that visits each vertex exactly once. While this can theoretically provide a highly accurate assembly by ensuring that each read is uniquely placed, the Hamiltonian path problem is NP-hard, making it computationally intensive for large datasets (Garey & Johnson, 1979) [3]. Early work in applying this approach to genome assembly was limited by high computational costs. However, theoretical studies have shown that if a Hamiltonian path can be found, it results in a highly accurate assembly. To address the computational challenges of the Hamiltonian path approach, several heuristic and approximation algorithms have been developed. Pevzner (1989)[4] proposed a greedy algorithm for constructing a Hamiltonian path by iteratively adding edges that extend the current path. More recent research has focused on hybrid approaches that combine elements of both the overlap graph and Hamiltonian path methods to improve accuracy and efficiency. With the advent of long-read sequencing technologies, there has been renewed interest in the Hamiltonian path approach, as longer reads can reduce the complexity of finding a Hamiltonian path by providing more unique overlaps (Chaisson & Tesler, 2012) [5] .
The comparative analysis of these approaches highlights their respective strengths and limitations. The overlap graph approach tends to be more computationally efficient and practical for a wide range of genomes, particularly with the use of optimized algorithms like the string graph. However, it can result in fragmented assemblies in genomes with high repeat content. Conversely, the Hamiltonian path approach offers potential advantages in terms of assembly contiguity but is constrained by its computational complexity and the difficulty of solving the NP-hard problem for large datasets (Koren et al., 2017) [6] . Hybrid strategies that leverage the strengths of both approaches may provide a more robust solution; particularly as sequencing technologies continue to advance (Nagarajan & Pop, 2013) [7]. Overlap graphs have been used extensively in genome sequencing, with applications in both prokaryotic and eukaryotic genomes. In prokaryotic genomes, overlap graph assembly is often used to generate high-quality, closed genomes from short-read sequencing data (Jain, M., Olsen, Akeson, M. (2016) [8]. In eukaryotic genomes, long-read sequencing technologies such as PacBio and
Oxford Nanopore have enabled the construction of more complex overlap graphs that can span repetitive regions and structural variations (Rhoads, A., & Au, K. F. (2015) [9].