Abstract
One of the key open problems in large-scale DNA sequence assembly is the correct reconstruction of sequences that contain repeats. A long repeat can confound a sequence assembler into falsely overlaying fragments that sample its copies, effectively compressing out the repeat in the reconstructed sequence. We call the task of correcting this compression by separating the overlaid fragments into the distinct copies they sample, the repeat separation problem. We present a rigorous formulation of repeat separation in the general setting without prior knowledge of consensus sequences of repeats or their number of copies. Our formulation decomposes the task into a series of four subproblems, and we design probabilistic tests or combinatorial algorithms that solve each subproblem. The core subproblem separates repeats using the so-called k-median problem in combinatorial optimization, which we solve using integer linear-programming. Experiments with an implementation show we can separate fragments that are over laid at 10 times the coverage with very few mistakes in a few seconds of computation, even when the sequencing error rate and the error rate between copies are identical. To our knowledge this is the first rigorous and fully general approach to separating repeats that directly addresses the problem.
Original language | English (US) |
---|---|
Pages | 176-183 |
Number of pages | 8 |
DOIs | |
State | Published - 2001 |
Event | 5th Annual Internatinal Conference on Computational Biology - Montreal, Que., Canada Duration: May 22 2001 → May 26 2001 |
Other
Other | 5th Annual Internatinal Conference on Computational Biology |
---|---|
Country/Territory | Canada |
City | Montreal, Que. |
Period | 5/22/01 → 5/26/01 |
Keywords
- Computational biology
- Disambiguating repeats
- Shotgun sequencing
- k-median problem
ASJC Scopus subject areas
- General Computer Science
- General Biochemistry, Genetics and Molecular Biology