Separating repeats in DNA sequence assembly

Research output: Contribution to conferencePaperpeer-review

19 Scopus citations


One of the key open problems in large-scale DNA sequence assembly is the correct reconstruction of sequences that contain repeats. A long repeat can confound a sequence assembler into falsely overlaying fragments that sample its copies, effectively compressing out the repeat in the reconstructed sequence. We call the task of correcting this compression by separating the overlaid fragments into the distinct copies they sample, the repeat separation problem. We present a rigorous formulation of repeat separation in the general setting without prior knowledge of consensus sequences of repeats or their number of copies. Our formulation decomposes the task into a series of four subproblems, and we design probabilistic tests or combinatorial algorithms that solve each subproblem. The core subproblem separates repeats using the so-called k-median problem in combinatorial optimization, which we solve using integer linear-programming. Experiments with an implementation show we can separate fragments that are over laid at 10 times the coverage with very few mistakes in a few seconds of computation, even when the sequencing error rate and the error rate between copies are identical. To our knowledge this is the first rigorous and fully general approach to separating repeats that directly addresses the problem.

Original languageEnglish (US)
Number of pages8
StatePublished - 2001
Event5th Annual Internatinal Conference on Computational Biology - Montreal, Que., Canada
Duration: May 22 2001May 26 2001


Other5th Annual Internatinal Conference on Computational Biology
CityMontreal, Que.


  • Computational biology
  • Disambiguating repeats
  • Shotgun sequencing
  • k-median problem

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)


Dive into the research topics of 'Separating repeats in DNA sequence assembly'. Together they form a unique fingerprint.

Cite this