MaSuRCA, Genome assembler
The University of Maryland Assembly Group aims at creating the best possible software for whole genome assembly. We develop the MaSuRCA genome assembler. The MaSuRCA genome assembler can be used on assembly projects of all sizes, from bacteria genomes to mammalian genomes to large plant genomes. MaSuRCA has been used to assemble de novo a variety of genomes, sometimes improving on published genomes using added data, sometimes creating the first publicly available draft genome for the species.
Super-Reads
The super-reads technique aims at improving genome assembly by
replacing many short reads with longer sequences, without losing any
information.
A super-read is an extension of a sequencing
read. Replacing reads by super-reads will improve many kinds of
assemblies. While our assembler MaSuRCA
uses super-reads, many other applications of super-reads are
possible. Our software "masurca-superreads", part of the MaSuRCA
distribution, converts Illumina paired-end reads into
super-reads. Super-reads satisfy the following properties:
- Each of the original reads is contained in a super-read.
- Many of the original reads yield the same super-read. Using super-reads leads to vastly reduced dataset.
Super-reads can be used for large and small projects:
- Our assembly of the Loblolly pine genome began by replacing 15 billion reads with 150 million super-reads. The reads averaged 130 bases and the super-reads averaged 362 bases.
- In a comparison of 8 genome assemblers on 12 bacterial genomes, scored first on 10 of the 12. MaSuRCA (Maryland Super Read Celera Assembler) is based on super-reads.
- Super-reads can help other assemblers. MaSuRCA can use SOAPdenovo, as an alternative to Celera Assembler, with super-reads to get better assemblies.