The latest version is here is MaSuRCA.

Compiling

The MaSuRCA assembler is written in C++ and perl. It is developed and tested on x86_64 Linux systems. It might work on other UNIX like systems but it is not well tested. The following is required (all current major Linux distributions include these software but may require installation with the builtin package manager):

  • GNU C++ compiler g++ version 4.7 or higher.
  • GNU make.
  • Perl version 5.8 or higher.
  • Development file of library bz2 (usually packaged as libbz2-dev or libbz2-devel).
  • Perl Statistics::Descriptive library.

To install, download the latest soure and run the installation script './install.sh'. To install in a different directory, say '/opt/MaSuRCA', pass the DEST environment variable, like this: DEST=/opt/MaSuRCA ./install.sh

A recent version of the compiler is available from the "Developer Toolset" on RedHat and Scientific Linux/CentOS. Pass the variable CC and CXX to the install script pointing to your compiler: CC=/path/to/gcc47 CXX=/path/to/g++47 ./install.sh

Running the assembler

This is only a quick start up guide. Refer to the documentation for more details. The assembly is driven by a configuration file that specifies the location of the read files and some parameters. A shell script is generated from this configuration that will run the actual assembler. The steps are as follows, assuming that the variable $MASURCA contains the directory where the code was compiled.

Generate a sample configuration file named 'configuration.txt': $MASURCA/bin/masurca -g configuration.txt Then edit the configuration file 'configuration.txt' with your favorite text editor and start the assembly as follows: $MASURCA/bin/masurca configuration.txt ./assemble.sh

Contact

If you experience problems with MaSuRCA, you can Contact us. We would like to help if we can. Perhaps we can point you in the right direction. For any questions or comments, contact Aleksey Zimin or Guillaume Marçais .

Change log

Version 3.1.3 (Bug fix version)

  • Fix error on read file preprocessing

Version 3.1.2 (Bug fix version)

  • Fixed bug where CA stops with error: jellyfish-2.0 not found

Version 3.1.1 (Bug fix version)

  • Assembly with SOAPdenovo run properly with multiple PE libraries

Version 3.1.0

  • New configuration parameter "SOAP_ASSEMBLY": use SOAPdenovo, instead of Celera Assembler, as last step of assembly.
  • New program "masurca-superreads": create super-reads from Illumina PE reads.

Version 2.3.2 (Bug fix version)

  • Fixed bug in generated assemble.sh script. Version 2.3.0 and 2.3.1 have severe bugs that render the assemblies unreliable.
    Please, rerun these assemblies with version 2.3.2.

Version 2.3.0

  • Improved jumping library filter: more stable and better performing.
  • Newer version of QuorUM.

Version 2.2.2 (Bug fix version)

  • added GC bias calculation and adjustment for computing the coverage and distinguishing between unique and repeat genome regions
  • Limit the number of short linking mates used in the assembly: their utility quickly diminishes as we use more, but the assembly run time inreases
  • Improved technique to choose k-mer sizes for super-reads and for the jumping library filtering

Version 2.2.1 (Bug fix version)

  • Fix compilation errors on CentOS/RedHat.
  • Many bug fixes.
  • Experimental binary distribution for some platforms, available on the ftp site.

Version 2.2.0

  • The error correction with Quorum is much faster and slightly improved.
  • The jumping libraries are filtered using variable k-mer sizes.
  • The gap filling procedure is faster.
  • Parameters for scaffolding have been fined tuned.

Version 2.1.0

  • Introduced additional filtering step for the circularization-based libraries: we now localize the paired end reads around each jumping pair and attempt to merge the two mates in the pair pretending it is a non-junction short innie. The merge fails for the correct junction-contatinig pairs. This is done in work2.1 folder and the additional non-junction (chimeric) mate pairs detected are listed in work2.1/output.txt
  • Rewrote renaming/filting of initial fastq files.
  • Set USE_LINKING_MATES=0 by default, force USE_LINKING_MATES=0 if OTHER long reads are supplied.
  • Set DO_HOMOPOLYMER_TRIM=0 by default.
  • Renamed runSRCA.pl to masurca.
  • The assemble.sh script can be regenerated with './assemble.sh -r'
  • The PATHS section in the configuration file is now deprecated, all paths to the binaries are automatically determined based on the location of the masurca script.
  • Improved the speed of the main jumping library filter code and correctly implemented the --join-aggressive flag in the mate joiner code. This flag joins the mate pair into a single read if any path through k-mer graph exists leading from one mate to the other.
  • Updated the scaffold merging logic in the CA scaffolder (cgw) to improve speed
  • Changed the logic of handling low kmer counts in the error corrector -- now if the current kmer count is below the Poisson threshold, but the alternative counts are low as well, such that the probability of an error is lower than 10e-6 (computed from binomial distribution), the base is not corrected

Version 2.0.3.1 (Bug fix version)

  • Fix compilation issues

Version 2.0.3

  • Keep skip mers in a Jellyfish hash in the Celera overlapper: the overlapper should not die anymore because of the number of skip mers
  • Fix race condition bug in gap closing: a few more gap will successfully be reported as closed
  • Various bug fixes