User Tools

Site Tools


benchshortreadmappingprogramsaug2008

for comments and suggestions send an email to Jerome.Gouzy@toulouse.inra.fr (do not hesitate, I'm not a power user of all software)

Data

  • reference MtrV2Chr5 37289045bp
  • reads solexa3B 4579755 reads, length 36bp

Summary

Specialized General purpose
Features rmap soap maq qpalma genomethreader blastn/megablast+db indexing crossmatch manyreads mummer gmap glint
url http://rulai.cshl.edu/rmap/ http://soap.genomics.org.cn/ http://maq.sourceforge.net/ http://www.fml.tuebingen.mpg.de/raetsch/projects/qpalma http://www.genomethreader.org/ ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools++/ NA http://mummer.sourceforge.net/ http://www.gene.com/share/gmap/ NA
version 0.41 1.11 0.6.8 NA 1.1.2 17 March 2008 0.990329 3.20 2007-09-28 rev 228
publication ? Ruiqiang Li, BioInfo 2008 ? De Bona, Bioinfo 2008 G. Gremme, IST 2005 Morgulis, Bioinfo 2008 ? S. Kurtz, Genome Biology 2004 T. Wu, Bioinfo 2005 Courcelle, NAR 2008 (narcisse)
licence ? GPL 3 GNU ? specific ? specific Artistic License specific /
program src yes yes yes no no yes yes yes yes yes
index/convert ref (on HD) no no yes, bfa ? yes yes no no yes yes
index/convert reads (on HD) no no yes, bfq ? yes no no no yes no
multithreading no yes no, LSF script ? no yes no yes yes no
input formats fasta,qual fasta, fastq fasta, fastq ? fasta fasta fasta, qual fasta fasta fasta
output formats tabulated - BED tabulated bin, tabulated (via mapview) ? geneseqr, xml blast CM txt txt, gff3, blat blast m8
multiple hits no (lst of ambiguous) yes in theory (core dump) ? yes yes yes yes yes yes
cutoff types #MM #MM, #GAP, insert size MM ? %coverage evalue, %ident / match length ? mask
quality files yes, qual fastq fastq ? no no qual no no no
spliced aln no no no yes yes no no no yes no
! std nuc =MM ? ? ? ? ok ok ? ? =MM
add on no miRNA simul, assembly, convert, SNP, etc. ? / / / plot db mngt tools (iit) /
paired ends no yes yes ? no no no no no no
  • MM = MisMatch
  • %I = identity percentage

Bench 1: Genome annotation

Goal: map mRNA solexa reads on the genome to identify transcribed regions

Cutoffs: 3 mismatches maximum over the 36 bp, multiple hits

Arch: tatum 2.6.18-6-amd64, x86_64, 32Go, 4*2 Xeon 3.4Ghz (cache 16M)

Specialized General purpose
Features rmap soap maq qpalma genomethreader blastn-toolkit + db indexing crossmatch.manyreads mummer gmap glint
compilation (pkg default) O3 03 O2 not yet available bin O2 O2 O3 O2 O3
program opt. -v -h 16 -w 36 -m 3 -s 12 -r 2 -w 9999 -v 3 -n 3 -C 100000000 -H allhits / -seedlength 16 -minmatchlen 33 -num_descriptions 10000000 -num_alignments 10000000 -use_index true -outfmt 6 -word_size 16 -index_name MtrV2Chr5.idx -perc_identity 91.66 -evalue 3e-7 -minmatch 20 (1) -l 33 -F -b -maxmatch (2) -F -l 20 -b -maxmatch -B 2 -f 3 (1) -m 8 -w 1 -d 1 -c 15 -X 24 -s 24 -M default (2) -m 8 -c 15 -X 24 -s 24 –no_dyn -M 111012110110211101 (3) -m 8 -c 15 -X 24 -s 24 –no_dyn –step 2 -M 111012110110211101
user time (s) 219 250 448 NA I killed the process at 94,320s ⇒ split the cdna database is required 11207 FATAL ERROR: MAX_NUM_BLOCKS exceeded =⇒ split the transcript db required (1) 159 (2) 2752 days (in progress) (1) 4772 (2)256 (3) 175
# of hits 271219 + a list of ambiguous 13,459,540 1,086,881 unique + X non unique (core dump reading allhits file ) NA NA 13,781,653 NA (1) 14,391,016 (2) 2,354,863,462 (yes 2Ghits) (1) 14,178,398 (2) 14,109,622 (3) 14,064,111
# of hits (after MM filter) idem 321,803 idem NA NA 13,781,653 (idem) NA (1) 13,249,315 (2) 13,249,315 (1) 13,437,916 (2) 13,312,616 (3) 13,269,587
an external post process is required for filtering on MM no yes no NA yes no NA yes yes yes
% of the chr covered 4.76 5.39 5.11 NA NA 7.63 NA 5.04 NA (1) 7.80 (2) 7.29 (3) 7.19
% of the chr sequence covered both by this test and by pasa2 (Mt ESTs Jun2008) 2.61 2.72 2.69 NA NA 2.89 NA 2.59 NA (1) 2.86 (2) 2.79 (3) 2.78

Bench 1: Warnings

  • rmap: end position with one additional base
  • spliced alignments are not taken into account ⇒ 3MM is computed on the basis of the read length (36bp)
  • pasa2 is based on gmap; the comparaison of short read mapping with “classical” EST mapping is used as a sensitivity indicator.

To be tested

  • AMOScmp-shortReads (nucmer based)
  • qpalma (when available)
benchshortreadmappingprogramsaug2008.txt · Last modified: 2008/08/21 14:02 by gouzy