Welcome to the GenomeQuest Documentation Wiki
LSPMUL HELP
Contents |
Introduction
GenomeQuest Engine (GQE, aka Biofacet) handles two different modes of sequence searching:
-
lspcalc - and its derivatives lspcalc.TH and lspcalc.THA.
This is the general purpose All vs. All, pairwise-based sequence comparison module. lspcalc supports numerous algorithms among:- Homology based, e.g. BL2 - Blast, SW - Smith-Waterman, NW - Needleman-Wunsh, etc..
- String matching, proprietary algorithms e.g. KERR, BLW, etc..
In contrast to homology-based algorithms, those algorithms provide exact answers.. - Pattern matching algorithms, handling motifs, e.g. PATT, etc..
-
lspmul - and its derivative lspmul.TH and lspcalc.THA.
This is the High-Speed Sequence Search module that is mainly used for NGS purposes.
10+ different algorithms can be used with lspcalc. They are of completely general purpose and offer a panel of choices to perform various searches useful for precise Comparative Genomics at large. For those interested in archeology, more on lspcalc can be found in the foundation paper, the roots of the GQE.
This document details lspmul. lspmul can be viewed as the "NGS side" of the GQE. It uses a filter based on seeds, to narrow down the search space to sequences that are locally or globally highly similar.
LSPMUL implements two ways of sequence mapping:
- An "historical" (2002) implementation based on the the number of words in common. This technique is also referred as Multiple Seeds.
- A more recent one based on GASSST. GASSST is a single seed approach, described in GASSST article in Bioinformatics.
The single seed algorithm (GASSST) is much faster than the multiple seeds one (LSPMUL). Nevertheless, LSPMUL has some good qualities, especially the ability to use BLAST on long reads.
LSPMUL Overview
LSPMUL computes alignments of sequences having a good-enough local or global window of similarity. Once the filter is realized, one of the four following algorithms can be applied on the fly:
- BL2: BLAST2 finds the most relevant sequences in terms of biological similarity. It relies on the NCBI BLAST2 algorithm
- SW: this is the classical Smith-Waterman algorithm..
- BLW: finds sequences having a common region of length L with at least P percentage of identities. This algorithm is a kind-of Blast-controlled; it is ungapped..
- KERR: computes the minimal number of differences between two sequences, by trying to optimally fit the shorter sequence into the longer one. Its implementation is based on a fast Myers's algorithm. While used for years by GenomeQuest IP audience, this algorithm is the state-of-art aligner for short reads mapping..
Please note that, in contrast to many short-reads-aligners, LSPMUL:
- Implements natively truly gapped algorithms
- Works with any vendor technology, including SOLiD in native colorspace mode (colorspace)
- Works with any read-lenghts
- Can operate in a full-accuracy mode, without missing any hit
- Can address a variety of problems. Beyond classical genome-remapping, transcriptome to genome, transcriptome to transcriptome problems can be processed
- Last, the Engine built-in feature of the dynamic AllxAll allows to successively filter your reads against various databases, non genome- or NGS-related. As an example, any Genbank division, or any combination of them, can be used as the subject database. QA, contamination quantification against viruses or bacteria can be conducted without any special effort. No special databases to extract and build, no special indices, etc. And not speaking about results that are not an ending, but dynamic reusable components for the next step of the workflow (see GenomeQuest Engine Primer).
Say for example you have millions of short reads in the database MyReads, and you want to map them, with up to two errors, against all Human transcripts. After having set the RefSeq mRNA database:
export rsm=/disk/GQtest/data/GQdata/content/hotdrive/RSM/38/RSM38_20091228.
The lspcalc command line would look like:
% time lspcalc.TH -t 8 -M kerr -O '[-errs 2 ]'\
-db $rsm -bfql 'os="homo sapiens"'\
-db -f 'f,r' MyReads\
-best 1 -o res/rsm.res -progress
while the lspmul would be:
% time lspmul.TH -t 8 -M mapnc,kerr -O '[-fltCut 720 -fltThreshold 6 -errs 2 -r ]'\
-db $rsm -bfql 'os="homo sapiens"'\
-db MyRead\
-best 1 -o res/rsm.res -progress
Beside the slight syntax changes (the way forward/reverse strands are specified, etc.), the major differences come from:
- the number of words in common, here 6 (-fltThreshold 6) that you expect to have between a read and,
- any genome window of 720nt (-fltCut 720)
This is what we call the filter that will be used to trigger the KERR aligner
As a result of this filter, speed is in orders of magnitude faster, at the cost of some putative misses of course. The misses can be precisely predicted as a function of the read lengths, the # of errors and fltThreshold. In the above case, with reads of 36nt, we can demonstrate that not a single alignment will be missed (see what is a good threshold).
.How to use:
- Generally speaking, BL2 or BLW (or SW but slower) can be used for 454/Sanger and longer reads such as PacBio, whatever the problem is (Metagenomics, RNA-Seq, etc.). Still with long reads, KERR can be used if data quality is good enough and problem is genomics/genomics or transcripts/transcripts mapping-oriented..
- For Ultra Short and Short reads (say up to 100) KERR is the choice. Some attempts can still be made to use SW, BL2 or BLW when reads are not Ultra Short (say >= 60), and when local alignments are needed, for:
- Either functional purposes, e.g. exon-exon junctions..
- Either technical purposes, e.g. get rid of low-qualities 3'- or 5'- ends..
- time spent in the filter itself. This time depends on the relationship between the query set, the reference target, and other parameters such as the query lengths, the threshold, the cut length etc. In average, it can count as much 50% of the time spent.
- time spent in the alignment method. Here again, two folds:
- time spent in "alignments candidates". Those that pass the filters and rejected by the aligner
- time spent in alignments that succeeded.
The ratio of the two items above reflects the accuracy of the filter. However, the alignment method obviously plays an important role here.
For short reads, and as far as alignment time is concerned, there is approximately one order of magnitude between KERR and BLW/BL2 and another one between BLW/BL2 and SW. For long reads, BL2 is the best choice..
As a bulk summary, it takes about:
- 45mns/64core to map 100M short (50-100nt) reads again the Human Transcriptome using KERR
- 12hours/64core to map 100M short (50-100nt) reads again the Human Genome using KERR.
- 1hour/64core to map 5M 454 Titanium reads against Human Transcriptome using BL2
GenomeQuest is actively working on alternative filters to speed up those numbers. Since our method is generalist (allows gaps, works for any technology, any read lengths, any algorithm, etc.) and since we do not want to sacrifice accuracy, we might use other techniques than the classical Burrows...Wheeler Transform (BWT) that "can" be fast but have severe drawbacks. We expect a significant improve in the short-term. However, please keep in mind that speed by itself is an easy "mirage", and just be the visible part of the NGS iceberg. It is useless without pre- or post-analysis capabilities, and flexibility. Overall improvements will cary over all unique goodies of the GQE that allows: (i) Global Throughput - it's not just about a single run (ii) Global Sequence Data Management and Mining. Last, please remind also that GQE architecture is scalable by design, and can take advantage of elastic computing by on-demand burst using the Cloud.
GASSST has been added to the GenomeQuest Engine. Here is the Press Release. From that point forward, many improvements over the Open Source code have been implemented. They are detailed in the next section.
With GASSST, the numbers above can be revised as follows (depending on sensitivity levels):
- 1-2hours/8core to map 30M short (50-100nt) reads (exome like) against the Human Genome
- 10 hours/96core to map 1.2B reads, 110nt, 44x Human Genome, against the Human Genome
After GASSST description, this document describes lspmul module. The third section details the briefly the 4 algorithms available with lspmuland their options. Algorithms options can be used indifferently with lspcalc or lspmul
GASSST
Rationale
Read mapping is a basic NGS procedure. There are more and more read mappers available. There is one new every month. On average, 10 to 12 emerge. Some are unknown while pretty good. Some have a brand while somewhat limited for certain purposes (e.g. Bowtie), some have a brand with good reasons (BWA).
Generally speaking, read mappers good at doing one thing. GenomeQuest has broader requirements.
The choice of a read mapper for GenomeQuest is subject to the investigation of the following list of properties. Some of them are more important than others, some are non-starters. There are two main categories:
- Functionally, a read mapper for GenomeQuest must:
- Handle gaps. By gap, we mean short gaps, but true gaps. For example Bowtie does not allow at all; BWA, while not so bad, is limited to some extent.
- Handle any read length. Obvious as read length increases, but many mappers are either limited, either “explode” when read lengths increase.
- Handle a variable # of errors The ability to define a # of errors that can vary with the read length is mandatory.
- Be Sequencing Technology Independent. This is almost never the case for read mappers; they do not handle LifeTech SOLiD data.
- Process the “best hits” in a flexible way.
- Allow any kind of reference, whether it be an assembled genome, or a set of 50,000 contigs.
- From a computer science architecture and scalability standpoints, we do not want:
- Fat code: fat is generally bad. Instead light, elegant and easy to integrate is much preferred.
- Pre-chewed indices such as the ones based on Burrows-Wheeler transformation (Bowtie, BWA, ...). Indeed:
- They lead to content overhead, especially in a distributed environment.
- They are static, and do not allow searching against for example Genbank Bacterial Division, or virtual databases as allowed in the GenomeQuest Platform.
- They are limited by design in gaps handling; heuristics that do not scale very well with read lengths are implemented to compensate this model.
- Be linked to any specialized hardware whether it be VLSI, FPGA or GPU.
After an exhaustive evaluation, GASSST not only met these requirements, but its elegant design proved a strong structural and performance fit into our platform. It has been integrated into the GQ engine, and from that point forward, all runtimes have been divided by 7 to 8 on average (sometimes more depending on the type of reads) allowing us to map an exome in less than 1 hour on 8core, and a human genome 44x coverage in 10 hours on 96core (approximate figures that obviously depend on sensitivity parameters).
All native GASSST parameters are available inside GQ Engine, plus additional ones specific to GQE and Colorspace Handling. A typical GASSST command line looks like:
% time lspmul.TH -t 8 -M mapnc,gassst -O '[ -w 14 -s 2 -errs 1:0,30:2,50:4 ]'
-db hg19 \
-db MyReads\
-best 1 -o foo.res -progress 1>foo.log 2>foo.err
will map MyReads against hg19, with a sensitivity threshold of 2, word length of 14, allowing 0 errors with read length < 30 (rl), 2 errors for rl>=30 and rl< 50, and 4 errors from rl >=50. The number of errors is a step function of the read length. -t 400 is a memory control parameter for the subject hash tables. It is advised to take this number for now.
The variable number of errors is especially powerful. The GenomeQuest workflows use the following default errors patterns:
- NUCLEOTIDE
1:0,25:2,38:3,55:4,70:5,85:6,101:1/14 - COLORSPACE
1:0,25:3,38:5,64:6,93:7,125:1/12
Please note that, due to the specific nature of two-bases encoding scheme, the number of errors in colorspace mode is higher than in nucleotide mode.
Options
Many improvements have been added over the original Open Source code from INRIA.
- Paired-End mode: the -best strategy now handles PE mapping in an efficient and accurate way. In a single pass, template fragments are aligned and tagged according to various situations such as: properly paired (aka Happy, meaning inside Expected Insert Size), paired on same chromosome, paired on different chromosomes, single mate mapped, etc. As a result, alignments are one-to-one BAM compliant, and can be exported in a straight and fast way using sam2gq.sh companion script. See section Paired-End Handling for more details.
- Single Hash table: instead of splitting the subject sequences in different chunks, a single hash table of all words and their neighborhood is built. This schema allows to scan a read one time, and allows to implement Paired-End best strategy efficiently. There are three options controlling this mode: saveindex, globaindex, saveindex (see below)
- Striding: a new option (-stride) allows to not consider all words shifted by one nucleotide, but words shifted by the -stride value. Although there might be some loss in sensitivity, the gains in speed improvement and decrease of memory consumption makes this option highly useful.
- Local alignments: local alignments can be produced using for example the -maxcut and -minlen* options.
- Small to medium gaps: deletions in sample, of lengths up to 64Knt can be found using the -midgap option. As a matter of example, the GQ SNP caller can find SNP events such as the alignments below:
AAGATGAAGAGCCCCAGCATCATGGAGAAGGCGCTGGCGTAGTAGGGGTAGGCCGAGGGGATGAAGCGCTCATACTGCGTGTGCTGGAGTGGCCGCACGGAT ccccaCcatcatggagaaggcgctggcgt----------------------gaagcgctcatactgcgtgtgctgg - rev 1921 ccccagcatcatggagaaggcgctggcgt----------------------gaagcgctcatactgcgtgtgctgg - rev 2087 catcatggagaaggcgctggcgt----------------------gaagcgctcatactgcgtgtgctggagtggc + fwd 1205 ggagaaggcgctggcgt----------------------gaagcgctcatactgcgtgtgctggagtggccgcacg - rev 278 >||||||||||||||||||||||<See -midgapmin and -midgapmax options below.
| Option/Sub-Option | Description | Default Value |
|---|---|---|
| -errs <int> -errs <pattern> |
Absolute number of errors. A fixed integer or a pattern. An integer is simple a fixed number of errors (.eg -errs 4). A pattern describes a read-len step function such as shown in the previous example. Additionally, syntax such as -errs 50:4,80:1/14 means that from len 80, 1 error every 14 nucleotides will be allowed. |
3 |
| -p <int> | Relative number of errors. p is the percentage of identities on alignment. Not advised for Ultra Short Reads, as rounding is imprecise. | 94 |
| -w <int> | Seed word size. As described in the GASSST article. With GQ-Engine, w can vary from 8 to 26. Computation time can decrease significantly, at a cost of increasing memory consumption. Nevertheless, GASSST implements a hashing-function for w>14, limiting the memory to be in the range to 8-12GB when w is in the range of 20. GQ defaults implements w=12 to w=15 for reads in the rangeof 50-100. For longer reads, a more aggressive w can be chosen (e.g. w=20). Please note also that the use of global indexes and striding change the memory requirements. | 12 |
| -stride <x> | striding value. By look at words shifted by <x> positions, this option considerably speeds up the mapping, and decreases Hash Table memory. The default – and historical behavior – is: <x>=1. Missing colors here
Stride 1
GCCCCCTCCACGTGGTCACGGCCCCCTGAGGTGGTGCCGAGCTCTTCCGAGACCACC
GCCCCCTCCACGTGGTCACGGCCCCCTGAGGTGGTGCCGAGCTCTTCCGAGACCACC
GCCCCCTCCACGTGGTCACGGCCCCCTGAGGTGGTGCCGAGCTCTTCCGAGACCACC
GCCCCCTCCACGTGGTCACGGCCCCCTGAGGTGGTGCCGAGCTCTTCCGAGACCACC
Stride 3
GCCCCCTCCACGTGGTCACGGCCCCCTGAGGTGGTGCCGAGCTCTTCCGAGACCACC
GCCCCCTCCACGTGGTCACGGCCCCCTGAGGTGGTGCCGAGCTCTTCCGAGACCACC
The drawback is a theoritical putative miss of alignments. Example: two msimatches at extremities of the words
Suject: GCCCCCTCCACGTGGTCACGGCCCCCTGAGGTGGTGCCGAGCTCTTCCGAGAC
GCCCCCTCCACGTGGTCACGGCCCCCTGAGGTGGTGCCGAGCTCTTCCGAGAC
Query: GCCCCCTCCACGTGGgCACGGCCCCCTGAGGcGGTGCCGAGCTCTTCCGAGAC
If, and only if, there is NO OTHER word matching from the query in the subject, this alignment will be missed, while it would have been found with stride=1. This can be the case when equally distributed errors (e.g. one error EXACTLY EVERY 14nt). Those cases are considered to be unlikely to happen. Experiments on the effect of striding with a value of 3 led so far to non significant misses for our SNP caller. |
1 |
| -s <int> | Sensitivity. As described in the GASSST article, s varies from 0 (modest depth of the number of matching seeds being considered) to 5 (exhaustive scanning). Compute times are multiplied approximatively by 2 by increasing s by 1. | 2 |
| -maxcut <value> | When value is different from 0, it is the maximum percentage of read that can be cut from the global alignment. It triggers local alignment that can be restricted to 5' and 3' extremities, or be anywhere in the read (see ecut option). maxcut is the opposite of minlenperc. -maxcut 20 means that local alignments with at most 80% of the read will be kept. This option disconnects some GASSST filters and runtimes are slower as soon as value is not 0. It should be used with caution. For example, for remapping non matched reads, looking for exon-exon junctions. This parameter is currently also needed with-midgapmin and -midgapmax. | 0 |
| -minlenperc <int> | Local alignment minimal length, as a percentage of query length. Inverse of maxcut. | 0 |
| -minlen <int> | Local alignment minimal lenght, in nt. Another form of local alignment where the minimal length of the aligned region is expressed in absolute number of nucleotides. Please note that the number of errors becomes RELATIVE to this length, and adapts accordingly. For example -minlen 18 -errs 1 means that the minimal alignment length will be 18nt, with at most 1 error in it. Alignments of length 36 will allow 2 errors, and so on so forth. The adaptive number of errors with a pattern applies. | 18 |
| -cbs <int> | When set, allow to cut both sides of the read. Local alignment can be anywhere in the read. This option slows down the search. | NotSet |
| -midgapmin <int> | Small to medium gaps lower bound. The -midgap[min,max] options allow to find deletions in the sample. The amplitude (length) of the deletion, expressed in number of gaps in the query, varies from min to max. For example, -midgapmin 5 -midgapmax 100 will find alignments with gaps in the sample from 5 to 100 included. Principle and limits. Such an alignment is a derivative of the local alignment scheme. It starts by finding an anchoring extremity of the read with at least half of the query length aligned (necessary condition). While obvious, this requirement is left mandatory for now, and then -mc 50 must be explicitly set in the command line. Once at least half of the read has been aligned, upstream and downstream local alignments on the other extremity of the read are tried with the remaining portion of the read. Boundaries of extension search on the genome is bounded to -midgapmax from the first alignment. Although the reality is a bit more complex than that, it is not expected that the smallest portion of the read aligned (left or right part of the gap) be smaller than -w. For example, if -w 12 has been chosen, and if the read length is 70, alignment lengths surrounding the gap will vary from 12 to 58 nt, including any combination in this interval. |
5 |
| -midgapmax <int> | Small to medium gaps upper bound. See above. | 50 |
| -noendgap <n> | Suppress gaps within n nt at the ends | 0 |
| -noendgaplocal <n> | Suppress gaps within n nt at the ends for local alignment | 0 |
| -saveindex <file> (-w x -stride y -db reference) | Starting from reference genome reference, save a global index with filename file, using word of size x and stride value y. | |
| -loadindex <file> (-w x -stride y -db reference) | Load index from file filereferenced with the triplet (reference,x,y). If the file does not exist, or if the index contained in the file do not match with word size and stride, computation is aborted. | |
| -globindex <file> (-w x -stride y -db reference) | Same as bove, but if the file does not exist, or if the index contained in the file do not match with word size and stride,index is built dynamically. | |
| -smx <int> -smxa <int> |
Controlled sensitivity. GASSST adjusts the sensibility level (s) by some heuristics based on the length of the genome and the word length (-w). This is sometimes not wanted. GQ-Engine allows to set precisely the internal seeds-space number (-smx) and alignments-processed (-smxa) in a fixed way, regardless of subject and word sizes. The numbers below are the ones advised for simulating -s. They must be used with caution. s=0 : smx = 2 smxa=100 |
smx=15 smxa=600 (equivalent to -s 2) |
| -hitmap | Score Distribution. See Hitmap section. | Not Set |
| -fullhitmap | Full Score Distribution. See Hitmap section. | Not Set |
As a matter of example, we provide below the Bash function that implements the default GASSST call in GenomeQuest workflows.
TheGreatGassstMapper()
{
typeset QUERYDB_TYPE=$1 # colorspace or nucleotide - not used anymore since NBERRS below is supposed to be set
typeset NBERRS=$2 # the default pattern - must be set by the caller
typeset SGE_QUEUE=$3 # queue name
typeset JOBNAME=$4 # job name
typeset SBJCT=$5 # subject
typeset SBJCTBFQL="$6" # filter on subject - if no filter, must be set to 'true'
typeset QUERY=$7 # query
typeset NT=$8 # -t nbnode.nbthreads
typeset ALLRES=$9 # results filename
typeset BESTRES=${10} # if specified (not "") name of the best results filename
typeset W=12
typeset S=2
typeset QUERYDB_READLEN=`lspdb $QUERY -bfql 'PRE{echo $seqdb.nbresids / $seqdb.onbseqs;}' -printf ''`
# simple heuristic to adjust W and sensitivity according to read lenghts
if [ $QUERYDB_READLEN -le 50 ]; then W=12 ; S="-smx 15 -smxa 600" # -s 2
elif [ $QUERYDB_READLEN -le 70 ]; then W=13 ; S="-smx 15 -smxa 600" # -s 2
elif [ $QUERYDB_READLEN -le 90 ]; then W=14 ; S="-smx 30 -smxa 1200" # -s 3
elif [ $QUERYDB_READLEN -le 110 ]; then W=15 ; S="-smx 30 -smxa 1200" # -s 3
else W=16 ; S="-smx 30 -smxa 1200" # -s 3
fi
echo TheGreatGassstMapper $QUERYDB_TYPE $NBERRS $SGE_QUEUE $JOBNAME $SBJCT "$SBJCTBFQL" $QUERY $NT $ALLRES $BESTRES >> comparison.log
initstep "MAPPING READS"
echo "[command:] lspcalc.THA -extend -splitrole query -qn $SGE_QUEUE -jobname $JOBNAME -- -t $NT -M mapnc,gassst -O '[ -t 400 -errs $NBERRS -w $W $S -extend -hitmap ]' -db $SBJCT -bfql $SBJCTBFQL -db $QUERY -o ${ALLRES} -best '1,single,hitcnt,{RS}' -progress 1>>lspcalc.progres 2>> lspcalc.stderr " >> comparison.log
lspcalc.THA -extend -splitrole query -qn $SGE_QUEUE -jobname $JOBNAME -- \
-t $NT \
-M mapnc,gassst -O "[ -t 400 -errs $NBERRS -w $W $S -hitmap -extend]" \
-db $SBJCT -bfql "$SBJCTBFQL" -db $QUERY -o ${ALLRES} \
-best '1,single,hitcnt,{RS}' -progress 1>>lspcalc.progress 2>> lspcalc.stderr &
check "MAPPING READS"
if [ "$BESTRES" != "" ]
then
if [ "$QUERYDB_TYPE" == "NUCCS" ]; then
initstep "DUMPING INFORMATIVE READS SOLID"
lspres ${ALLRES} -bfql "rp2<128 && (xf0 & 0x1)" -ocoll ${BESTRES} 2>> comparison.log &
check "DUMPING INFORMATIVE READS SOLID"
else
initstep "DUMPING INFORMATIVE READS"
lspres ${ALLRES} -bfql 'rp2<128' -ocoll ${BESTRES} 2>> comparison.log &
check "DUMPING INFORMATIVE READS"
fi
fi
}
LSPMUL
LSPMUL Principle
LSPMUL technology of the GenomeQuest Engine is based on the principle of filtering sequence pairs prior to align them. It is oriented toward the search of highly-similar regions. The inner workings of the algorithm are described in detail in the article published in IEEE, SCB 2002 (File:High-similarity-clustering.pdf). This technique is also referred as Multiple Seeds. For example SHRiMP (Rumble, S., Lacroute, P., Dalca, A., Fiume, M., Sidow, A., and Brudno, M. (2009). SHRiMP: Accurate Mapping of Short Color-space Reads. PLoS Computational Biology, 5(5)) uses a similar approach.
What follows here is a short, simplified summary.
Starting from two sequences, the Query sequence and the Subject sequence, the filter is based on counting the number of words - of lengths W - in common . This number is called the Threshold TC. Once the filter is realized (i.e. TC >= Tcmdline), an alignment with a given method M, together with its options, is started. M is of: BL2,SW,BLW,KERR (see <a href="#Algorithms_at_a_glance" title="">Algorithms Section</a>). The alignment is retained or not for further processing, according to some alignment-scoring criteria C. C is method specific: e.g. a score for BLAST, a number of differences for KERR, etc ....
Please note that in the current version of the GQ Engine, W=10, and it cannot be changed.
.
It has to be noted that the number of words in common TC is computed regardless of the positions of the words in the sequence. It is the alignment that confirms or not the relevance of the filter. The filter has to be seen as a necessary, but not sufficient condition. It is an accelerator. While this non-locality feature might induce many false candidates for the aligner, it might be useful to align sequences that have diverged over evolution. The figure below shows some example of alignments that can be obtained with TC=10, and with different methods..
>The alignment #3 shows an alignment that can be obtained with BL2. It is the short version of what could be found with distantly related sequences from long reads, or sequences having pretty long gaps. Using the scoring matrix NUC.2.1 (where a match counts as 1 and a mismatch counts as -2) , and with gapo=2,gape=2the overall score of such an alignment is:
- number of matches : 36. S1=36*1=36
- number of mismatches : 0. S2=0
- number of gaps : 7. S3=7*2+2*2=18 (there is two gap-opening events)
- Stot=S1-S2-S3=18
meaning that the following command line:
% lspmul.TH -M mapnc,bl2 -O '[ -gape 2 -gapo 2] -mn NUC.2.1 -db subject -db query -scut 18
will keep such an alignment.
What is a good threshold
The threshold is the number of word of length 10 in common between a query and a subject. A query is typically a short read, a read, or anything that has a reasonable size, say up to a transcript length. A subject is any sequence. For efficiency reasons detailed in implementation section, a subject is treated as N smaller overlapping regions, the chunks. By default, a chunk has a size of 10x the longest query.
For a query of length L, there is L-10+1 words. This is the maximum TCmax that would lead to alignments with 0 error. Now, every time one error is possible, it "consumes" 10 words. For example, for a query of length 50, TCmax=41. Allowing one error leads to a theoretical TCth=31; two errors, and TCth=21. And so on so forth. With four errors, we land with TCth=1. From a computing time standpoint, this is too low. Indeed, the lower TC, the higher the probability to have TC words in common is high, so the filter is not efficient. By choosing TC as close as TCth, the more accurate the results will be, but the lower the speed, especially for low TCs. It's all about "what is the theoretical" miss ?.
The theoretical miss occurs if we consider a uniform distribution of errors. The worst case is that errors fall down just at the frontiers of words spaced by 10 nt. This can be analytically estimated (see File:High-similarity-clustering.pdf). In practice, one has to have in mind that it is unlikely to happen as:.
- Although everything is possible, SNPs systematically and consecutively spaced every 10nt is almost impossible.
- If it is about sequencing errors, the low error rate together with the high coverage from NGS, will make this event an irrelevant noise. The probability that the reads in the neighborhood overlapping by one or few nt follow the SAME event is 0 (or close to).
- Practically speaking again, errors are usually shifted to the 5'- or 3'- ends of the query, and variations or SNPs can be mismatches or gaps, but are likely to be stretches of consecutive errors. Therefore, the words consumed by an error overlap, and TC can be chosen higher than TCth..
Our years of experience let us conclude that:
- for short reads in the range 30-50nt, and for say 3 to 5 errors, T=6 or T=7 is pretty optimal in term of tradeoff speed/accuracy.
- for reads in the range 50-100nt, and errors up to 7, T=12 to T=20 is pretty optimal in term of tradeoff speed/accuracy. Again, the enormous level of coverage induced by NGS leads to naturally compensate any singular situation, such as the very rare events of a uniform distribution of errors on ONE read. You can think it that way: 5 stretches of 13 nucleotides is TC=20. This is 5x13=65nt in total. For a read of 76nt, you are still left with 11 possible errors !
Let's take an example; for L=40, TCmax=31.
- two errors means TCth=11. This is good and can be kept as is.
- three errors means TCth=1. One word in common is typically too low. On the other hand, the number of events such as the three errors are equally distributed (i) is extremely low -- needs exact number (ii) will be compensated other reads shifted by one or multiple nucleotides. In practice, taking TC=7 leads to less than 0.1% of misses, and no loss in mapping coverage
Note: When algorithm is BLW, the treshold is calculated with the length of the overlap window (LEN), the minimal percentage of identity (PERID) or number of errors (ERRS), specified by the user and the length of the word (WORDLEN) set by us using the following formula:
threshold = max(1, LEN - WORDLEN +1 -(1- PERID/100)*; LEN *; WORDLEN)
As the number of errors, fltThteshold can be sepcified as a step function of the query length. In GenomeQuest workflows, default proposed threshold is: fltThreshold = 1:14,25:7,38:8,55:11,70:16,85:20,101:1/14 which means: for queries whose lenght is lower or equal than:
- 14: fltThrehsold=1
- 25: fltThrehsold=7
- 38: fltThrehsold=8
- 55: fltThrehsold=11
- And so on and so forth till 101nt, from which ftlThreshold is increased by one every 14nt (from the previous value of 20).
Implementation Notes
- W-Word Hash table on subject sequences are built dynamically; default = max(50000 seqs, 64Meg)
- The longer the sequences, the higher the probability to have TC >= C (any query has for example 10 words in common with Human Chromosome1). To that extent, subject sequences are cut on the fly with overlapping subsequences of length ftlCut and overlap fltOverlap. If L denotes the longest query in the query database, default values of ftlCut and fltOverlap are respectively 10*L and L. For example, with reads of lengths 50, subject sequences will be cut into regions of length 500, overlapping by 50.
Synopsis
lspmul has two ways of operating, and for each of them, a nucleotide mode (with a n) and a proteic mode (with a p).
- Clustering: this is single linkage clustering specified with nc or pc.
This mode will be documented in a separate section. - Mapping: this allows the mapping of NGS reads (or others) against any GQ Enginedatabase. The present document focuses on this way of computing.
-M mappc[,<algo>] ...O [ <filter_options> <algo_options>] -db subject [bfql] -db query [bfql] -best <best_options> -scut <value> -M mapnc[,<algo>] ...O [ <filter_options> <algo_options> [-r]] -db subject [bfql] -db query [bfql] -best <best_options> -scut <value>Let's primarily focus on mapnc, the nucleotide-nucleotide mode. We assume the paradigm -db subject [bfql] -db query [bfql] to be known..</p>.
- <algo>: is optional.
- If not specified, the alignment algorithm used is BLW (ungapped) - see <a href="#BLW_.28aka_Fragment_Search.29" title="">BLW algo</a>
- If specified, <algo> is one of : KERR, BL2, SW.
- <filter_options>
-
-fltThreshold <value>: minimum number of words in common.
- When algorithm is BLW, fltThresholdis automatically computed. See <a href="#What_is_a_good_threshold" title="">what is a good threshold</a>.
- Otherwise, fltThreshold must be set. If not set, the search might start with some values computed automatically. It is strongly advise to set it explicitly. Ther reason why it is intentionally left that way is because (i) this part will be enhanced soon (ii) even automatically computed, you might want to override the defaults.
-
-fltCut <value>: length of the genomic chunk.
For efficiency reasons detailed in (see [#Implementation Notes| Implementation Notes]]), subject sequences are cut on the fly into smaller pieces.
If not specified, automatically computed to 10X the max query length. - -fltOverlap <value>: length of the genomic chunck overlaps.
If not specified, automatically set to the max query length.
-
-fltThreshold <value>: minimum number of words in common.
-
-r : does the search on both forward and reverse strands of the query.
Warning: DO NOT FORGET THIS ONE. Must be explicitly set. Otherwise, only forward strands will be considered. - <algo_options>: Algorithm options (such as ...S, -E, -errs etc ...) as described in the <a href="#Algorithms_at_a_glance" title="">Algorithms Section</a>.
Please remind the importance of hitmap and fullhitmap KERR options, as described in the <a href="#Hitmap" title="">hitmap section</a> that complement the best module. - -scut <value>: Filter out alignments above (or below) a given score.
This score (RS in BFQL, %RS in printf) is algorithm dependent. The sorting order of score cut is automatically set, depending on the algorithm used. For example, for KERR, a score is the number of errors. The lower the score, the better the alignment. Therefore, -scut 2 means that all alignments having a score lower than or equal to 2 will be kept. On the other hand, for BLAST (or SW), -scut 30 means that all alignments having a score greater than or equal to 30 will be kept. As a memo:- BLW: RS=Number of matches in the alignment.
- BL2: RS=Blast score.
- KERR: RS=# of errors.
- SW: RS=Smith-Waterman score.
-
-best <best_options>
Defines how many hits to keep per query, and in which way. This is detailed in the <a href="#best" title="">next</a> subsection.
best
best is a processing module by itself. Its general form is:
$ lspmul -help
-best nbr[,<best strategy>[,{criteria list}]]
best res strategies:.....single.....rmdup.....subjid.....frame.....strand.....cover.....hitcnt.....tagcnt=<annot>....
best res criteria:.....res expression.
More precisely:
-
nbr : max number of alignments kept - stored - per query.
This is different from the total number of computed alignments (see single, hitmap and fullhitmap). - <best strategy>: one or multiple, coma separated, of:
- hitcnt (default) : compute the number of per-query hits that passed the threshold and alignment criterion. This number is stored in the variable RP1 (over-writing it if it is used by the algorithm [blast uses it for bits score]). The cost of computing is very low. It offers a way to know if a read potentially matches for instance one time, 5 times or 1000 times. These three possibilities tell very different stories on a read; for instance, one time is a pure uniquely placed read, 5 times could be because of a conserved domain, and 1000 times should be because of a repeat element or a vector contamination.
-
tagcnt[=<annot>]: like single, but the counting is processed by 'groups'. That is: for a given query, if multiple alignments occur but they all belong to subjects having the same annot property, the readcount (RP1) will be counted as one. If an alignment occurs in another group, RP1 is incremented.
Let's take an example: the GQ RNA-Seq workflows map reads to a transcriptome database, but considers that a read matching any transcript of a given gene is counted as one. This allows to very quickly differentiate reads that align to multiple genes, without having to know how many maximum isoforms can exist for a gene. Provisioning such a max number (human is close to 100) as nbr in bestleads to unmanageable size of results, and complicated post-filtering to separate isoforms from transcripts from other genes. So, assuming the transcript database is annotated with a GeneID (GD field), the following:
-best 1,tagcnt=GD -o resallows to store a minimal set of alignments and:
lspres res -bfql 'RP1==1' -oqcoll goodreadskeeps reads that hit a single gene.
This of course implies that GD annotation is available, on the headnode, and, for Array Computing, on the nodes. The pusher suite provides some tools to stage the nodes with special-added annotations.
The <annot> field is not general purpose. For efficiency and implementation reasons, the first 9 bytes of the annotation are considered. The spirit here is to used GD as a tag, a signature, a unique Identifier, e.g. a GeneID, a taxID, but also an Accession number, but not for example a Description field. Please note that tagcnt=OX, i.e. using the taxID of a sequence, can be particlarily useful for Metagenomics runs.
If [=annot] is omitted, the subject numerical GQ Engine LSPID (a subject record) is used. - single: (default): any alignment is considered as independent of each others. This is neutral for single-hit algorithms such as KERR or SW, but not for BL2, where multiple HSPs can occur. Although this would be enough for a vast majority of cases, even using Blast, the strategies below provide other modes..
- subjid: In that mode, alignments are still sorted, but when one comes into the ranked list to keep, all others HSPs between the query and the subject are pulled together with this best HSP. This allows to keep all other local alignments, whether they be good or not. It conforms to the NCBI way of reporting Blast alignments. This is of course especially useful when using BL2 with long reads as it allows for example transcript-to-transcript or transcript-to-genome mapping.
what happens on the genome ? is-it mutually-exclusive to single ? what happens if both are specified ? - frame: not adressed now.
- strand: rules, and same questions as subjid..
- cover: rules, and relations with subjid and strand.
-
criteria list: can be many different Biofacet scores such as RS, RE, RIQ, ...
As for sorting or grouping, is a multiple sort key, such as for example: {-RS,RE}
Examples (In Progress):
-
Classical EST mapping:
lspmul.TH -t 4 -M mapnc,bl2 -O ' [ -fltThreshold 10 -ftlCut 3000 -r] ' -mn NUC.3.1 -db SOME_GENOME -db SOME_EST -best '5,single,{-RS}' ILLUMINA : short reads, constant quality:
lspmul.TH -t 8 -M mapnc,kerr -O '[ -errs 2 -extend -fltThreshold 6 -hitmap -r]' -db db1 -db db2 -o res.
-
ABI : longer (but still short) reads, Quality decreases at extremities. So, while slower, we can use BL2:
lspmul.TH -t 8 -M mapnc,bl2\ -O [ -ftlThreshold 6 -fltCut 300 -r]\ // engine decides itself on overlap (set to max read lengths) -mn NUC.2.1\ // Modified NCBI matrix is better for short reads -scut 28 454 : longer reads, local alignments:
lspmul.TH -t 8 -M mapnc,bl2\ -O [ -fltThreshold 10 -ftlCut 5000 -fltOverlap 200 -r ] //another way to specify the filter (old <LEN,PERID>) -scut 30
Example of usage of lspcalc.THA for lspmul (see lspcal.THA -help):
jobname=MyJob
nbnodes=5
nbthreads=8
alg=kerr
params="-errs 3 -extend"
range=1:-1 # Set to, for instance, -range 1:30 for testing
genome=tmp/genomic.db
reads=/tmp/myreads
results=myresults
bestcriteria=-RS
lspcalc.THA -splitrole query -qs SGEX -jobname $jobname -- \
-t ${nbnodes}.${nbthreads}\
-M mapnc,$alg\.-O "[ -fltThreshold 6 $params -r ]"\
-mn NUC.2.1\
-db $genome $range -db $reads\
-o $resname\
-best 1,single,hitcnt,{$bestcriteria}"\
-progress 1>logs/${resname}.out 2>logs/${resname}.err
Hitmap for KERR and GASSST
KERR and GASSST handles two very important options: hitmap and fullhitmap. As a reminder, using for example single the best module allows to store the total number of putative alignments for a query. This information is stored in the RP1 field..</p>With KERR or GASSST, one can collect more fine grained information, such as the distribution of hits per-query, per-#-of-errors. In fact, it is not the distribution of hits stricto sensu, but rather an encoding that translates a probabilistic model. What guides us here is the ability to better decide what is a 'unique' alignment versus an 'ambiguous one'. Numerous packages, such as MAQ or Soap include these kind of measures. Generally, they decide that a unique alignment is something such as the second best one has more than two differences with the best. Here, we did extend these notions the following way: <p>Below is the simplified pseudo-code to compute probability that a read falls within a repetitive region. Weighted score that says there is a 25% chance that a hit with one error is really the same as a hit with no errors (A,C,G,T)..
if (best hit has 0 errors)
{
probability = 1 - (1 / (totalNbHits0errs + 0.25* totalNbHits1errs + 0.25*0.25* totalNbHits2errs))
}
else
if (best hit has 1 error)
{
probability = 1 - (1 / (totalNbHits1errs + 0.25* totalNbHits2errs))
}
else
if (best hit has 2 errors)
{
probability = 1 - (1 / totalNbHits2errs)
}
The GQ Engine encodes this model using the following values:.
# Errors Value (RP2) 0 64 1 16 2 4 >=3 1
0 means in fact the # of errors of the best alignment. All values are shifted according to the best hit. So to say that:
- one hit with 0 error : RP2 =64
- one hit with 1 error : RP2 =64
- two hits with 0 error : RP2 = 128
- one hit with 0 error and one hit with 1 error: RP2=80
- one hit with 0 errors and 4 hits with 1 error : RP2=128
- one hit with 0 errors and 5 hits with 2 errors : RP2=84
- one hit with 0 errors and 16 hits with 2 errors : RP2=128
The higher the RP2, the lower the probability to trust in the uniqueness of the best alignment. RP2 values greater than or equal to 128 translate that the probability that the hit is unique is lower than or equal to 0.5.
- -fullhitmap: this flag enables the RP2 computation.
- -hitmap: same as fullhitmap but the computation discards any further mapping of a read when RP2 >= 128. If for example -best 1 has been set, the best alignment will still be kept, and the RP2 will be provided for that read, but it will be in some way, saturated. The main advantage is a compute time cut. It can be seen as a filtering-on-repeats on the fly with no a priori knowledge of repeats, nor pre-processing of reads. We do encourage to use this flag, especially for Eukaryotic genomes resequencing. Depending on reads and species, it is in average 20% to 30% faster and there is no loss of information. The gain is less clear for transcriptome reads remapping as in theory repeats are much less abundant in pure transcriptome. Many GQ workflows use this option and then proceed with an a posteriori filtering such as:lpres res -bfql 'RP2 < 128'.
Please note that for the time being, hitmap and fullhitmapcan only be used with best strategy = hitcnt, and are not operational with best strategy = tagcnt[=<annot>]
Colorspace
The colorspace encoded reads are stored in sequence databases typed as "NUCCS" instead of "NUC" used for nucleic sequences. The sequences are always stored in a compressed form, 2 bits by letter plus the ambiguity positions. When a color space sequence database is used in sequence search or read mapping, GQE is aware of the input data type and transparently decompresses and decodes the reference NUC type sequence database to colorspace equivalent, while the NUCCS sequences are decompressed as colorspace and both are directed to the alignment computation algorithms. This allows unique implementation of all algorithms for nucleic and color space searches/mapping
After the mapping algorithm we obtain the positions of gapped alignments in the color space.
Then GQE tries to convert this alignment into a nucleotide alignment. This is done on the fly, using dynamic programming. We consider that the conversion succeeds if the resulting nucleotide space alignment has no more errors than the original color space alignment. In this case we correct the positions to nucleic, and we check the anchoring point i.e. the first real nucleotide in the colorspace encoding. The anchoring nucleotide is at the leftmost position for the forward orientation and at rightmost for the reverse. Currently the quality information is not used in alignment process and no attempts are made to modify the original read to try to correct the alignment. If the conversion fails the alignment is marked accordingly and will be displayed using color space encoding..
Please note that this strategy, while not not perfect, handles basic colorspace classical valid-adjacent-mismacthes (two colorspace consecutive mistmatches leading a single nucleotide mismtach), as well as more sophisticated rules described in various LifeTech documents A Theoretical Understanding of 2 Base Color Codes and Its Application to Annotation, Error Detection, and Error Correction.
Examples of ungapped alignments with nucleotide alignments together with their corresponding original colorspace ones. Please note that colorspace alignments are 1 position shorter.
ND= 741 NQ= 1326 CsErrs= 2 NucErrs= 1 (fw) XF0= nuc anchor nogap hitcnt= 1 stat=64
Q: 2 ATTGACCCAACAAAGGTTGTGAGAACTGCTTTATTG 37
||||||||||||||| ||||||||||||||||||||
S: 133282 ATTGACCCAACAAAGCTTGTGAGAACTGCTTTATTG 133317
Q: 3 30121001011002010111222012132003301 37
|||||||||||||| |||||||||||||||||||
S: 133282 30121001011002320111222012132003301 133316
ND= 680 NQ= 2380 CsErrs= 4 NucErrs= 2 (fw) XF0= nuc anchor nogap hitcnt= 1 stat=64
Q: 2 GGAATCCGCTAAGGAGTGTGTAACAACTCACCTGCC 37
||||| | ||||||||||||||||||||||||||||
S: 5129 GGAATTCACTAAGGAGTGTGTAACAACTCACCTGCC 5164
Q: 3 02032033230202211111301101221102130 37
|||| |||||||||||||||||||||||||||
S: 5129 02030211230202211111301101221102130 5163
ND= 945 NQ= 1423 CsErrs= 4 NucErrs= 2 (rv) XF0= nuc anchor nogap hitcnt= 1 stat=64
Q: 1 ATCTGCGGTATACAATTTTTAGGTGCCTCATTCGAC 36
|||||||||||||||| || ||||||||||||||||
S: 231908 ATCTGCGGTATACAATATTCAGGTGCCTCATTCGAC 231943
Q: 1 32213301333110300003201130221302321 35
||||||||||||||| | |||||||||||||||
S: 231908 32213301333110333021201130221302321 231942
ND= 345 NQ= 14037 CsErrs= 4 NucErrs= 2 (fw) XF0= nuc anchor nogap hitcnt= 22 stat=70
Q: 2 GTTGGAGACGAGCGTGGCCAACATGGTGAAACCCCA 37
||||||||| ||| ||||||||||||||||||||||
S: 2587 GTTGGAGACCAGCCTGGCCAACATGGTGAAACCCCA 2622
Q: 3 10102221322331103010113101120010001 37
|||||||| || |||||||||||||||||||||
S: 2587 10102221012302103010113101120010001 2621
Examples of gapped alignments with nucleotide alignments together with their corresponding original colorspace ones. Please note that colorspace alignments are 1 position shorter.
ND= 220 NQ= 23339 CsErrs= 4 NucErrs= 2 (rv) XF0= nuc anchor hitcnt= 5 stat=80
Q: 1 ATTTACAAGAAAGAAACAAACAACTCCATCAAAAAG 36
|||||||||||| ||||||||||| |||||||||||
S: 203140 ATTTACAAGAAA-AAACAAACAACCCCATCAAAAAG 203174
Q: 1 30031102200220011001101220132100002 35
||||||||||| |||||||||| ||||||||||
S: 203140 30031102200-00011001101000132100002 203173
ND= 292 NQ= 25880 CsErrs= 4 NucErrs= 3 (rv) XF0= nuc anchor hitcnt= 1 stat=64
Q: 1 AGGTATATAGCCAAGAGAAATGAAA--ATGTGTCCACA 36
||||||||||||||||||||||||| || ||||||||
S: 318426 AGGTATATAGCCAAGAGAAATGAAAACATATGTCCACA 318463
Q: 1 2013333323010222200312000311--1120111 35
||||||||||||||||||||||||| | |||||||
S: 318426 2013333323010222200312000113331120111 318462
ND= 322 NQ= 31295 CsErrs= 3 NucErrs= 3 (fw) XF0= nuc anchor hitcnt= 4 stat=112
Q: 2 CTTAGGCCTTCCAAA-TGCTGGGATTACAAGGCATGA 37
||||| ||||||||| ||||||||||||| |||||||
S: 4079 CTTAG-CCTTCCAAAATGCTGGGATTACA-GGCATGA 4113
Q: 3 20320302020100-313210023031102031312 37
|||| ||||||||| ||||||||||||| |||||||
S: 4079 2032-30202010003132100230311-2031312 4112
Both color space and nucleic score (number of errors) are recorded in the GQE results. Below the variable names and their meaning, available with BFQL and printf
- RS: color space score (number of errors)
- RE1: nucleic score (number of errors)
- RP1: hit count
- RP2: uniqueness indicator (mapping is unique if RP2 < 128 - see <a href="#Hitmap" title="">hitmap section</a>)
Furthermore each result is marked with flags in the XF0 field..Each flag represents a bit mask that can be checked using BFQL.Currently implemented flags are:.
- Nucleic alignment succeeded (mask=1) Anchoring nucleotide is correct (mask=2)
- Non gapped alignment (mask=4)
- Hit count is clamped (mask=8)
The last flag indicates that we stop to accumulate the number of mappings once we are sure the mapping is not unique..Flag masks can be combined using bit positional OR to select results of interest.
Informative Reads
GenomeQuest defines informative reads as those with an MAPQ greater than 4:
initstep "DUMPING INFORMATIVE READS"
lspres ${ALLRES} -bfql 'mapq>4' -ocoll ${BESTRES} 2>> comparison.log &
check "DUMPING INFORMATIVE READS"
Historically, Biofacet stored the mapping quality in the field RP2. RP2 is the uniqueness field that is assigned to reads as they are mapped or aligned to a reference. In order to understand the reasoning behind scoring an RP2 value, you must first understand that reads can have multiple mapping locations when being mapped to the genome. If a read has one mapping location, it is unique. If a read has multiple mapping locations, it is no longer unique and additional measures must be taken in order to determine if it has a best mapping, many equal mappings, and if any of those mappings can be taken by applying a filter based on some threshold (as in the code above). Thereby, the RP2 value allows us to disambiguate a read's multiple mappings by using it as a threshold to filter on (in the case a read has more than one mapping) as well as discard those reads that have multiple mapping locations, and are 'ambiguously' mapped to the extent that we cannot determine even a 'best' mapping location amongst all its mappings. Since these ambiguously mapped reads can have multiple locations with equally scored mappings in some cases, such as repetitive regions of the genome, there may indeed be no way to disambiguate the best alignment, and the RP2 value provides a very quick filter in order to narrow your mapping results down to just your informative reads; aka those that were aligned uniquely to the genome.
Actually, the RP2 value has been sync'ed with the MAPQ quality, as it can be found in BWA, MAQ, and other read mapping packages. According to the standard definition of MAPQ, this is -10*log(Pr) where Pr is the probability that the mapping position is wrong. The resulting MAPQ value is rounded to the nearest integer, and stored as an 8-bit value. Additionally, the value 255 has a special meaning of "quality non available".
The bigger the value, the better the mapping quality. When importing SAM / BAM to biofacet, this value is set directly to the value of the MAPQ field. Thereby, if the definition of RP2 is defined as the probability of correct placement, expressed as a scaled integer, then it converts to MAPQ as:
-10log(1-64/old_RP2)
For compatibility and comparison purposes, below are some some typical values:
old_RP2 MAPQ 64 60 unique result, the real computed value is infinity, clamped to 60 80 7 corresponds to one N error and one N+1 error results. 128 3 corresponds to two results with equal number of errors
To read more about RP2, please see RP2.
Algorithms at a glance
BL2 (Blast2)
Description
The BL2 algorithm is the BLAST NCBI 2.x version. As a summary, BL2:.
- reports gapped local alignments.
- can report multiple hits between two sequences.
- requires a scoring matrix in nucleotide and in protein comparison modes
- reports the correct position of an alignment in subject and query sequences before using the lspextend module
- is invoked as bl2 on the lspcalc command line
The BL2 algorithm reports the following basic alignment properties; names italicized refer to BFQL variables names
- <Field 1> RS or Blast Score.
- <Field 2> RE or Expect Value
- <Field 3> RP or Bits score.
Options
| Description | Option | Default value Proteic | Default value Nucleic |
|---|---|---|---|
| Word length for neighborhood generation | -W | ||
| Primary expect value cutoff | -E. | ||
| Extension cutoff score | -X | ||
| Primary cutoff score | -S | ||
| Secondary cutoff score | -S2 | ||
| Gap Opening Penalty | -gapo | ||
| Gap Extension Penalty | -gape |
About scoring matrix:
- nucleotides. There is basically two main choices: NUC.3.1 and NUC.2.1. The first one (match=1, mistmatch=-3) is better to use for homology-oriented searches, while the second one (match=1, mismatch=-2) is preferred when doing NGS-oriented searches. In the latter case, do not forget to adapt gapo and gape to some values compatible with NUC.2.1 (for example gapo=2, gape=2)
- proteins. Use classical BLOSUM62 or PAM matrices
Notes
There are a number of differences between NCBI blast 2.3.2 and the GQE implementation.
- NCBI tblastx (translated nucleotide vs. translated nucleotide) does not allow gaps. The Biofacet equivalent does allow gaps. Allowing gaps has implications for the statistical calculations. Therefore expect values in this comparison mode are not strictly comparable between the two implementations.
- The NCBI blast version does not use an external scoring matrix in nucleotide comparisons. In compliance with the Biofacet schema, Biofacet bl2 does use an external scoring matrix. This allows Biofacet bl2 to handle other scoring matrices and ambigous residues. The NUC.3.1 scoring matrix with Biofacet bl2 is equivalent to the way NCBI blast code handles nucleotide scoring internally..
- The Biofacet bl2 implementation does not support the always-changing set of heuristic rules NCBI blast uses to connect the statistics of multiple HSPs in an attempt to identify frameshifts. Biofacet bl2 reports Expect values per HSP and does not guess which HSPs to connect or which HSP is better than another.
- The Biofacet bl2 implementation does not automatically apply a low complexity filtering routine to the sequence before comparison (DUST, XNU, SEG).
To fit the blast algorithm into the entire GQE schema, two cosmetical changes had to be introduced. Firstly, the sequence coordinates in GQE are displayed after transformation of the sequence (reverse complement nucleotide sequences, translation into protein). Coordinates are displayed relative to the original sequence in NCBI blast. Secondly, NCBI blast displays the subject sequence as reverse complement while GQE displays the query sequence as reverse complement. This is done because it is more efficient to reverse complement the query sequence instead of the entire subject sequence database. However, using printf all display combinations, including NCBI style, are possible.
SW (Smith & Waterman)
Description
The Smith & Waterman algorithm finds the best local alignment between two sequences and implements [Smith & Waterman, 1981] with affine scoring. The algorithm takes into account substitutions, insertions and deletions. The costs for a substitution are specified by the scoring matrix. The costs for opening and extending a gap are specified by the gap opening and gap extension options. SW:.
- reports gapped local alignments.
- always reports a single alignment between two sequences unless a score cutoff is specified.
- requires a scoring matrix in nucleotide and in protein comparison modes.
- reports the correct position of an alignment in subject and query sequences before using the lspextend module.
- is invoked as sw on the lspcalc command line.
In contrast to algorithms like blast, the Smith & Waterman algorithm always reports an alignment between two sequences. This means that when a single query sequence is compared to a subject database, the Smith & Waterman algorithm will report as many alignments as there are sequences in the subject database. One way to overcome this problem is to set the cutoff score. The Smith & Waterman algorithm will only report alignments with a score greater than or equal to the cutoff score. If the BLOSUM62 matrix is used then a reasonable value for the cutoff score is 75..
The SW algorithm reports the following basic alignment properties; names italicized refer to BFQL variables names.
- <Field 1> RS or SW Score.
Options
| Description | Option | Default value Proteic | Default value Nucleic |
|---|---|---|---|
| Primary cutoff score | -cutoff | ||
| Gap Opening Penalty | -gapo | ||
| Gap Extension Penalty | -gape |
Same remarks as for BL2 applies to the scoring matrix choices.
KERR (aka GenePAST)
Description
The KERR algorithm belongs to the class of approximate string matching algorithms. Given two sequences S1&S2, KERR computes the minimal number of differences between S1&S2, by trying to optimally fit the shorter sequence into the longer one.

Differences in sequence alignments arise from mismatches or gaps. By design, the number of differences is computed on the alignment. Although the alignment is nearly identical in length as the shorter sequence, gaps can induce an alignment longer than the length of the shorter sequence.
Given the two sequences:
S1 = ATGTATAGTATA and S2 = ATAGA.
of respective lengths 12 and 5, kerr computes the following alignment:
ATGTATAGTATA
|||| |
ATAG-A
From an end-user perspective, trailing gaps are not displayed; only the core alignment is presented:
ATAGTA
|||| |
ATAG-A
In our case, the number of differences is 1 (one gap), and the alignment length is 6. The percentage of identities on the alignment is therefore 83.33% (5 matches on 6).
KERR implementation is based on:
E. Myers - A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Progamming, J. of ACM 46, 3 (1999), 539-553.
It incorporates proprietary features such as:
- Works for nucleotide and proteins.
- Support complete IUPAC character set with tuning such as ...T matches U / N does not match N..
- Can privilege mismatches over gaps (leading to more compact alignments more suitable for SNPcallers).
- No length restriction: switches automatically from 64bits ultra-fast packed implementation to a generic length-less implementation..
- Reports the best alignment but computes and stores the distribution of all alignments found.
While KERR is the state-of-art aligner for short reads mapping, its accuracy can be used for more homology-related purposes. The example below is a protein alignment computed with KERR that is a real case of small exon-skipping in one variant:
Q: MKFLILLFNI----------NHGVGPQGASGVDPITFDINSNQTGPAFLTAVEMAGVKYL |||||||||| |||||||||||||||||||||||||||||||||||||||| S: MKFLILLFNILCLFPVLAADNHGVGPQGASGVDPITFDINSNQTGPAFLTAVEMAGVKYL
Options
| Description | Option | Default value Proteic | Default value Nucleic |
|---|---|---|---|
| # of errors | errs | ||
| Compute Exact Positions | -extend | ||
| Score Distribution | -hitmap | ||
| Full Score Distribution | -fullhitmap | ||
| Percentage Errors Alignment | -perc | ||
| Percentage ID on Query | -percq | ||
| Percentage ID on Subject | -percd | ||
| Percentage ID on Alignment | -perca |
Notes:
- Percentage identities options are more IP-usage-oriented
- -extend is used to compute exact positions. For many algorithmics reasons, KERR always compute the exact best alignment, but can or cannot compute the precise starting and end positions. Not putting -extend speeds up computation - 10% to 20% - and can be useful in some cases such as RNA-Seq or Metagenomics studies. When precise coordinates are needed (e.g. Variants) -extend must be used. If this flag has been omitted, precise coordiantes can be computed a posteiori using lspextend
KERR and IP searching
This section might not be a direct interest for NGS, and can be skipped at a first glance.
BLW (aka Fragment Search)
Description
Like KERR algorithm, the BLW algorithm will report matches that are precisely defined before the comparison is started, using the following criteria:
- The minimal length of the required match - refered to as the overlap window - in number of nucleotides or amino acids.
- Tthe maximum number of mismatches within this overlap window as an absolute number, or as a percentage identity relative to the length of the overlap window.
BLW is a multiple hits algorithm. It rapidly detects alignments between having in common a window of length at least L1, with a percentage identity of at leastP..
These criteria (L1,P) are bound to the following limits:
- For nucleotide sequences, the minimal match that can be detected contains at least 10 consecutive residues without mismatches and at least 90% identity over the length of the overlap window.
- For protein sequences, the minimal match that can be detected contains at least 5 consecutive residues without mismatches and at least 80% identity over the length of the overlap window.
- Only non-ambigous residues are taken into account (ACGT for nucleotide sequences, ACDEFGHIKLMNPQRSTVWY for protein sequences). Ambigous residues N for nucleotide and X for protein sequences are always counted as a mismatch.
- Below the boundaries of 90% identity for nucleotide sequences and 80% for amino acids, results are guaranteed to be correct but not complete, i.e. not all matches will be detected.
Options
| Description | Option | Default value Proteic | Default value Nucleic |
|---|---|---|---|
| # of errors | -errs | ||
| Window Length | -len | ||
| Percentage Errors | -perid |
BLW is a kind of directed-Blast'. It is Blast because it computes local alignments; it is directed because it gives control over the lengths and percentage identities or # of errors.