Welcome to the GenomeQuest Documentation Wiki

Search Algorithms

From GQ Wiki
Jump to: navigation, search

GQ ENGINE ALGORITHMICS
Avatar.jpg
The NGS Edition
February 16st 2010 V0.94 Draft

Contents

Introduction

GenomeQuest Engine (GQE) handles two different modes of sequence searching:

  1. lspcalc - and its derivatives lspcalc.TH and lspcalc.THA.
    This is the general purpose All vs. All, pairwise-based sequence comparison module. lspcalc supports numerous algorithms among:
    1. Homology based, e.g. BL2 - Blast, SW - Smith-Waterman, NW - Needleman-Wunsh, etc.
    2. String matching, proprietary algorithms e.g. KERR, BLW, etc..
      In contrast to homology-based algorithms, those algorithms provide exact answers.
    3. Pattern matching algorithms, handling motifs, e.g. PATT, etc.
  2. lspmul - and its derivative lspmul.TH and lspcalc.THA.
    This is the High-Speed Sequence Search module that is mainly used for NGS purposes.

10+ different algorithms can be used with lspcalc. They are of completely general purpose and offer a panel of choices to perform various searches useful for precise Comparative Genomics at large. More on lspcalc can be found in Link title. This document focuses on lspmul. lspmul can be viewed as the "NGS side" of the GQE. It uses a filter based on the number of words in common, to narrow down the search space to sequences that are locally or globally highly similar. It computes alignments of sequences having a good-enough local or global window of similarity. Once the filter is realized, one of the four following algorithms can be applied on the fly:

  1. BL2: BLAST2 finds the most relevant sequences in terms of biological similarity. It relies on the NCBI BLAST2 algorithm.
  2. SW: this is the classical Smith-Waterman algorithm.
  3. BLW: finds sequences having a common region of length L with at least P percentage of identities. This algorithm is a kind-of Blast-controlled; it is ungapped.
  4. KERR: computes the minimal number of differences between two sequences, by trying to optimally fit the shorter sequence into the longer one. Its implementation is based on a fast Myers's algorithm. While used for years by GenomeQuest IP audience, this algorithm is the state-of-art aligner for short reads mapping.

Please note that, in contrast to many short-reads-aligners, LSPMUL:

  • Implements natively truly gapped algorithms.
  • Works with any vendor technology, including SOLiD in native colorspace mode (see colorspace).
  • Works with any read-lenghts.
  • Can operate in a full-accuracy mode, without missing any hit.
  • Can address a variety of problems. Beyond classical genome-remapping, transcriptome to genome, transcriptome to transcriptome problems can be processed.
  • Last, the Engine built-in feature of the dynamic AllxAll allows to successively filter your reads against various databases, non genome- or NGS-related. As an example, any Genbank division, or any combination of them, can be used as the subject database. QA, contamination quantification against viruses or bacteria can be conducted without any special effort. No special databases to extract and build, no special indices, etc. And not speaking about results that are not an ending, but dynamic reusable components for the next step of the workflow (see GenomeQuest Engine Primer).

Say for example you have millions of short reads in the database MyReads, and you want to map them, with up to two errors, against all Human transcripts. After having set the RefSeq mRNA database:

export rsm=/disk/GQtest/data/GQdata/content/hotdrive/RSM/38/RSM38_20091228

the lspcalc command line would look like:

% time lspcalc.TH -t 8
   -M kerr
   -O '[-errs 2 ]'
   -db $rsm -bfql 'os="homo sapiens"'
   -db -f 'f,r' MyReads
   -bfqlbest 1
   -o res/rsm.res
   -progress

while the lspmul would be:

% time lspmul.TH -t 8
   -M mapnc,kerr
   -O '[-fltCut 720 -fltThreshold 6 -errs 2 -r ]'
   -db $rsm -bfql 'os="homo sapiens"'
   -db MyReads
   -bfqlbest 1
   -o res/rsm.res
   -progress

Beside the slight syntax changes (the way forward/reverse strands are specified, etc.), the major differences come from:

  • the number of words in common, here 6 (-fltThreshold 6) that you expect to have between a read and,
  • any genome window of 720nt (-fltCut 720)

This is what we call the filter that will be used to trigger the KERR aligner.

As a result of this filter, speed is in orders of magnitude faster, at the cost of some putative misses of course. The misses can be precisely predicted as a function of the read lengths, the # of errors and fltThreshold. In the above case, with reads of 36nt, we can demonstrate that not a single alignment will be missed (see what is a good threshold).


How to use:

  • generally speaking, BL2 or BLW (or SW but slower) can be used for 454/Sanger and longer reads such as PacBio, whatever the problem is (Metagenomics, RNA-Seq, etc.). Still with long reads, KERR can be used if data quality is good enough and problem is genomics/genomics or transcripts/transcripts mapping-oriented.
  • For Ultra Short and Short reads (say up to 100) KERR is the choice. Some attempts can still be made to use SW, BL2 or BLW when reads are not Ultra Short (say >= 60), and when local alignments are needed, for:
    • Either functional purposes, e.g. exon-exon junctions.
    • Either technical purposes, e.g. get rid of low-qualities 3'- or 5'- ends.

Speed hints: as always, it's all about sensitivity/sensibility. lspmul's speed depends on:

  • time spent in the filter itself. This time depends on the relationship between the query set, the reference target, and other parameters such as the query lengths, the threshold, the cut length etc. In average, it can count as much 50% of the time spent.
  • time spent in the alignment method. Here again, two folds:
    • time spent in "alignments candidates". Those that pass the filters and rejected by the aligner.
    • time spent in alignments that succeeded.
      The ratio of the two items above reflects the accuracy of the filter. However, the alignment method obviously plays an important role here.
      For short reads, and as far as alignment time is concerned, there is approximately one order of magnitude between KERR and BLW/BL2 and another one between BLW/BL2 and SW. For long reads, BL2 is the best choice.

As a bulk summary, it takes about:

  • 45mns/64core to map 100M short (50-100nt) reads again the Human Transcriptome using KERR.
  • 12hours/64core to map 100M short (50-100nt) reads again the Human Genome using KERR.
  • 1hour/64core to map 5M 454 Titanium reads against Human Transcriptome using BL2.

GenomeQuest is actively working on alternative filters to speed up those numbers. Since our method is generalist (allows gaps, works for any technology, any read lengths, any algorithm, etc.) and since we do not want to sacrifice accuracy, we might use other techniques than the classical Burrows–Wheeler Transform (BWT) that "can" be fast but have severe drawbacks. We expect a significant improve in the short-term. However, please keep in mind that speed by itself is an easy "mirage", and just be the visible part of the NGS iceberg. It is useless without pre- or post-analysis capabilities, and flexibility. Overall improvements will cary over all unique goodies of the GQE that allows: (i) Global Throughput - it's not just about a single run (ii) Global Sequence Data Management and Mining. Last, please remind also that GQE architecture is scalable by design, and can take advantage of elastic computing by on-demand burst using the Cloud.

This document first describes lspmul module. The second section details the briefly the 4 algorithms available with lspmuland their options. Algorithms options can be used indifferently with lspcalc or lspmul.

LSPMUL

Principle

LSPMUL technology of the GenomeQuest Engine is based on the principle of filtering sequence pairs prior to align them. It is oriented toward the search of “highly-similar” regions. The inner workings of the algorithm are described in detail in the article published in IEEE, SCB 2002 (File:High-similarity-clustering.pdf). What follows here is a short, simplified summary.

Starting from two sequences, the Query sequence and the Subject sequence, the filter is based on counting the number of words - of lengths W - in common . This number is called the Threshold TC. Once the filter is realized (i.e. TC >= Tcmdline), an alignment with a given method M, together with its options, is started. M is of: BL2,SW,BLW,KERR (see Algorithms Section). The alignment is retained or not for further processing, according to some alignment-scoring criteria C. C is method specific: e.g. a score for BLAST, a number of differences for KERR, etc …

Please note that in the current version of the GQ Engine, W=10, and it cannot be changed.

Lspmul.png

It has to be noted that the number of words in common TC is computed regardless of the positions of the words in the sequence. It is the alignment that confirms or not the relevance of the filter. The filter has to be seen as a necessary, but not sufficient condition. It is an accelerator. While this non-locality feature might induce many false candidates for the aligner, it might be useful to align sequences that have diverged over evolution. The figure below shows some example of alignments that can be obtained with TC=10, and with different methods.

Lspmulexamples.png

The alignment #3 shows an alignment that can be obtained with BL2. It is the short version of what could be found with distantly related sequences from long reads, or sequences having pretty long gaps. Using the scoring matrix NUC.2.1 (where a match counts as 1 and a mismatch counts as -2) , and with gapo=2,gape=2the overall score of such an alignment is:

  • number of matches : 36. S1=36*1=36
  • number of mismatches : 0. S2=0
  • number of gaps : 7. S3=7*2+2*2=18 (there is two gap-opening events)
  • Stot=S1-S2-S3=18

meaning that the following command line:

lspmul.TH -M mapnc,bl2 -O '[ -gape 2 -gapo 2] -mn NUC.2.1 -db subject -db query -scut 18

will keep such an alignment.

What is a good threshold ?

<under work> : the threshold is the number of word of length 10 in common between a query and a subject. A query is typically a short read, a read, or anything that has a reasonable size, say up to a transcript length. A subject is any sequence. For efficiency reasons detailed in implementation section, a subject is treated as N smaller overlapping regions, the chunks. By default, a chunk has a size of 10x the longest query.

For a query of length L, there is L-10+1 words. This is the maximum TCmax that would lead to alignments with 0 error. Now, every time one error is possible, it "consumes" 10 words. For example, for a query of length 50, TCmax=41. Allowing one error leads to a theoretical TCth=31; two errors, and TCth=21. And so on so forth. With four errors, we land with TCth=1. From a computing time standpoint, this is too low. Indeed, the lower TC, the higher the probability to have TC words in common is high, so the filter is not efficient. By choosing TC as close as TCth, the more accurate the results will be, but the lower the speed, especially for low TCs. It's all about "what is the theoretical" miss ?

The theoretical miss occurs if we consider a uniform distribution of errors. The worst case is that errors fall down just at the frontiers of words spaced by 10 nt. This can be analytically estimated (see File:High-similarity-clustering.pdf). In practice, one has to have in mind that it is unlikely to happen as:

  • Although everything is possible, SNPs systematically and consecutively spaced every 10nt is almost impossible.
  • If it is about sequencing errors, the low error rate together with the high coverage from NGS, will make this event an irrelevant noise. The probability that the reads in the neighborhood overlapping by one or few nt follow the SAME event is 0 (or close to).
  • Practically speaking again, errors are usually shifted to the 5'- or 3'- ends of the query, and variations or SNPs can be mismatches or gaps, but are likely to be stretches of consecutive errors. Therefore, the words consumed by an error overlap, and TC can be chosen higher than TCth.

Our years of experience let us conclude that:

  • for short reads in the range 30-50nt, and for say 3 to 5 errors, T=6 or T=7 is pretty optimal in term of tradeoff speed/accuracy.
  • for reads in the range 50-100nt, and errors up to 7, T=12 to T=20 is pretty optimal in term of tradeoff speed/accuracy. Again, the enormous level of coverage induced by NGS leads to naturally compensate any singular situation, such as the very rare events of a uniform distribution of errors on ONE read. You can think it that way: 5 stretches of 13 nucleotides is TC=20. This is 5x13=65nt in total. For a read of 76nt, you are still left with 11 possible errors !

Let's take an example; for L=40, TCmax=31.

  • two errors means TCth=11. This is good and can be kept as is.
  • three errors means TCth=1. One word in common is typically too low. On the other hand, the number of events such as the three errors are equally distributed (i) is extremely low -- needs exact number (ii) will be compensated other reads shifted by one or multiple nucleotides. In practice, taking TC=7 leads to less than 0.1% of misses, and no loss in mapping coverage.

Note: When algorithm is BLW, the treshold is calculated with the length of the overlap window (LEN), the minimal percentage of identity (PERID) or number of errors (ERRS), specified by the user and the length of the word (WORDLEN) set by us using the following formula.

threshold = max(1, LEN - WORDLEN +1 -(1- PERID/100)*; LEN *; WORDLEN)

Synopsis

lspmul has two ways of operating, and for each of them, a nucleotide mode (with a n) and a proteic mode (with a p).

  • Clustering: this is single linkage clustering specified with nc or pc.
    This mode will be documented in a separate section.
  • Mapping: this allows the mapping of NGS reads (or others) against any GQ Enginedatabase. The present document focuses on this way of computing.
-M mappc[,<algo>] –O [ <filter_options> <algo_options>] -db subject [bfql] -db query [bfql] -bfqlbest <best_options> -scut <value> [misc]
-M mapnc[,<algo>] –O [ <filter_options> <algo_options> [-r]] -db subject [bfql] -db query [bfql] -bfqlbest <best_options> -scut <value> [misc]

Let's primarily focus on mapnc, the nucleotide-nucleotide mode. We assume the paradigm -db subject [bfql] -db query [bfql] to be known.

  1. <algo>: is optional.
    • If not specified, the alignment algorithm used is BLW (ungapped) - see BLW algo
    • If specified, <algo> is one of : KERR, BL2, SW.
  2. <filter_options>
    • -fltThreshold <value>: minimum number of words in common.
      • When algorithm is BLW, fltThresholdis automatically computed. See what is a good threshold.
      • Otherwise, fltThreshold must be set. If not set, the search might start with some values computed automatically. It is strongly advise to set it explicitly. Ther reason why it is intentionally left that way is because (i) this part will be enhanced soon (ii) even automatically computed, you might want to override the defaults.
    • -fltCut <value>: length of the genomic chunk.
      For efficiency reasons detailed in (see [#Implementation Notes| Implementation Notes]]), subject sequences are cut on the fly into smaller pieces.
      If not specified, automatically computed to 10X the max query length.
    • -fltOverlap <value>: length of the genomic chunck overlaps.
      If not specified, automatically set to the max query length.
  3. -r : does the search on both forward and reverse strands of the query.
    Warning: DO NOT FORGET THIS ONE. Must be explicitly set. Otherwise, only forward strands will be considered.
  4. <algo_options>: Algorithm options (such as –S, -E, -errs etc …) as described in the Algorithms Section.
    Please remind the importance of hitmap and fullhitmap KERR options, as described in the hitmap section that complement the bfqlbest module.
  5. -scut <value>: Filter out alignments above (or below) a given score.
    This score (RS in BFQL, %RS in printf) is algorithm dependent. The sorting order of score cut is automatically set, depending on the algorithm used. For example, for KERR, a score is the number of errors. The lower the score, the better the alignment. Therefore, -scut 2 means that all alignments having a score lower than or equal to 2 will be kept. On the other hand, for BLAST (or SW), -scut 30 means that all alignments having a score greater than or equal to 30 will be kept. As a memo:
    • BLW: RS=Number of matches in the alignment
    • BL2: RS=Blast score
    • KERR: RS=# of errors
    • SW: RS=Smith-Waterman score
  6. -bfqlbest <best_options>
    Defines how many hits to keep per query, and in which way. This is detailed in the next subsection.

bfqlbest

bfqlbest is a processing module by itself. Its general form is:

$ lspmul -help
...
	-bfqlbest nbr[,<best strategy>[,{[bfql criteria list]}]]
			best res strategies:
				single
				rmdup
				subjid
				frame
				strand
				cover
				hitcnt
				tagcnt=<annot>
			best res criteria:
				bfql res expression
  1. nbr : max number of alignments kept - stored - per query.
    This is different from the total number of computed alignments (see single, hitmap and fullhitmap).
  2. <best strategy>: one or multiple, coma separated, of:
    • hitcnt (default) : compute the number of per-query hits that passed the threshold and alignment criterion. This number is stored in the variable RP1 (over-writing it if it is used by the algorithm [blast uses it for bits score]). The cost of computing is very low. It offers a way to know if a read potentially matches for instance one time, 5 times or 1000 times. These three possibilities tell very different stories on a read; for instance, one time is a pure uniquely placed read, 5 times could be because of a conserved domain, and 1000 times should be because of a repeat element or a vector contamination.
    • tagcnt[=<annot>]: like single, but the counting is processed by 'groups'. That is: for a given query, if multiple alignments occur but they all belong to subjects having the same annot property, the readcount (RP1) will be counted as one. If an alignment occurs in another group, RP1 is incremented.
      Let's take an example: the GQ RNA-Seq workflows map reads to a transcriptome database, but considers that a read matching any transcript of a given gene is counted as one. This allows to very quickly differentiate reads that align to multiple genes, without having to know how many maximum isoforms can exist for a gene. Provisioning such a max number (human is close to 100) as nbr in bfqlbestleads to unmanageable size of results, and complicated post-filtering to separate isoforms from transcripts from other genes. So, assuming the transcript database is annotated with a GeneID (GD field), the following:
      -bfqltest 1,tagcnt=GD -o res allows to store a minimal set of alignments and:
      lspres res -bfql 'RP1==1' -oqcoll goodreads keeps reads that hit a single gene.
      This of course implies that GD annotation is available, on the headnode, and, for Array Computing, on the nodes. The pusher suite provides some tools to stage the nodes with special-added annotations.
      The <annot> field is not general purpose. For efficiency and implementation reasons, the first 9 bytes of the annotation are considered. The spirit here is to used GD as a tag, a signature, a unique Identifier, e.g. a GeneID, a taxID, but also an Accession number, but not for example a Description field. Please note that tagcnt=OX, i.e. using the taxID of a sequence, can be particlarily useful for Metagenomics runs.
      If [=annot] is omitted, the subject numerical GQ Engine LSPID (a subject record) is used.
    • single: (default): any alignment is considered as independent of each others. This is neutral for single-hit algorithms such as KERR or SW, but not for BL2, where multiple HSPs can occur. Although this would be enough for a vast majority of cases, even using Blast, the strategies below provide other modes.
    • subjid: In that mode, alignments are still sorted, but when one comes into the ranked list to keep, all others HSPs between the query and the subject are pulled together with this best HSP. This allows to keep all other local alignments, whether they be good or not. It conforms to the NCBI way of reporting Blast alignments. This is of course especially useful when using BL2 with long reads as it allows for example transcript-to-transcript or transcript-to-genome mapping.
      what happens on the genome ? is-it mutually-exclusive to single ? what happens if both are specified ?
    • frame: not adressed now.
    • strand: rules, and same questions as subjid.
    • cover: rules, and relations with subjid and strand.
  3. bfql criteria list: can be many different Biofacet scores such as RS, RE, RIQ, …
    As for sorting or grouping, is a multiple sort key, such as for example: {-[RS],+[RE]}

Examples (In Progress).

Classical EST mapping.

lspmul.TH -t 4 -M mapnc,bl2 -O ' [ -fltThreshold 10 -ftlCut 3000 -r] ' -mn NUC.3.1 
	-db SOME_GENOME -db SOME_EST -best '5,single,{-RS}' 

ILLUMINA : short reads, constant quality. We want ....

lspmul.TH -t 8 -M mapnc,kerr
     -O ’[ -errs 2 –extend –fltThreshold 6 -hitmap -r]’
     -db db1
     -db db2
     -o res

ABI : longer (but still short) reads, Quality decreases at extremities. So, while slower, we try with BL2.

lspmul.TH -t 8 -M mapnc,bl2
     -O ’[ –ftlThreshold 6 –fltCut 300 -r]’
      // engine decides itself on overlap (set to max read lengths.
     -mn NUC.2.1 …
       // Modified NCBI matrix is better for short reads
     -scut 28

454 : longer reads, local alignments.

lspmul.TH -t 8 -M mapnc,bl2
     -O ’[ -fltThreshold  10 –ftlCut 5000 –fltOverlap 200 –r ]’ …
      //another way to specify the filter (old <LEN,PERID>)
     -scut 30

Example of usage of lspcalc.THA for lspmul (see lspcal.THA –help)

jobname = “MyJob”
nbnodes = 5
nbthreads = 8
alg = “kerr”
params = “-errs 3 -extend”
range = “” # Set to, for instance, “-range 1:30” for testing
genome = “/tmp/genomic.db”
reads = “/tmp/myreads”
results = “myresults”
bestcriteria = “-RS”

lspcalc.THA -splitrole query -qs SGEX -jobname $jobname --\
 	-t ${nbnodes}.${nbthreads} -M mapnc,$alg\
-O "[ -fltThreshold 6 $params -r ]" –mn NUC.2.1\
-db $genome $range\
-db $reads\
-o  $resname\
-best “1,single,hitcnt,{$bestcriteria}" 
-progress  1>logs/${resname}.out 2>logs/${resname}.err 

Implementation Notes

<in progress>

  • W-Word Hash table on subject sequences are built dynamically; default = max(50000 seqs, 64Meg)
  • The longer the sequences, the higher the probability to have TC >= C (any query has for example 10 words in common with Human Chromosome1). To that extent, subject sequences are cut on the fly with overlapping subsequences of length ftlCut and overlap fltOverlap. If L denotes the longest query in the query database, default values of ftlCut and fltOverlap are respectively 10*L and L. For example, with reads of lengths 50, subject sequences will be cut into regions of length 500, overlapping by 50.

Colorspace

The colorspace encoded reads are stored in sequence databases typed as "NUCCS" instead of "NUC" used for nucleic sequences. The sequences are always stored in a compressed form, 2 bits by letter plus the ambiguity positions. When a color space sequence database is used in sequence search or read mapping, GQE is aware of the input data type and transparently decompresses and decodes the reference NUC type sequence database to colorspace equivalent, while the NUCCS sequences are decompressed as colorspace and both are directed to the alignment computation algorithms. This allows unique implementation of all algorithms for nucleic and color space searches/mapping.

After the mapping algorithm we obtain the positions of gapped alignments in the color space.

Then GQE tries to convert this alignment into a nucleotide alignment. This is done on the fly, using dynamic programming. We consider that the conversion succeeds if the resulting nucleotide space alignment has no more errors than the original color space alignment. In this case we correct the positions to nucleic, and we check the anchoring point i.e. the first real nucleotide in the colorspace encoding. The anchoring nucleotide is at the leftmost position for the forward orientation and at rightmost for the reverse. Currently the quality information is not used in alignment process and no attempts are made to modify the original read to try to correct the alignment. If the conversion fails the alignment is marked accordingly and will be displayed using color space encoding.

Please note that this strategy, while not not perfect, handles basic colorspace classical valid-adjacent-mismacthes (two colorspace consecutive mistmatches leading a single nucleotide mismtach), as well as more sophisticated rules described in various LIFE documents (A Theoretical Understanding of 2 Base Color Codes and Its Application to Annotation, Error Detection, and Error Correction).

Examples of ungapped alignments with nucleotide alignments together with their corresponding original colorspace ones. Please note that colorspace alignments are 1 position shorter.

ND= 741 NQ= 1326 CsErrs= 2 NucErrs= 1 (fw) XF0= nuc anchor nogap  hitcnt= 1 stat=64
Q:          2  ATTGACCCAACAAAGGTTGTGAGAACTGCTTTATTG  37
               ||||||||||||||| ||||||||||||||||||||
S:     133282  ATTGACCCAACAAAGCTTGTGAGAACTGCTTTATTG  133317

Q:          3  30121001011002010111222012132003301  37
               ||||||||||||||  |||||||||||||||||||
S:     133282  30121001011002320111222012132003301  133316
-------------
ND= 680 NQ= 2380 CsErrs= 4 NucErrs= 2 (fw) XF0= nuc anchor nogap  hitcnt= 1 stat=64
Q:          2  GGAATCCGCTAAGGAGTGTGTAACAACTCACCTGCC  37
               ||||| | ||||||||||||||||||||||||||||
S:       5129  GGAATTCACTAAGGAGTGTGTAACAACTCACCTGCC  5164

Q:          3  02032033230202211111301101221102130  37
               ||||    |||||||||||||||||||||||||||
S:       5129  02030211230202211111301101221102130  5163

-------------
ND= 945 NQ= 1423 CsErrs= 4 NucErrs= 2 (rv) XF0= nuc anchor nogap  hitcnt= 1 stat=64
Q:          1  ATCTGCGGTATACAATTTTTAGGTGCCTCATTCGAC  36
               |||||||||||||||| || ||||||||||||||||
S:     231908  ATCTGCGGTATACAATATTCAGGTGCCTCATTCGAC  231943

Q:          1  32213301333110300003201130221302321  35
               |||||||||||||||  |  |||||||||||||||
S:     231908  32213301333110333021201130221302321  231942
-------------
ND= 345 NQ= 14037 CsErrs= 4 NucErrs= 2 (fw) XF0= nuc anchor nogap  hitcnt= 22 stat=70
Q:          2  GTTGGAGACGAGCGTGGCCAACATGGTGAAACCCCA  37
               ||||||||| ||| ||||||||||||||||||||||
S:       2587  GTTGGAGACCAGCCTGGCCAACATGGTGAAACCCCA  2622

Q:          3  10102221322331103010113101120010001  37
               ||||||||  ||  |||||||||||||||||||||
S:       2587  10102221012302103010113101120010001  2621

Examples of gapped alignments with nucleotide alignments together with their corresponding original colorspace ones. Please note that colorspace alignments are 1 position shorter.

ND= 220 NQ= 23339 CsErrs= 4 NucErrs= 2 (rv) XF0= nuc anchor  hitcnt= 5 stat=80
Q:          1  ATTTACAAGAAAGAAACAAACAACTCCATCAAAAAG  36
               |||||||||||| ||||||||||| |||||||||||
S:     203140  ATTTACAAGAAA-AAACAAACAACCCCATCAAAAAG  203174

Q:          1  30031102200220011001101220132100002  35
               |||||||||||  ||||||||||  ||||||||||
S:     203140  30031102200-00011001101000132100002  203173
-------------
ND= 292 NQ= 25880 CsErrs= 4 NucErrs= 3 (rv) XF0= nuc anchor  hitcnt= 1 stat=64
Q:          1  AGGTATATAGCCAAGAGAAATGAAA--ATGTGTCCACA  36
               |||||||||||||||||||||||||  || ||||||||
S:     318426  AGGTATATAGCCAAGAGAAATGAAAACATATGTCCACA  318463

Q:          1  2013333323010222200312000311--1120111  35
               ||||||||||||||||||||||||| |   |||||||
S:     318426  2013333323010222200312000113331120111  318462
-------------
ND= 322 NQ= 31295 CsErrs= 3 NucErrs= 3 (fw) XF0= nuc anchor  hitcnt= 4 stat=112
Q:          2  CTTAGGCCTTCCAAA-TGCTGGGATTACAAGGCATGA  37
               ||||| ||||||||| ||||||||||||| |||||||
S:       4079  CTTAG-CCTTCCAAAATGCTGGGATTACA-GGCATGA  4113

Q:          3  20320302020100-313210023031102031312  37
               |||| ||||||||| ||||||||||||| |||||||
S:       4079  2032-30202010003132100230311-2031312  4112


Both color space and nucleic score (number of errors) are recorded in the GQE results. Below the variable names and their meaning, available with BFQL and printf.

  1. RS  color space score (number of errors)
  2. RE1 nucleic score (number of errors)
  3. RP1 hit count
  4. RP2 uniqueness indicator (mapping is unique if RP2 < 128 - see hitmap section).

Furthermore each result is marked with flags in the XF0 field. Each flag represents a bit mask that can be checked using BFQL.
Currently implemented flags are:

  1. Nucleic alignment succeeded    (mask=1),
  2. Anchoring nucleotide is correct (mask=2),
  3. Non gapped alignment           (mask=4),
  4. Hit count is clamped            (mask=8),

The last flag indicates that we stop to accumulate the number of mappings once we are sure the mapping is not unique. Flag masks can be combined using bit positional OR to select results of interest.

BWT, Soap, Bowtie etc.

Algorithms at a glance

BL2 (Blast2)

Description

The BL2 algorithm is the BLAST NCBI 2.x version. As a summary, BL2:

  • reports gapped local alignments
  • can report multiple hits between two sequences
  • requires a scoring matrix in nucleotide and in protein comparison modes
  • reports the correct position of an alignment in subject and query sequences before using the lspextend module
  • is invoked as bl2 on the lspcalc command line

The BL2 algorithm reports the following basic alignment properties; names italicized refer to BFQL variables names.

  • <Field 1> RS or Blast Score
  • <Field 2> RE or Expect Value
  • <Field 3> RP or Bits score

Options

Description Option Default value Proteic Default value Nucleic
Word length for neighborhood generation -W
3
11
Primary expect value cutoff -E
10
10
Extension cutoff score -X
0
Automatic
Primary cutoff score -S
0
0
Secondary cutoff score -S2
0
0
Gap Opening Penalty -gapo
11
5
Gap Extension Penalty -gape
1
2

About scoring matrix:

  • nucleotides. There is basically two main choices: NUC.3.1 and NUC.2.1. The first one (match=1, mistmatch=-3) is better to use for homology-oriented searches, while the second one (match=1, mismatch=-2) is preferred when doing NGS-oriented searches. In the latter case, do not forget to adapt gapo and gape to some values compatible with NUC.2.1 (for example gapo=2, gape=2).
  • proteins. Use classical BLOSUM62 or PAM matrices.

Notes

There are a number of differences between NCBI blast 2.3.2 and the GQE implementation

  • NCBI tblastx (translated nucleotide vs. translated nucleotide) does not allow gaps. The Biofacet equivalent does allow gaps. Allowing gaps has implications for the statistical calculations. Therefore expect values in this comparison mode are not strictly comparable between the two implementations.
  • The NCBI blast version does not use an external scoring matrix in nucleotide comparisons. In compliance with the Biofacet schema, Biofacet bl2 does use an external scoring matrix. This allows Biofacet bl2 to handle other scoring matrices and ambigous residues. The NUC.3.1 scoring matrix with Biofacet bl2 is equivalent to the way NCBI blast code handles nucleotide scoring internally.
  • The Biofacet bl2 implementation does not support the always-changing set of heuristic rules NCBI blast uses to connect the statistics of multiple HSPs in an attempt to identify frameshifts. Biofacet bl2 reports Expect values per HSP and does not guess which HSPs to connect or which HSP is better than another.
  • The Biofacet bl2 implementation does not automatically apply a low complexity filtering routine to the sequence before comparison (DUST, XNU, SEG).

To fit the blast algorithm into the entire GQE schema, two cosmetical changes had to be introduced. Firstly, the sequence coordinates in GQE are displayed after transformation of the sequence (reverse complement nucleotide sequences, translation into protein). Coordinates are displayed relative to the original sequence in NCBI blast. Secondly, NCBI blast displays the subject sequence as reverse complement while GQE displays the query sequence as reverse complement. This is done because it is more efficient to reverse complement the query sequence instead of the entire subject sequence database. However, using printf all display combinations, including NCBI style, are possible.

SW (Smith & Waterman)

Description

The Smith & Waterman algorithm finds the best local alignment between two sequences and implements [Smith & Waterman, 1981] with affine scoring. The algorithm takes into account substitutions, insertions and deletions. The costs for a substitution are specified by the scoring matrix. The costs for opening and extending a gap are specified by the gap opening and gap extension options. SW:

  • reports gapped local alignments
  • always reports a single alignment between two sequences unless a score cutoff is specified
  • requires a scoring matrix in nucleotide and in protein comparison modes
  • reports the correct position of an alignment in subject and query sequences before using the lspextend module
  • is invoked as sw on the lspcalc command line

In contrast to algorithms like blast, the Smith & Waterman algorithm always reports an alignment between two sequences. This means that when a single query sequence is compared to a subject database, the Smith & Waterman algorithm will report as many alignments as there are sequences in the subject database. One way to overcome this problem is to set the cutoff score. The Smith & Waterman algorithm will only report alignments with a score greater than or equal to the cutoff score. If the BLOSUM62 matrix is used then a reasonable value for the cutoff score is 75.

The SW algorithm reports the following basic alignment properties; names italicized refer to BFQL variables names.

  • <Field 1> RS or SW Score

Options

Description Option Default value Proteic Default value Nucleic
Primary cutoff score -cutoff
0
0
Gap Opening Penalty -gapo
11
5
Gap Extension Penalty -gape
1
2

Same remarks as for BL2 applies to the scoring matrix choices.

KERR (aka GenePAST)

Description

The KERR algorithm belongs to the class of approximate string matching algorithms. Given two sequences S1&S2, KERR computes the minimal number of differences between S1&S2, by trying to optimally fit the shorter sequence into the longer one.

Kerr.jpg

Differences in sequence alignments arise from mismatches or gaps. By design, the number of differences is computed on the alignment. Although the alignment is nearly identical in length as the shorter sequence, gaps can induce an alignment longer than the length of the shorter sequence.

Given the two sequences,

S1 = ATGTATAGTATA
S2 = ATAGA

of respective lengths 12 and 5, kerr computes the following alignment:

ATGTATAGTATA 
    |||| |
    ATAG-A

From an end-user perspective, trailing gaps are not displayed; only the core alignment is presented:

ATAGTA
|||| |
ATAG-A

In our case, the number of differences is 1 (one gap), and the alignment length is 6. The percentage of identities on the alignment is therefore 83.33% (5 matches on 6).

KERR implementation is based on:

E. Myers - A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Progamming, J. of ACM 46, 3 (1999), 539-553.

It incorporates proprietary features such as:

  • Works for nucleotide and proteins
  • Support complete IUPAC character set with tuning such as ‚T matches U / N does not match N.
  • Can privilege mismatches over gaps (leading to more compact alignments more suitable for SNPcallers)
  • No length restriction: switches automatically from 64bits ultra-fast packed implementation to a generic length-less implementation.
  • Reports the best alignment but computes and stores the distribution of all alignments found.

While KERR is the state-of-art aligner for short reads mapping, its accuracy can be used for more homology-related purposes. The example below is a protein alignment computed with KERR that is a real case of small exon-skipping in one variant:

Q: MKFLILLFNI----------NHGVGPQGASGVDPITFDINSNQTGPAFLTAVEMAGVKYL
   ||||||||||          ||||||||||||||||||||||||||||||||||||||||
S: MKFLILLFNILCLFPVLAADNHGVGPQGASGVDPITFDINSNQTGPAFLTAVEMAGVKYL

Options

Description Option Default value Proteic Default value Nucleic
# of errors -errs
1
1
Compute Exact Positions -extend
off
off
Score Distribution -hitmap
off
off
Full Score Distribution -fullhitmap
off
off
Percentage Errors Alignment -perc
not set
not set
Percentage ID on Query -percq
not set
not set
Percentage ID on Subject -percd
not set
not set
Percentage ID on Alignment -perca
not set
not set

Notes:

  • Percentage identities options are more IP-usage-oriented.
  • -extend is used to compute exact positions. For many algorithmics reasons, KERR always compute the exact best alignment, but can or cannot compute the precise starting and end positions. Not putting -extend speeds up computation - 10% to 20% - and can be useful in some cases such as RNA-Seq or Metagenomics studies. When precise coordinates are needed (e.g. Variants) -extend must be used. If this flag has been omitted, precise coordiantes can be computed a posteiori using lspextend.

Hitmap

KERR handles two very important options: hitmap and fullhitmap. As a reminder, using for example single the bfqlbest module allows to store the total number of putative alignments for a query. This information is stored in the RP1 field.

With KERR, one can collect more fine grained information, such as the distribution of hits per-query, per-#-of-errors. In fact, it is not the distribution of hits stricto sensu, but rather an encoding that translates a probabilistic model. What guides us here is the ability to better decide what is a 'unique' alignment versus an 'ambiguous one'. Numerous packages, such as MAQ or Soap include these kind of measures. Generally, they decide that a unique alignment is something such as the second best one has more than two differences with the best. Here, we did extend these notions the following way:

Below is the simplified pseudo-code to compute probability that a read falls within a repetitive region. Weighted score that says there is a 25% chance that a hit with one error is really the same as a hit with no errors (A,C,G,T).

    if (best hit has 0 errors) {
        probability = 1 - (1 / (totalNbHits0errs + 0.25* totalNbHits1errs + 0.25*0.25* totalNbHits2errs))
    }
    else
    if (best hit has 1 error) {
        probability = 1 - (1 / (totalNbHits1errs + 0.25* totalNbHits2errs))
    }
    else if (best hit has 2 errors) {
        probability = 1 - (1 / totalNbHits2errs)

The GQ Engine encodes this model using the following values:

# Errors        Value (RP2)
0                            64
1                            16
2                              4
>=3                        1

0 means in fact the # of errors of the best alignment. All values are shifted according to the best hit. So to say that:

  • one hit with 0 error : RP2 =64
  • one hit with 1 error : RP2 =64
  • two hits with 0 error : RP2 = 128
  • one hit with 0 error and one hit with 1 error: RP2=80
  • one hit with 0 errors and 4 hits with 1 error : RP2=128
  • one hit with 0 errors and 5 hits with 2 errors : RP2=84
  • one hit with 0 errors and 16 hits with 2 errors : RP2=128

The highest RP2, the lower the probability to trust in the uniqueness of the best alignment. RP2 values greater than or equal to 128 translate that the probability that the hit is unique is lower than or equal to 0.5.

  • -fullhitmap: this flag enables the RP2 computation.
  • -hitmap: same as fullhitmap but the computation discards any further mapping of a read when RP2 >= 128. If for example -bfqlbest 1 has been set, the best alignment will still be kept, and the RP2 will be provided for that read, but it will be in some way, saturated. The main advantage is a compute time cut. It can be seen as a filtering-on-repeats on the fly with no a priori knowledge of repeats, nor pre-processing of reads. We do encourage to use this flag, especially for Eukaryotic genomes resequencing. Depending on reads and species, it is in average 20% to 30% faster and there is no loss of information. The gain is less clear for transcriptome reads remapping as in theory repeats are much less abundant in pure transcriptome. Many GQ workflows use this option and then proceed with an a posteriori filtering such as:lpres res -bfql 'RP2 < 128'.
    Please note that for the time being, hitmap and fullhitmapcan only be used with bfqlbest strategy = hitcnt, and are not operational with bfqlbest strategy = tagcnt[=<annot>].

KERR and IP searching

This section might not be a direct interest for NGS, and can be skipped at a first glance.

BLW (aka Fragment Search)

Description

Like KERR algorithm, the BLW algorithm will report matches that are precisely defined before the comparison is started, using the following criteria:

  1. The minimal length of the required match - refered to as the overlap window - in number of nucleotides or amino acids
  2. Tthe maximum number of mismatches within this overlap window as an absolute number, or as a percentage identity relative to the length of the overlap window
Blw.png

BLW is a multiple hits algorithm. It rapidly detects alignments between having in common a window of length at least L1, with a percentage identity of at leastP.

These criteria (L1,P) are bound to the following limits:

  • For nucleotide sequences, the minimal match that can be detected contains at least 10 consecutive residues without mismatches and at least 90% identity over the length of the overlap window.
  • For protein sequences, the minimal match that can be detected contains at least 5 consecutive residues without mismatches and at least 80% identity over the length of the overlap window.
  • Only non-ambigous residues are taken into account (ACGT for nucleotide sequences, ACDEFGHIKLMNPQRSTVWY for protein sequences). Ambigous residues N for nucleotide and X for protein sequences are always counted as a mismatch.
  • Below the boundaries of 90% identity for nucleotide sequences and 80% for amino acids, results are guaranteed to be correct but not complete, i.e. not all matches will be detected.

BLW produce local alignments. Therefore mismatches at the begin and end of an alignment, for example in overlapping sequences, are not taken into account.

Options

Description Option Default value Proteic Default value Nucleic
# of errors -errs
6
0
Window Length -len
150
20
Percentage Errors -perid
not set
not set

BLW is a kind of directed-Blast'. It is Blast because it computes local alignments; it is directed because it gives control over the lengths and percentage identities or # of errors.

Personal tools