Welcome to the GenomeQuest Documentation Wiki

Variant Detection Workflow

From GQ Wiki
Jump to: navigation, search

This page will show you how to get from NGS reads to a database of annotated SNPs and indels. The SNPs and indels it will detect are usually in the range of 1 to 8 bases, depending on the length of your NGS reads.

Contents

How to detect and annotate variants

There are several steps in this workflow:

  • Upload and process raw NGS reads using the read processing workflow
  • Align the processed NGS reads against a reference genome
  • Call the variants using the alignments generated in the previous step
  • Annotate the called variants using public data sources like dbSNP

How to handle these steps in the interface:

  • The Read Processing step is a separate workflow that can be found under the launch workflow / utility menu
  • The alignment, variant calling and variant annotation steps can be run all at once using default parameters, or can be run separately with the possibility to change the default parameters.

Upload and process raw NGS reads

This workflow will work with data coming from all the sequencing vendors that we support. It is preferred that your data contains base call quality values, but this is not a prerequisite. Detailed information on how to get NGS reads into the system and on how to process them (clean them up) can be found on the Read Processing Workflow page.

Align processed NGS reads against reference genome

About GASSST / read mapping algorithm

  • GenomeQuest uses GASSST for read mapping (paper here: http://bioinformatics.oxfordjournals.org/content/26/20/2534.abstract)
  • The algorithm has been completely integrated into our engine, meaning that:
    • it works in a distributed way (on multiple nodes, on multiple CPUs and cores)
    • it works with our (very efficient) sequence and alignment database formats
  • it handles both gaps and mismatches very well
  • it handles any read length and doesn't slow down with longer reads like other algorithms
  • it handles different read lengths in the same run, adjusting the number of mismatches / gaps to the read length
  • it is technology independent, it handles: Illumina, Helicos, Pac Bio, 454, and Solid (in color space)
  • has the flexibility to redefine what the best alignment for a read is
  • works with any reference sequence easily, from nice finished genomes to sets of 50,000 contigs

GASSST parameters we use

  • Variable number of mismatches / indels (errors) depending on the read length:
    • for nucleotide space: 1-24 bases 0 errors, 25-37 bases 2 errors, 38-54 bases 3 errors, 55-69 bases 4 errors, 70-84 bases 5 errors, 85-100 bases 6 errors, 101 bases and more 1 error every 14 bases
  • Seed word size (see GASSST paper) depending on the read length:
    • 1-50 bases W=12, 51-70 bases W=13, 71-90 bases W=14, 91-110 bases W=15, 111 bases and up W=16
  • Controlled sensitivity, seed space number (SMX) and number of alignments processed (SMXA) (see GASSST paper)
    • 1-70 bases SMA=15 / SMAX=600, 71bases and up SMA=30 / SMAX=1200

Best alignment

  • We look at all the alignments for a read, but we only keep the best one
  • The best alignment for a read is the one with the best score (least amount of errors)
    • Reads that "map well" to multiple locations on the reference genome are discarded
    • The definition of "mapping well" depends on the number of errors in the best alignment
      • a read can't have 2 (or more) equivalent alignments on the reference
      • equivalent alignment is defined as:
        • two alignments with the same number of errors (or two alignments without errors)
        • 4 alignments with 1 error counts as 1 alignment with 0 errors
        • 4 alignments with 2 errors counts as 1 alignment with 1 error
        • 16 alignments with 2 errors counts as 1 alignment with 0 errors, etc.

Report statistics and results

  • After a mapping we show the following statistics
    • Number of reads we started with
    • Number of reads that produced at least one alignment
    • Number of reads that mapped well (informative reads), these are used for further analysis
    • Number of reads that did not map

Call the variants starting from alignments

Detailed information on this workflow can be found here: Variant Calling Workflow.

The calling algorithm runs through all alignments that have at least one mismatch or indel (event). Every event in an alignment is run through the following filter:

  • the base call quality at the event position is at least X (20 by default)
  • the base call quality of bases directly flanking the event (one on each side) is at least Y (15 by default)
  • the distance between the event and the edge of the alignment (left and right) is at least Z bases (3 by default)
  • there are no more than K (1 by default) identical alignments that have showed this event before

In short this does:

  • Neighborhood Quality Score
  • getting rid of sequencing errors towards beginning and end of reads, getting rid of alignment artifacts which are more frequent at these positions
  • getting rid of sample preparation artifacts causing the exact same read to appear many times

Note that each of these steps can be changed, or turned off.

A genomic position (chromosome, start & stop position) is flagged for allele calling if

  • at least L reads confirm the same event at that position (6 by default)
  • at least M of those reads align to the forward strand (1 by default)
  • at least N of those reads align to the reverse strand (1 by default)

For all regions flagged for allele calling we examine all alignment thats overlap the region to see what is happening at the event position. Everything that is found (both variant and reference bases) is subjected to the event filtering rules described above. An allele is called when it passes the allele flagging rules described above.

The output we generate is per allele, meaning that if more than one allele was called at a single position, multiple records will be created.

Annotate the called variants using public data sources

Personal tools