Welcome to the GenomeQuest Documentation Wiki
Variant Calling Workflow
The variant calling workflow finds SNPs and small insertions and deletions (indels) starting from a database of aligned reads.
Contents |
Workflow Input and Output
This workflow operates on a alignment database, coming from the read mapping or alignment import workflow. It generates a database of SNPs and indels that can be browsed within the GenomeQuest interface, or can be downloaded as an Excel spreadsheet.
Next Steps
The next logical step after running this workflow is to annotate the variant database that this workflow produces with public information like gene positions, known SNPs, etc. This can be done using the variant annotation workflow.
Workflow Launch Page
The launch page lists the following parameters:
- Run Name: set this to something meaningful to be able to find the workflow run later on.
- Mapping: select the database of alignments that will be used to find the SNPs and indels.
- Sample Information: text fields holding things like the sample and experiment ID that will be stored with the variant database.
- Read Selection Parameters: parameters that decide if individual reads will contribute to a SNP or indel.
- Minimum Variant Base Quality: minimal base call quality of the variant base in the read.
- Minimum Flanking Base Quality: minimal base call quality of the bases directly flanking the variant (left and right).
- Minimum Edge Distance: distance between the variant and the alignment edge (left or right edge, which ever one is closest). This allows to filter out sequencing errors which typically happen at the begin or end of a read.
- Maximum Alignment Redundancy: number of times that the exact same alignment has been seen before (caused by identical reads generating identical alignments). This allows to filter out highly redundant reads.
- Allele Calling Parameters: parameters that decide how much evidence is needed to call a SNP or indel.
- Minimum Number of Reads: number of reads that pass the read selection criteria (detailed above) supporting a particular SNP or indel.
- Minimum Number of Forward / Reverse Reads: number of those reads that need to align to the forward and / or reverse strand.
- Maximum Number of Reads: the total coverage (number of reads covering the position, regardless of whether they support a variant). When this maximum is reached no variant is reported at this position.
- Flanking Sequence Length: when this workflow creates a variant database it includes
Workflow Report
The report of a finished workflow run shows the following things:
- Interactive Reports
- Unannotated Variant database (that can be browsed within GenomeQuest interface)
- Excel spreadsheet with all variants
- A link to the Read Mapping workflow that has been used as input.
- Some statistics on how well the variant alleles and variant regions are covered.
- Workflow Parameters coming from the launch page.
Contents of a Variant Database
- The unannotated variant database contains sequence records representing variant alleles.
- A sequence record is about a single variant allele. When multiple alleles have been found at a genomic position, then each allele will have its own sequence record (even when that allele is the reference sequence).
- Each sequence record contains the following information:
- Variant ID and description
- Reference sequence ID (the chromosome) and begin and end position on the chromosome.
- The type of variant (mismatch, insertion, deletion, reference) and its length (in nucleotides)
- Coverage Allele. The number of reads passing quality tests and thereby confirming the allele this sequence record is about (allele-specific coverage).
- Coverage Locus. The number of reads passing quality tests, confirming any allele at the genomic position (the total coverage).
- % Coverage Allele. Percentage of the total coverage confirming this specific allele. Use this number to find heterozygous and homozygous alleles.
- Reference pileup. See which reads contribute to this specific allele and how they align to the reference genome.
- Table of all alleles found at this position. This table lists everything found at the position:
- The good alleles: all alleles (variant or reference) found at the position that pass the allele calling tests.
- The bad alleles: all alleles found at the position with insufficient information to pass the allele calling test.
- For all alleles (good or bad), the number of reads passing and the number of reads failing the read selection tests (good or bad reads).
- Number good alleles. This field can be used to find all heterozygous variant positions.
- Number bad alleles. Variant positions with a high number of bad alleles are suspicious, as they are likely repetitive genomic regions.
- Reference sequence around the genomic position. Every record in a variant database is a sequence record that can be searched by text and by sequence (using BLAST for example).
Variant Calling Algorithm
The variant calling algorithm works in the following way:
- Find variant positions
- Loop through all alignments with a mismatch or indel.
- For every mismatch or indel apply the following tests (in this order):
- See if the mismatch / indel is far enough from the edge of the alignment.
- Test if base call quality at and around the mismatch / indel position is good enough.
- See how many times an identical alignment was encountered (should be below a given number).
- If the following criteria are satisfied for a specific mismatch / indel then the genomic position is added to a list of variant positions:
- Number of reads passing the read selection criteria detailed above is above a certain number.
- A number of these reads align specifically to the forward and / or reverse strand of the reference genome.
- Merge overlapping and adjacent variant positions.
- All overlapping and adjacent variant positions reported in the previous step are merged into a single variant position.
- Analysis of what is found at each variant position.
- Loop through all alignments (including those without mismatches or indels).
- For every alignment that spans a variant position found in the first step:
- Extract from the NSG read the bases at the variant position (regardless of whether they indicate a mismatch, indel or reference sequence).
- Apply identical quality tests as are used to find the variant positions
- See if the bases are far enough from the edge of the alignment.
- Test if the base call quality at and around the bases is good enough.
- See how many times an identical alignment was encountered (should be below a given number).
- When a maximum number of reads spanning a given variant position is found, the entire variant position and everything found there is removed.
- The following output is generated in the form of an Excel spreadsheet and interactive GenomeQuest sequence database:
- Good alleles. Specific events (mismatches, indels or reference) for which there is a sufficient number of reads that pass the quality tests. The number of reads that pass / fail the quality tests for every event is listed.
- Bad alleles. Specific events for which not enough evidence was found to call an allele. Here as well, the number is reads that pass / fail the quality tests is listed.
Frequently Asked Questions
How do I select heterozygous / homozygous alleles?
While browsing the variant database apply a filter on the % Coverage Allele field. For example >=40% to find heterozygous alleles, and >=95% to find homozygous alleles.
How do I see what other alleles are found at a position?
While browsing the variant database group the alleles by the Chromosome and Position field (click Grouping / Group By / Chromosome and Position).
How do I know if a SNP is already known in dbSNP or which gene it falls in?
You need to annotate the variant database first using the Variant Annotation workflow.