Welcome to the GenomeQuest Documentation Wiki

NGS Reads

From GQ Wiki
Revision as of 13:44, October 7, 2011 by John knepper (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

What does it do?

  • The NGS Reads workflow (better known as the Read Processing workflow), allows you to upload your raw NGS reads onto GenomeQuest servers for use with the GenomeQuest platform. Once uploaded, the reads can be aligned or searched against a large number of public databases maintained by GenomeQuest.



What does it produce?

  • This workflow takes your reads in a number of formats, and produces a GenomeQuest database that can be viewed in the sequence results database browser as well as exported to a number of downstream tools. The processing is implemented through the Read Processing workflow.
  • The supported formats include:
    • FASTA
      • Standard FASTA
      • Other FASTA
      • 454 Standard FASTA
      • Illumina Standard FASTA Format
      • CSFASTA (color space FASTA)
      • Standard Helicos FASTA
    • FASTQ:
      • Standard FASTQ
      • 434 Standard FASTQ
      • Illumina Standard FASTQ format with Phred-scaled quality values
      • Illumina FASTQ v1.2
      • Illumina FASTQ v1.3
      • CSFASTQ (color space FASTQ)
      • Helicos SMS
      • Helicos SRF
      • Pacific Biosciences Basecall HDF5
      • Roche 454 SFF
      • Roche 454 multiplexed data.



Roche 454 standard FASTA format

The FASTA format contains sequence IDs and sequences. Sequence IDs start with the greater than sign (>) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences need to be on a single line and cannot contain numbers or white-space characters. This format does not support quality values.

Example of FASTA format

>seqid1
ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG
>seqid2 length=42
GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA



Roche 454 standard FASTQ format

The FASTQ format contains sequence IDs, sequences and base call quality values. Sequence IDs start with the at sign (@) and continue until the first white-space or the end of the line (length=42 for seqid1 in the example below is ignored). Sequences and base call quality values are on a single line separated by a plus sign (+). This line can (but does not have to) contain the sequence ID again. Quality values are standard Phred/Sanger scores (0 to 93) encoded as ASC characters 33 to 126). The sequence and the quality value line need to have the same length.

Example of FASTQ format

@seqid1 length=42
ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG
+seqid1
98876776889898#$@@!!999999999999999999999!!
@seqid2
GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA
+
98876776889898#$@@!!999999999995466466539#!



Roche 454 separate sequence and quality files

This format contains sequence IDs, sequences and base call quality values. Sequences and quality values are in separate files. In most cases the file names end in .fna and .qual for sequences and quality values respectively. Both files must contain the same number of sequence IDs in the same order. Like with any other FASTA format, sequences and base call quality values must be on a single line. Base call quality values are standard Phred scores as numbers separated by a space.

Example of sequence file in FASTA format (file usually ends with .fna)

>FZKE0LS02RGRJ2
AGAGAGTGGTTCACAGTATTATCGCACACGCACTAACCGGTGAG
>FZKE0LS02PSSZH
ATTATTCACACGTGAGACATATTCACCTGAACGGGTATCGAC

Example of quality file in FASTA format (file usually ends with .qual)

>FZKE0LS02RGRJ2
37 37 37 37 37 37 37 37 37 37 37 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39 39 39 37 36 36 36 37 37
>FZKE0LS02PSSZH
39 39 39 39 39 39 39 39 39 39 39 40 40 40 40 40 40 40 40 40 39 38 36 39 39 39 39 33 27 23 23 21 28 30 26 35 30 30 29 29 21 21 17



Roche 454 Sequence Flowgram Format (SFF)

This format comes as a binary file (not a text file) that contains sequence IDs, sequences, base call quality values and can contain information on how sequences need to be clipped.



Illumina standard FASTA format

The FASTA format contains sequence IDs and sequences. Sequence IDs start with the greater than sign (>) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences need to be on a single line and cannot contain numbers of white-space characters. This format does not support quality values.

Example of FASTA format

>seqid1
ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG
>seqid2 length=42
GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA



Illumina standard FASTQ format

The FASTQ format contains sequence IDs, sequences and base call quality values. Sequence IDs start with the at sign (@) and continue until the first white-space or the end of the line (length=42 for seqid1 in the example below is ignored). Sequences and base call quality values are on a single line separated by a plus sign (+). This line can (but does not have to) contain the sequence ID again. The sequence and the quality value line need to have the same length.

The following quality value formats can be found with the Illumina FASTQ format:

  • ASC encoded Phred scores, ASC characters 33-126 encoding scores of 0 to 93
  • ASC encoded Illumina pipeline 1.2 or earlier scores, ASC characters 59-126 encoding scores of -5 to 62
  • ASC encoded Illumina pipeline 1.3 or later scores, ASC characters 64-126 encoding scores of 0 to 62

If the sequence data has been recently produced then the quality values are most likely in "ASC encoded Illumina pipeline 1.3 or later" format. If you are unsure about the format please contact your sequence service provider.

Example of FASTQ format

@seqid1 length=42
ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG
+seqid1
98876776889898#$@@!!999999999999999999999!!
@seqid2
GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA
+
98876776889898#$@@!!999999999995466466539#!



Illumina standard SCARF format

The Illumina SCARF format contains sequence IDs, sequences and base call quality values on a single line separated by double colons (:). All information before the sequence will be added together to create an unique sequence ID.

The following quality value formats can be found with the Illumina SCARF format:

  • ASC encoded Phred scores, ASC characters 33-126 encoding scores of 0 to 93
  • Numerical Phred scores, space-separated numbers from 0 to 93
  • ASC encoded Illumina pipeline 1.2 or earlier scores, ASC characters 59-126 encoding scores of -5 to 62
  • ASC encoded Illumina pipeline 1.3 or later scores, ASC characters 64-126 encoding scores of 0 to 62

If the sequence data has been recently produced then the quality values are most likely in "ASC encoded Illumina pipeline 1.3 or later" format. If you are unsure about the format please contact your sequence service provider.

Example of Illumina SCARF format with Phred quality values

HWI-EAS68_2_FC205D4:7:1:565:389:GAAGCAAAAAGAAGACTAATAAAAACTTTACACTTT:>>>>=>>>>>???>?>>?>?;B<8>=<7864767/5
HWI-EAS68_2_FC205D4:7:1:164:493:GTGTGCATGTGTATGTGTTTTTTTTTTTTTTTTTCT:>'>>5=.>54'?.>?6,;49B733+5+(40351/0-

Example Illumina SCARF format with numerical Phred quality values

HWI-EAS68_2_FC205D4:7:1:565:389:TAAATGTTTTCAACGTTAAACTTCTCTA:21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21
HWI-EAS68_2_FC205D4:7:1:164:493:AGTAAACGATAACGCTTTCACAGACATT:21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21



Standard FASTA format

The FASTA format contains sequence IDs and sequences. Sequence IDs start with the greater than sign (>) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences need to be on a single line and cannot contain numbers of white-space characters. This format does not support quality values.

Example of FASTA format

>seqid1
ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG
>seqid2 length=42
GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA



Standard FASTQ format

The FASTQ format contains sequence IDs, sequences and base call quality values. Sequence IDs start with the at sign (@) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences and base call quality values are on a single line separated by a plus sign (+). This line can (but does not have to) contain the sequence ID again. Quality values are standard Phred/Sanger scores (0 to 93) encoded as ASC characters 33 to 126). The sequence and the quality value line need to have the same length.

Example of FASTQ format

@seqid1 length=42
ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG
+seqid1
98876776889898#$@@!!999999999999999999999!!
@seqid2
GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA
+
98876776889898#$@@!!999999999995466466539#!



Next Steps:

  • The next logical step after running this workflow is to map the reads against a reference or genome via one of the many assemblies or builds that GenomeQuest maintains from public databases such as NCBI, or assemble them de novo. Mapping can be done utilizing the Whole Genome Mapping, ChIP-Seq, and RNA-Seq workflows, whereas de novo assembly and rapid annotation can be done utilizing the Newbler, Velvet, or Metagenomics workflows. Additionally, reads may be searched against a number of databases using the BLAST Search, or IP Search workflows.



Important Parameters that are used in the Workflow:

  • The launch page lists the following parameters:


  • What do you want to call this database?'
    • An intuitive name to identify your results later on.
  • Species:
    • The species the reads came from.
  • Sample Source:
    • What kind of nucleic acid polymer (genome, transcriptome, small-RNA (sRNA), unknown)
  • Sequencing Machine Vendor:
    • What vendor produced the reads (454, Life Tech, Illumina, etc.)
  • Sequencing Machine:
    • The sequencing machine type (454 Flex, Illumina GA IIx, Solid 4.0)
  • Sequence File Format:
    • The format-type the reads are in (FASTQ, FASTA, Helicos SMS, etc.)
  • I have paired end reads in separate files:
    • Check if reads are paired-end, and in the forward and reverse reads are in two separate files.
  • Sequence File:
    • The sequence file pertaining to the reads about to be uploaded.
  • Quality File:
    • The quality file pertaining to the reads about to be uploaded.
  • Trim ambiguous residues (Ns) from both sides of the sequence:
    • Trim any no-calls or undetermined residues (Ns) within reads.
  • Trim bases with Phred quality below 'x' from left side (5' end) of the sequence:
    • Trim residues that fall below threshold 'x' from left side within read.
  • Trim bases with Phred quality below 'x' from right side (3' end) of the sequence:
    • Trim residues that fall below threshold 'x' from right side within read.
  • Remove reads with more than 'x' residues with quality below 'n':
    • Remove all reads with the 'x' number of residues falling below threshold 'n'.
  • Remove adaptor sequence:
    • If yes, provide linkers to remove, what side they are present on (5' or 3' end), and their minimum length.
  • Remove reads that contain ambiguous residues (Ns):
    • Remove reads that contain any ambiguous residues.
  • Remove reads shorter than 'x' residues:
    • Remove reads that are shorter than threshold 'x'.


  • *For paired end or mate pair data:


  • Dataset # 1:
    • The forward-oriented reads
  • Dataset # 2:
    • The reverse-oriented reads
  • Insert Size:
    • Estimated size, and standard deviation
  • Paired Library Type:
    • Paired End or Mate Pair



Report Page:

  • The report of a finished workflow run shows the following things:
    • Processing Parameters:
      • File information
      • Sequence Quality Trimming and Filtering params
      • Adapter Clipping
      • General Cleaning
      • Sequence Mode
      • Multiplexing
    • Original File Cleaning:
      • Number of reads processed
      • No. of reads removed total
      • Reads removed because of ambiguous residues
      • Reads removed in overall quality filtering
      • Reads removed because too short
      • Reads removed to keep only plain pairs
      • Number of reads kept
    • Database Details:
      • Total number of sequences (kept)
      • Read Length Distribution (Longest, shortest, Mean, and standard deviation)



A Description of the Workflow's Interactive Results:

  • The processed database contains sequence records (one record per sequence).
    • A sequence record is about a single read.
    • Each sequence read contains the following information:
      • Identifier: the name of the sequence record
      • Database Name: the name of the sequence record's database
      • Base Quality Values: the quality scores per each base of the read
      • Sequence Length
      • The raw sequence itself: the actual base-content of the read



A High-level and Algorithmic Description of this workflow:

  • The NGS reads algorithm works in the following way:
    • Apply filtering:
    • Walk through all reads, and apply filtering specified in the launch page parameters (4)
    • Remove all reads falling below filters
    • Return results in a 'browsable' sequence database in GenomeQuest's biofacet format
    • Make database available to users for downstream analyses.



FAQs:

  • Frequently Asked Questions:
      • Does the workflow accept multiplexed data?
      • The workflow accepts multiplexed data only in Roche 454's SFF format.
    • What are ambiguous residues (the Ns)?
      • Ambiguous residues are designated by Ns according to IUPAC/IUB codes. This means the actual base itself was undeterminable during sequencing, because the quality was too poor, or no base was able to be determined. In certain contexts, N can designate "any base", but not during sequencing, and not here.
    • Does the NGS Reads workflow accept paired end data?
      • Yes, it accepts both paired end as well as mate pair data.
    • Is there a maximum number of reads allowed for import?
      • Absolutely not. If there exists a large number of reads, and it is expected that these reads will generate very high coverage- there is no upper limit. They can be uploaded.



Miscellaneous:

Dealing with paired end / mate pair sequences

The processing of paired end / mate pair sequences is supported in the following way:

  • Forward and reverse sequences are in separate files. In this case simply check the check box that says "I have paired end reads (forward/reverse) in separate files" on the Read Processing launch page, and you will be asked to choose two files for processing (one for forward and one for reverse reads).
  • With 454 paired-end technology, forward and reverse sequences are physically joined together by a paired end linker and sequenced as a single sequence. In this case you can copy & paste a linker sequence and the workflow will split the concatenated sequences in forward and reverse sequences. Currently this is only supported for 454 Sequence Flowgram Format (SFF).

The processing workflow will automatically add _F or _R to the sequence ID of each sequence.



Sequence quality trimming and filtering

Trim ambiguous residues (Ns) from both sides of the sequence

This option will remove ambiguous residues (N in the case of nucleotide sequences, or . in the case of color space sequences) from both ends of each sequence. This allows to remove the trailing stretches of Ns that some analysis pipelines add to reads to "fill them up" to a specified length, or bases that were unable to be read or have a base ascertained by whatever sequencing technology you may be using; example- SOLiD specifies '.' as no-calls, and, therefore, these no-calls are typically translated to 'N' as they're 'unknown', but they are also sometimes designated as the reference base (taken from the reference sequence) in some mapping algorithms.



Trimming poor quality sequence ends

This option will trim low quality residues from the 5' and/or 3' end of each read. By default, a Phred base call quality of 11 or below is considered to be a low quality residue, but this value can be changed. Doing this will reduce the number of potential sequencing errors in a read, thereby increasing the chance that a read can be aligned afterward. In just about every sequencing technology or platform, chemistry deteriorates towards the ends of sequences, and, thus, the quality values decrease in a corresponding manner.



Overall quality filtering

This option will remove overall bad quality reads by counting the number of low quality residues in the first part of a sequence. Doing this will speed-up the alignment step, and reduce the number of false positive mismatches, insertions and deletions in the alignment step.



Removing adapter sequences

Remove known adapter(s) from each sequence

Use this option to remove the linker sequences from the left (5') and right (3') ends. For example, when left linker sequence is: AAATTTGGCCGC, and the minimum linker length is 5, it will try to remove bases on the left following these possible permutations of bases:

AAATTTGGCCGC

AATTTGGCCGC 
 ATTTGGCCGC 
  TTTGGCCGC 
   TTGGCCGC 
    TGGCCGC 
     GGCCGC 
      GCCGC

When right linker sequence is: CCGGTTGC, and the minimum linker length is 5, it will try to remove bases on the right following these possible permutations of bases:

CCGGTTGC CCGGTTG CCGGTT CCGGT

Additionally, this option allows one to remove known adapter sequences from either the 5' and/or 3' end of each sequence using regular expressions. Use this option when the sequence of the adapter is known.

The following regular expression syntax applies:

^      Match the beginning of the line
$      Match the end of the line
.      Match any character
[]     Character class
()     grouping
+      Match 1 or more times
{n}    Match exactly n times
{n,}   Match at least n times
{n,m}  Match at least n but not more than m times
>adapter1
GATCGGAAGAGCTC
>adapter1a
^GATCGGAAGAGCTC
>adapter1b
^ATCGGAAGAGCTC
>adapter1c
^TCGGAAGAGCTC
>adapter1d
^CGGAAGAGCTC
>adapter1e
^GGAAGAGCTC
>adapter1a
GATCGGAAGAGCTC$

>adapter1b
GATCGGAA[ACGT]AGCTC$



Split up the dataset using barcodes and sequence ids within the sequences

Use this option to de-multiplex the reads file into multiple separate sequence databases. The sequences are split based on the barcodes and regular expression on sequence identifier, you can also set the name suffixes of the resulting sequence databases. For example,

barcode: AACAATCTA, sequence id regex: ^SRRR153, suffix: db1

It will put all sequences whose identifiers start with SRRR153 and contain barcode AACAATCTA into a new database with db1 as name suffix. You will be able to see all the databases produced in the report page of the workflow. All the non-matching sequences will be put into a database with suffix _NOMATCH.

You can also append '/n' to the barcode, where n is a number, to allow errors in the barcode matching. For example AACAATCTA/1, it means allow 1 error during barcode matching.



General sequence cleanup

Remove reads that contain ambiguous residues (Ns)

Ambiguous residues are characteristic of a sequencer being unable to call a base, or determine that base's nucleotide-type (A/C/G/T). This can be the result of poor sequencing chemistry as well as a low quality or fluorescing score for whatever reason, which essentially leads to the same issue- the inability to determine whether a base is a specific nucleotide. Therefore, these poor quality bases are given an 'N' to reflect the sequencer's inability to detect or "call" any nucleotide. Depending on the number of Ns, this can significantly effect a read's ability to map as well as hinder the ability to accurately detect or call SNPs and/or indels. Thereby, the Read Processing workflow allows users to filter and remove any reads that contain the ambiguous N residues. This is not to be confused with removing Ns from either side of a read, which is explained here.



Remove reads shorter than Χ residues

Depending on the application, sometimes read lengths are not consistent across a library. A good example is when sequencing poly-adenylated RNAs. When sequencing poly(A)-RNAs, you receive a mixture of RNA types- fRNAs, ncRNAs, pre-mRNAs, and of course mRNAs. The type of RNA you receive is dependent on the methods you use during library prep of course. For example, was an RNA filtering kit used during library prep? Anyway, when trying to sequence an RNA up to a specified length, and that RNA is much shorter due to its physical size or fragmentation size (say it was fragmented at the end of an mRNA), it can result in a mixture of cDNA lengths when sequencing is finished. For reasons as such, the Read Processing workflow allows users to filter and remove reads shorter than a specified length.



Multiplexing

Split up the sequencing run using barcodes within the sequences

Currently, this method is only available for 454's SFF format.

Personal tools