Welcome to the GenomeQuest Documentation Wiki

ReadProcessing

From GQ Wiki
Jump to: navigation, search

Contents

About the Read Processing workflow

The read processing workflow creates GenomeQuest sequence databases out of raw sequence files while allowing the user to cleanup and organize the data. This is the flow NGS sequence data follows:

  • Prepare an experimental sample and have it sequenced
  • Upload the raw sequence file(s)
  • Process the file(s) and make it into a real GenomeQuest sequence database
  • Run workflow(s) on the sequence database(s) and interact with the results


The read processing workflow supports the following operations:

  • Dealing with paired end / mate pair sequences
  • Sequence quality trimming and filtering - trimming bad quality bases from the sequence ends, and removing overall low quality sequences
  • Removing adapter sequences
  • General sequence cleanup - removing small sequences, redundant sequences and sequences that are obvious repeats.
  • Multiplexing - splitting up a single sequencing run into multiple separate experiments.


Supported sequence formats

Sequences can come in many different formats, all of which we try to support. If your sequence format is not in the list feel free to contact us at support@genomequest.com


Roche 454

Roche 454 standard FASTA format

The FASTA format contains sequence IDs and sequences. Sequence IDs start with the greater than sign (>) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences need to be on a single line and cannot contain numbers or white-space characters. This format does not support quality values.

Example of FASTA format

>seqid1
ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG
>seqid2 length=42
GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA

Roche 454 standard FASTQ format

The FASTQ format contains sequence IDs, sequences and base call quality values. Sequence IDs start with the at sign (@) and continue until the first white-space or the end of the line (length=42 for seqid1 in the example below is ignored). Sequences and base call quality values are on a single line separated by a plus sign (+). This line can (but does not have to) contain the sequence ID again. Quality values are standard Phred/Sanger scores (0 to 93) encoded as ASC characters 33 to 126). The sequence and the quality value line need to have the same length.

Example of FASTQ format

@seqid1 length=42
ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG
+seqid1
98876776889898#$@@!!999999999999999999999!!
@seqid2
GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA
+
98876776889898#$@@!!999999999995466466539#!

Roche 454 separate sequence and quality files

This format contains sequence IDs, sequences and base call quality values. Sequences and quality values are in separate files. In most cases the file names end in .fna and .qual for sequences and quality values respectively. Both files must contain the same number of sequence IDs in the same order. Like with any other FASTA format, sequences and base call quality values must be on a single line. Base call quality values are standard Phred scores as numbers separated by a space.

Example of sequence file in FASTA format (file usually ends with .fna)

>FZKE0LS02RGRJ2
AGAGAGTGGTTCACAGTATTATCGCACACGCACTAACCGGTGAG
>FZKE0LS02PSSZH
ATTATTCACACGTGAGACATATTCACCTGAACGGGTATCGAC

Example of quality file in FASTA format (file usually ends with .qual)

>FZKE0LS02RGRJ2
37 37 37 37 37 37 37 37 37 37 37 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39 39 39 37 36 36 36 37 37
>FZKE0LS02PSSZH
39 39 39 39 39 39 39 39 39 39 39 40 40 40 40 40 40 40 40 40 39 38 36 39 39 39 39 33 27 23 23 21 28 30 26 35 30 30 29 29 21 21 17

Roche 454 Sequence Flowgram Format (SFF)

This format comes as a binary file (not a text file) that contains sequence IDs, sequences, base call quality values and can contain information on how sequences need to be clipped.


Illumina

Illumina standard FASTA format

The FASTA format contains sequence IDs and sequences. Sequence IDs start with the greater than sign (>) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences need to be on a single line and cannot contain numbers of white-space characters. This format does not support quality values.

Example of FASTA format

>seqid1
ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG
>seqid2 length=42
GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA

Illumina standard FASTQ format

The FASTQ format contains sequence IDs, sequences and base call quality values. Sequence IDs start with the at sign (@) and continue until the first white-space or the end of the line (length=42 for seqid1 in the example below is ignored). Sequences and base call quality values are on a single line separated by a plus sign (+). This line can (but does not have to) contain the sequence ID again. The sequence and the quality value line need to have the same length.

The following quality value formats can be found with the Illumina FASTQ format:

  • ASC encoded Phred scores, ASC characters 33-126 encoding scores of 0 to 93
  • ASC encoded Illumina pipeline 1.2 or earlier scores, ASC characters 59-126 encoding scores of -5 to 62
  • ASC encoded Illumina pipeline 1.3 or later scores, ASC characters 64-126 encoding scores of 0 to 62

If the sequence data has been recently produced then the quality values are most likely in "ASC encoded Illumina pipeline 1.3 or later" format. If you are unsure about the format please contact your sequence service provider.

Example of FASTQ format

@seqid1 length=42
ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG
+seqid1
98876776889898#$@@!!999999999999999999999!!
@seqid2
GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA
+
98876776889898#$@@!!999999999995466466539#!

Illumina standard SCARF format

The Illumina SCARF format contains sequence IDs, sequences and base call quality values on a single line separated by double colons (:). All information before the sequence will be added together to create an unique sequence ID.

The following quality value formats can be found with the Illumina SCARF format:

  • ASC encoded Phred scores, ASC characters 33-126 encoding scores of 0 to 93
  • Numerical Phred scores, space-separated numbers from 0 to 93
  • ASC encoded Illumina pipeline 1.2 or earlier scores, ASC characters 59-126 encoding scores of -5 to 62
  • ASC encoded Illumina pipeline 1.3 or later scores, ASC characters 64-126 encoding scores of 0 to 62

If the sequence data has been recently produced then the quality values are most likely in "ASC encoded Illumina pipeline 1.3 or later" format. If you are unsure about the format please contact your sequence service provider.

Example of Illumina SCARF format with Phred quality values

HWI-EAS68_2_FC205D4:7:1:565:389:GAAGCAAAAAGAAGACTAATAAAAACTTTACACTTT:>>>>=>>>>>???>?>>?>?;B<8>=<7864767/5
HWI-EAS68_2_FC205D4:7:1:164:493:GTGTGCATGTGTATGTGTTTTTTTTTTTTTTTTTCT:>'>>5=.>54'?.>?6,;49B733+5+(40351/0-

Example Illumina SCARF format with numerical Phred quality values

HWI-EAS68_2_FC205D4:7:1:565:389:TAAATGTTTTCAACGTTAAACTTCTCTA:21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21
HWI-EAS68_2_FC205D4:7:1:164:493:AGTAAACGATAACGCTTTCACAGACATT:21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21


Life Tech Solid

Solid standard FASTA format

Solid Color Space standard FASTA format ; Solid Nucleotide Space standard FASTA format

The FASTA format contains sequence IDs and sequences. Sequence IDs start with the greater than sign (>) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences need to be on a single line and cannot contain numbers of white-space characters. This format does not support quality values.

Example of nucleotide space FASTA format

>seqid1
ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG
>seqid2 length=42
GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA

Sequences are either in nucleotide or color space format. GenomeQuest will align color space sequences in color space and only translate into nucleotide space when the alignment is needed for further analysis. As can be seen in the example below, color space sequences need to start with an anchoring nucleotide.

Example of color space FASTA format

>853_64_807_F3
T00021321231230002103301312101101323
>853_64_861_F3
T00101210200133033001332333320203311

Solid standard FASTQ format

The FASTQ format contains sequence IDs, sequences and base call quality values. Sequence IDs start with the at sign (@) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences and base call quality values are on a single line separated by a plus sign (+). This line can (but does not have to) contain the sequence ID again. Quality values are standard Phred/Sanger scores (0 to 93) encoded as ASC characters 33 to 126). The sequence and the quality value line need to have the same length.

Example of FASTQ format

@seqid1 length=42
ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG
+seqid1
98876776889898#$@@!!999999999999999999999!!
@seqid2
GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA
+
98876776889898#$@@!!999999999995466466539#!


Sequences are either in nucleotide or color space format. GenomeQuest will align color space sequences in color space and only translate into nucleotide space when the alignment is needed for further analysis. As can be seen in the example below, color space sequences need to start with an anchoring nucleotide.

Example of color space FASTQ format

@853_64_807_F3
T00021321231230002103301312101101323
+
>>>>=>>>>>???>?>>?>?;B<8>=<7864767/5
@853_64_861_F3
T00101210200133033001332333320203311
+
>>>>=??>?>>?>?;B<8>=<7864767/5323334


Solid separate sequences and quality files

Solid Color Space separate sequence and quality files Solid Nucleotide Space separate sequence and quality files

This format contains sequence IDs, sequences and base call quality values. Sequences and quality values are in separate files. In most cases the file names end in .fna and .qual for sequences and quality values respectively. Both files must contain the same number of sequence IDs in the same order. Like with any other FASTA format, sequences and base call quality values must be on a single line. Base call quality values are standard Phred scores as numbers separated by a space.

Example of sequence file in FASTA (FASTA) format (file usually ends with .fna)

>853_15_64_F3
CAGCACTAGCATTTACGAGAGCAGCGACTTAGCAGC
>853_15_79_F3
GCGCATCAGCATAGCAGCAGTAGCAGCGATTAGCAG

Example of a standard SOLiD outputted sequence file in CSFASTA (ColorSpace FASTA) format (file usually ends with .csfasta)

>853_15_64_F3
T0120301212131221133200
>853_15_79_F3
T0131212301230120120312

Example of quality file in FASTA format (file usually ends with .qual)

>853_15_64_F3
-1 6 -1 5 2 -1 5 -1 5 12 -1 8 -1 5 3 7 13 23 5 2 2 16 10 7 -1 3 7 5 5 -1 6 -1 5 10 -1 
>853_15_79_F3
-1 24 -1 25 25 -1 25 -1 25 24 -1 27 -1 28 21 25 25 26 27 28 26 24 26 25 -1 27 22 27 29 -1 26 -1 26 29 -1 


Sequences are either in nucleotide or color space format. GenomeQuest will align color space sequences in color space and only translate into nucleotide space when the alignment is needed for further analysis. As can be seen in the example below, color space sequences need to start with an anchoring nucleotide.

Example of color space sequence file in FASTA format

>853_15_64_F3
T.1.00.0.01.0.31231133003.2023.2.13.
>853_15_79_F3
T.2.31.2.00.2.00322313301.3101.3.32.

Other formats

Standard FASTA format

The FASTA format contains sequence IDs and sequences. Sequence IDs start with the greater than sign (>) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences need to be on a single line and cannot contain numbers of white-space characters. This format does not support quality values.

Example of FASTA format

>seqid1
ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG
>seqid2 length=42
GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA

Standard FASTQ format

The FASTQ format contains sequence IDs, sequences and base call quality values. Sequence IDs start with the at sign (@) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences and base call quality values are on a single line separated by a plus sign (+). This line can (but does not have to) contain the sequence ID again. Quality values are standard Phred/Sanger scores (0 to 93) encoded as ASC characters 33 to 126). The sequence and the quality value line need to have the same length.

Example of FASTQ format

@seqid1 length=42
ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG
+seqid1
98876776889898#$@@!!999999999999999999999!!
@seqid2
GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA
+
98876776889898#$@@!!999999999995466466539#!


Processing operations

Dealing with paired end / mate pair sequences

The processing of paired end / mate pair sequences is supported in the following way:

  • Forward and reverse sequences are in separate files. In this case simply check the check box that says "I have paired end reads (forward/reverse) in separate files" on the Read Processing launch page, and you will be asked to choose two files for processing (one for forward and one for reverse reads).
  • With 454 paired-end technology, forward and reverse sequences are physically joined together by a paired end linker and sequenced as a single sequence. In this case you can copy & paste a linker sequence and the workflow will split the concatenated sequences in forward and reverse sequences. Currently this is only supported for 454 Sequence Flowgram Format (SFF).

The processing workflow will automatically add _F or _R to the sequence ID of each sequence.

Sequence quality trimming and filtering

Trim ambiguous residues (Ns) from both sides of the sequence

This option will remove ambiguous residues (N in the case of nucleotide sequences, or . in the case of color space sequences) from both ends of each sequence. This allows to remove the trailing stretches of Ns that some analysis pipelines add to reads to "fill them up" to a specified length, or bases that were unable to be read or have a base ascertained by whatever sequencing technology you may be using; example- SOLiD specifies '.' as no-calls, and, therefore, these no-calls are typically translated to 'N' as they're 'unknown', but they are also sometimes designated as the reference base (taken from the reference sequence) in some mapping algorithms.

Keep only the first Χ residues of each read

This option will keep only the first X residues of each read, allowing to get reads that are all of the same length.

Trimming poor quality sequence ends

This option will trim low quality residues from the 5' and/or 3' end of each read. By default, a Phred base call quality of 11 or below is considered to be a low quality residue, but this value can be changed. Doing this will reduce the number of potential sequencing errors in a read, thereby increasing the chance that a read can be aligned afterward. In just about every sequencing technology or platform, chemistry deteriorates towards the ends of sequences, and, thus, the quality values decrease in a corresponding manner.

Overall quality filtering

This option will remove overall bad quality reads by counting the number of low quality residues in the first part of a sequence. Doing this will speed-up the alignment step, and reduce the number of false positive mismatches, insertions and deletions in the alignment step.

Removing adapter sequences

Remove a fixed number of nucleotides from each sequence

This option allows to clip a fixed number of residues from the 5' and/or 3' end of each sequence. Use this option when the position and length of the adapter sequence in the read is known. For example, in the case of an RNA-Seq library, or other libraries, where some RNA species may fall short of your target sequencing length and include adaptor sequence.

Remove known adapter(s) from each sequence

This option allows to remove known adapter sequences from the 5' and/or 3' end of each sequence using regular expressions. Use this option when the sequence of the adapter is known.

The following regular expression syntax applies

^      Match the beginning of the line
$      Match the end of the line
.      Match any character
[]     Character class
()     grouping
+      Match 1 or more times
{n}    Match exactly n times
{n,}   Match at least n times
{n,m}  Match at least n but not more than m times
>adapter1
GATCGGAAGAGCTC
>adapter1a
^GATCGGAAGAGCTC
>adapter1b
^ATCGGAAGAGCTC
>adapter1c
^TCGGAAGAGCTC
>adapter1d
^CGGAAGAGCTC
>adapter1e
^GGAAGAGCTC
>adapter1a
GATCGGAAGAGCTC$

>adapter1b
GATCGGAA[ACGT]AGCTC$

Use clipping information from sequence provider

This option only works with 454 Sequence Flowgram Format (SFF) files. Use this option to clip sequences using the information provided by the sequence service provider.

General sequence cleanup

Remove reads that contain ambiguous residues (Ns)

Flag reads shorter then Χ residues

Flag reads with low complexity regions in them (NCBI dust)

Flag reads hitting a database of well known repetitive sequences

Multiplexing

Split up the sequencing run using barcodes within the sequences

Split up the sequencing run using sequence IDs

The report explained

Explain all the items in the report


Working with the sequence databases

Explain where to find the sequence databases and how to work with them


Frequently asked questions

  • What to do if my file format is not supported
  • What to do if my file format is not correctly recognized
  • One of the operations described here doesn't show up when I try to process my reads
  • What is the order in which operations are applied to sequences
  • What is the difference between removing and flagging sequences
  • Where do I find the GenomeQuest sequence databases after processing
Personal tools