Welcome to the GenomeQuest Documentation Wiki
ReadProcessing
Contents
|
About the Read Processing workflow
The read processing workflow creates GenomeQuest sequence databases out of raw sequence files while allowing the user to cleanup and organize the data. This is the flow NGS sequence data follows:
- Prepare an experimental sample and have it sequenced
- Upload the raw sequence file(s)
- Process the file(s) and make it into a real GenomeQuest sequence database
- Run workflow(s) on the sequence database(s) and interact with the results
The read processing workflow supports the following operations:
- Dealing with paired end / mate pair sequences
- Sequence quality trimming and filtering - trimming bad quality bases from the sequence ends, and removing overall low quality sequences
- Removing adapter sequences
- General sequence cleanup - removing small sequences, redundant sequences and sequences that are obvious repeats.
- Multiplexing - splitting up a single sequencing run into multiple separate experiments.
Supported sequence formats
Sequences can come in many different formats, all of which we try to support. If your sequence format is not in the list feel free to contact us at support@genomequest.com
Roche 454
Roche 454 standard FASTA format
The FASTA format contains sequence IDs and sequences. Sequence IDs start with the greater than sign (>) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences need to be on a single line and cannot contain numbers or white-space characters. This format does not support quality values.
Example of FASTA format
>seqid1 ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG >seqid2 length=42 GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA
Roche 454 standard FASTQ format
The FASTQ format contains sequence IDs, sequences and base call quality values. Sequence IDs start with the at sign (@) and continue until the first white-space or the end of the line (length=42 for seqid1 in the example below is ignored). Sequences and base call quality values are on a single line separated by a plus sign (+). This line can (but does not have to) contain the sequence ID again. Quality values are standard Phred/Sanger scores (0 to 93) encoded as ASC characters 33 to 126). The sequence and the quality value line need to have the same length.
Example of FASTQ format
@seqid1 length=42 ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG +seqid1 98876776889898#$@@!!999999999999999999999!! @seqid2 GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA + 98876776889898#$@@!!999999999995466466539#!
Roche 454 separate sequence and quality files
This format contains sequence IDs, sequences and base call quality values. Sequences and quality values are in separate files. In most cases the file names end in .fna and .qual for sequences and quality values respectively. Both files must contain the same number of sequence IDs in the same order. Like with any other FASTA format, sequences and base call quality values must be on a single line. Base call quality values are standard Phred scores as numbers separated by a space.
Example of sequence file in FASTA format (file usually ends with .fna)
>FZKE0LS02RGRJ2 AGAGAGTGGTTCACAGTATTATCGCACACGCACTAACCGGTGAG >FZKE0LS02PSSZH ATTATTCACACGTGAGACATATTCACCTGAACGGGTATCGAC
Example of quality file in FASTA format (file usually ends with .qual)
>FZKE0LS02RGRJ2 37 37 37 37 37 37 37 37 37 37 37 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39 39 39 37 36 36 36 37 37 >FZKE0LS02PSSZH 39 39 39 39 39 39 39 39 39 39 39 40 40 40 40 40 40 40 40 40 39 38 36 39 39 39 39 33 27 23 23 21 28 30 26 35 30 30 29 29 21 21 17
Roche 454 Sequence Flowgram Format (SFF)
This format comes as a binary file (not a text file) that contains sequence IDs, sequences, base call quality values and can contain information on how sequences need to be clipped.
Illumina
Illumina standard FASTA format
The FASTA format contains sequence IDs and sequences. Sequence IDs start with the greater than sign (>) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences need to be on a single line and cannot contain numbers of white-space characters. This format does not support quality values.
Example of FASTA format
>seqid1 ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG >seqid2 length=42 GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA
Illumina standard FASTQ format
The FASTQ format contains sequence IDs, sequences and base call quality values. Sequence IDs start with the at sign (@) and continue until the first white-space or the end of the line (length=42 for seqid1 in the example below is ignored). Sequences and base call quality values are on a single line separated by a plus sign (+). This line can (but does not have to) contain the sequence ID again. The sequence and the quality value line need to have the same length.
The following quality value formats can be found with the Illumina FASTQ format:
- ASC encoded Phred scores, ASC characters 33-126 encoding scores of 0 to 93
- ASC encoded Illumina pipeline 1.2 or earlier scores, ASC characters 59-126 encoding scores of -5 to 62
- ASC encoded Illumina pipeline 1.3 or later scores, ASC characters 64-126 encoding scores of 0 to 62
If the sequence data has been recently produced then the quality values are most likely in "ASC encoded Illumina pipeline 1.3 or later" format. If you are unsure about the format please contact your sequence service provider.
Example of FASTQ format
@seqid1 length=42 ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG +seqid1 98876776889898#$@@!!999999999999999999999!! @seqid2 GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA + 98876776889898#$@@!!999999999995466466539#!
Illumina standard SCARF format
The Illumina SCARF format contains sequence IDs, sequences and base call quality values on a single line separated by double colons (:). All information before the sequence will be added together to create an unique sequence ID.
The following quality value formats can be found with the Illumina SCARF format:
- ASC encoded Phred scores, ASC characters 33-126 encoding scores of 0 to 93
- Numerical Phred scores, space-separated numbers from 0 to 93
- ASC encoded Illumina pipeline 1.2 or earlier scores, ASC characters 59-126 encoding scores of -5 to 62
- ASC encoded Illumina pipeline 1.3 or later scores, ASC characters 64-126 encoding scores of 0 to 62
If the sequence data has been recently produced then the quality values are most likely in "ASC encoded Illumina pipeline 1.3 or later" format. If you are unsure about the format please contact your sequence service provider.
Example of Illumina SCARF format with Phred quality values
HWI-EAS68_2_FC205D4:7:1:565:389:GAAGCAAAAAGAAGACTAATAAAAACTTTACACTTT:>>>>=>>>>>???>?>>?>?;B<8>=<7864767/5 HWI-EAS68_2_FC205D4:7:1:164:493:GTGTGCATGTGTATGTGTTTTTTTTTTTTTTTTTCT:>'>>5=.>54'?.>?6,;49B733+5+(40351/0-
Example Illumina SCARF format with numerical Phred quality values
HWI-EAS68_2_FC205D4:7:1:565:389:TAAATGTTTTCAACGTTAAACTTCTCTA:21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 HWI-EAS68_2_FC205D4:7:1:164:493:AGTAAACGATAACGCTTTCACAGACATT:21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21
Life Tech Solid
Solid standard FASTA format
Solid Color Space standard FASTA format ; Solid Nucleotide Space standard FASTA format
The FASTA format contains sequence IDs and sequences. Sequence IDs start with the greater than sign (>) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences need to be on a single line and cannot contain numbers of white-space characters. This format does not support quality values.
Example of nucleotide space FASTA format
>seqid1 ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG >seqid2 length=42 GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA
Sequences are either in nucleotide or color space format. GenomeQuest will align color space sequences in color space and only translate into nucleotide space when the alignment is needed for further analysis. As can be seen in the example below, color space sequences need to start with an anchoring nucleotide.
Example of color space FASTA format
>853_64_807_F3 T00021321231230002103301312101101323 >853_64_861_F3 T00101210200133033001332333320203311
Solid standard FASTQ format
The FASTQ format contains sequence IDs, sequences and base call quality values. Sequence IDs start with the at sign (@) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences and base call quality values are on a single line separated by a plus sign (+). This line can (but does not have to) contain the sequence ID again. Quality values are standard Phred/Sanger scores (0 to 93) encoded as ASC characters 33 to 126). The sequence and the quality value line need to have the same length.
Example of FASTQ format
@seqid1 length=42 ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG +seqid1 98876776889898#$@@!!999999999999999999999!! @seqid2 GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA + 98876776889898#$@@!!999999999995466466539#!
Sequences are either in nucleotide or color space format. GenomeQuest will align color space sequences in color space and only translate into nucleotide space when the alignment is needed for further analysis. As can be seen in the example below, color space sequences need to start with an anchoring nucleotide.
Example of color space FASTQ format
@853_64_807_F3 T00021321231230002103301312101101323 + >>>>=>>>>>???>?>>?>?;B<8>=<7864767/5 @853_64_861_F3 T00101210200133033001332333320203311 + >>>>=??>?>>?>?;B<8>=<7864767/5323334
Solid separate sequences and quality files
Solid Color Space separate sequence and quality files Solid Nucleotide Space separate sequence and quality files
This format contains sequence IDs, sequences and base call quality values. Sequences and quality values are in separate files. In most cases the file names end in .fna and .qual for sequences and quality values respectively. Both files must contain the same number of sequence IDs in the same order. Like with any other FASTA format, sequences and base call quality values must be on a single line. Base call quality values are standard Phred scores as numbers separated by a space.
Example of sequence file in FASTA (FASTA) format (file usually ends with .fna)
>853_15_64_F3 CAGCACTAGCATTTACGAGAGCAGCGACTTAGCAGC >853_15_79_F3 GCGCATCAGCATAGCAGCAGTAGCAGCGATTAGCAG
Example of a standard SOLiD outputted sequence file in CSFASTA (ColorSpace FASTA) format (file usually ends with .csfasta)
>853_15_64_F3 T0120301212131221133200 >853_15_79_F3 T0131212301230120120312
Example of quality file in FASTA format (file usually ends with .qual)
>853_15_64_F3 -1 6 -1 5 2 -1 5 -1 5 12 -1 8 -1 5 3 7 13 23 5 2 2 16 10 7 -1 3 7 5 5 -1 6 -1 5 10 -1 >853_15_79_F3 -1 24 -1 25 25 -1 25 -1 25 24 -1 27 -1 28 21 25 25 26 27 28 26 24 26 25 -1 27 22 27 29 -1 26 -1 26 29 -1
Sequences are either in nucleotide or color space format. GenomeQuest will align color space sequences in color space and only translate into nucleotide space when the alignment is needed for further analysis. As can be seen in the example below, color space sequences need to start with an anchoring nucleotide.
Example of color space sequence file in FASTA format
>853_15_64_F3 T.1.00.0.01.0.31231133003.2023.2.13. >853_15_79_F3 T.2.31.2.00.2.00322313301.3101.3.32.
Other formats
Standard FASTA format
The FASTA format contains sequence IDs and sequences. Sequence IDs start with the greater than sign (>) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences need to be on a single line and cannot contain numbers of white-space characters. This format does not support quality values.
Example of FASTA format
>seqid1 ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG >seqid2 length=42 GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA
Standard FASTQ format
The FASTQ format contains sequence IDs, sequences and base call quality values. Sequence IDs start with the at sign (@) and continue until the first white-space or the end of the line (length=42 for seqid2 in the example below is ignored). Sequences and base call quality values are on a single line separated by a plus sign (+). This line can (but does not have to) contain the sequence ID again. Quality values are standard Phred/Sanger scores (0 to 93) encoded as ASC characters 33 to 126). The sequence and the quality value line need to have the same length.
Example of FASTQ format
@seqid1 length=42 ATGCGCGCATCAGCGACTACGACGACGACTACGCATCGGGGAG +seqid1 98876776889898#$@@!!999999999999999999999!! @seqid2 GCGCATTAGCGAGATTAGCGACTTAGCAGCGCATTAGCAGCGA + 98876776889898#$@@!!999999999995466466539#!
Processing operations
Dealing with paired end / mate pair sequences
The processing of paired end / mate pair sequences is supported in the following way:
- Forward and reverse sequences are in separate files. In this case simply check the check box that says "I have paired end reads (forward/reverse) in separate files" on the Read Processing launch page, and you will be asked to choose two files for processing (one for forward and one for reverse reads).
- With 454 paired-end technology, forward and reverse sequences are physically joined together by a paired end linker and sequenced as a single sequence. In this case you can copy & paste a linker sequence and the workflow will split the concatenated sequences in forward and reverse sequences. Currently this is only supported for 454 Sequence Flowgram Format (SFF).
The processing workflow will automatically add _F or _R to the sequence ID of each sequence.
Sequence quality trimming and filtering
Trim ambiguous residues (Ns) from both sides of the sequence
This option will remove ambiguous residues (N in the case of nucleotide sequences, or . in the case of color space sequences) from both ends of each sequence. This allows to remove the trailing stretches of Ns that some analysis pipelines add to reads to "fill them up" to a specified length, or bases that were unable to be read or have a base ascertained by whatever sequencing technology you may be using; example- SOLiD specifies '.' as no-calls, and, therefore, these no-calls are typically translated to 'N' as they're 'unknown', but they are also sometimes designated as the reference base (taken from the reference sequence) in some mapping algorithms.
Keep only the first Χ residues of each read
This option will keep only the first X residues of each read, allowing to get reads that are all of the same length.
Trimming poor quality sequence ends
This option will trim low quality residues from the 5' and/or 3' end of each read. By default, a Phred base call quality of 11 or below is considered to be a low quality residue, but this value can be changed. Doing this will reduce the number of potential sequencing errors in a read, thereby increasing the chance that a read can be aligned afterward. In just about every sequencing technology or platform, chemistry deteriorates towards the ends of sequences, and, thus, the quality values decrease in a corresponding manner.
Overall quality filtering
This option will remove overall bad quality reads by counting the number of low quality residues in the first part of a sequence. Doing this will speed-up the alignment step, and reduce the number of false positive mismatches, insertions and deletions in the alignment step.
Removing adapter sequences
Remove a fixed number of nucleotides from each sequence
This option allows to clip a fixed number of residues from the 5' and/or 3' end of each sequence. Use this option when the position and length of the adapter sequence in the read is known. For example, in the case of an RNA-Seq library, or other libraries, where some RNA species may fall short of your target sequencing length and include adaptor sequence.
Remove known adapter(s) from each sequence
This option allows to remove known adapter sequences from the 5' and/or 3' end of each sequence using regular expressions. Use this option when the sequence of the adapter is known.
The following regular expression syntax applies
^ Match the beginning of the line
$ Match the end of the line
. Match any character
[] Character class
() grouping
+ Match 1 or more times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
>adapter1 GATCGGAAGAGCTC
>adapter1a ^GATCGGAAGAGCTC >adapter1b ^ATCGGAAGAGCTC >adapter1c ^TCGGAAGAGCTC >adapter1d ^CGGAAGAGCTC >adapter1e ^GGAAGAGCTC
>adapter1a GATCGGAAGAGCTC$ >adapter1b GATCGGAA[ACGT]AGCTC$
Use clipping information from sequence provider
This option only works with 454 Sequence Flowgram Format (SFF) files. Use this option to clip sequences using the information provided by the sequence service provider.
General sequence cleanup
Remove reads that contain ambiguous residues (Ns)
Flag reads shorter then Χ residues
Flag reads with low complexity regions in them (NCBI dust)
Flag reads hitting a database of well known repetitive sequences
Multiplexing
Split up the sequencing run using barcodes within the sequences
Split up the sequencing run using sequence IDs
The report explained
Explain all the items in the report
Working with the sequence databases
Explain where to find the sequence databases and how to work with them
Frequently asked questions
- What to do if my file format is not supported
- What to do if my file format is not correctly recognized
- One of the operations described here doesn't show up when I try to process my reads
- What is the order in which operations are applied to sequences
- What is the difference between removing and flagging sequences
- Where do I find the GenomeQuest sequence databases after processing