Welcome to the GenomeQuest Documentation Wiki
Sequence File Formats
Sequence file formats are important to follow when dealing with multiple sequences. Specifically, this happens when uploading your own sequence databases or when pasting multiple sequences into the sequence search page.
Contents |
Fasta Format
This format is suitable when only minimal annotation, an identifier and a brief description, is present for each sequence.
- In fasta format, each new sequence starts off with a header line.
- Start every header line with a > (greater than) sign.
- The first word on the header line is taken as the identifier of the sequence.
- The rest of the header line (after the first word) is taken as the description of the sequence
- End the header line with a line break.
- All content up to the next header line is considered the sequence.
- The sequence may be on a single long line or split into multiple lines.
- Spaces and numbers and other extraneous characters in the sequence are ignored.
Example Fasta File
The following example shows a fasta file of two nucleotide sequences:
>HSU07918 Human G-protein-coupled inwardly rectifying potassium channel (KCNJ3) gene, polymorphic repeat sequence. GGAATCGTTGTGGTTAGATTAGCAAAATGTTAGGACATATATGGTCTTTCAGACCACCATTTTTTTATTGCATCTGCACT TGATAATAGTCATATGGAGAAACAAGACAACATCCAATTTGGAATATAGATATATATATGGAGGATATGTATACACACAC ACACACACACACACACACATATATATGGAGGATAAAAGAAGGTGAGCACAAAAATAACATATGTGATGTTAGAAGAGGAA AGGAATGTACTCTATTACCTTTTGACAAGTGAAAGTTAAGGATTGCACAGCTGACCTCTTAGGAAAGAAAGAGGATGCCT ATTGGCAATAAATAG >HUMGPCRB Human EBV induced G-protein coupled receptor (EBI2) mRNA, complete cds. GGAATTCCCTGATATACACCTGGACCACCACCAATGGATATACAAATGGCAAACAATTTTACTCCGCCCTCTGCAACTCC TCAGGGAAATGACTGTGACCTCTATGCACATCACAGCACGGCCAGGATAGTAATGCCTCTGCATTACAGCCTCGTCTTCA TCATTGGGCTCGTGGGAAACTTACTAGCCTTGGTCGTCATTGTTCAAAACAGGAAAAAAATCAACTCTACCACCCTCTAT TCAACAAATTTGGTGATTTCTGATATACTTTTTACCACGGCTTTGCCTACACGAATAGCCTACTATGCAATGGGCTTTGA CTGGAGAATCGGAGATGCCTTGTGTAGGATAACTGCGCTAGTGTTTTACATCAACACATATGCAGGTGTGAACTTTATGA CCTGCCTGAGTATTGACCGCTTCATTGCTGTGGTGCACCCTCTACGCTACAACAAGATAAAAAGGATTGAACATGCAAAA
EMBL Format
This format is suitable for annotation which is more extensive. In this format, each sequence gets its own section. sections are delimited by a line with two slashes i.e.
//
Within the section for each sequence, all its annotations and its sequence are specified.
Specifying annotations
- Every sequence has multiple annotation fields.
- Each annotation field is specified as a key-value pair.
- The key for an annotation field is a two character identifier.
- Each annotation of every database is specified on a single annotation line.
- The first two characters on the annotation line must be the two character identifier for that field.
- This code is always followed by three blanks, so that the actual information in each line begins in character position 6.
- The rest of the line (up to the line break) is the value of that annotation.
- For example, to specify that the accession number (AC) of a sequence is U07918, use
AC U07918
Specifying the sequence
- Lines specifying the sequence must begin with two space characters.
- Sequence may be specified in a single long line or on multiple lines.
Example EMBL File
The following example shows a file with two sequences. Each sequence has the following annotation fields defined:
- ID: Identifier.
- AC: Accession number.
- SV: Sequence version number.
- GI: Genbank Identifier.
- DE: Brief description.
- OS: Source organism.
- OX: NCBI's taxonomy identifier for the organism.
ID HSU07918 AC U07918 SV 1 GI 469481 DE Human G-protein-coupled inwardly rectifying potassium channel (KCNJ3) gene, polymorphic repeat sequence. OS Homo sapiens (human) OX 9606 GGAATCGTTGTGGTTAGATTAGCAAAATGTTAGGACATATATGGTCTTTCAGACCACCATTTTTTTATTGCATCTGCACT TGATAATAGTCATATGGAGAAACAAGACAACATCCAATTTGGAATATAGATATATATATGGAGGATATGTATACACACAC ACACACACACACACACACATATATATGGAGGATAAAAGAAGGTGAGCACAAAAATAACATATGTGATGTTAGAAGAGGAA AGGAATGTACTCTATTACCTTTTGACAAGTGAAAGTTAAGGATTGCACAGCTGACCTCTTAGGAAAGAAAGAGGATGCCT ATTGGCAATAAATAG // ID HUMGPCRB AC L08177 SV 1 GI 292056 DE Human EBV induced G-protein coupled receptor (EBI2) mRNA, complete cds. OS Homo sapiens (human) OX 9606 GGAATTCCCTGATATACACCTGGACCACCACCAATGGATATACAAATGGCAAACAATTTTACTCCGCCCTCTGCAACTCC TCAGGGAAATGACTGTGACCTCTATGCACATCACAGCACGGCCAGGATAGTAATGCCTCTGCATTACAGCCTCGTCTTCA TCATTGGGCTCGTGGGAAACTTACTAGCCTTGGTCGTCATTGTTCAAAACAGGAAAAAAATCAACTCTACCACCCTCTAT TCAACAAATTTGGTGATTTCTGATATACTTTTTACCACGGCTTTGCCTACACGAATAGCCTACTATGCAATGGGCTTTGA CTGGAGAATCGGAGATGCCTTGTGTAGGATAACTGCGCTAGTGTTTTACATCAACACATATGCAGGTGTGAACTTTATGA CCTGCCTGAGTATTGACCGCTTCATTGCTGTGGTGCACCCTCTACGCTACAACAAGATAAAAAGGATTGAACATGCAAAA GGCGTGTGCATATTTGTCTGGATTCTAGTATTTGCTCAGACACTCCCACTCCTCATCAACCCTATGTCAAAGCAGGAGGC //
FastQ Format
This format is especially needed when dealing with quality scores. Quality scores are important in the variant detection workflow since they are needed for making a judgment on the quality of a predicted polymorphism.
In the FastQ format, each sequence is represented by four lines
- First line starts with an '@' character and is followed by the identifier of the sequence.
- Second line shows the sequence itself.
- The third line starts with a '+' sign and has some annotation for the sequence.
- The fourth line shows the quality scores of the sequence.
Example FastQ File
Following is an example FastQ file with five sequences.
@SRR002813.10000010F GGGGAAGAATAACCTTTACAAACGGAAATACCACTT + $$$$""$&$&&""###%""""%##"!"#""!!"!"# @SRR002813.10000043F CAACACCTTTCAGGGGACAACCGTGGTACTGAGGAT + "$$#&$&$%&$"$!!!"""%"%"""!"!!"!!!!"# @SRR002813.10000156F CCTGTTACCCTCACCCAGGTCTCAGCAGGGCTGATT + !"$!!$$!#$$""""&!!""$"$"#""!!!!"!"#! @SRR002813.10000250F CTTTTTCACAAGAAATAAGTACCTTCTCAAATTGCT + $$$"$$"$$""!"##$"%!!$$"$$"''"!!!!!!" @SRR002813.10000280F CCTTACAGTATCTATTGTATGTGTTGCATTTAATTA + &!$$&&&$!&$"!##$!!##!!!%!!!%"""!"&""