Welcome to the GenomeQuest Documentation Wiki

Sequence File Formats

From GQ Wiki
Jump to: navigation, search

Sequence file formats are important to follow when dealing with multiple sequences. Specifically, this happens when uploading your own sequence databases or when pasting multiple sequences into the sequence search page.

Contents

Fasta Format

This format is suitable when only minimal annotation, an identifier and a brief description, is present for each sequence.

  1. In fasta format, each new sequence starts off with a header line.
    1. Start every header line with a > (greater than) sign.
    2. The first word on the header line is taken as the identifier of the sequence.
    3. The rest of the header line (after the first word) is taken as the description of the sequence
    4. End the header line with a line break.
  2. All content up to the next header line is considered the sequence.
    1. The sequence may be on a single long line or split into multiple lines.
    2. Spaces and numbers and other extraneous characters in the sequence are ignored.

Example Fasta File

The following example shows a fasta file of two nucleotide sequences:

>HSU07918 Human G-protein-coupled inwardly rectifying potassium channel (KCNJ3) gene, polymorphic repeat sequence. 
GGAATCGTTGTGGTTAGATTAGCAAAATGTTAGGACATATATGGTCTTTCAGACCACCATTTTTTTATTGCATCTGCACT
TGATAATAGTCATATGGAGAAACAAGACAACATCCAATTTGGAATATAGATATATATATGGAGGATATGTATACACACAC
ACACACACACACACACACATATATATGGAGGATAAAAGAAGGTGAGCACAAAAATAACATATGTGATGTTAGAAGAGGAA
AGGAATGTACTCTATTACCTTTTGACAAGTGAAAGTTAAGGATTGCACAGCTGACCTCTTAGGAAAGAAAGAGGATGCCT
ATTGGCAATAAATAG
>HUMGPCRB Human EBV induced G-protein coupled receptor (EBI2) mRNA, complete cds. 
GGAATTCCCTGATATACACCTGGACCACCACCAATGGATATACAAATGGCAAACAATTTTACTCCGCCCTCTGCAACTCC
TCAGGGAAATGACTGTGACCTCTATGCACATCACAGCACGGCCAGGATAGTAATGCCTCTGCATTACAGCCTCGTCTTCA
TCATTGGGCTCGTGGGAAACTTACTAGCCTTGGTCGTCATTGTTCAAAACAGGAAAAAAATCAACTCTACCACCCTCTAT
TCAACAAATTTGGTGATTTCTGATATACTTTTTACCACGGCTTTGCCTACACGAATAGCCTACTATGCAATGGGCTTTGA
CTGGAGAATCGGAGATGCCTTGTGTAGGATAACTGCGCTAGTGTTTTACATCAACACATATGCAGGTGTGAACTTTATGA
CCTGCCTGAGTATTGACCGCTTCATTGCTGTGGTGCACCCTCTACGCTACAACAAGATAAAAAGGATTGAACATGCAAAA

EMBL Format

This format is suitable for annotation which is more extensive. In this format, each sequence gets its own section. sections are delimited by a line with two slashes i.e.

//

Within the section for each sequence, all its annotations and its sequence are specified.

Specifying annotations

  1. Every sequence has multiple annotation fields.
  2. Each annotation field is specified as a key-value pair.
  3. The key for an annotation field is a two character identifier.
  4. Each annotation of every database is specified on a single annotation line.
    1. The first two characters on the annotation line must be the two character identifier for that field.
    2. This code is always followed by three blanks, so that the actual information in each line begins in character position 6.
    3. The rest of the line (up to the line break) is the value of that annotation.
  5. For example, to specify that the accession number (AC) of a sequence is U07918, use
    AC U07918

Specifying the sequence

  1. Lines specifying the sequence must begin with two space characters.
  2. Sequence may be specified in a single long line or on multiple lines.

Example EMBL File

The following example shows a file with two sequences. Each sequence has the following annotation fields defined:

  1. ID: Identifier.
  2. AC: Accession number.
  3. SV: Sequence version number.
  4. GI: Genbank Identifier.
  5. DE: Brief description.
  6. OS: Source organism.
  7. OX: NCBI's taxonomy identifier for the organism.
ID  HSU07918
AC  U07918
SV  1
GI  469481
DE  Human G-protein-coupled inwardly rectifying potassium channel (KCNJ3) gene, polymorphic repeat sequence.
OS  Homo sapiens (human)
OX  9606
  GGAATCGTTGTGGTTAGATTAGCAAAATGTTAGGACATATATGGTCTTTCAGACCACCATTTTTTTATTGCATCTGCACT
  TGATAATAGTCATATGGAGAAACAAGACAACATCCAATTTGGAATATAGATATATATATGGAGGATATGTATACACACAC
  ACACACACACACACACACATATATATGGAGGATAAAAGAAGGTGAGCACAAAAATAACATATGTGATGTTAGAAGAGGAA
  AGGAATGTACTCTATTACCTTTTGACAAGTGAAAGTTAAGGATTGCACAGCTGACCTCTTAGGAAAGAAAGAGGATGCCT
  ATTGGCAATAAATAG
//
ID  HUMGPCRB
AC  L08177
SV  1
GI  292056
DE  Human EBV induced G-protein coupled receptor (EBI2) mRNA, complete cds.
OS  Homo sapiens (human)
OX  9606
  GGAATTCCCTGATATACACCTGGACCACCACCAATGGATATACAAATGGCAAACAATTTTACTCCGCCCTCTGCAACTCC
  TCAGGGAAATGACTGTGACCTCTATGCACATCACAGCACGGCCAGGATAGTAATGCCTCTGCATTACAGCCTCGTCTTCA
  TCATTGGGCTCGTGGGAAACTTACTAGCCTTGGTCGTCATTGTTCAAAACAGGAAAAAAATCAACTCTACCACCCTCTAT
  TCAACAAATTTGGTGATTTCTGATATACTTTTTACCACGGCTTTGCCTACACGAATAGCCTACTATGCAATGGGCTTTGA
  CTGGAGAATCGGAGATGCCTTGTGTAGGATAACTGCGCTAGTGTTTTACATCAACACATATGCAGGTGTGAACTTTATGA
  CCTGCCTGAGTATTGACCGCTTCATTGCTGTGGTGCACCCTCTACGCTACAACAAGATAAAAAGGATTGAACATGCAAAA
  GGCGTGTGCATATTTGTCTGGATTCTAGTATTTGCTCAGACACTCCCACTCCTCATCAACCCTATGTCAAAGCAGGAGGC
//

FastQ Format

This format is especially needed when dealing with quality scores. Quality scores are important in the variant detection workflow since they are needed for making a judgment on the quality of a predicted polymorphism.

In the FastQ format, each sequence is represented by four lines

  1. First line starts with an '@' character and is followed by the identifier of the sequence.
  2. Second line shows the sequence itself.
  3. The third line starts with a '+' sign and has some annotation for the sequence.
  4. The fourth line shows the quality scores of the sequence.

Example FastQ File

Following is an example FastQ file with five sequences.

@SRR002813.10000010F
GGGGAAGAATAACCTTTACAAACGGAAATACCACTT
+
$$$$""$&$&&""###%""""%##"!"#""!!"!"#
@SRR002813.10000043F
CAACACCTTTCAGGGGACAACCGTGGTACTGAGGAT
+
"$$#&$&$%&$"$!!!"""%"%"""!"!!"!!!!"#
@SRR002813.10000156F
CCTGTTACCCTCACCCAGGTCTCAGCAGGGCTGATT
+
!"$!!$$!#$$""""&!!""$"$"#""!!!!"!"#!
@SRR002813.10000250F
CTTTTTCACAAGAAATAAGTACCTTCTCAAATTGCT
+
$$$"$$"$$""!"##$"%!!$$"$$"''"!!!!!!"
@SRR002813.10000280F
CCTTACAGTATCTATTGTATGTGTTGCATTTAATTA
+
&!$$&&&$!&$"!##$!!##!!!%!!!%"""!"&""
Personal tools