Welcome to the GenomeQuest Documentation Wiki

Annotate Workflow

From GQ Wiki
Jump to: navigation, search


Use the Annotate Workflow to add annotation data from subject sequences onto query sequences in any sequence search.

Contents

Use Case

I have a database of 100,000 reads (454 machine) derived from a sample from the Sargasso sea. I would like to know what is already known about these reads in the public domain. So, I run different Blast searches with my reads against various databases: SwissProt, Refseq mRNAs and Genbank EST division. Now I would like to aggregate all this data: add annotation to my reads from my SwissProt hits (preferably), or Refseq hits (if there are no SwissProt hits) or Genbank EST hits (as a last resort). For this, I use the Annotate workflow and specify my pecking order (the order of preference for copying the annotation over).

The Overall Flow of Data

Annotate Data Flow.png

When you run a Blast workflow you get a results database which shows:

  1. your query sequence(s)
  2. hits to your queries in the subject database(s)
    1. alignments between your query and corresponding subject sequence and
    2. full annotation of the subejct sequence from the subject database.

The results are displayed in a result browser which allows you to to filter, group and sort them in a variety of ways.

It is often useful to take the annotation of a particularly closely related subject database and create a new copy of the query sequence enriched with this annotation. This is the purpose of the Annotate workflow. Indeed, Annotate can take multiple blast runs relating to your query set.

For each query sequence, the Annotate workflow:

  • Looks for a "good" match in the first Blast workflow.
  • If there is a good alignment, it takes annotation from that sequence and adds it to your query sequence.
    • If not, it looks in the second Blast workflow and for a good match and adds annotation from there.
  • And so on.

How to Use the Annotate workflow

Your query sequences all have to be nucleotide for annotate to work. In principle, it is possible to make annotate available for protein queries as well, but this does not appear to be a big need. However note that the subject databases (from where you extract annotation for your queries) can be nucleotide or protein.

To annotate your set of query sequences, first think about where (which databases) you are most likely to find good matches to your sequences. Which databases to run Blast searches against (and in what order) depends on your sequence data and the goals of your project.

  1. For rodent (or primate) sequences, you might want to try SwissProt, Refseq and Genbank rodent (or primate) division (in that order).
  2. If you have deep sequencing of human transcripts, you could try to annotate it against a
    1. virtual databases of human transcripts,
    2. other primate transcripts
    3. rodent transcripts
    4. human genome, (in that order).

Note that the order in which the Blast searches are considered is specified on the launch page of the Annotate workflow.

Step 1: Get Data to the Server

  • Ensure that your sequence data files are on the GenomeQuest server. You can do this by using the upload procedure.
  • One of your colleagues may have already uploaded the data. If so, ask them to share it with you.

Step 2: Launch the Blast workflow

Run sequence search with your sequence data against the reference databases of your choice. It might be helpful to create virtual databases of special interest for you. Finally, if you have proprietary sequences, you can run blast searches against them as well. For full details on how to run Blast searches, please see the wiki for launching sequence search.

As mentioned above, you will likely have to run multiple searches since Annotate allows you to have a tiered preference of where to copy the annotation from.

Step 3: Launch Annotate

Once all your Blast workflows are completed, you can run Annotate workflow on it to copy annotation from the subject sequences to your queries. You have to fill out the following parameters.


Annotate Launch Page.png

Run name

You need to name your run with an appropriate name. The name defaults to a time stamp. This is the name that you will see on your My GenomeQuest page.

Choose your Blast runs

The list displays the all Blast runs you can access where the queries are nucleotide sequences. Choose the ones that relate to the queries you would like to annotate. You can choose up to 10 Blast workflows.

  1. Choose multiple Blast workflows on the left and move them to the right by clicking the >> button.
  2. Rearrange the pecking order from where you want to copy the annotation fields by moving the name of the Blast run up or down the list.
    1. Ex: Put your search against SwissProt at the top and the annotation from that database will be given the highest priority.

Set Percent ID

I may prefer annotation from Refseq to that from say Genbank EST division. However, if the Refseq hit is very weak, I run the risk of annotating my query with information from an unrelated sequence. In setting the percentage identity, you are specifying how strong you want the hit to be before the annotation information is copied over. This is set by default to 85%. This means annotation information is not copied over unless the subject sequence is 85% identical along the entire length of the query.

Annotation Fields

Here, you specify which annotation fields from the subject databases must be added on to your query. For example when you choose Accession, the accession number of the subject database is added on to your query sequence. This likely is very informative if your subejct database is Genbank or Refseq; on the other hand, this may not be very informative if the subject database contains genomic chromosome sequences. The fileds you can choose from include:

  1. Accession. Subject accession number
  2. Description. The description or title of the subject sequence.
  3. Organism and Taxonomy ID. The name of the organism and its Tax ID (as annotated on the subject sequence).
  4. Species Taxonomy. This includes the full classification scheme of the subject organism.
  5. Gene Name of the subject sequence.
  6. Database. The name of the database (like "SwissProt" or "Refseq") from where the hit was found. This and the accession should give you a unique way to trace back to the original subject sequence.
  7. Keywords.

These two features are properties of the alignment between query and subject.

  1. Blast Score. When chosen, this number is available in a separate field to allow you to filter on it as a number.
  2. Display the alignment. If you check this box, the alignment between the query and subject sequence are included in the annotation copied over.

The following is a property of the query sequence.

  1. Base Quality Values. If you check this box, query sequences' quality values are preserved into the new copy of the query.

Note that if a particular annotation field is absent in a subject database (say if Gene Name is missing form a database like Genbank EST division), Annotate workflow will simply leave that field blank.

Step 4: View the results

Report

The report page contains a link to the query database (beefed up with more annotation) and also some statistics.

Annotate Report Page.png

Database of annotated reads

The database of annotated query sequences can be viewed and browsed just like any other sequence database

Annotated Query DB.png

Material and Methods

Other related workflows

  • Sequence Search allows to compare reads to a wide variety of content (nucleotide sequences and peptides), and algorithms.
  • Newbler allows to assemble reads into longer contigs. For example, you can focus on a set of reads of interest (specific to an organism, or a gene), and assemble them with Newbler.

References

Personal tools