Welcome to the GenomeQuest Documentation Wiki
Rapid Annotation Process
The Metagenomics Process (formerly the "Rapid Annotation Process" or "RAP") is a sequence searching process able to accept millions of sequences as input (query database) and compare them against the world of sequences as target (subject databases). The result of this workflow is a GenomeQuest database of annotated reads with "best hit" annotation. This means that the annotated read database can be filtered, grouped and selected for further analysis such as Gene mapping. This database may also be used in additional workflows such as contiging using the 454 Newbler assembler.
The Metagenomics Process has been designed to annotate large datasets of long reads (454, or Sanger). By annotate, we mean to provide an overview of the “landscape” of the dataset as compared to various subject databases. The landscape is provided by various measures on the distribution of annotated read:
- composition by datasets
- composition by organisms
Also included in this process is a fully annotated database of reads with the "best hit".
In the example below, we start with 300,000 454 reads from the Medicago truncatula species. The Metagenomics workflow will provide you with information such as a global picture of species distribution and the ability to drill down to a per-read deep annotation of the best hit.
Metagenomics Database distribution pie chart
Metagenomics Database distribution pie chart
Video for demonstration
How to Use the Metagenomics workflow
There are three steps in using this workflow:
Get Data to the Server
- Ensure that your sequence data files are on the GenomeQuest server. You can do this by using the upload procedure.
- One of your colleagues may have already uploaded the data. If so, ask them to share it with you.
Launch the Metagenomics Process
To launch a new Metagenomics Process, you need to click the "Launch New" button on the My GenomeQuest page. The submit page will then be displayed:
You need to fill in a few parameters, that are described below:
You need to name your run with an appropriate name. The default will be "Metagenomics DATE", where DATE is today's date.
This dropdown list displays the list of datasets you have uploaded or that have been shared with you. You can upload new datasets by clicking on the "Add a new dataset..." link.
- Multiple datasets can be added by holding the CTRL key while selecting the items in the list of datasets.
- Only 454 or Sanger sequences will be efficiently annotated with Metagenomics analysis. Shorter reads (Illumina, SOLiD) need to be assembled before then.
Choose the reference databases you want to use to annotate your sequences:
- Human: GenBank primate division, as well as human genomes in RefSeq Genomes.
- Mouse: GenBank rodent division, as well as mouse genomes in RefSeq Genomes.
- Rat: GenBank rodent division, as well as rat genomes in RefSeq Genomes.
- Vertebrates: GenBank mammalian, primate, rodent, and vertebrate divisions, as well as vertebrate mammalian and other divisions of RefSeq Genomes.
- Invertebrates: GenBank invertebrate division, as well as the invertebrate division of RefSeq Genomes.
- Bacteria: GenBank bacteria, bacteriophage, environmental samples, and synthetic divisions, as well as Microbial and Plasmid divisions of RefSeq Genomes.
- Virus: GenBank viral, and environmental samples divisions, as well as plasmid and viral divisions of Refseq Genomes.
- Plants: GenBank plant/fungal/algal division, as well as plant division of RefSeq Genomes.
- Fungi: GenBank plant/fungal/algal division, as well as fungi and microbial divisions of RefSeq Genomes.
- Protozoa: GenBank environmental samples division, as well as protozoa division of RefSeq Genomes.
- mRNA: The RefSeq mRNA division.
Simple and Advanced Modes
An advanced mode is available for the Metagenomics workflow. It's intended for advanced users with knowledge in Bioinformatics and in-depth understanding of the HS3 algorithm. Click here for the advanced Metagenomics parameters.
View the results
The Metagenomics report is divided into sections that are described here after. Each section contains links that, once clicked, lead you to the reads targeted.
This section displays some global statistics such as the total number of reads, the total number of non-redundant reads, the total number of assigned reads (the ones with at least one hit), ...
Metagenomics report: statistics
This section contains a table of the number of reads and percentage of reads annotated by the different databases you selected on the submit page.
those statistics are an excellent way to assess the purity of your datasets (contamination by virus/bacteria/....), and to quickly drill down into the reads annotated by a specific content (only human reads, virus reads, ...).
On the screen capture below, most of the reads have been annotated by Plant sequences, since the dataset came from a sequencing experiment of Medicago truncatula.
Metagenomics report: database/dataset distribution
This section contains a pie chart, and a table of the distribution of annotated reads among the datasets you used in your Metagenomics workflow. This section is particularly useful when you select several datasets to annotate at once.
Metagenomics report: dataset distribution
This section contains a pie chart of the 10 most common organisms in your experiment, as well as a table listing the complete list of organisms. As for the other tables in the report, you can click on the organism links to quickly drill-down into a specific organism of interest.
Metagenomics report: organism distribution
Databases of reads
From the report, you can drill down into any specific content, such as:
- All the reads annotated by a specific content (human, mouse, rat, plant, virus, ...)
- All the unannotated reads (those without hits)
- all the reads annotated by a specific organism
The statistics part of the Metagenomics report provide links to the complete read set, as well as the annotated and unannotated reads.
Links to the annotated and unannotated reads are embedded in the Statistics section of the Metagenomics report
Database of annotated reads
When you browse the annotated reads, you will see the list of reads, along with the annotations taken from their best hit. You can filter/sort/group on the annotations of those reads.
Metagenomics: database of annotated reads
Database of unannotated reads
The Metagenomics report contains a link to the unannotated reads as well. Those reads don't have a match to any sequence and are unassigned or unannotated. You can launch a Sequence Search to find similarities to proteins, or you can launch a de novo assembly with Newbler on those reads.
Database of results
For each annotated reads, you can get the evidence of the assignment of this read to the corresponding sequence, by viewing the best alignment to this read.
You can access those alignments by clicking on the "See results for this read" link in the Comments field of an annotated read, as shown below.
Metagenomics: annotation details for a read. Note the "See results for this read" link embedded in the Comments field
Material and Methods
HS3 used as a first step a word-based search that maintains high sensitivity while decreasing the total computational time by cutting down the overall number of pair-wise alignments. It generates a subset of the data that is then subject to BLASTN analysis.
It has significant speedup advantages compared to Blast when large sequence databases are analyzed and high percentages of similarity are required.
In the first step, the subject sequences are cut into fragments of 5,000, with an overlap of 500 bases, and the minimum number of words of 10 nucleotides two sequences must have in common is 10. In the second Blastn step, the minimum score must be 30, and the 5 alignments with the best score are kept for each read.
The algorithm explained
The inner workings of the HS3 algorithm are described in Dudoignon et al. 2002. What follows here is a short, simplified summary.
The HS3 algorithm acts as a filter before an attempt to align two sequences is made. This filter is based upon the number of words two sequences have in common. A word is defined as a number of consecutive residues, for example 10 nucleotides. The lspmul algorithm counts and stores the number of words in every individual sequence involved in the comparison. For every sequence the frequency of all the encountered words is stored in a table. This table is used to determine the number of words in common between two sequences. If the number of common words exceeds a precalculated threshold, the algorithm will attempt to align the two sequences using a derivative of the Blast algorithm.
The threshold (T) is calculated with the length of the overlap window (L), the minimal percentage of identity (P) specified by the user and the length of the word (W) set by us using the following formula.
T = max(1, L - W + 1 -(1- P/100)* L * W)
For reasons explained in the article, the word length is set to 10 residues for nucleotide sequences and to 5 residues for protein sequences.
- Sequence Search allows to compare reads to a wide variety of content (nucleotide sequences and peptides), and algorithms.
- Newbler allows to assemble reads into longer contigs. For example, you can focus on a set of reads of interest (specific to an organism, or a gene), and assemble them with Newbler.
- Proc IEEE Comput Soc Bioinform Conf. 2002;1:228-36. High similarity sequence comparison in clustering large sequence databases. Dudoignon L, Glemet E, Heus HC, Raffinot M.