Welcome to the GenomeQuest Documentation Wiki

RNA Seq

From GQ Wiki
Jump to: navigation, search

Use RNA-Seq Workflow to measure gene expression profiles using NGS sequencing. This analysis is sometimes also referred to as Digital Gene Expression.

Contents

Example use cases

  • You have 40 million Illumina reads derived from healthy human liver and human hepatoma samples. You would like to see overexpressed genes in sample.
  • You have 500 million Illumina reads derived from multiple lines of human autism samples. You would like to see overexpressed genes in sample.
  • You have 800 million SOLiD reads from an experimental time series from Arabidopsis thaliana. You would like to view the gene expression profiles over the time series
  • Your organization has the (yet unpublished) transcriptome sequences of organism X. Now you have 250 million reads Helicos from Organism X; you would like to measure gene expression.

How to Use RNA Seq

RNA-Seq takes as input one or two experiments, processes them against the Transcriptome and the Genome of a given species, and produces various data: a report with statistics, two spreadsheets, a databases of annotated genes by read counts.

Video for Demonstration

There are three steps in using this workflow:

Get Data to the Server

  • Ensure that your sequence data fies are on the GenomeQuest server. You can do this by using the upload procedure
  • One of your colleagues may have already uploaded the data. If so, ask them to share it with you

Launch the RNA-Seq Workflow

Go to the RNA-Seq launch page as described in the My GenomeQuest help page.

Datasets

Step 1: Give your run a name.

Step 2: Choose the data set you would like to use for each of Experiment A and Experiment B. If necessary get your data to the GenomeQuest server as explained above.

What is one experiment? One experiment is a set of reads that comes from one or multiple fasta files that holds data from the same experiment – e.g. a tissue expression measurement under a unique experimental condition. Any number of fasta files can be considered as an experiment, but usually this capability reflects a multiple lanes run. The first experiment is labeled as Experiment A.

What is two experiments? Two experiments is the ability to have two complete reads dataset as input. In this case, the second experiment is labeled Experiment B. It is typically another tissue than experiment A, or same tissue under different conditions. When two experiments are present, additional differential expression statistical measures are provided.

Multiple Files:

  • Note that multiple files can be uploaded to form a single database on GenomeQuest.
  • When a database comprised of multiple files is used as Experiment A or B (or both), then GenomeQuest produces a Complete Table of the number of reads from each file mapping to each gene. This is particularly useful when different lanes represent different sample ID's.

What is the difference between having one or two experiments?

  • Whether one or two experiments are provided, all reads from all files from the experiment(s) are processed against subject databases (Transcriptome, Genome). For each experiment, all read counts and RPKM (Reads per kilobase per million reads) per gene are computed. These individual, per file data, are provided as a raw spreadsheet through the link: “Complete Table”.
  • An additional spreadsheet and a database of annotated genes is provided that contains, per experiment and per gene
    • the total read count - sum of all read counts files per experiment per gene
    • the mean of RPKM - mean of all RPKM files per experiment per gene.
  • Finally, in the case of two experiments, a statistical model is applied to the two previous datasets, that qualifies whether a gene in experiment B is: overexpressed, underxpressed, or neutral (the confidence interval is not-significant) relative to experiment A. A likelihood of this qualifier (a P-Value) is also computed. The statistical model is based on the following paper: Process stats with winflat, Stephane Audic, and Jean-Michel Claverie. Genome Research Vol. 7,No. 10, pp. 986-995, 1997

Dataset Type

Dataset types can be 454, Illumina, Helicos or SOLiD.

Reference Transcriptomes

Choose the reference transcriptome to comapre to. The following reference transcriptomes are available:

  1. Human (Homo sapiens)
  2. Mouse (Mus musculus)
  3. Rat (Rattus norvegicus)
  4. Rice (Oryza sativa)
  5. Thale cress (Arabidopsis thaliana)
  6. Maize / corn (Zea mays)
  7. Sorghum (Sorghum bicolor)
  8. Soy bean (Glycine max)

GenomeQuest is constantly adding more reference databases. But you can also add a transcriptome that you collated using the following procedure:

  1. Create a sequence database in EMBL format.
  2. The following fields are required for a transcriptome database to be available in RNA Seq,
    1. AC for accession number.
    2. GN for gene name.
    3. Other fields are allowed. Please see annotation fields list for the full list of other annotation fields recognized by GenomeQuest.
  3. Upload the database to GenomeQuest
  4. Please drop us a line for further assistance.

View the results

Go to the result page of your run through the My GenomeQuest page.

The top of the page shows overall statistics and lists of genes that are over-expressed in experiment A and experiment B.

RNA SEQ Report 1.png

Below the lists, you will see statistics of the computational details of the workflow.

Database of Genes

As you will see in the workflow details section below, the main end result of the workflow is a database of all genes in the species annotated by

  1. raw count of reads mapping to its transcript(s).
  2. normalized (RPKM) count of reads mapping to its transcript(s).
  3. In case of 2 experiments, whether the gene is over-expressed in Experiment A as compared to B.

This database is accessible from the result page shown above and you can browse just like any other sequence database.

Export to Other Tools: You can export your results to an Excel spreadsheet from the database of genes. And from Excel, it should be possible to export into other tools like GeneSpring or SpotFire to view heat maps or other renderings of the expression profiling.

Workflow Details

The RNA-Seq workflow operates on two subject databases: a Transcriptome database and a Genome database. The databases are natively built from Gene databases. A Gene database has primarily two different sources and format:

  1. Databases coming from NCBI Entrez Gene. This mainly concern vertebrate-mammalian species and a couple of plant species. Those data are very well annotated and structured. GenomeQuest takes its source in the direct ASN1 NCBI format.
  2. Databases coming from reference web-sites, such as Public Domain Consortiums. Non finished plant genomes, such as Corn, Soybean or Soja are typical of those. Those data are generally less annotated than NCBI ones; native format is GFF.

GenomeQuest mines all the sources in different formats, and produces annotated Gene / Genome / Transcriptome / Proteome databases. This provides benefits like:

  1. Remapping of all GenBank Records non included in RefSeq mRNA
  2. A powerful viewer, GQGene Viewer, that allows to diplay Gene structures and read alignments relative to exons/introns etc.

As a result the Transcriptome databases used are more comprehensive than classical ones. Additionally, GenomeQuest filters out some of transcripts that could lead to wrong mapping, such as HTC divisions, transcripts from Genbank that are partially mapped, etc …

RNA-Seq can potentially take any transcriptome and genome data source as input. The only constraints that must be fulfilled is as follows: every transcript record must have a unique GeneID in its annotation (as explained above).

Details of Workflow

Hereafter, for simplicity reasons, we consider only one experiment, composed or not of multiple lanes. All gene counts are computed relative to a single lane from a single experiment. The process operates against a single given species.

The overall process is split into two main steps:

  1. Transcriptome Analysis
  2. Genome Correction

Step 1: (Transcriptome Analysis) aligns all reads to a database of all transcripts available for a given species. Every transcript has an assignment to a unique Gene Name (or GeneID). Counting of reads aligned is performed on a per-gene basis. That is: a read aligning to multiple transcripts from the same gene is counted as one. A read aligning to two genes, regardless of the number of transcripts aligned inside any of the two genes, is counted as two, and so on so forth, as illustrated in the figure below.

RNA Seq Workflow.png

The alignment algorithm used to map reads to transcripts and genes is Mega Search used on the Sequence Search page. Depending on the type of reads (short or long) it performs a global (GenePAST) or a local (BLAST) alignment.

  1. The BLAST alignment operates for 454 reads. Alignments having at least 10 words in common between a read and a transcript are candidates to BLAST aligner, and only alignments whose score is greater than or equal to 20 are kept.
  1. For short reads, the parameters are essentially the number of errors – mismatches or gaps – allowed (#errs). Typically, for Illumina reads whose lengths is in average 36, #errs =3. For SOLiD 3 data (51 cs), # errs = 5.

At the end of this process, only reads having a mapping to a unique gene are kept. Any read that aligns to more than one gene is discarded. These criteria can be changed with the GQ Engine API.


Step 2: Genome Correction takes all reads uniquely mapped to the Transcriptome and rescans them against the Genome. Then, all reads that aligns to multiple locations on the genome are discarded again. Here also, these criteria can be modified using the GQ Engine API.

All reads passing Step 1 and Step 2 are submitted to various statistical processes to produce a gene database annotated by different values (read count, RPKM, etc …).

RNA-Seq On Multiple Experiments

Set up

Consider merging your RNA samples into one conglomerate Virtual Database (VDB), though the same procedure will work for any multi-experiment situation.

  1. Consider that Control01, Control02, Control03, Control04, Control05 represent five files representing reads from control samples at five different time points.
  2. Consider that Case01, Case02, Case03, Case04, Case05 represent five files representing reads from case samples at five different time points.

In order to analyze the controls and the cases (treated samples), you can merge them into a GenomeQuest Virtual Database (VDB) prior to running the RNA-Seq workflow.

Procedure

  • Go to the GenomeQuest Dashboard:
  • Click on the search-button next to the search tool.

Dashboard.jpg

  • Un-check 'Folder' and 'Workflow' buttons, but leave 'Seqdb' checked.
  • Now select "producer workflow type" for the left-most drop-down, keep the middle drop-down on "equals", and select "Read Processing" for the right-most drop-down.
  • Next, hit 'Apply'. You will now have a list of your previously uploaded and processed read sequence databases.

Select.jpg

  • Now check the runs that you wish to merge into a larger virtual database, and now click "Browse Sequence Databases".

Merge.jpg

  • Once inside the sequence database browser, click "Results" -> "Save as Virtual Database", and give your VDB an intuitive name when prompted to do so.

Vdb.jpg

  • Next, access the RNA-Seq Workflow through the Workflows drop down via "Workflows" -> "Other" -> "RNA-Seq", and you will now see your VDB with the rest of your reads.

Please email us if you are interested in learning more about this feature.

Export to GeneSpring

If you use GeneSpring, you can export calculated expression levels from GenomeQuest to GeneSpring. See our export page for this procedure.

Other related workflows

Personal tools