Welcome to the GenomeQuest Documentation Wiki

Sequence Database Browser

From GQ Wiki
Jump to: navigation, search

Use this page to do annotation search or key word search i.e. to look up sequences by words in their annotation like gene name, organism, patent assignee etc.

Contents

Use cases

  • Get all human EST sequences and create a virtual database of them.
  • Find all human kinase sequences in the curated RefSeq mRNA sequence database.
  • Get sequences belonging to a particular patent (given the patent number).
  • Get the genome sequence of organism Arabidopsis thaliana.
  • Get my NGS reads from Sample_1234 which I uploaded and processed.

Getting Started

For each of these use cases, there are two simple steps:

  1. Step 1: Identify the databases in which to look. Open them in browse mode.
  2. Step 2: Apply operations in order to show only those sequences that are of interest.
  3. Step 3: finish up by creating a virtual database or launching other applications.

Step 1: Identify the databases

For browsing, you can choose either Reference Databases (like Genbank, patents, genomes etc) in GenomeQuest or you can choose one of your own databases (like your NGS reads database). Choosing either type of database to browse is similar in operation. However reference databases are typically easier to find.

Finding Reference Databases

To select a reference database "by hand",

  1. Click the "+" sign next to Reference Databases in the left panel of your My GenomeQuest page.
  2. Choose the appropriate node within there by clicking its name. For example, (a) to see Patent databases, click Patents; (b) to see nucleotide databases like Genbank, click Nucleotide; etc. In this picture below, I am looking for Genbank primate and rodent divisions. I click the Nucleotide link.
  3. Then in the main panel, select the appropriate databases by clicking the checkbox next to their name. In this example, I selected the Genbank primate and rodent divisions.
  4. To browse the selected databases, click the Browse Sequence Databases button at the top.

Choosing Genbank Primate and Rodent divisions to browse

Since there are many reference databases GenomeQuest also provides a way to select databases by searching for their name in the search box at the top of the Left panel. For the above example, I type in "Genbank" and then select Primate and Rodent divisions.

Note that I can always click on the header of the Name column in the table if I want to sort the databases alphabetically by their name to aid in locating them.

Using the Searchbox to look for Genbank entries

Finding your own databases

If you don't do NGS work, this section is not for you; move on to the next section: Step 2: Apply filters

Reads databases and the Reads Processing workflow that created them are named the same. This name is what you specified in the workflow launch page. Browsing is done on the product of the workflow i.e. the reads databases and not on the workflows themselves. And this can get a bit confusing since the reads databases and the workflows are named the same. (Note that some Reads processing workflows can create multiple databases - in cases of multiplexed data, for example).

The simplest case is to find just one reads database and this can be accomplished by simply searching for the name of the Reads processing workflow of interest. In the picture below, I searched for my_ilmn_. The results show a reads processing workflow and also a reads database of the name my_ilmn_dataset.

Example searching for "my_ilmn_" in the database names

A few important things to note here:

  1. The reads database has an icon that looks like a stack of disks next to it. Whereas the workflow has an icon that looks like a cogwheel next to it. Even though the the names are the same, the icons set the two entities apart.
  2. As on the My GenomeQuest page, clicking on the name does the expected thing (opens the results for a workflow and opens teh database for browsing). Clicking on the line - but not on the name - opens up details about the entity; this is one more way of identifying which is the database and which the workflow.
  3. You can extend the same operation to the more complicated case: if multiple reads databases are on the displayed list, you can select several of them and click the Browse Sequence Databases button.
    1. Make sure that you do not select a workflow (cogwheel icon) in this case. If you do, you will generate a warning.

Creating a Virtual Database of Reads

The procedure for selecting multiple reads databases is similar. If each database has a different name, one trick to follow is to create a folder and move the reads databases into that folder. For example, consider that you need to select 5 reads databases named "lane_1", "lane_2", ... "lane_5" for running an RNA-Seq workflow. You might do this through the following steps:

  1. Create a folder called "Five lanes" as described here.
  2. Search for each of the reads processing as described above. For example, search serially for "lane_1", "lane_2", ... and "lane_5".
  3. After each search, select the database (stack icon as shown above) and move it into the Five lanes folder.
  4. Now the new folder Five lanes has the five reads databases you need. Open the folder to see them.
  5. Select all five of them by clicking the checkbox next to them and click the "Browse sequence databases" button as described above.
  6. This opens all the reads databases together. GO on to the next steps below; in this particular RNA-Seq example, you probably want to create a virtual database - so you can use the consolidated reads database in RNA-Seq workflow.

Step 2: Applying Operations

Once you have opened sequence database(s) for browsing, you can apply several operations on the list of results.

  • Filter: Apply further filters to the sequences returned so as to reduce their numbers and increase their relevance. E.g. search for human kinase might return sequences where the word human occurs in the description of the sequence. You can delete sequences like these by looking for the word human to occur in only the Organism field by applying the filter: Organism matches homo sapiens. For further details on filtering, please see the filtering widget help page.
  • Group: The returned sequences can be grouped by various attributes like organism and gene name. E.g. group the results of human kinase search by gene name in order to see all entries belonging to the same gene grouped together. Grouping is done through the Grouping Widget.
  • Sort. When viewing the table of results, you can sort the entries by values in any of the columns by clicking on their header.

NOTE: For really large sequence databases, the grouping and sorting functions are disabled since they will take an inordinate amount of time. However, if you first apply filters to reduce the dataset, these operatiosn will become available.

Step 3: Finish up

Most of the finishing up operations are available through the top level menu. For example, you can

  • Virtual Database. You can save the results as a Virtual Database under the Result menu. Several workflows allow you to input a virtual database. For example, combine several reads databases - each coming from a different lane - into a single database. Subsequently run say the RNA-Seq workflow on the combined dataset.
  • Report. Create a report in table (e.g. Excel) or document (e.g. Word) format of the results you are seeing under the export menu.
Personal tools