Welcome to the GenomeQuest Documentation Wiki

Sequence Search Launch Page

From GQ Wiki
Jump to: navigation, search

To search sequences for patent landscape analysis is best done through the IP Search page. GenomeQuest also allows sequence search for a a more biological purpose through its Blast workflow. This page described help for the Blast workflow.

Use Blast workflow or IP workflow? There are many differences between the two sequence search workflows in GenomeQuest. As a rule of thumb, if you are looking for IP landscape around your query sequences (what patents match to them) you should use the IP workflow, by and large. If you are analyzing NGS reads, you should use the Blast workflow exclusively.

Contents

Launch Page

You can reach the Launch page of the blast workflow in two ways:

  1. Launch Blast Search button: If this button is available on your My GenomeQuest landing page, just click it to launch the Blast workflow.
  2. Through menu: On your My GenomeQuest page, click the Launch workflow button in the left panel and choose Sequence search -> Blast search.

Both actions lead you to the Blast work launch page. On the launch page, specify the following to initiate a sequence search: (a) query sequence(s), (b) subject database(s), (c) search strategy and (d) other options.

Specifying Query Sequence(s)

Choose the type of query sequences first. All query sequences must be of the same type: protein or nucleotide. Next, specify the query sequences in one of two ways.

Paste

Paste in your query sequences from a text file in fasta or EMBL format.

  1. If you have just one sequence, you can simply paste it in raw format; any spaces and numbers are ignored in the sequence.
  2. Protein sequences must be in single character mode. E.g. Enter the protein sequence Met Asp Leu Ser Ala Leu as MDLSAL.

My Databases

Uploaded sequence databases, including reads databases as well as saved virtual databases are available in the query section as one of your databases. To get a drop down of these databases, first click the choose one of my databases link.

Get the list of My DBs by clicking the choose one of my databases link

Then choose your query database from the drop down list of your databases.

Specifying Subject Database(s)

Choose subject database(s) which you want to search by clicking the check-boxes to their left. Databases are classified into two types: Reference databases (public and collated databases mentioned in the content section) and Your (your uploaded and virtual databases). Choose either some combination of Reference databases or one of Your databases.

First choose the type of database you wish to search: nucleotide or protein. Unlike in the IP workflow, there is no option in the Blast workflow to search nucleotide and protein databases together. You must choose a single type of databases to search.

Next, click the appropriate link to search a Reference database or Your database.

Reference Databases

Select the reference database(s) you wish to search. When choosing from reference databases, you can select multiple databases to search - simply clcik the checkboxes next each of them.

Reference databases belong to the GenomeQuest administrator and are shared with all users. These databases are organized into a tree view. The tree view might change or evolve over time.

Your Databases

Just as with query sequences, you can choose one of your own databases as subject. Thus Blast allows you to run a search with one of your own query databases against another one of your own databases.

These databases belong to you or they are shared with you by a colleague who owns them. These might be Databases uploaded by you or databases saved by you as virtual databases.

Specifying Search Strategy

When comparing sequences, looking for "related" sequences has a specific meaning which is strategy dependent. The Sequence search page allows you to choose from among the following algorithm strategies:

Strategy Description When to use
Blast Heuristics based search for biologically related sequences. When looking for "local" alignments which are shorter than the length of the sequence. When "related sequence" is defined by evolutionary distance
GenePAST - by percentage identity Get related sequences specified by percentage identity over the entire length of the query or subject sequence. When you are dealing with short sequences. When investigating patent claims of the kind 80% identity to SEQ ID NO: 3
GenePAST - by number of errors Get related sequences specified by the number of errors (mismatches or gaps) between the query and subject sequence. When you want to find all sequences with at most X number of mismatches from your query.
Fragment Search Search for highly similar fragments When you want to use a non-heuristic method to find "local" matches shorter than your query
Mega Search Search with large numbers of queries (more than say 500) When you have large query sets, this is the algorithm to use since all other strategies are too slow. If your number of queries is greater than 1000, GenomeQuest will force you into this choice.


The parameters for each strategy are pretty straightforward and described below. Generally speaking, it is safe to leave the parameters at default values.

Blast Parameters

  1. Limit output to: Number of results to return (Default 500)
  2. Sort order: top X (500) results sorted by what? (Default: E-value)
  3. Word size to use in the initial step of Blast run.
    1. Default: 11 for Nucleotide searches and 3 for Protein searches.
    2. The practical consequence is that no nucleotide "match" that does not have at least 11 consecutive nucleotides (a word 11-long) in common with the query will be considered a match.
  4. E-val cutoff. Hits with an expectation value worse than this cutoff will not be returned.
    1. E-value of an alignment is a statistical measure of its strength. It is defined as the number of such alignments expected to be found in the database by random chance.
    2. Thus, a strong alignment will have a low E-value since few such alignments are expected in the database by chance.
    3. A weak alignment will have a high E-value (several such alignments are expected by pure random chance).

Parameters of Blast searching

GenePAST Parameters

There are two modes for GenePAST: (a) Percent Identity mode and (b) Number of errors mode. Toggle between these two modes by clicking the appropriate link in the Strategy area.


Percent Identity Mode

  1. Limit output to: Number of results to return (Default 500)
  2. Percentage Identity: specify the number (Default: 80)
  3. Over what: The percent identity can be measured over the query or the subject. Example: Imagine that a 50 NT query matches perfectly to the middle of a 200 NT subject sequence. The percent identity over query is 100 and that over subject is 50/200 i.e 25 percent. The options here are:
    1. my query: the length of your query
    2. any subject: the length of the subject sequence found in the database or
    3. query or subject: compute the percent identity over the shorter of the two sequences.
      1. If the query is shorter - it is fit into the middle of the subject and the percent identity is measured over the length of the query.
      2. Conversely, if the subject is shorter, IT is fit into the middle of the query and the percent identity is measured over the length of the subject.

Percent Identity mode of GenePAST

Number of errors mode

  1. Limit output to: Number of results to return (Default 500)
  2. Max number of errors: specify how many errors you are willing to tolerate between the query and the subject.

Number of errors mode of GenePast

Fragment Search

Use this strategy if you wish to find local matches from a fragment of your query to a fragment of the subject sequence. Parameters available here are:

  1. Limit output to: Number of results to return (Default 500)
  2. Sort order: in what order should the results be listed in order to determine what the "top 500 are"?
  3. Window size: minimum length of fragmentary match considered.
  4. Percent ID: what should be the percentage identity within the window in order for the match to be reported.

Fragment search parameters

MegaSearch

This strategy is useful for rapid searching of large numbers of query sequences against large subject databases. Such database vs database comparisons are very compute intensive and Mega Search has the right algorithms to make it happen. The parameters for this strategy depend on whether your query sequences are long or short. Please choose the appropriate mode depending on your query sequences; this is one place where depending on default values can lead you astray.

Short query sequences

When sequences are short (all are shorter than 120 nucleotides), it makes better sense to keep track of the number of errors rather than the percent identity. So, the parameters available here are:

  1. Max number of errors allowed: Only matches with at most this many errors are returned.
  2. Max number of results: this may top hits are returned.

Mega search with short sequences

Long query sequences

These include contigs, long read NGS sequences or full length mRNA sequences and others. Specifically, when one or more of your sequences is 120 nucleotides long, you should use the long mode. The parameters here are:

  1. Score cutoff: hits with Blast score lower than this cutoff will not be returned
  2. Max number of results: this many top hits will be returned (default: 5) for each query. Mega search is typically run with very large numbers of query sequences. So, it makes sense to keep the number of results per query to be relatively small.

Mega search with long sequences

Other Options

  1. Send E-mail on completion?.
    By default, your searches complete quietly and their results are added to your My GenomeQuest page. If you check this box, an email will be sent upon completion.
  2. Email address.
    This defaults to your email address on record at GenomeQuest but you can specify where to send the result of this search, if you wish.
  3. Result Name.
    Specify a name for the result here. This is the name by which the result will be displayed on your My GenomeQuest page.

Limits on Sequence Search

Sequence search page allows you to run very high throughput sequence comparisons, defined by the following limits:

If you are a paying user:

  1. you can run searches using the Mega Search algorithm without any limits on query or subject size.
  2. When using a regular algorithm (Blast, GenePAST or Fragment search),
    1. your query database can have a total of up to one million characters.
    2. the longest sequence in the subject database can be up to 10 million chars long.
  3. When you input too large a query set or choose a subject database whose longest sequence is longer than 10 million characters, the strategy will be automatically set to Mega Search.

If you are Trial user there are more strict limitations.

Results of Sequence Search: Results Browser

Results of a sequence search are displayed in the Sequence Results Browser. On this browser, you can:

  • Filter: Apply filters to reduce the number of results while increasing their relevance. E.g. to view only human sequences in your results, apply the filter: Organism matches homo sapiens. For further details on filtering, please see the filtering widget help page.
  • Group: The returned sequences can be grouped by various attributes like (for biologists) organism, gene name or (for patent searches) patent number, patent family. Grouping is done through the Grouping Widget.
  • Sort. When viewing the table of results, you can sort the entries by values in any of the columns by clicking on their header.
  • Using the top level menu on the browser you can
    • Save. You can save the "results" as a Virtual Database.
    • Report. Create a report in table (e.g. Excel) or document (e.g. Word) format of the results you are seeing.
    • Launch Applications. Launch 3rd party applications on specified sequences.

Differences between Blast and IP workflows

  1. The number of errors mode of GenePAST is not available in the IP workflow.
  2. My Databases are not available as query in the IP workflow.
  3. Motif search is not available as strategy in the Blast workflow.
  4. Mega search is not available in the IP workflow.
Personal tools