Welcome to the GenomeQuest Documentation Wiki

IP Workflow

From GQ Wiki
Jump to: navigation, search

Use the IP search workflow to analyze your sequences. You can do Freedom-to-Operate searches or find out if your sequences are publically known or similar to known sequences.

Contents

Example use cases

  • Your scientists want to work on a new protein. You would like to do a Freedom-to-Operate search to be confident that they can keep working on this new protein.
  • You would like to patent a series of newly discovered bacterial genes and you would like to known if anything similar to those sequences is publically known in patent and public reference databases.
  • You believe a competitor could potentially infringe on one of your patents. You would like to monitor weekly if anything similar to your patented sequences is in our patent database.

Launching the IP workflow

There are two ways to reach the IP workflow launch page,

  1. Launch IP Search button: If this button is available on your My GenomeQuest landing page, just click it to launch the IP workflow.
  2. Through menu: On your My GenomeQuest page, click the Launch workflow button in the left panel and choose Sequence search -> IP.

Once on the IP launch page, you have to specify three things: (a) query sequence(s), (b) search strategy and (c) subject databases.

Query Sequence Input

Input your query sequence(s) in the main window, after having chosen if your sequences are nucleic or proteic:

Inputsequence.png

Sequences must be in one-letter code and must be in one of the following formats

Acceptable sequence formats

Raw format. You can input one sequence in raw format, i.e. just the sequence. All spaces, numbers and punctuation marks will be automatically removed. For instance:

tacgacgcagca agcagcactca acat agga 34
    atnagagataggnnatataggaggcccc 

Will be transformed into

tacgacgcagcaagcagcactcaacataggaatnagagataggnnatataggaggcccc

Fasta format. You can input one or more sequences in fasta format. Again, all spaces and punctuation marks will be ignored if they do not interfere with the fasta format. The basic fasta format is composed of a greater than sign ('>') immediately followed by an identifier (no space after '>', and only alpha-numerical characters and underscore in the identifier). Then, the line below that does not start with a '>' sign is considered to be the sequence. A second sequence can then follow, starting with a > sign, identifier and sequence:

>seq_1
gacatcacgacgcacgacctacac
acacggananannnaggagaatga

>seq_2
acagcagcgaccgacgaccagca
atcagcagcagcaccacactacgcagctacac
atcacgac


Uploaded queries

It is possible to use sequences that you already have uploaded as "annotated sequences".

Other Options in the Query area

Type of search: FTO or Patentability. These are two default search types. However, selecting either does not preclude you from manually over-riding later.

  1. Patent Databases only. This is the default option and selecting it searches patent databases.
  2. Patents and Public Reference Databases. Selecting this option searches patent and reference databases.

See section #Automatic selection for more details.

Result Name. Give the result a name. This is the name under which the result will be displayed on the My GenomeQuest page.

Compare to both nucleotide and protein databases. Check this box, if you want to compare your sequence(s) to both nucleotide and protein databases. By default (if this box is left unchecked), nucleotide query sequences are searched against nucleotide databases and protein queries are searched against protein databases.

By checking this box, if your sequence(s) are nucleic, GenomeQuest translates it in all 6 frames and compares the translated sequence to protein databases. If your query sequence(s) are protein, all sequences from the nucleotide databases are translated in all 6 frames prior to comparison.

IP Launch page

Search Strategy

Four different search strategies are available in GenomeQuest. The table below shows some typical use cases for each search strategy.

Algorithm decision table
Use case Description Search strategy
Homology Search Find sequences evolutionarily related to my query sequence(s) BLAST
Classic patent search Investigate claims on similar sequences where BLAST is specified as the method BLAST
Defined patent search Investigate claims on similar sequences specified in terms of percentage identity over that sequence Percent Identity ‐ GenePAST
Small sequence search Investigate matches to short query sequences like primers, probes and peptides Percent Identity ‐ GenePAST
Primer search Investigate all possible primers that can be made out of a longer sequence Fragment Search
SNP search Searching with small sequences that have a variable position. Motif Search
Protein domain search Searching with a protein sequences with a known domain structure e.g. Antibody with small variable regions and long fixed spacers Motif Search

Genepast

GenomeQuest's GenePAST algorithm is the base choice for most patent‐related sequence searches. The GenePAST "percent identity" algorithm finds the best fit between the query sequence and the subject sequence, and expresses the alignment as an exact percentage. Unlike BLAST, GenePAST makes no alignment scoring adjustments based on considerations of biological relevance between query and subject sequences. Use GenePAST when relevant patent considerations are, or will be, expressed in terms of sequence identity. GenePAST is particularly useful for getting straightforward search information when your query sequence is short or when you are looking for hits based on sequence identity over the entire length of the sequence. The alignment that GenePAST finds is guaranteed to be the longest alignment between two sequences that meets or exceeds a certain percent identity threshold. This is useful when investigating claims on a sequence based on percent identity over its length. Where BLAST will provide a shorter alignment of high homology, GenePAST will provide a longer alignment of perhaps lower identity (but exceeding the desired minimum threshold). Unlike to traditional approaches where the percent ID is only computed relative to an alignment, GenePAST allows to specify percentage identities on the sequence itself. For example, the following alignment:

Q:  10 ATG-TATA
       ||| ||||
S: 756 ATGGTATA

This leads to 87.5% percent ID on alignment (7 matches on an alignment of length 8), and a 100% percent ID on query (7 residues on 7 aligned).

Percentage Identities

GenePast makes extensive use of percentage identities. There are three kinds of percentage identities and it is critical to understand what they mean. Below is an illustration of their meanings:

PercentID.png

Blast

BLAST is GenomeQuest’s implementation of the NCBI BLAST2 algorithm and finds the most relevant sequences in terms of biological similarity. Use BLAST when you're looking for the biological relationship between a query sequence and sequences in the subject database you're searching. BLAST helps scientists develop hypotheses about gene function by scoring alignments with respect to previously discovered relationships between homologous sequences. In general, BLAST should not be used to perform searches with short query sequences (less than 20 nucleotides). Even with substantial “fine tuning” of BLAST search parameters a BLAST search with a short query sequence will miss results due to its heuristic approach. The sequence search has the following default BLAST parameters for nucleotide searches:

  • Word size : 11
  • E‐value cutoff : 10
  • Scoring Matrix : NUC.3.1 (match = 1, mismatch = ‐3)
  • Gap Opening : 5
  • Gap Extension : 2

and the following parameters for protein searches:

  • Word size : 3
  • E‐value cutoff : 10
  • Scoring Matrix : BLOSUM62
  • Gap Opening : 11
  • Gap extension : 1

Fragment Search

The Fragment Search strategy uses the gapless blast v1 algorithm over a “sliding window” and calculates percent identity as it slides the window along the query sequence(s). Use Fragment Search when you're evaluating claims containing language like “and any fragment 15‐25 nucleotides thereof.” Set the Fragment Search percent identity parameter to “100%” to find the longest perfect matches between your query sequence and subject sequences (BLAST will return top hits containing gaps and extensions and GenePAST will return alignments over the entire query sequence).

Note: The fragment search algorithm does not introduce gaps in constructing the alignment between the query and subject sequences.

For example, with Fragment Search it is possible to find all 30 bp fragments of a particular sequence in the entire public domain, provided they are 95% identical or more to your query. Fragment search will find all the regions that are at least 30 bp long with 95% identity, and will extend the region as long as this criterion is fulfilled. The end result is a fragment of length L, such that L is greater than 30, and the fragment has at least 95% identity to the query. Another example is to specify a 100% identity over a given length. This will find the longest subsequence that has consecutive matches on at least a window of length L. Thus, Fragment Search can be viewed as a “controlled BLAST” where the percent ID and length of the local alignment is specified as input.

Motif search

The Motif search allows the search for nucleotide motifs against the nucleotide databases and protein motifs agains the protein databases. Note that no translation can be done, i.e. no nucleotide motif can be searched in protein databases and no protein motif can be searched in nucleotide databases..

Please note that GenomeQuest Motif search is IUPAC compliant. This means that, for DNA, a M is replaced by [AC], i.e. an A or a C. Similarly a T is always automatically equivalent to a U and vice versa. T = [TU] = U.

Examples of GenomeQuest Motif Search Patterns.
Search Type GenomeQuest Motif Matches
SNP search AGCAGGGG[AC]CGCGCAT AGCAGGGGACGCGCAT or AGCAGGGGCCGCGCAT
Repeat search (ATG){5,} ATGATGATGATGATG and more
Repeat search ATTA{5,15}TT ATTAAAAATT up until ATTAAAAAAAAAAAAAAATT
Domain search AQV[LE]PRSIG AQVLPRSIG or AQVEPRSIG
Advanced Domain search C.{2,4}C.{3}[LIVMFYWC].{8}H.{3,5}H For instance CXXCXXXLXXXXXXXXHXXXH, where X can be any residue.
Antibody search VBVV.*VDDEEEF.*BVBVVV The three Complementarity Determining Regions (CDRs) VBVV, VDDEEEF and BVBVVV interspersed by any other amino acid sequence.


Syntax Description
Syntax Description
. The dot character represents any residue
T{n,m} Stretch of T residues with length at least n and maximum m residues
T{n} Stretch of T residues of length exactly n residues
T{n,} Stretch of T residues of at least length n residues (no upper limit)
T* Any length stretch of T residues or no T at all. Shortcut for T{0,}
T+ Any length stretch of T residues. Shortcut for T{1,}
T? A single T residue or no T at all. Shortcut for T{0,1}
[TG] Either a T or G residue
[^TG] Anything but a T or G residue
(ATTG|GT) Either ATTG or GT
^ATCG The sequence starts with ATCG
ATCG$ The sequence ends with ATCG
^ATCG$ The sequence is exactly ATCG

Subject database selection

GenomeQuest IP workflow provides a meaningful (for IP searches) subset of the databases available in GenomeQuest. However, if you need us to add other databases, do not hesitate to let us know at support@genomequest.com.

Apply filters to subject databases

It is now possible to apply filters to subject database BEFORE the comparisons are made. This allows for more accurate results in some cases such as:

  • Only compare to database entries with a publication date BEFORE a given date
  • Remove mega-patents by only comparing to patents with less than a certain number of sequences. This will greatly improve speed and probably relevancy of results.
  • Only compare to granted and pending patents (You need access to Gold Plus or Premium databases)

The databases

Some of the databases that we do not make available are:

Databank Reason for exclusion
RefSeq Genomic It is generally unnecessary to search genomes for IP searches, especially large plant and mammalian genomes.
GQGene It is composed of transcripts (already available in GenBank) and parts of genomes.
Drugbank It is composed of sequences available elsewhere

Automatic selection

When you choose Patent databases only patent databases are selected. When you select Patents and Public Reference databases, the following databanks are selected because we believe that those reference databases represent the best set of non-genomic databanks with minimized redundancy:

Nucleotide databanks Protein databanks Type
GQPAT GQPAT Patent databases
Derwent Geneseq (paying option) Derwent Geneseq (paying option)
Protein Data Bank Protein Data Bank Reference databases
GenBank Core Divisions:
Unannotated division
Other Mammalian division
Invertebrate division
Bacterial division
Bacteriophage division
Rodent division
Primate division
High-throughput cDNA division
GB SET
Synthetic division
Plant / fungal / algal division
Environmental Samples
Other Vertebrate division
Viral division
GenPept
Genbank Expressed Sequence Tags division  

Manual selection

However, you can feel free to manually select other databanks or remove some selected databanks by simply clicking the checkboxes next to them.

Virtual databanks

It is also possible to use a virtual databank as a subject sequence. For technical reasons, it is not possible to mix a virtual databank and a normal databank. Create a virtual databank after a keyword search.

Viewing results

The picture of the summary report below gives some overall details about how the search was done and what results were found. The most important link is at the top of the page "900 results"; clicking this link leads to a Sequence Result Browser with these results.

Warning about maximum number of results reached

A recently added feature shows a warning in the report page when the maxumum number of results has been reached. This is activated for all algorithms except blast, since the latter often reaches the limit. This warning helps one understands if one needs to increase the number of maximum results computed.

MaxReached.png

Redo and alerts

In the result report, there is a redo button on the top-right corner. Clicking this button allows one to go back to the launch page. The launch page is now filled in exactly the way the original search was done. You can now change any values except the queries since the queries are what tie all redos together. Launching a redo will produce one added element in the report. It will show links to the subject sequences that were not found by the previous search.

Similarly, setting an alert on a result will automatically launch a redo when one of the subject databases is updated. To setup an alert,

  1. Go to your My GenomeQuest page and view the listing of results.
  2. Click on the line of the IP search on which you want to set alert. (Don't clik on the name of the result - this will open the result. Instead click on some area of the entry outside of the name). This opens the details about the result.
  3. Click on the sharing tab.
  4. Click the check box in this tab.

The operations are marked with arrows in the picture below.

Alerting operation

An alert runs every time one of your chosen subject databases is updated and it triggers an email to you with a summary of the new findings and a link to the results.

Picture of Report page

Top-linked.png Venn.png Stats.png Drugbank.png Audit-trail.png

Personal tools