Welcome to the GenomeQuest Documentation Wiki
IP Workflow
Use the IP search workflow to analyze your sequences. You can do Freedom-to-Operate searches or find out if your sequences are publically known or similar to known sequences.
Contents |
Example use cases
- Your scientists want to work on a new protein. You would like to do a Freedom-to-Operate search to be confident that they can keep working on this new protein.
- You would like to patent a series of newly discovered bacterial genes and you would like to known if anything similar to those sequences is publically known in patent and public reference databases.
- You believe a competitor could potentially infringe on one of your patents. You would like to monitor weekly if anything similar to your patented sequences is in our patent database.
Launching the IP workflow
There are two ways to reach the IP workflow launch page,
- Launch IP Search button: If this button is available on your My GenomeQuest landing page, just click it to launch the IP workflow.
- Through menu: On your My GenomeQuest page, click the Launch workflow button in the left panel and choose Sequence search -> IP.
Once on the IP launch page, you have to specify three things: (a) query sequence(s), (b) search strategy and (c) subject databases.
Query Sequence Input
Input your query sequence(s) in the main window, after having chosen if your sequences are nucleic or proteic:
Sequences must be in one-letter code and must be in one of the following formats
Acceptable sequence formats
Raw format. You can input one sequence in raw format, i.e. just the sequence. All spaces, numbers and punctuation marks will be automatically removed. For instance:
tacgacgcagca agcagcactca acat agga 34
atnagagataggnnatataggaggcccc Will be transformed into
tacgacgcagcaagcagcactcaacataggaatnagagataggnnatataggaggcccc
Fasta format. You can input one or more sequences in fasta format. Again, all spaces and punctuation marks will be ignored if they do not interfere with the fasta format. The basic fasta format is composed of a greater than sign ('>') immediately followed by an identifier (no space after '>', and only alpha-numerical characters and underscore in the identifier). Then, the line below that does not start with a '>' sign is considered to be the sequence. A second sequence can then follow, starting with a > sign, identifier and sequence:
>seq_1 gacatcacgacgcacgacctacac acacggananannnaggagaatga >seq_2 acagcagcgaccgacgaccagca atcagcagcagcaccacactacgcagctacac atcacgac
Uploaded queries
It is possible to use sequences that you already have uploaded as "annotated sequences".
Other Options in the Query area
Type of search: FTO or Patentability. These are two default search types. However, selecting either does not preclude you from manually over-riding later.
- Patent Databases only. This is the default option and selecting it searches patent databases.
- Patents and Public Reference Databases. Selecting this option searches patent and reference databases.
See section #Automatic selection for more details.
Result Name. Give the result a name. This is the name under which the result will be displayed on the My GenomeQuest page.
Compare to both nucleotide and protein databases. Check this box, if you want to compare your sequence(s) to both nucleotide and protein databases. By default (if this box is left unchecked), nucleotide query sequences are searched against nucleotide databases and protein queries are searched against protein databases.
By checking this box, if your sequence(s) are nucleic, GenomeQuest translates it in all 6 frames and compares the translated sequence to protein databases. If your query sequence(s) are protein, all sequences from the nucleotide databases are translated in all 6 frames prior to comparison.
Search Strategy
Four different search strategies are available in GenomeQuest. The table below shows some typical use cases for each search strategy.
| Use case | Description | Search strategy |
|---|---|---|
| Homology Search | Find sequences evolutionarily related to my query sequence(s) | BLAST |
| Classic patent search | Investigate claims on similar sequences where BLAST is specified as the method | BLAST |
| Defined patent search | Investigate claims on similar sequences specified in terms of percentage identity over that sequence | Percent Identity ‐ GenePAST |
| Small sequence search | Investigate matches to short query sequences like primers, probes and peptides | Percent Identity ‐ GenePAST |
| Primer search | Investigate all possible primers that can be made out of a longer sequence | Fragment Search |
| SNP search | Searching with small sequences that have a variable position. | Motif Search |
| Protein domain search | Searching with a protein sequences with a known domain structure e.g. Antibody with small variable regions and long fixed spacers | Motif Search |
Genepast
GenomeQuest's GenePAST algorithm is the base choice for most patent‐related sequence searches. The GenePAST "percent identity" algorithm finds the best fit between the query sequence and the subject sequence, and expresses the alignment as an exact percentage. Unlike BLAST, GenePAST makes no alignment scoring adjustments based on considerations of biological relevance between query and subject sequences. Use GenePAST when relevant patent considerations are, or will be, expressed in terms of sequence identity. GenePAST is particularly useful for getting straightforward search information when your query sequence is short or when you are looking for hits based on sequence identity over the entire length of the sequence. The alignment that GenePAST finds is guaranteed to be the longest alignment between two sequences that meets or exceeds a certain percent identity threshold. This is useful when investigating claims on a sequence based on percent identity over its length. Where BLAST will provide a shorter alignment of high homology, GenePAST will provide a longer alignment of perhaps lower identity (but exceeding the desired minimum threshold). Unlike to traditional approaches where the percent ID is only computed relative to an alignment, GenePAST allows to specify percentage identities on the sequence itself. For example, the following alignment:
Q: 10 ATG-TATA
||| ||||
S: 756 ATGGTATA
This leads to 87.5% percent ID on alignment (7 matches on an alignment of length 8), and a 100% percent ID on query (7 residues on 7 aligned).
Percentage Identities
GenePast makes extensive use of percentage identities. There are three kinds of percentage identities and it is critical to understand what they mean. Below is an illustration of their meanings:
Blast
BLAST is GenomeQuest’s implementation of the NCBI BLAST2 algorithm and finds the most relevant sequences in terms of biological similarity. Use BLAST when you're looking for the biological relationship between a query sequence and sequences in the subject database you're searching. BLAST helps scientists develop hypotheses about gene function by scoring alignments with respect to previously discovered relationships between homologous sequences. In general, BLAST should not be used to perform searches with short query sequences (less than 20 nucleotides). Even with substantial “fine tuning” of BLAST search parameters a BLAST search with a short query sequence will miss results due to its heuristic approach. The sequence search has the following default BLAST parameters for nucleotide searches:
- Word size : 11
- E‐value cutoff : 10
- Scoring Matrix : NUC.3.1 (match = 1, mismatch = ‐3)
- Gap Opening : 5
- Gap Extension : 2
and the following parameters for protein searches:
- Word size : 3
- E‐value cutoff : 10
- Scoring Matrix : BLOSUM62
- Gap Opening : 11
- Gap extension : 1
Fragment Search
The Fragment Search strategy uses the gapless blast v1 algorithm over a “sliding window” and calculates percent identity as it slides the window along the query sequence(s). Use Fragment Search when you're evaluating claims containing language like “and any fragment 15‐25 nucleotides thereof.” Set the Fragment Search percent identity parameter to “100%” to find the longest perfect matches between your query sequence and subject sequences (BLAST will return top hits containing gaps and extensions and GenePAST will return alignments over the entire query sequence).
Note: The fragment search algorithm does not introduce gaps in constructing the alignment between the query and subject sequences.
For example, with Fragment Search it is possible to find all 30 bp fragments of a particular sequence in the entire public domain, provided they are 95% identical or more to your query. Fragment search will find all the regions that are at least 30 bp long with 95% identity, and will extend the region as long as this criterion is fulfilled. The end result is a fragment of length L, such that L is greater than 30, and the fragment has at least 95% identity to the query. Another example is to specify a 100% identity over a given length. This will find the longest subsequence that has consecutive matches on at least a window of length L. Thus, Fragment Search can be viewed as a “controlled BLAST” where the percent ID and length of the local alignment is specified as input.
Motif search
The Motif search allows the search for nucleotide motifs against the nucleotide databases and protein motifs agains the protein databases. Note that no translation can be done, i.e. no nucleotide motif can be searched in protein databases and no protein motif can be searched in nucleotide databases..
Please note that GenomeQuest Motif search is IUPAC compliant. This means that, for DNA, a M is replaced by [AC], i.e. an A or a C. Similarly a T is always automatically equivalent to a U and vice versa. T = [TU] = U.
| Search Type | GenomeQuest Motif | Matches |
|---|---|---|
| SNP search | AGCAGGGG[AC]CGCGCAT | AGCAGGGGACGCGCAT or AGCAGGGGCCGCGCAT |
| Repeat search | (ATG){5,} | ATGATGATGATGATG and more |
| Repeat search | ATTA{5,15}TT | ATTAAAAATT up until ATTAAAAAAAAAAAAAAATT |
| Domain search | AQV[LE]PRSIG | AQVLPRSIG or AQVEPRSIG |
| Advanced Domain search | C.{2,4}C.{3}[LIVMFYWC].{8}H.{3,5}H | For instance CXXCXXXLXXXXXXXXHXXXH, where X can be any residue. |
| Antibody search | VBVV.*VDDEEEF.*BVBVVV | The three Complementarity Determining Regions (CDRs) VBVV, VDDEEEF and BVBVVV interspersed by any other amino acid sequence. |
| Syntax | Description |
|---|---|
| . | The dot character represents any residue |
| T{n,m} | Stretch of T residues with length at least n and maximum m residues |
| T{n} | Stretch of T residues of length exactly n residues |
| T{n,} | Stretch of T residues of at least length n residues (no upper limit) |
| T* | Any length stretch of T residues or no T at all. Shortcut for T{0,} |
| T+ | Any length stretch of T residues. Shortcut for T{1,} |
| T? | A single T residue or no T at all. Shortcut for T{0,1} |
| [TG] | Either a T or G residue |
| [^TG] | Anything but a T or G residue |
| (ATTG|GT) | Either ATTG or GT |
| ^ATCG | The sequence starts with ATCG |
| ATCG$ | The sequence ends with ATCG |
| ^ATCG$ | The sequence is exactly ATCG |
Subject database selection
GenomeQuest IP workflow provides a meaningful (for IP searches) subset of the databases available in GenomeQuest. However, if you need us to add other databases, do not hesitate to let us know at support@genomequest.com.
Apply filters to subject databases
It is now possible to apply filters to subject database BEFORE the comparisons are made. This allows for more accurate results in some cases such as:
- Only compare to database entries with a publication date BEFORE a given date
- Remove mega-patents by only comparing to patents with less than a certain number of sequences. This will greatly improve speed and probably relevancy of results.
- Only compare to granted and pending patents (You need access to Gold Plus or Premium databases)
The databases
Some of the databases that we do not make available are:
| Databank | Reason for exclusion |
|---|---|
| RefSeq Genomic | It is generally unnecessary to search genomes for IP searches, especially large plant and mammalian genomes. |
| GQGene | It is composed of transcripts (already available in GenBank) and parts of genomes. |
| Drugbank | It is composed of sequences available elsewhere |
Automatic selection
When you choose Patent databases only patent databases are selected. When you select Patents and Public Reference databases, the following databanks are selected because we believe that those reference databases represent the best set of non-genomic databanks with minimized redundancy:
| Nucleotide databanks | Protein databanks | Type | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GQPAT | GQPAT | Patent databases | ||||||||||||||
| Derwent Geneseq (paying option) | Derwent Geneseq (paying option) | |||||||||||||||
| Protein Data Bank | Protein Data Bank | Reference databases | ||||||||||||||
|
GenPept | |||||||||||||||
| Genbank Expressed Sequence Tags division |
Manual selection
However, you can feel free to manually select other databanks or remove some selected databanks by simply clicking the checkboxes next to them.
Virtual databanks
It is also possible to use a virtual databank as a subject sequence. For technical reasons, it is not possible to mix a virtual databank and a normal databank. Create a virtual databank after a keyword search.
Viewing results
The picture of the summary report below gives some overall details about how the search was done and what results were found. The most important link is at the top of the page "900 results"; clicking this link leads to a Sequence Result Browser with these results.
Warning about maximum number of results reached
A recently added feature shows a warning in the report page when the maxumum number of results has been reached. This is activated for all algorithms except blast, since the latter often reaches the limit. This warning helps one understands if one needs to increase the number of maximum results computed.
Redo and alerts
In the result report, there is a redo button on the top-right corner. Clicking this button allows one to go back to the launch page. The launch page is now filled in exactly the way the original search was done. You can now change any values except the queries since the queries are what tie all redos together. Launching a redo will produce one added element in the report. It will show links to the subject sequences that were not found by the previous search.
Similarly, setting an alert on a result will automatically launch a redo when one of the subject databases is updated. To setup an alert,
- Go to your My GenomeQuest page and view the listing of results.
- Click on the line of the IP search on which you want to set alert. (Don't clik on the name of the result - this will open the result. Instead click on some area of the entry outside of the name). This opens the details about the result.
- Click on the sharing tab.
- Click the check box in this tab.
The operations are marked with arrows in the picture below.
An alert runs every time one of your chosen subject databases is updated and it triggers an email to you with a summary of the new findings and a link to the results.








