Welcome to the GenomeQuest Documentation Wiki
BFQLPrimer
You should seriously consider reading System Concepts and the GQ Engine Primer before attacking this primer.
Contents |
Intro
BFQL is the Biofacet Query Language (and "Biofacet" is the code name for the GenomeQuest Engine). It is available with Biofacet using –bfql and with GQHTx with –q. One can query sequence databases (lspdb –bfql…, gqfetch –q…) and results (lspres –bfql…, gqresult –q…).
There are two ways to use BFQL. The simple way implies the use of basic comparison operators and their combination with classical logic operators. The complex way uses condition expressions, loops and other constructs of the language. Only by using the simple way will indices be used.
BFQL: Simply
- BFQL allows the use of comparisons operators for strings, numbers and ranges.
- All sequence annotations and alignment properties can be used and compared to constants.
- Whether one uses BFQL to examine a sequence databank or a sequence comparison result, some variables are implicit.
- All the power of first order logic is at hand.
- BFQL is case insensitive excepted when explicitly not.
When querying a sequence databank, the sequence $seq is the implicit variable. All annotations are directly accessible, e.g. ID for identifier, as well as sequence properties, e.g. L for sequence length.
When querying a result databank, the result $res is the implicit variable. All result properties are directly accessible, e.g. RIQ for %identity on the query, as well as query and subject sequences, e.g. q.ID and s.AC for query identifier and subject accession number respectively.
In all examples below, we assume the BFQL is placed inside of single quotes in a -bfql statement running on lspdb or lspres.
Looking for a specific ID and a word in the description:
id="NC_0000001" AND de="chromosome"
is equivalent to
ID="nc_0000001" aNd De="ChRoMoSoMe"
The above expressions, if used when exploring a sequence databank, will assume the annotations ID and DE are applied to sequences.
It could have been explicitely written as:
$seq.id="NC_0000001" AND $seq.de="chromosome"
with $seq meaning the current sequence in this context.
For results, $res is the implicit variable:
$res.s.id="NC_0000001" AND $res.rs > 300
is equivalent to
s.id="NC_0000001" AND rs > 300
Expressions are tied together with logic operators and parentheses
(id="NC_0000001" AND NOT ac="ABCDEF") AND (de="chromosome" OR de="contig")
is equivalent to
(id="NC_0000001" ! ac="ABCDEF") && (de="chromosome" || de="contig")
Although comparing annotations to non-constants is perfectly valid, the indexing will not be used. Each sequence (resp. result) will be checked in turn:
id=ac+".1"
In this case, since ac is a sequence annotation field rather than a constant, each sequence must be checked in turn, without the benefit of indices. it is not a constant.
==Strings and Numbers
Complex String Matching
Strings can be compared in a variety of ways. The construct uses some letter modifiers to the = sign. The basic modifiers are :
- i for case insensitive
- kw for the special keyword modifier
- w for word
- s for substring
- r for regex
- m for exact match
For instance:
DE =kwi ‘(transketolase*, TKT) ! FAD’
This operator is equivalent to simply writing =. It interprets the right hand side expression and decompose it in simpler expressions. In this case, it will look in a case insensitive way for a record containing words starting with transketolase (trasnketolase, transketolases) or TKT and not FAD.
DE =w ‘transketolase Formaldehyde’
matches records with both the words transketolase and Formaldehyde in a case sensitive way.
DE =mi ‘Formaldehyde transketolase’
The description record must contain extractly the given phrase and nothing more (case insensitive).
Ranges
Ranges apply to sequence databanks or results. A range is one or more intervals (comma separated) allowing access to sequences or results. These numbers are the direct indices of each sequence or result in the associated file, an extremely compact representation for large collections of entities. These intervals are numerical. A range is either associated with a result, with a sequence databank or with nothing.
For sequence databank querying:
[1:1000, 2000:*]
will select the first 1000 sequences and sequences 2000 to the last one included.
For results, there are 3 types of ranges, ranges of results, of query database or subject database:
[-3]@$resdb
will select the third result from the end.
[-8:-1]@$resdb.qbnk
will select the last eight query sequences
[*]@$resdb.sbnk
will select all subject sequences (it could have been written as [1:-1])
Ranges can be manipulated with classic set operators:
Union
[1:100, 200:300] | [120:320] == [1:100,120:320]
Intersection'
[1:100, 200:300] & [120:320] == [200:300] <pre> '''Difference''' <pre> [1:100, 200:300] - [120:320]== [1:100]
Complement
~[1:100, 200:300] == [101:199,301:*]
Lists
A list is an ordered bag of miscellaneous items:
{1, 4, 4, 12}
{4, "GQ", {"ABC", [1:1024]}}
The above list is composed of a number, a string and another list. The latter containing a string and a range.
Files
BFQL can read a file of items and directly compare the content of the file to, for instance, an annotation. Let's suppose there is a file called MYFILE of accession numbers (one per line). It can be directly applied to a sequence databank:
ac=@"MYFILE"
Putting it all together
To find badly annotated transketolases in eukaryotes, one can use such a query:
(de={”transketolase,tkt*”} or gn=”tkt*”) and oc=”eukaryota” and de != “fragment” and l < 400
This will find sequences whose description contains transketolase or a word prefixed by tkt or a gene name prefixed by tkt, from a species classified as eukaryota, with a description not containing the word fragment and with a sequence length smaller than 400. These sequences are probably fragments (but not annotated as such) or wrongly annotated transketolases.
BFQL: Complex Queries
Using complex features of BFQL will disable the BFQL indexing, thus potentially making it slower. If one can use the simple BFQL, it will thus be more efficient. However there are many useful features in the complex BFQL.
The basic working of BFQL are very similar to C/C++ or perl:
PRE {
// some commands to run before doing any query.
}
//
//commands to run for each item(result or sequence) in turn.
//
POST {
//some commands to run after selection has been done.
}
Note that PRE and POST blocks are optional. // is used for comments. The central block is not surrounded by { }.
Because we're going to get larger BFQL scripts now, it's useful to point out that you can write BFQL in a file. If you write on the command line, you proceed with:
% lspdb MYDB -bfql '<code here>'
If you instead place your BFQL into a file, you can run with the -fbfql option. The below example assumes you wrote a BFQL program into the file myprogram.bfql:
% lspdb MYDB -fbfql myprogram.bfql
Basic Syntax
Lines and Blocks
Lines are ended with ; and blocks are surrounded by {}.
Variables
Variables are prefixed by a $ sign and the assignment operator is :=
$mylist := {1, 1, 2, 3, 5, 8, 13};
$myvar := 4;
Conditions and loops
A simple if/else.
if ($myvar == 4) {
$myvar += 10;
}
else {
$myvar++;
}
A while loop using the echo operator.
$counter := 10;
while ($counter-- > 0) {
echo $counter;
}
The foreach expression.
$mylist := {"1", "2", "3"};
foreach ($val in $mylist) {
echo $val;
}
Here we loop in the list of generic names of a virtual database and stop when/if we reach one called “my_gname” using the break statement (note that the continue statement also exists).
$found := false;
$nb := 0;
foreach ($db in $seqdb.phybnk) {
$nb++;
if ($db.gname == "my_gname") {
echo "found and breaks";
$found := true;
break;
}
}
echo "found: " + $found;
echo "nb: " + $nb;
Functions
BFQL fully supports functions - even lambda functions! Here are just a sampling of some of the more general purpose functions. For further reference, see File:BFQLReference.pdf.
| Name | Example | Description |
|---|---|---|
| funlist | echo funlist();
|
outputs the list of available functions |
| version | echo version();
|
prints the version |
| string | $mystr := string(1);echo $mystr+”3”;
|
prints 13. transforms a number into a string |
| num | $mynum := num(“1024”);
|
transforms a string into a number |
| min / max | $ten := min(42, 10);
|
returns the min/max number |
| strlen | echo strlen(“012345”);
|
prints 6. returns the string length |
| substr | echo substr(“012345”,2);echo substr(“012345”, 2, 3);
|
prints 2345. prints 234. |
Complex Query Usage
There are two ways to use BFQL. Either one wants to tell biofacet which sequence/result is selected, or one wants to output something based on BFQL.
Complex selection of items
One can construct a complex query to extract only the sequences/results wanted. When the main block returns true, the current sequence/result is passed on. Since conditions and loops do not return anything, an explicit true must be called outside the loop/condition:
$toreturn := false;
if (ID==”1VS7”) {
$toreturn := true;
}
$toreturn;
Note the last line - it explicitly returns the value true up to the GQ Engine.
Complex scripting - Computing GC content
With the echo operator, one can output complex output based on reading the sequence/result database.
In the below example we compute the %GC content for a sequence database:
PRE {
$globalcount := 0;
$count := 0;
}
$ii := 0;
while ($ii < L) {
if (substr(s, $ii, 1) == “C” || substr(s, $ii, 1) == “G”) {
$count++;
}
$ii++;
}
$globalcount := $globalcount + $count;
echo ID + “ “ + $count + “ “ + 100.0*$count/l;
POST {
echo “global “ + $globalcount + “ “ + 100.0*$globalcount/$seqdb.nbresids;
}
Notes: Notes:
- The use of uppercase
L(line 6) is just for clarity. A lowercaselcould have been written as well. - To compute a percentage, the use of
100.0transforms the expression into floating point-based. So we have inline casting. Using100would have kept the expression as an integer operation. - Even if only one statement is done in a loop/condition,
{ }are necessary. - The script can be put in a file and called with the
–fbfqlswitch instead of–bfql. - To run this with biofacet, use
printf ’’to avoid having extra information printed, for instance:
% lspdb MYDB –fbfql myscriptabove –printf ’’
More Docs
Now that you've read the BFQL Primer, there are a few directions you can go:
- Go Broad: Get the overall System Concepts for the entire GQ platform, from command line to web
- Go Deep: Learn more about the GQ Engine.
- Go Deeper: Learn more about BFQL with the BFQL Reference Manual: File:BFQLReference.pdf
- Get Busy: Learn how to write your own workflows
As always, we're here to help. Reach out to support@genomequest.com for any question, any time.