Welcome to the GenomeQuest Documentation Wiki

BFQLPrimer

From GQ Wiki
Jump to: navigation, search

You should seriously consider reading System Concepts and the GQ Engine Primer before attacking this primer.

Contents

Intro

BFQL is the Biofacet Query Language (and "Biofacet" is the code name for the GenomeQuest Engine). It is available with Biofacet using –bfql and with GQHTx with –q. One can query sequence databases (lspdb –bfql…, gqfetch –q…) and results (lspres –bfql…, gqresult –q…).

There are two ways to use BFQL. The simple way implies the use of basic comparison operators and their combination with classical logic operators. The complex way uses condition expressions, loops and other constructs of the language. Only by using the simple way will indices be used.

BFQL: Simply

  • BFQL allows the use of comparisons operators for strings, numbers and ranges.
  • All sequence annotations and alignment properties can be used and compared to constants.
  • Whether one uses BFQL to examine a sequence databank or a sequence comparison result, some variables are implicit.
  • All the power of first order logic is at hand.
  • BFQL is case insensitive excepted when explicitly not.

When querying a sequence databank, the sequence $seq is the implicit variable. All annotations are directly accessible, e.g. ID for identifier, as well as sequence properties, e.g. L for sequence length.

When querying a result databank, the result $res is the implicit variable. All result properties are directly accessible, e.g. RIQ for %identity on the query, as well as query and subject sequences, e.g. q.ID and s.AC for query identifier and subject accession number respectively.

In all examples below, we assume the BFQL is placed inside of single quotes in a -bfql statement running on lspdb or lspres.

Looking for a specific ID and a word in the description:

id="NC_0000001" AND de="chromosome"

is equivalent to

ID="nc_0000001" aNd De="ChRoMoSoMe"

The above expressions, if used when exploring a sequence databank, will assume the annotations ID and DE are applied to sequences.

It could have been explicitely written as:

$seq.id="NC_0000001" AND $seq.de="chromosome"

with $seq meaning the current sequence in this context.

For results, $res is the implicit variable:

$res.s.id="NC_0000001" AND $res.rs > 300

is equivalent to

s.id="NC_0000001" AND rs > 300

Expressions are tied together with logic operators and parentheses

(id="NC_0000001" AND NOT ac="ABCDEF") AND (de="chromosome" OR de="contig")

is equivalent to

(id="NC_0000001" ! ac="ABCDEF") && (de="chromosome" || de="contig")

Although comparing annotations to non-constants is perfectly valid, the indexing will not be used. Each sequence (resp. result) will be checked in turn:

id=ac+".1"

In this case, since ac is a sequence annotation field rather than a constant, each sequence must be checked in turn, without the benefit of indices. it is not a constant.

==Strings and Numbers

Complex String Matching

Strings can be compared in a variety of ways. The construct uses some letter modifiers to the = sign. The basic modifiers are :

  • i for case insensitive
  • kw for the special keyword modifier
  • w for word
  • s for substring
  • r for regex
  • m for exact match

For instance:

DE =kwi ‘(transketolase*, TKT) ! FAD’

This operator is equivalent to simply writing =. It interprets the right hand side expression and decompose it in simpler expressions. In this case, it will look in a case insensitive way for a record containing words starting with transketolase (trasnketolase, transketolases) or TKT and not FAD.

DE =w ‘transketolase Formaldehyde’

matches records with both the words transketolase and Formaldehyde in a case sensitive way.

DE =mi ‘Formaldehyde transketolase’

The description record must contain extractly the given phrase and nothing more (case insensitive).

Ranges

Ranges apply to sequence databanks or results. A range is one or more intervals (comma separated) allowing access to sequences or results. These numbers are the direct indices of each sequence or result in the associated file, an extremely compact representation for large collections of entities. These intervals are numerical. A range is either associated with a result, with a sequence databank or with nothing.

For sequence databank querying:

[1:1000, 2000:*]

will select the first 1000 sequences and sequences 2000 to the last one included.

For results, there are 3 types of ranges, ranges of results, of query database or subject database:

[-3]@$resdb

will select the third result from the end.

[-8:-1]@$resdb.qbnk

will select the last eight query sequences

[*]@$resdb.sbnk

will select all subject sequences (it could have been written as [1:-1])

Ranges can be manipulated with classic set operators:

Union

[1:100, 200:300] | [120:320] == [1:100,120:320]

Intersection'

[1:100, 200:300] & [120:320] == [200:300]
<pre>

'''Difference'''
<pre>
[1:100, 200:300] - [120:320]== [1:100]

Complement

~[1:100, 200:300] == [101:199,301:*]

Lists

A list is an ordered bag of miscellaneous items:

{1, 4, 4, 12}
{4, "GQ", {"ABC", [1:1024]}}

The above list is composed of a number, a string and another list. The latter containing a string and a range.


Files

BFQL can read a file of items and directly compare the content of the file to, for instance, an annotation. Let's suppose there is a file called MYFILE of accession numbers (one per line). It can be directly applied to a sequence databank:

ac=@"MYFILE"

Putting it all together

To find badly annotated transketolases in eukaryotes, one can use such a query:

(de={”transketolase,tkt*”} or gn=”tkt*”) and oc=”eukaryota” and de != “fragment” and l < 400

This will find sequences whose description contains transketolase or a word prefixed by tkt or a gene name prefixed by tkt, from a species classified as eukaryota, with a description not containing the word fragment and with a sequence length smaller than 400. These sequences are probably fragments (but not annotated as such) or wrongly annotated transketolases.

BFQL: Complex Queries

Using complex features of BFQL will disable the BFQL indexing, thus potentially making it slower. If one can use the simple BFQL, it will thus be more efficient. However there are many useful features in the complex BFQL.

The basic working of BFQL are very similar to C/C++ or perl:

PRE {
// some commands to run before doing any query.
}
//
//commands to run for each item(result or sequence) in turn.
//
POST {
//some commands to run after selection has been done.
}

Note that PRE and POST blocks are optional. // is used for comments. The central block is not surrounded by { }.

Because we're going to get larger BFQL scripts now, it's useful to point out that you can write BFQL in a file. If you write on the command line, you proceed with:

% lspdb MYDB -bfql '<code here>'

If you instead place your BFQL into a file, you can run with the -fbfql option. The below example assumes you wrote a BFQL program into the file myprogram.bfql:

% lspdb MYDB -fbfql myprogram.bfql

Basic Syntax

Lines and Blocks

Lines are ended with ; and blocks are surrounded by {}.

Variables

Variables are prefixed by a $ sign and the assignment operator is :=

$mylist := {1, 1, 2, 3, 5, 8, 13};
$myvar := 4;

Conditions and loops

A simple if/else.

if ($myvar == 4) {
	$myvar += 10;
}
else {
	$myvar++;
}

A while loop using the echo operator.

$counter := 10;
while ($counter-- > 0) {
  echo $counter;
}

The foreach expression.

$mylist := {"1", "2", "3"};
foreach ($val in $mylist) {
  echo $val;
}

Here we loop in the list of generic names of a virtual database and stop when/if we reach one called “my_gname” using the break statement (note that the continue statement also exists).

$found := false;
$nb := 0;

foreach ($db in $seqdb.phybnk) {
  $nb++;
  if ($db.gname == "my_gname") {
    echo "found and breaks";
    $found := true;
    break;
  }
}

echo "found: " + $found;
echo "nb:    " + $nb;

Functions

BFQL fully supports functions - even lambda functions! Here are just a sampling of some of the more general purpose functions. For further reference, see File:BFQLReference.pdf.

Name Example Description
funlist echo funlist(); outputs the list of available functions
version echo version(); prints the version
string $mystr := string(1);echo $mystr+”3”; prints 13. transforms a number into a string
num $mynum := num(“1024”); transforms a string into a number
min / max $ten := min(42, 10); returns the min/max number
strlen echo strlen(“012345”); prints 6. returns the string length
substr echo substr(“012345”,2);
echo substr(“012345”, 2, 3);
prints 2345.
prints 234.

Complex Query Usage

There are two ways to use BFQL. Either one wants to tell biofacet which sequence/result is selected, or one wants to output something based on BFQL.

Complex selection of items

One can construct a complex query to extract only the sequences/results wanted. When the main block returns true, the current sequence/result is passed on. Since conditions and loops do not return anything, an explicit true must be called outside the loop/condition:

$toreturn := false;
if (ID==”1VS7”) {
	$toreturn := true;
}
$toreturn;

Note the last line - it explicitly returns the value true up to the GQ Engine.

Complex scripting - Computing GC content

With the echo operator, one can output complex output based on reading the sequence/result database.

In the below example we compute the %GC content for a sequence database:

PRE {
  $globalcount := 0;
  $count := 0;
}
$ii := 0;
while ($ii < L) {
  if (substr(s, $ii, 1) == “C” || substr(s, $ii, 1) == “G”) {
    $count++;
  }
  $ii++;
}
$globalcount := $globalcount + $count;
echo ID + “ “ + $count + “ “ + 100.0*$count/l;
POST {
  echo “global “ + $globalcount + “ “ + 100.0*$globalcount/$seqdb.nbresids;
}

Notes: Notes:

  • The use of uppercase L (line 6) is just for clarity. A lowercase l could have been written as well.
  • To compute a percentage, the use of 100.0 transforms the expression into floating point-based. So we have inline casting. Using 100 would have kept the expression as an integer operation.
  • Even if only one statement is done in a loop/condition, { } are necessary.
  • The script can be put in a file and called with the –fbfql switch instead of –bfql.
  • To run this with biofacet, use printf ’’ to avoid having extra information printed, for instance:
% lspdb MYDB –fbfql myscriptabove –printf ’’

More Docs

Now that you've read the BFQL Primer, there are a few directions you can go:

As always, we're here to help. Reach out to support@genomequest.com for any question, any time.

Personal tools