Welcome to the GenomeQuest Documentation Wiki

LSPDB HELP

From GQ Wiki
Jump to: navigation, search
Option/sub-option Description Examples
-version Prints the version number > lspdb -version
lspdb: BIOFACET v6.0.1.D010 [2009.07.22] (c) GenomeQuest
[Build: LINUX-x86_64-64bit-gcc-libc-2.5-OPT 16:01:26]
-h / -help Prints a long list of all available options > lspdb -help|more
-verbose Obsolete
-nolog Suppresses command log information. (See in ~/.lassaprc/) > lspdb DB -nolog
-stdoutfile <FILE> Redirects standard out to file <FILE> > lspdb DB -stdoutfile STDOUT
-stderrfile <FILE> Redirects standard error to file <FILE> > lspdb DB -stderrfile STDERR
-gui Obsolete
-licwait Obsolete
-licqueue Obsolete
-sort <sort expression> Old sorting implementation. Use bfqlsort instead > lspdb DB -sort 'H#OS,-L'
-group <group expression> Old grouping implementation. Use bfqlgroup instead. > lspdb DB -group '{H#OS},{-L},{H#OS}'
-bfqlsort <bfqlsort expression> Used to sort. Multiple criteria sorting is allowed, as well as complex sorting expressions. Use - or + to specify sorting order; +, the default, is ascending order. > lspdb DB -bfqlsort '[OS],-[L]'
-bfqlgroup <bfqlgroup expression> Used for grouping sequences. Refer to the complete manual for details. > lspdb DB -bfqlgroup '{[OS]},{[-L]},{[OS]}'
-fbfqlgroup <FILE> Reads <FILE> and uses it as a bfql group expression. > lspdb DB -fbfqlgroup filename
-pergroupid <NUMBER> The maximum number of sequences to keep per group. > lspdb DB -bfqlgroup '{[OS]},{[-L]},{[OS]}' -pergroupid 1
-maxgroup <NUMBER> The maximum number of groups to keep. > lspdb DB -bfqlgroup '{[OS]},{[-L]},{[OS]}' -pergroupid 1 -maxgroup 5
-maxkeep <NUMBER> The maximum number of sequences to keep. > lspdb DB -bfql 'OS="homo sapiens"' -maxkeep 5
-select <SELECT STATEMENT> Old selection implementation. Use bfql instead.
-range <RANGE EXPRESSION> Applied before selection to restrict the sequences use. > lspdb DB -range '1:30,42'
-frange <FILE> Uses the ranges in <FILE> as above. First line of <FILE> must contain "Biofacet Range TXT v2.5". Second line contains the ranges. > cat myrange
Biofacet Range TXT v2.5
1,12,100:200
> lspdb DB -frange myrange
-memctrl <MEMCTRL EXPRESSION>

blk_sized=<val>
blk_seqnd=<val>
blk_ctbd=<val>
Internal optimising. Advanced usage only. Using the following key=value expression in a comma separated list:
memory block size limit in MB. Default 64M
Number of sequences
Number of sequences in CTB file
> lspdb -memctrl 'blk_sized=128'
-dump <FILENAME> Creates a biofacet sequence database. > lspdb DB -bfql 'OS="homo sapiens"' -dump MyHumanDB
-vdump <FILENAME> As -dump above, except that it keeps the same virtual db logic as the original. > lspdb DB -bfql 'OS="homo sapiens"' -vdump MyHumanDB
-shadow <FILENAME> As -dump, except that only sequences are dumped. This allows the creation of files for cluster nodes. > lspdb DB -bfql 'OS="homo sapiens"' -shadow MyHumanDB_withannot
-split <NUMBER> Splits the original db in <NUMBER> biofacet dbs. The file names will be the name of the original file _splt_numbers from 0 to <NUMBER>-1. A virtual biofacet database called original name_splt.ind is also created. > lspdb DB -count
db nbseqs = 1000
> lspdb DB -split 10
> ls DB_splt*
-chunk <NUMBER1>
-overlap <NUMBER2>
Splits genome size sequences in chunks of <NUMBER1> bases with <NUMBER2> overlapping bases. Should be used with | lspbank > lspdb DB -chunk 100000 -overlap 50 | lspbank -nuc -F MY_NEW_DB
-format <FORMATNAME> <FORMATNAME> can be FASTA or DB2. Will output records in FASTA or DB2 format. > lspdb DB -format FASTA
-frame <FRAME EXPRESSION> Can only be used with -format (above). Will translate the sequence where <FRAME EXPRESSION> can be one or more (comma separated) of "for rev all top bot +1 +2 +3 -1 -2 -3", meaning forward, reverse, all 6 frames, top 3 frames, bottom 3 frames and frame +1,... -3. > lspdb DB -format FASTA -frame 'for,rev'
> lspdb DB -format FASTA -frame all |lspbank -prot -F DB_TRANSLATED -T fasta
-crc Obsolete. Use %CRC with printf (see below)
-noseq Does not output the sequence part > lspdb DB -noseq
-motif Obsolete.
-lspid Obsolete
-printf <FORMAT STRING> This is the most practical way to format biofacet output. A separate manual demonstrates its possibilities. Below is listed the list of accessors and formators, where the background of the cells is grey. > lspdb DB -printf '%H#OS\n%VOID'
Numerical formatters It is possible to apply C-like numerical formatters in a printf statement. They are written just after the printf variable as .[%10d]. For instance %N displays the sequence index. To display it using 10 characters, simply do -printf '%N.[%10d]\n%VOID' which will display:
         1
         2
...
        10
Extra variables (%1 %2 ...) Some printf variables can use extra variables called %1, %2, ... They are described in some rows, below. To use these extra variables, you must enclose them in [] after the % sign and either before a variable name or/and after. For instance -printf '%[%1 ]S[%2 %3\n%VOID]' where %S is the variable.
Modifiers Noted m1,m2,..., modifiers are available for %S and %H. They allow truncation and "chunking" of the output and are used after the dot sign. For instance to display 10 residues at a time (then a newline) from residue 100 to residue 200, %S.10.100.200[\n%VOID]
 %VOID Empty reference. Since printf never writes the last constant string, use %VOID to change this behaviour. For instance the first command on the right does not write the last new line, the second command does. > lspdb DB -printf '%H#ID %H#OS\n'

> lspdb DB -printf '%H#ID %H#OS\n%VOID'
 %PRE Preamble separator. Anything before %PRE is done once at the very beginning. (Same as BEGIN in awk) > lspdb DB -bfql 'L>1000' -printf 'Number of sequences selected=%NBSEQS\nout of a total of %ONBSEQS\n%PRE'
 %POST Postamble separator. Anything after %POST is done once at the very end. (Same as END in awk) > lspdb DB -bfql 'L>1000' -printf '%H#ID\t%L\n%VOID%POSTNumber of sequences selected=%NBSEQS\nout of a total of %ONBSEQS\n'
 %DATE Current date > lspdb DB -printf '%DATE\n%PRE'
 %VER Biofacet version > lspdb DB -printf '%VER\n%PRE'
 %CRC CRC string. This is a CRC string computed for the sequence. For two given sequences, if their length is identical and their CRC is different, the two sequences are different. If the CRC are the same, then there is a small chance that the two sequences are in fact different. This can be used to remove redundancy in databases. > lspdb DB -range 1:10 -printf '%H#ID %CRC %L\n%VOID'
 %NGC Genetic code index. This is an internal index used to identify which genetic code is used. 1 is the standard genetic code.
 %GC Genetic code name. Usually "Standard".
 %NDBTYPE sequence db type index. Internal index.
 %DBTYPE sequence db type. NUC for nucleotide, NUCCS for color space, PRO for protein. > lspdb DB -range 1 -printf '%DBTYPE\n%VOID'
 %NBSEL Advanced master/slave usage. This shows the number of sequences before "reduce" is applied
 %NBSEQS The number of sequences after all filtering. > lspdb DB -bfql 'L>1000' -printf 'Number of sequences selected=%NBSEQS\nout of a total of %ONBSEQS\n%PRE'
 %ONBSEQS The number of sequences before any filtering.
 %NBMASTERS Advanced master/slave usage. master count
 %ONBMASTERS Advanced master/slave usage. original master count
 %NBRESIDS Number of residues before any filtering is applied. > lspdb DB -bfql 'L>1000' -printf 'Number of sequences selected=%NBSEQS
out of a total of %ONBSEQS and %NBRESIDS residues\n%PRE'
 %MAXLEN The max sequence length in the database before any filtering. > lspdb DB -bfql 'L>1000' -printf 'Number of sequences selected=%NBSEQS\nout of a total of %ONBSEQS and %NBRESIDS residues\nThe longest sequence in the database is %MAXLEN residue long\n%PRE'
 %NBANNOTS The number of annotations in the database. > lspdb DB -range 1 -printf '%NBANNOTS\n%VOID'
%ANNOTS
%1 .. %5
This displays the annotation names and some annotation attributes.
%1 is the attribute (string) and is one of (normal, shared, master).  %2 is the class (string) and is generally "native".  %3 is the type (string) and is one of (int, string, control).  %4 is the index type (string) and is one of (notindexed, btree, hash).  %5 is an advanced parameter called compound annotation fields (list) and is often empty.
> lspdb DB -range 1 -printf '%ANNOTS[ %3 %4\n]\n%VOID'
ID string hash
AC string hash
SV string hash
GI string notindexed
GN string hash
SY string notindexed
D6 int btree
%GNAME
%1 .. %12
Displays the sequence db generic name. Attributes %1 to %12 show respectively:
generic name virtual (string);
generic name physical (string);
current file number (unsigned int);
number of files (unsigned int);
number of sequences in virtual database (unsigned int);
number of sequences in current physical database (unsigned int);
original number of sequences in virtual database (unsigned int);
original number of sequences in current physical database (unsigned int);
file name of current physical database (string);
crypto ident (string);
file offset (unsigned int);
c_len (unsigned int)
 %DBFILE Displays the sequence db file name. Useful also with %DBPATH. See below and example. > lspdb DB -printf '%DBPATH/%DBFILE\n%VOID%PRE'
/opt/MyDBs/DB
 %DBPATH sequence db file path. See example above.
 %STATS sequence db stats. Not yet implemented.
%GROUP (also ^GROUP and $GROUP)
%1 .. %7
The group variable is special in that it can start with ^,% or $. ^GROUP will apply a the beginning of a group, $GROUP at the end and %GROUP for each element in the group. Using ^ and $GROUP are useful to produce XML for instance where a TAG is opened before a group and closed after a group. Everything below applies to all 3 GROUP variants.
By default %GROUP prints the group index (unsigned int). To make it not print it, use %GROUP.[]
Everything that has to be printed must be done between [] either after the % or after the GROUP or after the GROUP.[].
The extra variables %1 to %7 print the following information:
%1 : The group index (unsigned int)
%2: The group first element (unsigned int)
%3 : The group last element (unsigned int)
%4: Number of elements (unsigned int)
%5: Number of groups (unsigned int)
%6: Number of members before filtering (unsigned int)
%7: Number of groups before filtering (unsigned int)
> lspdb DB -bfqlgroup '{[OS]},{[-L]},{[OS]}' -pergroupid 3 -printf '^GROUP.[][---
%H#OS %4 of %6\n%VOID]%GROUP.[][ %H#ID %L\n%VOID]%POST---
'

---
Borrelia burgdorferi 3 of 56
NP_862626 429
NP_862652 406
YP_783878 371
---
Buchnera aphidicola 3 of 9
NP_047187 516
NP_047189 466
NP_047188 363
---
Leptospirillum ferrooxidans 3 of 10
YP_220399 133
YP_220386 127
YP_220385 100
---
%NBGROUPS
%1
%2
%3
%4
%5
%6
Prints the number of groups by default. Can be switched of with .[] (like for %GROUP).
%1 shows the number of groups (unsigned int);
% 2, the number of members (unsigned int);
%3, the number of groups before filtering (unsigned int);
%4, the number of members before filtering (unsigned int);
%5, the maxgroup (unsigned int);
and %6, pergroupid (unsigned int)
> lspdb DB -bfqlgroup '{[OS]},{[-L]},{[OS]}' -pergroupid 3 \
-printf '%NBGROUPS.[][%1 groups, %2 members, %3 groups before filtering, %4 members before filtering, maxgroup=%5 pergroupid=%6\n%VOID]%PRE'

5 groups, 14 members, 5 groups before filtering, 100 members before filtering, maxgroup=0 pergroupid=3
 %N Prints the sequence index. Warning, this is after filters are applied. Generally speaking, %ON (see below) should be used. See the same example ran with %N and %ON on the right hand side and below. > lspdb DB -range 200:220 -bfql 'L>100' -sort '-L' -printf '%N %L\n%VOID'
1 730
2 603
3 502
4 473
5 422
6 358
7 352
8 345
9 342
10 331
11 266
12 252
13 206
14 143
15 142
 %ON Prints the original sequence index, before any kind of filtering is applied. This is a number. > lspdb DB -range 200:220 -bfql 'L>100' -sort '-L' -printf '%ON %L\n%VOID'
206 730
200 603
217 502
207 473
211 422
216 358
214 352
220 345
210 342
212 331
203 266
202 252
204 206
201 143
215 142
%H
%1 .. %6
m1 and m2
This displays the sequence annotations. Generally, %H is used with a # qualifier. Simply typing %H#ID will display the ID field. The %1 to %6, as well as the m1 and m2 are very rarely used.
%1 is the annotation name (string)
%2, the annotation index (int)
%3, the annotation name if exists (string)
%4, the annotation index if exists (int)
%5, master/slave (string)
%6 the master ord (int)
m1 and m2 are the start and stop position, so that to print the first two letters of the ID, one would type %H#ID.1.2
> lspdb DB -printf '%H#ID %H#DE\n%VOID'
 %L Prints the sequence length. > lspdb DB -printf '%L,%POST\n%VOID'
730,603,502,473,422,358
%S
%1
%2
%3
m1
m2
m3
 %S is used to display the sequence.
%1 is the first position in chunk (unsigned int), %2 is the last position in chunk (unsigned int) and %3 is the chunk length (unsigned int). Those are used to display coordinates. Integer formatting can be used as in C.
m1 is the chunk length limit. Use %S.10 to display the sequence by chunks of 10 residues. 0 means in one chunk.
m2 and m3 are the start and stop positions. Sequence coordinates start at 1. Use negative numbers to start/stop from the end.  %S.0.1.-1 is equivalent to %S
> lspdb DB -range 1 -printf '%S\n%VOID'

> lspdb DB -range 1 -printf '%S.100[\n%VOID]'

> lspdb DB -range 1 -printf '%[%1.[%5d] ]S.5.-10.-1[ %2.[%5d] (%3 base long)%VOID\n]\n%VOID'
448 TAAAA 452 (5 base long)
453 ATGGG 457 (5 base long)
 %IDNAME Value of first annotation. > lspdb DB -range 1:10 -printf '%IDNAME\n%VOID'
-fprintf <FILE> <FILE> contains a printf statement to use > lspdb DB -fprintf myprintf.txt
-win_start <NUMBER> Used for pagination.
-win_stop <NUMBER> See above
-full_sel forces complete selection count for win_start/win_stop. (Advanced used only)
-xml Outputs a XML
-xml_start <NUMBER> Same as win_start, for xml output
-xml_stop <NUMBER> Same as win_stop, for xml output
-xml_full_sel Same as full_sel, for xml output
-xml_seqpos_start <NUMBER> first residue to display in xml
-xml_seqpos_stop <NUMBER> last residue to display in xml
-xml_all_but <STRING> XML display options. Advanced used only.
-count Returns the number of sequences (usually used in combination with a bfql selection). Using -count is incompatible with other output switches such as -printf. > lspdb DB -range 1:100 -count
db nbseqs = 100
db nbres = 20570
db maxlen = 634
db fields = ID AC SV

> lspdb DB -range 1:100 -bfql 'L < 500' -count
db nbseqs = 98
db nbres = 19420
db maxlen = 487
db fields = ID AC SV
-ocoll <COLLECTION> NEW. Not yet entirely defined.
-esmdb <ESM DB> Advanced use only.
-dsptmp <TMPDATASPACE> Advanced use only.
expand|count> Advanced use only.
Indexing options are described in another manual.
-indexinit
-indexcreate <annotation list>
-indexdelete <annotation list>
-indexinfo <annotation list>
-indexsinfo
-indexlist <annotation list>
-indexhisto <annotation list>
-wordlist <annotation list>
-indexstats <annotation list>
-indexselect <selection expression>
-indexcontrol <index control>
-bfql Usage is described in another manual. The BioFacet Query Language is used for creating queries. Use single quotes to enclose your bfql. > lspdb DB -bfql 'id="NC_*"'
-fbfql <BFQLFILE> Reads the <BFQLFILE> as a bfql query.
-bfqlcontrol <CONTROL LIST> Advanced usage.


Personal tools