Welcome to the GenomeQuest Documentation Wiki
DeveloperAPISystemConcepts
Core System Concepts
Quick links to each concept below, but we recommend you read through this entire document linearly. It builds on earlier concepts as you go.
- The GQ Engine
- Sequence and Annotation Databases
- Result Databases
- The Hotdrive
- Local
- The Metadata Layer
- Userdata
- Workflow Architecture
- Plugin Architecture
- Using The Array
- GenomeCast
The GQ Engine
The GQ Engine is the underlying technology that enables the entire system to work. It consists of
- a specific representation for sequence databases which is compact, binary, and includes annotation
- a specific representation for comparisons between sequence databases, again, compact, binary, and searchable
- a set of commonly used sequence comparison algorithms
- a programatic query language, BFQL, best described as "PL/SQL" for sequence and result databases
- innate knowledge of compute clusters
The GQ Engine is implemented as a series of UNIX binaries available at the command line. The complete set of binaries is below. Note that full documentation for these modules is available in the GQ Engine Reference Manual.
| name | inputs | outputs | function |
|---|---|---|---|
| lspbank | text file of sequences and annotation | GQ Engine seqdb | converts an input file into a GQ Engine seqdb |
| lspdb | seqdb, query criteria | a set of records that meet query criteria | queries a sequence database for some properties |
| lspvbank | set of seqdbs | GQ Engine virtual seqdb | creates a virtual seqdb from the set of inputs. Such virtual database thereafter works exactly like a normal seqdb. |
| lspcalc | seqdb1, seqdb2, algorithm | GQ Engine resdb | compares seqdb1 to seqdb2 using algorithm, produces resdb |
| lspmul | seqdb1, seqdb2, algorithm | GQ Engine resdb | compares seqdb1 to seqdb2 using a two-phased algorithm: heuristic word matching followed by dynamic programming alignment, produces resdb |
| lspres | resdb, query criteria | a set of records that meet query criteria | queries a result database for some properties, returns those results that meet query conditions |
| lspvres | set of resdbs | GQ Engine virtual resdb | creates a virtual resdb from the set of inputs. Such virtual result database thereafter works exactly like a normal resdb. |
| lspextend | resdb | an extended resdb | computes additional properties of a result database to allow for querying on alignment properties |
| lspcalc.TH | seqdb1, seqdb2, algorithm | GQ Engine resdb | exactly like lspcalc but is multi-threaded |
| lspmul.TH | seqdb1, seqdb2, algorithm | GQ Engine resdb | exactly like lspmul but is multi-threaded |
| lspcalc.THA | seqdb1, seqdb2, algorithm | GQ Engine resdb | a fusion of lspcalc.TH and lspmul.TH that is aware of the entire compute cluster |
| lspdb.H | seqdb, query criteria | a set of records that meet query criteria | queries a sequence database for some properties, is aware of the GQ Hotdrive |
Sequence and Annotation Databases
A GenomeQuest Sequence Database is a compact representation of an arbitrarily large number of sequences and associated annotation. It is stored in a binary format in a UNIX filesystem. It is not meant to be edited directly or to be viewed directly as its representation is abstracted from the user.
Sequence types
GenomeQuest supports the following sequence types:
- Nucleic
- Colorspace nucleic
- Nucleic pattern
- Peptide
- Peptide pattern
Sequence annotation
Each sequence in the database can also have corresponding annotation associated with it. This annotation is not positional in nature (see Result Databases for that), but rather global annotation on the sequence itself. GenomeQuest stores annotation values in fields which are indexed by two-character keys. Example annotation fields might be:
- ID the human-readable sequence identifier for the sequence, e.g. "NM_018189.1"
- DE the description of the sequence, e.g., "Homo sapiens hypothetical protein FLJ10713 (FLJ10713), MRNA"
Any two-character string is a valid annotation field in a GenomeQuest sequence database. At the GenomeQuest engine level, there is no semantic meaning attached to any such annotation field. You could easily use the ID field to store something else, or put the sequence identifier in a field called '4R' if you preferred.
The GenomeQuest product itself does in fact apply additional semantic meaning to certain fields, even though the GenomeQuest Engine doesn't care. The list of fields and their semantic associated is available here:
Note that this page returns raw text, machine readable. To view it formatted, choose "View Source" from your web browser.
Making and Working with a Sequence Database
Multiple input formats can be parsed by the GQ Engine to make a sequence database. Typically the most common formats are:
- EMBL-like (a.k.a. DB2 - this is the native format)
- FASTA
EMBL-like format allows for the addition of arbitary annotation fields, whereas FASTA only allows for sequence ID and sequence itself. Below is a sample EMBL-like file format:
ID SNP-18 DE SNP AC177813.2 position 1439-1439 GP AC177813.2 OS Zea mays KW G -> t CGCGACCCGGCTGGTCATGACTATTTCACACGTCGTGAGtTTTTTCGATCCCTACAAAGGAAAGGATGAGTACGGGATCTT // ID SNP-35 DE SNP AC177813.2 position 142409-142409 GP AC177813.2 OO MAIZE KW G -> a CGCCTTGTCCTAACCCTTCAGCATATCCTCTAGCTCATCaTCTGGGACGTCTTGAGGGAAGCCGATGATGTCTTGAAGGCT //
As you can see, the database specifies two sequences, each represent a single nucleotide polymorphism in the middle of the sequence. The annotation fields being used are:
| ID | An identifier associated with the SNP |
| DE | A human readable description |
| GP | The genomic product on which the SNP exists |
| OO | The human-readable organism name |
| KW | Any keywords |
Again, at the GenomeQuest engine level, there is no semantic meaning associated with these annotation fields, but at the level of the GenomeQuest product (the web application), these fields tend to have specific meanings.
Now, let's assume that this file of "EMBL-like" sequence data is available in my current working directory and is named "data.embl". To make a GenomeQuest engine database, observe the following interaction on the UNIX command line with the GQ Engine installed:
runner@linnaeus:~/doc>ls -l total 4 -rw-r--r-- 1 runner geneit 344 Jan 5 13:35 data.embl runner@linnaeus:~/doc>lspbank -dbtype NUC -T EMBL -F myseqdb data.embl data.embl : sequences 2, residues 162, max seq length 81 runner@linnaeus:~/doc>ls -l total 16 -rw-r--r-- 1 runner geneit 344 Jan 5 13:35 data.embl -rw-r--r-- 1 runner geneit 40 Jan 5 13:56 myseqdb.ctb -rw-r--r-- 1 runner geneit 330 Jan 5 13:56 myseqdb.ind -rw-r--r-- 1 runner geneit 168 Jan 5 13:56 myseqdb.seq runner@linnaeus:~/doc>The command
lspbank -dbtype NUC -T EMBL -F myseqdb data.emblconverts the text file called "data.embl" into a GQ Engine database called "myseqdb". Notice that physical implementation of the logical myseqdb database is in fact three different files:
-rw-r--r-- 1 runner geneit 40 Jan 5 13:56 myseqdb.ctb -rw-r--r-- 1 runner geneit 330 Jan 5 13:56 myseqdb.ind -rw-r--r-- 1 runner geneit 168 Jan 5 13:56 myseqdb.seq
If you now wanted to query that database, you may do so using the command "lspdb":
runner@linnaeus:~/doc>lspdb myseqdb ID SNP-18 AC OS Zea mays DE SNP AC177813.2 position 1439-1439 KW G -> t CGCGACCCGGCTGGTCATGACTATTTCACACGTCGTGAGTTTTTTCGATC CCTACAAAGGAAAGGATGAGTACGGGATCTT // ID SNP-35 AC OS Zea mays DE SNP AC177813.2 position 142409-142409 KW G -> a CGCCTTGTCCTAACCCTTCAGCATATCCTCTAGCTCATCATCTGGGACGT CTTGAGGGAAGCCGATGATGTCTTGAAGGCT //
Notice first that you refer to the logical name "myseqdb" rather than to the physical names of any of the files. The lspdb command behaves like the UNIX command "cat," except that rather than displaying a text file, it displays the contents of a binary database as text on STDOUT.
There is much power in the lspdb command alone - see the GenomeQuest Engine Primer for much more, or trylspdb -help
A few undocumented examples to give you some flavor.
Output a GQ Engine Sequence Database as a FASTA file
% lspdb myseqdb -printf '>%H#ID\n%S\n%VOID' >SNP-18 CGCGACCCGGCTGGTCATGACTATTTCACACGTCGTGAGTTTTTTCGATCCCTACAAAGGAAAGGATGAGTACGGGATCTT >SNP-35 CGCCTTGTCCTAACCCTTCAGCATATCCTCTAGCTCATCATCTGGGACGTCTTGAGGGAAGCCGATGATGTCTTGAAGGCT
Returning the record for sequence with ID SNP-35
% lspdb myseqdb -bfql 'ID="SNP-35"' ID SNP-35 AC OS Zea mays DE SNP AC177813.2 position 142409-142409 KW G -> a CGCCTTGTCCTAACCCTTCAGCATATCCTCTAGCTCATCATCTGGGACGT CTTGAGGGAAGCCGATGATGTCTTGAAGGCT //
Publishing a Sequence Database to the GenomeQuest Front-End
In order to publish a GenomeQuest Engine sequence database to the GenomeQuest web product, you must use the GenomeQuest tool called admin_db.pl. This tool is part of the GenomeQuest Content Manager suite, and comes as part of the GenomeQuest installation. It encapsulates the functionality of lspbank as described above, along with metadata about the access control related to the sequence database. Full documentation of admin_db.pl is available on the command line, as well as in the GenomeQuest Content Manager Reference Manual.
% $GQ_INSTALL/data/GQdata/external_apps/gqadmindb/admin_db.pl
Please specify action(--action).
Usage:
admin_db.pl --action <convert|index|configure|push|activate|add|delete|update|list|showtree|showfields|lookupfields>
--db_file <input_db_file_or_dir>
--gq_base_dir <GenomeQuest_installation_dir>
--db_id <GQ_database_definition_id, e.g., "LOCAL_MYDB">
--db_format <EMBL+, FASTA, FASTQ>
--map <e.g. "ID|AC|DE">
--db_type <NUC, PRT, PRO(same as PRT), NUCCS>
--db_name <database_name>
--index_fields <e.g. "ID,AC,DE">
--target <hotdrive, local, all>
--release <release_number>
--gq_fields <e.g., "ID,AC,GI,OS,FT...">
--norm_pn
--pattern <e.g., "number*fragment">
--owner <login name of a GenomeQuest user>
--access <the access level for the database. e.g. "private", "public", "group">
Use -h/--help to see detailed instructions for each option.
The GenomeQuest Sequence Database Browser
Once a sequence database is published into the GenomeQuest system, it is available inside the GQ product. Simply go to the MyGQ page, and view "My Uploaded Databases." Your database should be visible there for browsing. Note that all of the annotation fields you provided will be available for querying, sorting, and grouping by the user. In this way you can publish meaningful annotated sequence databases for your users to browse. Examples:
- a sequence database where each sequence represents a SNP and flanking sequence. Annotation fields include positional information about the SNP, its quality, and whether it is involved in a change in an encoded protein
- a sequence database where each sequence is a gene. Annotation fields include expression levels of the gene in a series of tissues
- a sequence database where each sequence is a target associated with a drug. Annotation fields include the name of the drug and other information about the disease / phenotype.
Any sequence database published is also available via the GenomeQuest URL API. For instance, if you publish a sequence database via admin_db.pl and give it the db_id as follows:
db_id=MY_DB
then you should be able to access it via the URL:
Much more on the URL API is available here.
Result Databases
A Result Database is a very specific concept in GenomeQuest. By Result Database, we mean a particular GenomeQuest Engine database format which represents the results of a sequence comparison between two GenomeQuest Engine Sequence Databases.
A Result database is a represented as a set of files on the UNIX command line, much as a Sequence database is.
It is produced by a sequence comparison (or by converting a SAM/BAM file).
There are three core GenomeQuest Engine commands that produce a Result Database:
- lspcalc: takes two Sequence Databases and a sequence comparison algorithm and compares every sequence in database 1 against every sequence in database 2, using the algorithm specified
- lspmul: same as lspcalc, however it first performs a word-based matching to determine which pairs of sequences are likely to have an alignment, and then only runs the comparison algorithm on those pairs
- lspresbank: takes a SAM/BAM file and produces a result database.
Example: lspcalc producing a result database
Let's download and convert the SwissProt database into a GQ Engine format, and then run a BLAST comparing all human proteins against all mouse proteins:
% lftp ftp.expasy.ch/.../swissprot/release_compressed/uniprot_sprot.dat.gz % time gunzip uniprot_sprot.dat.gz | lspbank –T embl –prot –F sp STDIN : sequences 512994, residues 180531504, max seq length 35213 real 0m24.024s user 0m34.382s sys 0m4.661s
Now the SwissProt database has been downloaded and converted into a GQ Engine format. The logical identifier of the database is sp and it resides in our current working directory. Next, let's run the comparison:
% lspcalc –M bl2 –mp BLOSUM62 // use the BLAST2 algorithm with the BLOSUM62 matrix
–db sp –bfql ’os=”homo sapiens”’ // Subject (reference)
–db sp -bfql ’os=”mus musculus”’ // Query
–o HsMm.res –best 5,{-RS} // output into a Result Database called "HsMm.res", keeping the best 5 hits for each query, sorted by BLAST score (descending order)
In this directory we now have a result file called "HsMm.res" which we can interact with:
% lspres HsMm.res | head 1A1L2_HUMAN 31 555 1A1L2_MOUSE 51 577 S= 1605 E= 1.37637e-178 Bits= 622.854 3HIDH_HUMAN 1 336 3HIDH_MOUSE 1 335 S= 1562 E= 6.66903e-174 Bits= 606.29 5HT1A_HUMAN 1 421 5HT1A_MOUSE 1 421 S= 1901 E= 4.42149e-213 Bits= 736.873 5HT3B_HUMAN 6 438 5HT3B_MOUSE 1 434 S= 1702 E= 5.52321e-190 Bits= 660.218 5NT3L_HUMAN 1 291 5NT3L_MOUSE 1 291 S= 1390 E= 4.86686e-154 Bits= 540.035
Or perhaps to look at the alignments:
% lspres HsMm.res -a | head
1A1L2_HUMAN 31 555 1A1L2_MOUSE 51 577 S= 1605 E= 1.37637e-178 Bits= 622.854
Q: 51 EKMLKFQHVIRNQFLQQISQQMQCVPPGDQQCTQTSRKRKKM-GYLLSQMVNFLWSNTVK 109
| | | + |+| |+| + +++ |+ + + + |+ +|+| | |
S: 31 EITLHLQQAMTEHFVQLTSRQGLSLE--ERRHTEAICEHEALLSRLICRMINLLQSGAAS 88
Q: 110 KLKFKVPLPCLDSRCGIKVGHQTLSPWQTGQSRPSLGGFEAALASCTLSKRGAGIYESYH 169
|+ +|||| ||| ++ | + | | | ||| + || || | |
S: 89 GLELQVPLPSEDSRGDVRYGQRAQLSGQP-DPVPQLSDCEAAFVNRDLSIRGIDISVFYQ 147
Of course, all of this is documented in detail in the GQ Engine Primer and associated manuals. And if you ever need more help, never forget:
lspcalc -h
or
lspres -h
Comparison Algorithms
A large number of comparison algorithms exists to create a Result Database:
- Blast (local alignment)
- Needleman & Wunsch (global alignment)
- Kerr (global alignment on the smallest sequence)
- Smith-Waterman
- String, pattern matching
In addition, a Result Database can be created from a SAM/BAM file, thereby allowing any external alignment program to produce GenomeQuest Engine Result Databases.
Additional Operations Relating to Sequence Comparisons
Aside from importing a SAM/BAM file to create a Result database, all of the sequence comparison approaches outlined above are augmented by allowing the user to perform dynamically any of the following operations:
- pre-filtering of query sequences based on word-matching of queries against subject sequences. This process is encapsulated in a GenomeQuest Engine command called lspmul.
- selection on the fly of query and subject sequences that have certain properties
- strategies to select and retain the "best" hit(s) among all hits
- automated dispatch of computation across a set of compute nodes
Publishing a Result Database to the GenomeQuest Front-End
Unlike Sequence Databases which can be published to the GenomeQuest front-end via admindb.pl, result databases are typically products of GenomeQuest Workflows. See the documentation on How to Make Your Own Workflow for details on how to publish Result Databases to the GenomeQuest front end.
The GenomeQuest Result Database Browser
Like the Sequence Database Browser, the GenomeQuest platform has an innate browser built to interactively browse a Result Database. The Sequence Search workflow automatically creates GenomeQuest Result Databases, so to see an example of the Result Browser in action, simply run a Sequence Search using the GenomeQuest platform and then click into the result.
The GenomeQuest URL API provides direct linkable access to a Result Database. It does so through the API call "gqresult." For instance, if a Result has the id 125118, then you should be able to access it via the URL:
https://my.genomequest.com/query?do=gqresult&db=id:125118
See the documentation on How To Create Your Own Workflow for more details on how to publish your own Result Databases.
The Hotdrive
Downloading and converting the terabytes of reference databases can be a good exercise, but keeping the references up-to-date is generally a pain, with little value-added. GenomeQuest provides this service for you, either through the use of the GenomeQuest portal at my.genomequest.com, or if you are an installed customer and house GenomeQuest on your local system, via a database update service (GenomeCast™), a simple Perl script that brings to your organization all reference databases, converted with carefully chosen, indexed annotation fields. GenomeCast handles updates for you, at the frequency that you choose.
We call this the Hotdrive - the world's publicly available reference data. It's available in the GenomeQuest installation at:
-
$GQ_INSTALL/data/GQdata/content/hotdrive
assuming you have installed GenomeQuest at $GQ_INSTALL.
Hotdrive Channels
Let's take a look inside the Hotdrive, assuming you have installed GenomeQuest on a disk at the location $GQ_INSTALL:
% ls $GQ_INSTALL/data/GQdata/content/hotdrive configuration GB_SYN GQGENE_TR_Glycine_max PDB_PRT DRUGBANKPRO_NUC GB_UNA GQGENE_TR_Homo_sapiens PEPTIDE_GQGENE_Glycine_max DRUGBANKPRO_PRT GB_VRL GQGENE_TR_Mus_musculus PEPTIDE_GQGENE_Sorghum_bicolor ENSM GB_VRT GQGENE_TR_Oryza_sativa PEPTIDE_GQGENE_Zea_mays ENSP GEAA GQGENE_TR_Rattus_norvegicus RSG GB_BCT GENA GQGENE_TR_Sorghum_bicolor RSG_FUNGI GB_ENV GENOMIC_GQGENE_Glycine_max GQGENE_TR_Zea_mays RSG_INVERTEBRATE GB_EST GENOMIC_GQGENE_Sorghum_bicolor GQGENE_Zea_mays RSG_MICROBIAL GB_GSS GENOMIC_GQGENE_Zea_mays GQPAT_NUC RSG_PLANT GB_HTC GP GQPAT_PRT RSG_PLASMID GB_HTG GQGENE HS_RSG RSG_PROTOZOA GB_INV GQGENE_Arabidopsis_thaliana IPI RSG_VERTEBRATE_MAMMALIAN GB_MAM GQGENE_Glycine_max MRNA_GQGENE_Glycine_max RSG_VERTEBRATE_OTHER GB_PHG GQGENE_Homo_sapiens MRNA_GQGENE_Sorghum_bicolor RSG_VIRAL GB_PLN GQGENE_Mus_musculus MRNA_GQGENE_Zea_mays RSM GB_PRI GQGENE_Oryza_sativa NCBI_IGBLAST_NUC RSP GB_ROD GQGENE_Rattus_norvegicus NCBI_IGBLAST_PRT UNIPROT GB_SET GQGENE_Sorghum_bicolor NCBI_PROBE VARIANT_DATA GB_STS GQGENE_TR_Arabidopsis_thaliana PDB_NUC
All of these databases are made available to the GenomeQuest web product via the use of admin_db.pl, described above. So not only are these databases available on the UNIX command line, they are also accessible via the GenomeQuest user interface.
Now, each item in this location is itself a directory with a series of GQ Engine databases. We call each such directory a Channel.
- Hotdrive Channel - a logical grouping of GQ Engine databases which maintains a hierarchy used for handling releases and configuration files for GenomeQuest.
Inside a Hotdrive Channel
Let's look at the Hotdrive Channel for the EST division of Genbank. It shows the following entries:
% ls -ls $GB_INSTALL/disk/GQ/data/GQdata/content/hotdrive/GB_EST total 16 4 drwxr-xr-x 5 runner geneit 4096 2009-06-11 01:20 171 4 drwxr-xr-x 5 runner geneit 4096 2009-08-27 01:21 172 4 drwxr-xr-x 5 runner geneit 4096 2009-10-15 01:15 173 4 drwxr-xr-x 5 runner geneit 4096 2009-12-18 01:17 174 0 drwxr-xr-x 5 runner geneit 129 2010-01-07 01:18 175 0 drwxr-xr-x 3 runner geneit 147 2009-08-21 04:43 configuration
We can see the past five revisions of GB_EST (171 through 175). Every channel also has a configuration directory. The configuration directory contains information which is provided to the GenomeQuest framework that allows the system to know which version of the database to use. Indeed, let's look at a particular file in this directory that shows this:
% ls -ls $GB_INSTALL/disk/GQ/data/GQdata/content/hotdrive/GB_EST/configuration
total 36
4 -rw-r--r-- 1 runner geneit 735 2005-08-26 17:25 channel-property.rss
4 -rw-rw-r-- 1 runner geneit 105 2009-07-21 13:18 seqdb_definition.seqdbconf
24 -rw-rw-r-- 1 runner geneit 23714 2010-01-06 12:01 seqdb_instance.seqdbconf
0 drwxr-xr-x 2 runner geneit 29 2009-08-21 04:43 templates
% tail $GB_INSTALL/disk/GQ/data/GQdata/content/hotdrive/GB_EST/configuration/seqdb_instance.seqdbconf
GB_EST174, GB_EST, INACTIVE, 20091118144000, GB_EST174_20091118, ${HOTDRIVE}/GB_EST/174/
GB_EST174, GB_EST, INACTIVE, 20091125142442, GB_EST174_20091125, ${HOTDRIVE}/GB_EST/174/
GB_EST174, GB_EST, INACTIVE, 20091202145630, GB_EST174_20091202, ${HOTDRIVE}/GB_EST/174/
GB_EST174, GB_EST, INACTIVE, 20091209150859, GB_EST174_20091209, ${HOTDRIVE}/GB_EST/174/
GB_EST174, GB_EST, INACTIVE, 20091216223935, GB_EST174_20091216, ${HOTDRIVE}/GB_EST/174/
GB_EST175, GB_EST, INACTIVE, 20091224081253, GB_EST175_20091224, ${HOTDRIVE}/GB_EST/175/
GB_EST175, GB_EST, INACTIVE, 20091230152022, GB_EST175_20091230, ${HOTDRIVE}/GB_EST/175/
GB_EST175, GB_EST, ACTIVE, 20100106170155, GB_EST175_20100106, ${HOTDRIVE}/GB_EST/175/
@/seqdb_instance
%
The end of this file shows that GB_EST175 is the active database in the channel.
We leave it to the reader to explore the hotdrive structure in more detail. Typically you will never need to understand the contents of the hotdrive, as you can access its contents through the GQ Engine abstractly.
Accessing the Hotdrive via GQ Engine commands
There are three primary commands in the GenomeQuest Engine which require references to databases:
-
lspdb, our iterator of Sequence Databases. -
lspcalc, our Sequence Database Comparison tool -
lspmul, our heuristic (word-based) Sequence Database Comparison tool
Each of these commands accepts one (lspdb) or more (lspcalc, lspmul) sequence databases as input. Normally you specify the full path of the database, however with sequence databases in the Hotdrive, this is not necessary. We have provided "Hotdrive-aware" tools:
-
lspdb.H: the equivalent of lspdb, hotdrive aware -
lspcalc.THA: a hotdrive-aware replacements for lspcalc and lspmul
lspdb.H
As an example of the equivalence of lspdb and lspdb.H:
% lspdb /disk/GQ/data/GQdata/content/hotdrive/GB_SYN/175/GB_SYN175_20100106 -count db nbseqs = 91799 db nbres = 138103596 db maxlen = 1089202 db fields = ID AC SV GI GN SY D6 D7 D1 D2 MT DE KW OS OX OC CC DR RA A1 A2 A3 F4 FB FC FD FE W1 HL W4 FT % lspdb.H GB_SYN[] -- -count db nbseqs = 91799 db nbres = 138103596 db maxlen = 1089202 db fields = ID AC SV GI GN SY D6 D7 D1 D2 MT DE KW OS OX OC CC DR RA A1 A2 A3 F4 FB FC FD FE W1 HL W4 FT
The first command requires that you know the precise location of your GQ Engine Sequence Database. The second substitutes that full path with the name of the channel (GB_SYN) followed by empty brackets, symbolizing hotdrive entry. Note that lspdb.H requires that you put all arguments you will send to lspdb after a double dash. Indeed, lspdb.H simply passes all of the information after the -- to the lspdb command.
lspcalc.THA
lspcalc.THA has equivalence to both lspcalc and lspmul, with awareness of the hotdrive (as well as awareness of array-based computing). As an example, consider the following code:
lspcalc.THA -qs SGE -splitrole query -jobname documentation --
-t 16.4 // should be in array options
-M mapnc,kerr
-O "[-fltThreshold 6 -errs 3 -extend -hitmap -r ]" // -r –extend default
-mn NUC.2.1
-db GB_PRI[] -bfql 'de ="Human (alu consensus,Line-1 repeat mrna)" '
-db /home/mystuff/mydatabase
-o res/repeats.res -best '1,single,hitcnt,rmdup,{RS}' // rmdup single {RS} default -progress
While lspcalc.THA is further documented here and in the GQ Engine Primer, the command above illustrates the awareness of the Hotdrive with the two lines that start with -db. The first line indicates that the subject database is the version of GB_PRI that is available in the Hotdrive. The second line describes the query database - a GenomeQuest database available in my home directory.
More documentation on the Hotdrive from the GQ Engine Primer.
Local
Much as the Hotdrive is the single-stop for all of the public reference data, GenomeQuest provides a single source where all user's Sequence Databases live. It is called Local
- Local: the location of user Sequence Databases, as placed by the GenomeQuest Content Manager.
Local is available at the following location, assuming you installed GenomeQuest at $GQ_INSTALL:
$GQ_INSTALL/data/GQdata/content/local
Notice that it resides in the same parent directory as the GenomeQuest Hotdrive.
The organization of Local is precisely the same as it is for the Hotdrive, except that each database is preceeded by the word LOCAL.
How do you put data in Local for users?
Two ways:
- They can do it themselves using the GenomeQuest web interface
- You can do it for them using the GenomeQuest Content Manager.
Accessing Local through GQ Engine commands
Local can be accessed through GQ Engine commands in exactly the same way you would wish to access a Hotdrive database: by using lspdb.H and lspcalc.THA. For instance, assuming a Local database LOCAL_YORUBA_READS and a Hotdrive database GB_PRI:
% lspdb.H LOCAL_YORUBA_READS[] -- -count
db nbseqs = 24272624
db nbres = 873814464
db maxlen = 36
db fields = ID PV OD W1
% lspdb.H GB_PRI[] -- -count
db nbseqs = 572580
db nbres = 5820225602
db maxlen = 3284914
db fields = ID AC SV GI GN SY D6 D7 D1 D2 MT DE KW OS OX OC CC DR RA A1 A2 A3 F4 FB FC FD FE W1 HL W4 FT
% lspcalc.THA -qs SGE -splitrole query -jobname documentation --
-t 16.4 // should be in array options
-M mapnc,kerr
-O "[-fltThreshold 6 -errs 3 -extend -hitmap -r ]" // -r –extend default
-mn NUC.2.1
-db GB_PRI[]
-db LOCAL_YORUBA_READS[]
-o res/repeats.res -best '1,single,hitcnt,rmdup,{RS}' // rmdup single {RS} default -progress
Again, the other details of lspcalc.THA are documented here, but you can see how databases in the Hotdrive or in Local are referenced through these commands.
The Metadata Layer
In between the GenomeQuest Engine (the lsp* commands) and the GenomeQuest web interface rests a relational database which stores metadata about the system.
For deployed systems, GenomeQuest supports both MySQL and ORACLE. We use MySQL for our own hosted solution, and the remainder of this documentation assumes MySQL as the database engine that stores GenomeQuest metadata.
Each of the tables in the GenomeQuest metadata layer is documented below:
| Table Name | Description |
|---|---|
| USER TABLES - store information pertaining to users | |
| gq_user | Each row corresponds to a single user in the GenomeQuest system |
| accounting_group | Each row corresponds to an accounting group - users can not share outside of their accounting group |
| preference | Links user ids with application-level preferences (e.g., the user 349 has the preference EXPAND_ALIGNMENT_BY_DEFAULT)
|
| preference_value | Links application-level preferences with the values for those preferences (e.g., the preference EXPAND_ALIGNMENT_BY_DEFAULT described as preference.preference_id=1029 has the value TRUE)
|
| tb_user | For GenomeQuest Live only, stores those users who have signed up for Free Basic Accounts but who haven't activated their account yet. |
| user_class | The set of legal user classes (typically gold, silver, bronze, admin, etc). Each user is associated with exactly one of these in the gq_user table. |
| SEQUENCE DATABASES - store information pertaining to sequence databases that exist in Hotdrive or Local | |
| physical_seqdb | Each row corresponds to a specific physical sequence database stored in either the Hotdrive or in Local |
| virtual_seqdb | Each row corresponds to a specific virtual sequence database - basically a saved filter on a set of physical sequence databases. Exactly like the physical_seqdb table in other respects. |
| virtual_seqdb_physical_seqdb | The set of physical_seqdb databases that comprise a particular virtual_seqdb. |
| GQ5 SEQUENCE SEARCH - tables to represent sequence searches in GQ5 and below | |
| comparison | A wide table that stores information about each GQ5 search that has been run in the system, as well as links to other tables such as gq_user |
| folder | Stores folders of related GQ5 sequence searches. This concept is not employed in GQ6 and beyond. |
| resource_5_2 | This is for GQ5 only. Unifies sequence databases and other entities (such as features) so that they can be referred to in a single namespace. (It's ok if you don't understand this.) |
| sharepriv | Ancient. Gq5. Ignore. |
| sharepriv_user | Ancient. Gq5. Ignore. |
| view_status | GQ5 only - relates a comparison to a gq_user and indicates whether the gq_user has ever viewed the comparison. |
| WORKFLOWS - tables to manage workflows and workflow runs | |
| plugin | Stores the list of valid workflows in the system (e.g., RNA-Seq, Velvet) and descriptors for their code-level names |
| workflow | Information about each workflow run that has ever occured. The rows in this table relate to the plugin table to describe which workflow was run, as well as the gq_user table, etc. |
| workflow_params | Relates to the workflow table. Stores each parameter of the workflow for a given run in the workflow table. |
| workflow_transaction | A record is saved here every time a workflow is run. Like the workflow table, however if a user deletes a workflow from his account, it is removed from the workflow table, but not from this table. |
| ACCESS CONTROL - tables to handle access control and sharing | |
| feature | Stores features of the system that can be controlled by access control logic. |
| gq_resource | Internal table, subject to change at a moment's notice. This table homogenizes different resources inside of GQ to allow users to share resources with each other regardless of their type. |
| resource_user_account | Relates gq_resource entries to specific users, and allows for specification of quota and expiration date on a per user basis. |
| physical_seqdb_user | Stores sharing relationships between a physical_seqdb and a gq_user, where the original owner (physical_seqdb.gq_user_id) shared the physical_seqdb with a set of specific users. This table is not used when the original owner shares with a group or with "all". |
| plugin_user | Stores relationships between workflows and users. Typically workflows (such as RNA-Seq) will be defined in the plugin table to have a share_level of ":acl" which implies that the workflow is shared globally. To limit the extent to which a workflow is accessible in the system, the share_level in the plugin table should be set to _____ and this plugin_user stores the relationship between accessible workflows and the users who can access them. |
| sharee | A generalization of a set of users (either an individual user, an accounting group, or a user class) that can be used to describe a resource which is being shared with that group. |
| sharee_resource | A relationship between a sharee and a gq_resource, indicating that the sharee gets access to the gq_resource. This table also allows for the specification of particular quota and expiries. |
| virtual_seqdb_user | Stores sharing relationships between a virtual_seqdb and a gq_user, where the original owner (virtual_seqdb.gq_user_id) shared the virtual_seqdb with a set of specific users. This table is not used when the original owner shares with a group or with "all". |
| workflow_user | Stores sharing relationships between a workflow that has been run and a gq_user, where the original owner (workflow.gq_user_id) shared the workflow with a set of specific users. This table is not used when the original owner shares with a group or with "all". |
| UTILITY - other tables used by the system | |
| event_log | Stores every event received by the system's dispatcher, by login id. |
| event_param | Stores parameters associated with events in event_log |
| alert | Used to store alerts that users have set up. This capability automatically runs searches for users and alerts them when new hits arrive in public or patent data. |
| link_record | Stores the format of an HTML link-out for data in specific Hotdrive channels. For instance, a record in UNIPROT has a certain link to an external site, whereas a record in Drugbank has a different link-out URL format |
| tag_pair | Currently unused - designed to store key-value pairs for tagging certain GQ entities. |
| upgrade_history | Stores upgrade events to the GenomeQuest platform |
| SEQUENCES - used to increment primary keys | |
| accounting_group_sequence | A sequence to increment the primary key for the accounting_group table |
| alert_sequence | A sequence to increment the primary key for the alert table |
| comparison_sequence | A sequence to increment the primary key for the comparison table |
| folder_sequence | A sequence to increment the primary key for the folder table |
| resource_sequence | A sequence to increment the primary key for the gq_resource table |
| seqdb_sequence | A sequence to increment the primary key for both the physical_seqdb table and the virtual_seqdb table, preserving the uniqueness of the primary key space across the union of these tables. |
| sharepriv_sequence | A sequence to increment the primary key for the sharepriv table, which is Ancient. Gq5. Ignore. |
| workflow_sequence | A sequence to increment the primary key for the workflow table. |
Userdata
Where the Metadata layer stores the metadata that describes objects in the system, and the Hotdrive (and it's "Local" counterpart) store Sequence Databases available in the system, the Userdata layer stores the actual doings of the user - the results of runs of workflows and sequence searches.
- Userdata: the location of the results of workflows run by users.
Userdata is available at the following location, assuming you installed GenomeQuest at $GQ_INSTALL:
$GQ_INSTALL/data/GQdata/userdata
The Userdata layer is implemented as a directory on the UNIX filesystem. Inside this directory you will find one directory for every user on the system who has ever run a workflow. The directory names are keyed not to their username, but rather to a unique id which is available in the Metadata layer's gq_user table. For instance:
% ls -l $GQ_INSTALL/data/GQdata/userdata | head total 120 drwxrwxrwx 9 runner geneit 4096 Dec 22 04:50 09080310591634a76fb drwxrwxrwx 8 runner geneit 4096 Feb 2 04:51 09080407051444a7815 drwxrwxrwx 11 runner geneit 4096 Feb 2 11:38 09080511494254a79aa drwxrwxrwx 12 runner geneit 4096 Jan 15 05:40 09081105150664a8136 drwxrwxrwx 10 runner geneit 4096 Jan 19 05:21 09081114472374a81bc drwxrwxrwx 6 runner geneit 4096 Dec 7 15:25 09081814025484a8aec drwxrwxrwx 6 runner geneit 4096 Dec 7 15:16 090818173145124a8b1 drwxrwxrwx 3 runner geneit 4096 Aug 21 14:40 090821144013134a8ee drwxrwxrwx 10 runner geneit 4096 Jan 25 16:10 090825115239144a940 %
In the Metadata layer there is a mapping from these directory names to gq_users. It is found in the gq_user table:
mysql> describe gq_user; +----------------------------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +----------------------------+--------------+------+-----+---------+-------+ | gq_user_id | int(11) | NO | PRI | NULL | | | old_unique_id | varchar(48) | YES | UNI | NULL | | | login_name | varchar(64) | NO | UNI | NULL | | | directory_name | varchar(128) | NO | UNI | NULL | | | first_name | varchar(64) | NO | | NULL | | | last_name | varchar(64) | NO | | NULL | | | email | varchar(64) | YES | | NULL | | | active | tinyint(1) | NO | | NULL | | | anonymous | tinyint(1) | NO | | NULL | | | password | varchar(64) | YES | | NULL | | | user_profile_name | varchar(128) | YES | | NULL | | | accounting_group_id | int(11) | NO | MUL | NULL | | | user_class_id | int(11) | NO | | NULL | | | expiration_date | int(11) | YES | | NULL | | | creation_date | int(11) | YES | | NULL | | | last_activity | int(11) | YES | | NULL | | | access_key | varchar(128) | YES | UNI | NULL | | | access_key_creation_date | int(11) | YES | | NULL | | | access_key_expiration_date | int(11) | YES | | NULL | | | is_ppu | tinyint(1) | NO | | 0 | | | is_progressive | tinyint(1) | NO | | 1 | | | current_login_time | timestamp | YES | | NULL | | | previous_login_time | timestamp | YES | | NULL | | +----------------------------+--------------+------+-----+---------+-------+ 23 rows in set (0.00 sec)
The two fields of interest are login_name and directory_name. For instance:
mysql> select login_name, directory_name from gq_user where login_name = 'test1'; +------------+---------------------+ | login_name | directory_name | +------------+---------------------+ | test1 | 090904083727234aa10 | +------------+---------------------+ 1 row in set (0.00 sec)
So to view the actual workflow results for the user test1 you would:
% cd $GQ_INSTALL/data/GQdata/userdata/090904083727234aa10 % ls -l drwxrwxrwx 3 runner geneit 4096 Feb 2 12:23 1002021216071124561 drwxrwxrwx 10 runner geneit 4096 Feb 4 11:07 seqdbsearch drwxr-xr-x 31 runner geneit 4096 Feb 4 16:04 workflow
Mapping Between Userdata and Metadata
It can be a bit troubling to always have to open up the Metadata relational database when you want to find a user's working userdata directory. Sometimes you know the directory but not the login name. Sometimes you know the login name but not the directory. On our systems we built two very simple scripts which we offer here for you to deploy on your own machine as well to quickly map between these two without having to open a mysql console.
I Know The Directory
Copy the following code into a file, make it executable, and make it available in your PATH variable:
#!/bin/bash mysql -u runner --password='<the-password>' --execute="select login_name from gq_user where directory_name='$2'" $1
If you saved this as a file ud-i-know-dir, you would execute this as:
% ud-i-know-dir gqdb 090904083727234aa10 +------------+ | login_name | +------------+ | test1 | +------------+ %
In this case, the identifier gqdb is the mysql database name associated with your installation of GQ.
I Know The Login
Copy the following code into a file, make it executable, and make it available in your PATH variable:
#!/bin/bash mysql -u runner --password='<the-password>' --execute="select directory_name from gq_user where login_name='$2'" $1
If you saved this as a file ud-i-know-login, you would execute this as:
% ud-i-know-login gqdb test1 +---------------------+ | directory_name | +---------------------+ | 090904083727234aa10 | +---------------------+ %
In this case, the identifier gqdb is the mysql database name associated with your installation of GQ.
Contents of Userdata Directories
Let's have a look inside a given user's userdata directory. Remember, we won't expect to find their Sequence Databases here (those are stored in Local):
% ls -l $GQ_INSTALL/data/GQdata/userdata/090904083727234aa10 total 12 drwxrwxrwx 3 runner geneit 4096 Feb 2 12:23 1002021216071124561 drwxrwxrwx 10 runner geneit 4096 Feb 4 11:07 seqdbsearch drwxr-xr-x 31 runner geneit 4096 Feb 4 16:04 workflow
There can be a variety of files and directories here but we concern ourselves only with the directory called workflow. All of the other directories either represent some transient state of the user's interactions with the system, or they represent sequence searches from older versions of GenomeQuest (pre 6.0).
Let's take a look at the workflow directory.
% cd $GQ_INSTALL/data/GQdata/userdata/090904083727234aa10/workflow % ls -lt total 116 drwxrwxrwx 3 runner geneit 4096 Feb 8 21:16 1690 drwxrwxrwx 3 runner geneit 4096 Feb 4 16:05 1714 drwxrwxrwx 3 runner geneit 4096 Feb 4 11:10 1572 drwxrwxrwx 3 runner geneit 4096 Jan 29 17:40 1612 drwxrwxrwx 6 runner geneit 4096 Jan 29 12:30 1588 drwxrwxrwx 7 runner geneit 4096 Jan 28 19:23 1605 drwxrwxrwx 3 runner geneit 4096 Jan 28 19:05 1604 drwxrwxrwx 7 runner geneit 4096 Jan 27 19:38 1584 drwxrwxrwx 7 runner geneit 4096 Jan 27 19:32 1582 drwxrwxrwx 6 runner geneit 4096 Jan 27 19:31 1571 drwxrwxrwx 7 runner geneit 4096 Jan 27 19:28 1583 drwxrwxrwx 5 runner geneit 4096 Jan 27 18:45 1577 drwxrwxrwx 3 runner geneit 4096 Jan 27 18:41 1576 drwxrwxrwx 4 runner geneit 4096 Jan 27 18:38 1575 drwxrwxrwx 4 runner geneit 4096 Jan 27 18:34 1574 drwxrwxrwx 7 runner geneit 4096 Jan 27 18:31 1570 drwxrwxrwx 5 runner geneit 4096 Jan 27 18:26 1573 drwxrwxrwx 3 runner geneit 4096 Jan 27 17:23 1569 drwxrwxrwx 7 runner geneit 4096 Jan 21 12:02 1546 drwxrwxrwx 3 runner geneit 4096 Jan 20 21:37 1534 drwxrwxrwx 3 runner geneit 4096 Jan 4 14:08 1388 drwxrwxrwx 4 runner geneit 4096 Dec 31 14:46 1358 drwxrwxrwx 4 runner geneit 4096 Dec 31 14:45 1357 drwxrwxrwx 4 runner geneit 4096 Dec 31 14:39 1354 drwxrwxrwx 5 runner geneit 4096 Dec 30 11:46 1348 drwxrwxrwx 7 runner geneit 4096 Dec 30 11:04 1350 drwxrwxrwx 2 runner geneit 4096 Dec 30 10:42 1347 drwxrwxrwx 2 runner geneit 4096 Nov 23 11:55 1122 drwxrwxrwx 2 runner geneit 4096 Sep 14 16:08 676
Each of the directories in this area correspond to the user running a single workflow. What are these numbers? Indeed, they map to columns in the Metadata layer - the MySQL (or ORACLE) database. In particular, they map to a table called workflow. Let's have a closer look:
mysql> describe workflow; +--------------------+--------------+------+-----+----------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------------+--------------+------+-----+----------+-------+ | workflow_id | int(11) | NO | PRI | NULL | | | gq_user_id | int(11) | NO | | NULL | | | directory_name | varchar(48) | NO | | NULL | | | type | varchar(32) | NO | | NULL | | | text_label | varchar(255) | YES | | NULL | | | description | text | YES | | NULL | | | total_nbresults | int(11) | YES | | NULL | | | submit_time | int(11) | YES | | NULL | | | finish_time | int(11) | YES | | NULL | | | status | varchar(64) | YES | | NULL | | | input1_id | int(11) | YES | | NULL | | | input1_type | char(64) | YES | | NULL | | | input2_id | int(11) | YES | | NULL | | | input2_type | char(64) | YES | | NULL | | | version | varchar(3) | YES | | NULL | | | email | varchar(128) | YES | | NULL | | | disk_usage | int(11) | YES | | NULL | | | credit_cost | int(11) | NO | | 0 | | | credit_paid | int(11) | NO | | NULL | | | credit_code | varchar(16) | YES | | NULL | | | parent_workflow_id | int(11) | YES | | NULL | | | base_workflow_id | int(11) | NO | | NULL | | | share_level | varchar(16) | NO | | :private | | +--------------------+--------------+------+-----+----------+-------+
The field workflow.workflow_id points the MySQL metadata to one of the numbers here. The workflow_id is unique across all users so you can simply ask against it, for instance:
mysql> select workflow_id,directory_name,type,from_unixtime(submit_time), status from workflow where workflow_id=1690; +-------------+----------------+---------------+----------------------------+----------+ | workflow_id | directory_name | type | from_unixtime(submit_time) | status | +-------------+----------------+---------------+----------------------------+----------+ | 1690 | 1690 | GqWfSeqSearch | 2010-02-04 11:08:26 | FINISHED | +-------------+----------------+---------------+----------------------------+----------+ 1 row in set (0.00 sec)
We can see that in the directory 1690, we should find the results of a Sequence Search workflow (GqWfSeqSearch). Let's have a closer look:
% cd 1690 % ls -l total 136 -rw-r--r-- 1 runner geneit 2213 Feb 4 11:08 body.script.comparison.sh -rw-r--r-- 1 runner geneit 0 Feb 4 11:08 comparison.has.finished -rw-r--r-- 1 runner geneit 13588 Feb 4 11:08 comparison.log lrwxrwxrwx 1 runner geneit 82 Feb 4 11:08 lastjob -> /disk/GQtest/data/GQdata/userdata/090904083727234aa10/workflow/1690/.tmp_res.21373 lrwxrwxrwx 1 runner geneit 80 Feb 4 11:08 lastquery -> /disk/GQtest/data/GQdata/userdata/090904083727234aa10/workflow/1690/query.db.ind lrwxrwxrwx 1 runner geneit 82 Feb 4 11:08 lastsubject -> /disk/GQtest/data/GQdata/userdata/090904083727234aa10/workflow/1690/subject.db.ind -rw-r--r-- 1 runner geneit 2070 Feb 4 11:08 lspcalc.progress -rw-r--r-- 1 runner geneit 378 Feb 4 11:08 lspextend.progress -rw-r--r-- 1 runner geneit 7 Feb 4 11:08 overall.progress -rw-r--r-- 1 runner geneit 24 Feb 4 11:08 query.db.ctb -rw-r--r-- 1 runner geneit 325 Feb 4 11:08 query.db.ind -rw-r--r-- 1 runner geneit 22 Feb 4 11:08 query.db.seq -rw-r--r-- 1 runner geneit 24000 Feb 4 11:08 res_0 -rw-r--r-- 1 runner geneit 823 Feb 4 11:08 res_0.resinfo -rw-r--r-- 1 runner geneit 10000 Feb 4 11:08 res_0.tab -rw-r--r-- 1 runner geneit 618 Feb 4 11:08 res.resinfo -rw-r--r-- 1 runner geneit 4564 Feb 4 11:08 script.comparison.sh -rw-r--r-- 1 runner geneit 9189 Feb 4 11:08 script.comparison.sh.params -rw-r--r-- 1 runner geneit 6552 Feb 4 11:08 script.comparison.sh.params.json -rw-r--r-- 1 runner geneit 6 Feb 4 11:08 script.comparison.sh.pid -rw-r--r-- 1 runner geneit 0 Feb 4 11:08 script.comparison.sh.stderr -rw-r--r-- 1 runner geneit 0 Feb 4 11:08 script.comparison.sh.stdout -rw-r--r-- 1 runner geneit 1187 Feb 4 11:08 subject.db.ind
These are all of the "file droppings" associated with that workflow. As you write your own workflows, the files they produce will be placed in the appropriate userdata directory under the appropriate workflow_id.
Documentation of these files is outside of the scope of this particular documentation. See How Do I Make My Own Workflows for more information.
Workflow Architecture
Workflows. GenomeQuest's version of the "App Store." This is where you build arbitrary applications and integrate them into the GenomeQuest architecture. The Workflow Architecture is totally open - you can use whatever programming language you want for most of the workflow. What you get from GenomeQuest is:
- management of all of the underlying sequence data
- the powerful GenomeQuest Engine for doing massive sequence comparison, analysis, and search
- the presence of a compute array behind the scenes
- a simple mechanism to publish results to users
- the GenomeQuest framework allows users to share your workflow results like anything else in the system
See How Do I Write My Own Workflows for the details.
Where Does Workflow Code Go?
There are two kinds of workflows. System Workflows are workflows that are developed and supported by GenomeQuest. They live in $GQ_INSTALL/web/GQ/system_plugins/Workflows.
Your workflows should be place in $GQ_INSTALL/web/GQ/plugins/Workflows. Let's have a look.
% ls -l $GQ_INSTALL/web/GQ/plugins/Workflows total 4 lrwxrwxrwx 1 runner geneit 46 Oct 20 10:59 GqWfChipseq -> /home/runner/heush/Services/trunk/GqWfChipseq/ drwxr-xr-x 4 runner geneit 4096 Jul 31 2009 GqWfVariant.old
In this directory we see two different workflows. The first, GqWfChipseq, is a working copy of a CHiP-Seq workflow. (Indeed, we can see that it is a symbolic link to another directory - don't let that worry you, it's just a mechanism to allow you to develop the workflow wherever you want, as long as there is a link to it in $GQ_INSTALL/web/GQ/plugins/Workflows.) The second workflow is an old version of a Variant workflow.
</pre>
How Do I Write My Own Workflow?
See How Do I Write My Own Workflows for the details.
I'm Confused. What's the Relationship Between Workflow Code, Userdata, Local, and Hotdrive?
Remember, the Hotdrive is where the public reference sequence is kept, in GQ Engine format. $GQ_INSTALL/data/GQdata/content/hotdrive. We keep it up to date using GenomeCast. You never have to worry about this.
Local is where users' sequence data is kept. $GQ_INSTALL/data/GQdata/content/local. You can add data to this using the GenomeQuest Content Manager. When users use the web system to upload their own sequence data, the GenomeQuest Content Manager is automatically called in the background for them.
Workflow code is place in $GQ_INSTALL/web/GQ/plugins/Workflows. Each subdirectory should correspond to a single workflow.
Whenever a workflow is run, it creates a new directory in Userdata in the directory of the user who ran it: $GQ_INSTALL/data/GQdata/userdata. All of the file droppings of your workflow will be placed in this directory.
The key insight to take away is that your workflow code will live in one place, and it will run in another place - in userdata.
Plugin Architecture
Unlike workflows which allow you to provide arbitrary levels of customizability, plugins run either in the Sequence Database Browser or the Result Database Browser context. They iterate over sequences or results that have been selected by a user and perform some operation on each of them.
When you develop a plug-in, it becomes available in the Applications menu of the Sequence or Result Database Browser. Indeed, you can decide whether you want it to be available in a Sequence context, a Result context (remember - Results are databases of alignments!), or both.
When Would I Write a Plug-in?
Whenever you want to allow the user to pick a subset of sequences or alignments and process them somehow. Examples:
- clustalw these sequences
- analyze these alignments to see if they overlap
- export these sequences/alignments in a particular format
- plug-in some other application to run on a set of sequences or alignments
We use the plug-in architecture primarily to provide export services (GFF, Genbank/EMBL, Gbrowse, UCSC, etc), to add analytical functionality (Clustalw, EMBOSS, etc.) and to create useful post-analyses for customers (e.g., combining all of the SNPs - sequences, remember - into a table which is grouped by position and shows the haplotype for each of a series of lines or patients).
When Would It Be Better To Write a Workflow?
Plug-ins only run inside the Sequence Database Browser or the Result Database Browser. They have a limited user interface and they don't have the support of the My GenomeQuest page to show things like progress, success/failure, and so on. So if you are building a large application that allows users great specificity and takes hours to run, make a workflow. If you are building a quick way to analyze or export sequences or alignments, use a plugin.
Where does Plug-In Code Live?
There are two kinds of plugins. System Plugins are plugins that are developed and supported by GenomeQuest. They live in $GQ_INSTALL/web/GQ/system_plugins/Exports.
Your plugins should be place in $GQ_INSTALL/web/GQ/plugins/Exports. Let's have a look.
% ls -l $GQ_INSTALL/web/GQ/plugins/Exports total 4 lrwxrwxrwx 1 runner geneit 48 Dec 22 15:40 GqAppExtractNoHits -> /home/runner/richard/Services/GqAppExtractNoHits lrwxrwxrwx 1 runner geneit 49 Dec 17 11:24 GqAppExtractQueries -> /home/runner/richard/Services/GqAppExtractQueries lrwxrwxrwx 1 runner geneit 50 Dec 21 14:41 GqAppExtractSubjects -> /home/runner/richard/Services/GqAppExtractSubjects lrwxrwxrwx 1 runner geneit 41 Sep 29 13:11 GqAppHapTbl -> /home/runner/richard/Services/GqAppHapTbl
In this directory we see four different plugins. Each plugin is automatically activated by the system as soon as you place the directory containing the code into this directory. Indeed, these are actually just symbolic links to other directories - this allows developers to develop in whichever directory they want.
How Do I Write My Own Plugin?
See Plug-in My Application for detailed instructions.
Using The Array
Depending on the type of installation, you will generally have a an array of compute nodes available to you to launch GQ Engine sequence comparisons. Here's what you don't need to worry about:
- locality of data - we automatically place the entire hotdrive on the local disks of each compute node. We also make local NFS mounted on each node. So each node has everything in the hotdrive and in local.
- load balancing - we use Sun Grid Engine to queue up jobs
- splitting searches - we automatically split the search up into smaller pieces and distribute each small piece on the right node
- rejoining results - we automatically rejoin each smaller search on the head node that launched the job
So all you have to do to run a job on the array is use our array-aware comparison engine:
- lspcalc.THA
Full documentation of this application is available on the command line via
% lspcalc.THA -help
Examples using lspcalc.THA
The following is an example of an ’all against all’ command (GB_PRI compared with GB_ROD) on the compute nodes:
% lspcalc.THA -progress -shadowrole both -- -t 8.2 -M bl2 -O ’[ -E 10 ]’ -mn NUC.3.1 -db GB_PRI[] -db GB_ROD[] -o res
-
-progresswill create a progress file that updates as the search runs -
-shadowrole bothtells the system that the actual databases exist locally on the compute nodes (these are both hotdrive databases so they aren't NFS-mounted like local is) -
--everything past this double-dash is what is passed to the underlyinglspcalccommand on each node -
-t 8.2tells the system to use 8 jobs, 2 threads per job -
-M bl2tells the nodes to use the Blast2 algorithm -
-O '[-E 10]'passes whatever options to Blast2 you desire, in this case, -E 10 -
-mn NUC.3.1uses the NUC.3.1 matrix for Blast2 -
-db GB_PRI[]says that the subject database is the primate division of Genbank on the hotdrive -
-db GB_ROD[]says that the query database is the rodent division of Genbank on the hotdrive -
-o resreassemble the results into a Result Database called res
Let's do the same but with both subject and query databases split (8 chunks for subject, 3 for query) which will generate 24 jobs :
% lspcalc.THA -progress -splitrole both -shadowrole both -- -t 8.3.2 -M bl2 -O ’[ -E 10 ]’ -mn NUC.3.1 -db GB_PRI[] -db GB_ROD[] -o res
-
-splitrole bothtells the system to split both the query and the subject databases into smaller pieces -
-t 8.3.2tells the system to split the subject into 8 dbs, the query into 3, and each job with two threads.
Some Suggestions On Parameters for NGS
ILLUMINA : short reads, constant quality.
lspcalc.THA -splitrole query -qs all.q -jobname 'my_job' --
-t 16.4 -M mapnc,kerr
-O ’[ -errs 2 –extend –fltThreshold 6 –fltCut 300 –fltOverlap 32 -r]’
-db db1
-db db2
-o res
ABI : longer (but still short) reads, Quality decreases at extremities
lspcalc.THA -splitrole query -qs all.q -jobname 'my_job' --
-t 8 -M mapnc,sw
-O ’[ –ftlThreshold 6 –fltCut 300 -r]’
-mn NUC.2.1 …
// Modified NCBI matrix is better for short reads
-scut 28
454 : longer reads, local alignments.
lspcalc.THA -splitrole query -qs all.q -jobname 'my_job' --
-t 8 -M mapnc,bl2
-O ’[ -fltThreshold 10 –ftlCut 5000 –fltOverlap 200 –r ]’ …
//another way to specify the filter (old <LEN,PERID>)
-scut 30
More on this lspmul functionality
You can read more about lspmul and this principle of filtering sequence pairs here.
More on the Array Computing
The GQ Engine Primer has a chapter on array computing.
GenomeCast
Downloading and converting the terabytes of reference databases can be a good exercise, but keeping the references up-to-date is generally a pain, with little value-added. GenomeQuest provides a database update service (GenomeCast™), a simple Perl script that brings to your organization all reference databases, converted with carefully chosen, indexed annotation fields. GenomeCast handles updates for you, at the frequency that you choose. We call this the Hotdrive. With GenomeCast, you get in real time mirror image of the Hotdrive that GenomeQuest continuously builds on its servers.