Welcome to the GenomeQuest Documentation Wiki

DeveloperAPISystemConcepts

From GQ Wiki
Jump to: navigation, search

Contents

Core System Concepts

Quick links to each concept below, but we recommend you read through this entire document linearly. It builds on earlier concepts as you go.

  1. The GQ Engine
  2. Sequence and Annotation Databases
  3. Result Databases
  4. The Hotdrive
  5. Local
  6. The Metadata Layer
  7. Userdata
  8. Workflow Architecture
  9. Plugin Architecture
  10. Using The Array
  11. GenomeCast

The GQ Engine

The GQ Engine is the underlying technology that enables the entire system to work. It consists of

  • a specific representation for sequence databases which is compact, binary, and includes annotation
  • a specific representation for comparisons between sequence databases, again, compact, binary, and searchable
  • a set of commonly used sequence comparison algorithms
  • a programatic query language, BFQL, best described as "PL/SQL" for sequence and result databases
  • innate knowledge of compute clusters

The GQ Engine is implemented as a series of UNIX binaries available at the command line. The complete set of binaries is below. Note that full documentation for these modules is available in the GQ Engine Reference Manual.

GQ Engine Modules
name inputs outputs function
lspbank text file of sequences and annotation GQ Engine seqdb converts an input file into a GQ Engine seqdb
lspdb seqdb, query criteria a set of records that meet query criteria queries a sequence database for some properties
lspvbank set of seqdbs GQ Engine virtual seqdb creates a virtual seqdb from the set of inputs. Such virtual database thereafter works exactly like a normal seqdb.
lspcalc seqdb1, seqdb2, algorithm GQ Engine resdb compares seqdb1 to seqdb2 using algorithm, produces resdb
lspmul seqdb1, seqdb2, algorithm GQ Engine resdb compares seqdb1 to seqdb2 using a two-phased algorithm: heuristic word matching followed by dynamic programming alignment, produces resdb
lspres resdb, query criteria a set of records that meet query criteria queries a result database for some properties, returns those results that meet query conditions
lspvres set of resdbs GQ Engine virtual resdb creates a virtual resdb from the set of inputs. Such virtual result database thereafter works exactly like a normal resdb.
lspextend resdb an extended resdb computes additional properties of a result database to allow for querying on alignment properties
lspcalc.TH seqdb1, seqdb2, algorithm GQ Engine resdb exactly like lspcalc but is multi-threaded
lspmul.TH seqdb1, seqdb2, algorithm GQ Engine resdb exactly like lspmul but is multi-threaded
lspcalc.THA seqdb1, seqdb2, algorithm GQ Engine resdb a fusion of lspcalc.TH and lspmul.TH that is aware of the entire compute cluster
lspdb.H seqdb, query criteria a set of records that meet query criteria queries a sequence database for some properties, is aware of the GQ Hotdrive

Sequence and Annotation Databases

A GenomeQuest Sequence Database is a compact representation of an arbitrarily large number of sequences and associated annotation. It is stored in a binary format in a UNIX filesystem. It is not meant to be edited directly or to be viewed directly as its representation is abstracted from the user.

Sequence types

GenomeQuest supports the following sequence types:

  • Nucleic
  • Colorspace nucleic
  • Nucleic pattern
  • Peptide
  • Peptide pattern

Sequence annotation

Each sequence in the database can also have corresponding annotation associated with it. This annotation is not positional in nature (see Result Databases for that), but rather global annotation on the sequence itself. GenomeQuest stores annotation values in fields which are indexed by two-character keys. Example annotation fields might be:

ID the human-readable sequence identifier for the sequence, e.g. "NM_018189.1"
DE the description of the sequence, e.g., "Homo sapiens hypothetical protein FLJ10713 (FLJ10713), MRNA"

Any two-character string is a valid annotation field in a GenomeQuest sequence database. At the GenomeQuest engine level, there is no semantic meaning attached to any such annotation field. You could easily use the ID field to store something else, or put the sequence identifier in a field called '4R' if you preferred.

The GenomeQuest product itself does in fact apply additional semantic meaning to certain fields, even though the GenomeQuest Engine doesn't care. The list of fields and their semantic associated is available here:

https://my.genomequest.com/query?do=gqfetch.get_db_field_list

Note that this page returns raw text, machine readable. To view it formatted, choose "View Source" from your web browser.

Making and Working with a Sequence Database

Multiple input formats can be parsed by the GQ Engine to make a sequence database. Typically the most common formats are:

  • EMBL-like (a.k.a. DB2 - this is the native format)
  • FASTA

EMBL-like format allows for the addition of arbitary annotation fields, whereas FASTA only allows for sequence ID and sequence itself. Below is a sample EMBL-like file format:

ID      SNP-18
DE      SNP AC177813.2 position 1439-1439
GP      AC177813.2
OS      Zea mays
KW      G -> t
  CGCGACCCGGCTGGTCATGACTATTTCACACGTCGTGAGtTTTTTCGATCCCTACAAAGGAAAGGATGAGTACGGGATCTT
//
ID      SNP-35
DE      SNP AC177813.2 position 142409-142409
GP      AC177813.2
OO      MAIZE
KW      G -> a
  CGCCTTGTCCTAACCCTTCAGCATATCCTCTAGCTCATCaTCTGGGACGTCTTGAGGGAAGCCGATGATGTCTTGAAGGCT
//

As you can see, the database specifies two sequences, each represent a single nucleotide polymorphism in the middle of the sequence. The annotation fields being used are:

ID An identifier associated with the SNP
DE A human readable description
GP The genomic product on which the SNP exists
OO The human-readable organism name
KW Any keywords

Again, at the GenomeQuest engine level, there is no semantic meaning associated with these annotation fields, but at the level of the GenomeQuest product (the web application), these fields tend to have specific meanings.

Now, let's assume that this file of "EMBL-like" sequence data is available in my current working directory and is named "data.embl". To make a GenomeQuest engine database, observe the following interaction on the UNIX command line with the GQ Engine installed:

runner@linnaeus:~/doc>ls -l
total 4
-rw-r--r-- 1 runner geneit 344 Jan  5 13:35 data.embl
runner@linnaeus:~/doc>lspbank -dbtype NUC -T EMBL -F myseqdb data.embl
data.embl : sequences 2, residues 162, max seq length 81
runner@linnaeus:~/doc>ls -l
total 16
-rw-r--r-- 1 runner geneit 344 Jan  5 13:35 data.embl
-rw-r--r-- 1 runner geneit  40 Jan  5 13:56 myseqdb.ctb
-rw-r--r-- 1 runner geneit 330 Jan  5 13:56 myseqdb.ind
-rw-r--r-- 1 runner geneit 168 Jan  5 13:56 myseqdb.seq
runner@linnaeus:~/doc>
The command
lspbank -dbtype NUC -T EMBL -F myseqdb data.embl
converts the text file called "data.embl" into a GQ Engine database called "myseqdb". Notice that physical implementation of the logical myseqdb database is in fact three different files:
-rw-r--r-- 1 runner geneit  40 Jan  5 13:56 myseqdb.ctb
-rw-r--r-- 1 runner geneit 330 Jan  5 13:56 myseqdb.ind
-rw-r--r-- 1 runner geneit 168 Jan  5 13:56 myseqdb.seq

If you now wanted to query that database, you may do so using the command "lspdb":

runner@linnaeus:~/doc>lspdb myseqdb
ID	SNP-18
AC	
OS	Zea mays
DE	SNP AC177813.2 position 1439-1439
KW	G -> t

  CGCGACCCGGCTGGTCATGACTATTTCACACGTCGTGAGTTTTTTCGATC
  CCTACAAAGGAAAGGATGAGTACGGGATCTT
//
ID	SNP-35
AC	
OS	Zea mays
DE	SNP AC177813.2 position 142409-142409
KW	G -> a

  CGCCTTGTCCTAACCCTTCAGCATATCCTCTAGCTCATCATCTGGGACGT
  CTTGAGGGAAGCCGATGATGTCTTGAAGGCT
//

Notice first that you refer to the logical name "myseqdb" rather than to the physical names of any of the files. The lspdb command behaves like the UNIX command "cat," except that rather than displaying a text file, it displays the contents of a binary database as text on STDOUT.

There is much power in the lspdb command alone - see the GenomeQuest Engine Primer for much more, or try
lspdb -help

A few undocumented examples to give you some flavor.

Output a GQ Engine Sequence Database as a FASTA file

% lspdb myseqdb -printf '>%H#ID\n%S\n%VOID'
>SNP-18
CGCGACCCGGCTGGTCATGACTATTTCACACGTCGTGAGTTTTTTCGATCCCTACAAAGGAAAGGATGAGTACGGGATCTT
>SNP-35
CGCCTTGTCCTAACCCTTCAGCATATCCTCTAGCTCATCATCTGGGACGTCTTGAGGGAAGCCGATGATGTCTTGAAGGCT

Returning the record for sequence with ID SNP-35

% lspdb myseqdb -bfql 'ID="SNP-35"'
ID	SNP-35
AC	
OS	Zea mays
DE	SNP AC177813.2 position 142409-142409
KW	G -> a

  CGCCTTGTCCTAACCCTTCAGCATATCCTCTAGCTCATCATCTGGGACGT
  CTTGAGGGAAGCCGATGATGTCTTGAAGGCT
//

Publishing a Sequence Database to the GenomeQuest Front-End

In order to publish a GenomeQuest Engine sequence database to the GenomeQuest web product, you must use the GenomeQuest tool called admin_db.pl. This tool is part of the GenomeQuest Content Manager suite, and comes as part of the GenomeQuest installation. It encapsulates the functionality of lspbank as described above, along with metadata about the access control related to the sequence database. Full documentation of admin_db.pl is available on the command line, as well as in the GenomeQuest Content Manager Reference Manual.

% $GQ_INSTALL/data/GQdata/external_apps/gqadmindb/admin_db.pl 
Please specify action(--action).

Usage:
     admin_db.pl    --action <convert|index|configure|push|activate|add|delete|update|list|showtree|showfields|lookupfields> 
                    --db_file <input_db_file_or_dir> 
                    --gq_base_dir <GenomeQuest_installation_dir>
                    --db_id <GQ_database_definition_id, e.g., "LOCAL_MYDB">
                    --db_format <EMBL+, FASTA, FASTQ>
                    --map <e.g. "ID|AC|DE">
                    --db_type <NUC, PRT, PRO(same as PRT), NUCCS>
                    --db_name <database_name>
                    --index_fields <e.g. "ID,AC,DE">
                    --target <hotdrive, local, all>
                    --release <release_number>
                    --gq_fields <e.g., "ID,AC,GI,OS,FT...">
                    --norm_pn
                    --pattern <e.g., "number*fragment">
                    --owner <login name of a GenomeQuest user>
                    --access <the access level for the database. e.g. "private", "public", "group">

     Use -h/--help to see detailed instructions for each option.

The GenomeQuest Sequence Database Browser

Once a sequence database is published into the GenomeQuest system, it is available inside the GQ product. Simply go to the MyGQ page, and view "My Uploaded Databases." Your database should be visible there for browsing. Note that all of the annotation fields you provided will be available for querying, sorting, and grouping by the user. In this way you can publish meaningful annotated sequence databases for your users to browse. Examples:

  • a sequence database where each sequence represents a SNP and flanking sequence. Annotation fields include positional information about the SNP, its quality, and whether it is involved in a change in an encoded protein
  • a sequence database where each sequence is a gene. Annotation fields include expression levels of the gene in a series of tissues
  • a sequence database where each sequence is a target associated with a drug. Annotation fields include the name of the drug and other information about the disease / phenotype.

Any sequence database published is also available via the GenomeQuest URL API. For instance, if you publish a sequence database via admin_db.pl and give it the db_id as follows:

db_id=MY_DB

then you should be able to access it via the URL:

https://my.genomequest.com/query?do=gqfetch&db=MY_DB

Much more on the URL API is available here.

Result Databases

A Result Database is a very specific concept in GenomeQuest. By Result Database, we mean a particular GenomeQuest Engine database format which represents the results of a sequence comparison between two GenomeQuest Engine Sequence Databases.

A Result database is a represented as a set of files on the UNIX command line, much as a Sequence database is.

It is produced by a sequence comparison (or by converting a SAM/BAM file).

There are three core GenomeQuest Engine commands that produce a Result Database:

  1. lspcalc: takes two Sequence Databases and a sequence comparison algorithm and compares every sequence in database 1 against every sequence in database 2, using the algorithm specified
  2. lspmul: same as lspcalc, however it first performs a word-based matching to determine which pairs of sequences are likely to have an alignment, and then only runs the comparison algorithm on those pairs
  3. lspresbank: takes a SAM/BAM file and produces a result database.

Example: lspcalc producing a result database

Let's download and convert the SwissProt database into a GQ Engine format, and then run a BLAST comparing all human proteins against all mouse proteins:

% lftp
ftp.expasy.ch/.../swissprot/release_compressed/uniprot_sprot.dat.gz
% time gunzip uniprot_sprot.dat.gz | lspbank –T embl –prot –F sp

STDIN : sequences 512994, residues 180531504, max seq length 35213
real 0m24.024s user 0m34.382s sys 0m4.661s

Now the SwissProt database has been downloaded and converted into a GQ Engine format. The logical identifier of the database is sp and it resides in our current working directory. Next, let's run the comparison:

% lspcalc –M bl2 –mp BLOSUM62      // use the BLAST2 algorithm with the BLOSUM62 matrix
–db sp –bfql ’os=”homo sapiens”’	// Subject (reference)
–db sp -bfql ’os=”mus musculus”’	// Query
–o HsMm.res –best 5,{-RS}                // output into a Result Database called "HsMm.res", keeping the best 5 hits for each query, sorted by BLAST score (descending order)

In this directory we now have a result file called "HsMm.res" which we can interact with:

% lspres HsMm.res | head
1A1L2_HUMAN 31 555 1A1L2_MOUSE 51 577 S= 1605 E= 1.37637e-178 Bits= 622.854
3HIDH_HUMAN 1 336 3HIDH_MOUSE 1 335 S= 1562 E= 6.66903e-174 Bits= 606.29
5HT1A_HUMAN 1 421 5HT1A_MOUSE 1 421 S= 1901 E= 4.42149e-213 Bits= 736.873
5HT3B_HUMAN 6 438 5HT3B_MOUSE 1 434 S= 1702 E= 5.52321e-190 Bits= 660.218
5NT3L_HUMAN 1 291 5NT3L_MOUSE 1 291 S= 1390 E= 4.86686e-154 Bits= 540.035

Or perhaps to look at the alignments:

% lspres HsMm.res -a | head
1A1L2_HUMAN 31 555 1A1L2_MOUSE 51 577 S= 1605 E= 1.37637e-178 Bits= 622.854
Q:	51 EKMLKFQHVIRNQFLQQISQQMQCVPPGDQQCTQTSRKRKKM-GYLLSQMVNFLWSNTVK 109
           |  |  |  +   |+|  |+|   +   +++ |+   + + +   |+ +|+| | |
S:	31 EITLHLQQAMTEHFVQLTSRQGLSLE--ERRHTEAICEHEALLSRLICRMINLLQSGAAS 88

Q:     110 KLKFKVPLPCLDSRCGIKVGHQTLSPWQTGQSRPSLGGFEAALASCTLSKRGAGIYESYH 169
            |+ +||||  |||  ++ | +     |     | |   |||  +  || ||  |    |
S:	89 GLELQVPLPSEDSRGDVRYGQRAQLSGQP-DPVPQLSDCEAAFVNRDLSIRGIDISVFYQ 147

Of course, all of this is documented in detail in the GQ Engine Primer and associated manuals. And if you ever need more help, never forget:

lspcalc -h

or

lspres -h

Comparison Algorithms

A large number of comparison algorithms exists to create a Result Database:

  • Blast (local alignment)
  • Needleman & Wunsch (global alignment)
  • Kerr (global alignment on the smallest sequence)
  • Smith-Waterman
  • String, pattern matching

In addition, a Result Database can be created from a SAM/BAM file, thereby allowing any external alignment program to produce GenomeQuest Engine Result Databases.


Additional Operations Relating to Sequence Comparisons

Aside from importing a SAM/BAM file to create a Result database, all of the sequence comparison approaches outlined above are augmented by allowing the user to perform dynamically any of the following operations:

  • pre-filtering of query sequences based on word-matching of queries against subject sequences. This process is encapsulated in a GenomeQuest Engine command called lspmul.
  • selection on the fly of query and subject sequences that have certain properties
  • strategies to select and retain the "best" hit(s) among all hits
  • automated dispatch of computation across a set of compute nodes

Publishing a Result Database to the GenomeQuest Front-End

Unlike Sequence Databases which can be published to the GenomeQuest front-end via admindb.pl, result databases are typically products of GenomeQuest Workflows. See the documentation on How to Make Your Own Workflow for details on how to publish Result Databases to the GenomeQuest front end.

The GenomeQuest Result Database Browser

Like the Sequence Database Browser, the GenomeQuest platform has an innate browser built to interactively browse a Result Database. The Sequence Search workflow automatically creates GenomeQuest Result Databases, so to see an example of the Result Browser in action, simply run a Sequence Search using the GenomeQuest platform and then click into the result.

The GenomeQuest URL API provides direct linkable access to a Result Database. It does so through the API call "gqresult." For instance, if a Result has the id 125118, then you should be able to access it via the URL:

https://my.genomequest.com/query?do=gqresult&db=id:125118

See the documentation on How To Create Your Own Workflow for more details on how to publish your own Result Databases.

The Hotdrive

Downloading and converting the terabytes of reference databases can be a good exercise, but keeping the references up-to-date is generally a pain, with little value-added. GenomeQuest provides this service for you, either through the use of the GenomeQuest portal at my.genomequest.com, or if you are an installed customer and house GenomeQuest on your local system, via a database update service (GenomeCast™), a simple Perl script that brings to your organization all reference databases, converted with carefully chosen, indexed annotation fields. GenomeCast handles updates for you, at the frequency that you choose.

We call this the Hotdrive - the world's publicly available reference data. It's available in the GenomeQuest installation at:

$GQ_INSTALL/data/GQdata/content/hotdrive

assuming you have installed GenomeQuest at $GQ_INSTALL.

Hotdrive Channels

Let's take a look inside the Hotdrive, assuming you have installed GenomeQuest on a disk at the location $GQ_INSTALL:

% ls $GQ_INSTALL/data/GQdata/content/hotdrive
configuration    GB_SYN                          GQGENE_TR_Glycine_max        PDB_PRT
DRUGBANKPRO_NUC  GB_UNA                          GQGENE_TR_Homo_sapiens       PEPTIDE_GQGENE_Glycine_max
DRUGBANKPRO_PRT  GB_VRL                          GQGENE_TR_Mus_musculus       PEPTIDE_GQGENE_Sorghum_bicolor
ENSM             GB_VRT                          GQGENE_TR_Oryza_sativa       PEPTIDE_GQGENE_Zea_mays
ENSP             GEAA                            GQGENE_TR_Rattus_norvegicus  RSG
GB_BCT           GENA                            GQGENE_TR_Sorghum_bicolor    RSG_FUNGI
GB_ENV           GENOMIC_GQGENE_Glycine_max      GQGENE_TR_Zea_mays           RSG_INVERTEBRATE
GB_EST           GENOMIC_GQGENE_Sorghum_bicolor  GQGENE_Zea_mays              RSG_MICROBIAL
GB_GSS           GENOMIC_GQGENE_Zea_mays         GQPAT_NUC                    RSG_PLANT
GB_HTC           GP                              GQPAT_PRT                    RSG_PLASMID
GB_HTG           GQGENE                          HS_RSG                       RSG_PROTOZOA
GB_INV           GQGENE_Arabidopsis_thaliana     IPI                          RSG_VERTEBRATE_MAMMALIAN
GB_MAM           GQGENE_Glycine_max              MRNA_GQGENE_Glycine_max      RSG_VERTEBRATE_OTHER
GB_PHG           GQGENE_Homo_sapiens             MRNA_GQGENE_Sorghum_bicolor  RSG_VIRAL
GB_PLN           GQGENE_Mus_musculus             MRNA_GQGENE_Zea_mays         RSM
GB_PRI           GQGENE_Oryza_sativa             NCBI_IGBLAST_NUC             RSP
GB_ROD           GQGENE_Rattus_norvegicus        NCBI_IGBLAST_PRT             UNIPROT
GB_SET           GQGENE_Sorghum_bicolor          NCBI_PROBE                   VARIANT_DATA
GB_STS           GQGENE_TR_Arabidopsis_thaliana  PDB_NUC

All of these databases are made available to the GenomeQuest web product via the use of admin_db.pl, described above. So not only are these databases available on the UNIX command line, they are also accessible via the GenomeQuest user interface.

Now, each item in this location is itself a directory with a series of GQ Engine databases. We call each such directory a Channel.

Hotdrive Channel - a logical grouping of GQ Engine databases which maintains a hierarchy used for handling releases and configuration files for GenomeQuest.

Inside a Hotdrive Channel

Let's look at the Hotdrive Channel for the EST division of Genbank. It shows the following entries:

% ls -ls $GB_INSTALL/disk/GQ/data/GQdata/content/hotdrive/GB_EST
total 16
4 drwxr-xr-x 5 runner geneit 4096 2009-06-11 01:20 171
4 drwxr-xr-x 5 runner geneit 4096 2009-08-27 01:21 172
4 drwxr-xr-x 5 runner geneit 4096 2009-10-15 01:15 173
4 drwxr-xr-x 5 runner geneit 4096 2009-12-18 01:17 174
0 drwxr-xr-x 5 runner geneit  129 2010-01-07 01:18 175
0 drwxr-xr-x 3 runner geneit  147 2009-08-21 04:43 configuration

We can see the past five revisions of GB_EST (171 through 175). Every channel also has a configuration directory. The configuration directory contains information which is provided to the GenomeQuest framework that allows the system to know which version of the database to use. Indeed, let's look at a particular file in this directory that shows this:

% ls -ls $GB_INSTALL/disk/GQ/data/GQdata/content/hotdrive/GB_EST/configuration
total 36
 4 -rw-r--r-- 1 runner geneit   735 2005-08-26 17:25 channel-property.rss
 4 -rw-rw-r-- 1 runner geneit   105 2009-07-21 13:18 seqdb_definition.seqdbconf
24 -rw-rw-r-- 1 runner geneit 23714 2010-01-06 12:01 seqdb_instance.seqdbconf
 0 drwxr-xr-x 2 runner geneit    29 2009-08-21 04:43 templates
% tail $GB_INSTALL/disk/GQ/data/GQdata/content/hotdrive/GB_EST/configuration/seqdb_instance.seqdbconf
GB_EST174, GB_EST, INACTIVE, 20091118144000, GB_EST174_20091118, ${HOTDRIVE}/GB_EST/174/
GB_EST174, GB_EST, INACTIVE, 20091125142442, GB_EST174_20091125, ${HOTDRIVE}/GB_EST/174/
GB_EST174, GB_EST, INACTIVE, 20091202145630, GB_EST174_20091202, ${HOTDRIVE}/GB_EST/174/
GB_EST174, GB_EST, INACTIVE, 20091209150859, GB_EST174_20091209, ${HOTDRIVE}/GB_EST/174/
GB_EST174, GB_EST, INACTIVE, 20091216223935, GB_EST174_20091216, ${HOTDRIVE}/GB_EST/174/
GB_EST175, GB_EST, INACTIVE, 20091224081253, GB_EST175_20091224, ${HOTDRIVE}/GB_EST/175/
GB_EST175, GB_EST, INACTIVE, 20091230152022, GB_EST175_20091230, ${HOTDRIVE}/GB_EST/175/
GB_EST175, GB_EST, ACTIVE, 20100106170155, GB_EST175_20100106, ${HOTDRIVE}/GB_EST/175/

@/seqdb_instance
%

The end of this file shows that GB_EST175 is the active database in the channel.

We leave it to the reader to explore the hotdrive structure in more detail. Typically you will never need to understand the contents of the hotdrive, as you can access its contents through the GQ Engine abstractly.

Accessing the Hotdrive via GQ Engine commands

There are three primary commands in the GenomeQuest Engine which require references to databases:

  1. lspdb, our iterator of Sequence Databases.
  2. lspcalc, our Sequence Database Comparison tool
  3. lspmul, our heuristic (word-based) Sequence Database Comparison tool

Each of these commands accepts one (lspdb) or more (lspcalc, lspmul) sequence databases as input. Normally you specify the full path of the database, however with sequence databases in the Hotdrive, this is not necessary. We have provided "Hotdrive-aware" tools:

  • lspdb.H: the equivalent of lspdb, hotdrive aware
  • lspcalc.THA: a hotdrive-aware replacements for lspcalc and lspmul

lspdb.H

As an example of the equivalence of lspdb and lspdb.H:

% lspdb /disk/GQ/data/GQdata/content/hotdrive/GB_SYN/175/GB_SYN175_20100106 -count
		 db nbseqs = 91799
		 db nbres  = 138103596
		 db maxlen = 1089202
		 db fields = ID AC SV GI GN SY D6 D7 D1 D2 MT DE KW OS OX OC CC DR RA A1 A2 A3 F4 FB FC FD FE W1 HL W4 FT 
% lspdb.H GB_SYN[] -- -count
		 db nbseqs = 91799
		 db nbres  = 138103596
		 db maxlen = 1089202
		 db fields = ID AC SV GI GN SY D6 D7 D1 D2 MT DE KW OS OX OC CC DR RA A1 A2 A3 F4 FB FC FD FE W1 HL W4 FT 

The first command requires that you know the precise location of your GQ Engine Sequence Database. The second substitutes that full path with the name of the channel (GB_SYN) followed by empty brackets, symbolizing hotdrive entry. Note that lspdb.H requires that you put all arguments you will send to lspdb after a double dash. Indeed, lspdb.H simply passes all of the information after the -- to the lspdb command.

lspcalc.THA

lspcalc.THA has equivalence to both lspcalc and lspmul, with awareness of the hotdrive (as well as awareness of array-based computing). As an example, consider the following code:

lspcalc.THA -qs SGE -splitrole query -jobname documentation --
    -t 16.4	// should be in array options
    -M mapnc,kerr
    -O "[-fltThreshold 6 -errs 3 -extend -hitmap -r ]"  // -r –extend default
    -mn NUC.2.1
    -db GB_PRI[] -bfql 'de ="Human (alu consensus,Line-1 repeat mrna)" '
    -db /home/mystuff/mydatabase 
    -o res/repeats.res -best '1,single,hitcnt,rmdup,{RS}' // rmdup single {RS} default -progress

While lspcalc.THA is further documented here and in the GQ Engine Primer, the command above illustrates the awareness of the Hotdrive with the two lines that start with -db. The first line indicates that the subject database is the version of GB_PRI that is available in the Hotdrive. The second line describes the query database - a GenomeQuest database available in my home directory.

More documentation on the Hotdrive from the GQ Engine Primer.

Local

Much as the Hotdrive is the single-stop for all of the public reference data, GenomeQuest provides a single source where all user's Sequence Databases live. It is called Local

Local: the location of user Sequence Databases, as placed by the GenomeQuest Content Manager.

Local is available at the following location, assuming you installed GenomeQuest at $GQ_INSTALL:

$GQ_INSTALL/data/GQdata/content/local

Notice that it resides in the same parent directory as the GenomeQuest Hotdrive.

The organization of Local is precisely the same as it is for the Hotdrive, except that each database is preceeded by the word LOCAL.

How do you put data in Local for users?

Two ways:

  1. They can do it themselves using the GenomeQuest web interface
  2. You can do it for them using the GenomeQuest Content Manager.

Accessing Local through GQ Engine commands

Local can be accessed through GQ Engine commands in exactly the same way you would wish to access a Hotdrive database: by using lspdb.H and lspcalc.THA. For instance, assuming a Local database LOCAL_YORUBA_READS and a Hotdrive database GB_PRI:

% lspdb.H LOCAL_YORUBA_READS[] -- -count
		 db nbseqs = 24272624
		 db nbres  = 873814464
		 db maxlen = 36
		 db fields = ID PV OD W1 
% lspdb.H GB_PRI[] -- -count
		 db nbseqs = 572580
		 db nbres  = 5820225602
		 db maxlen = 3284914
		 db fields = ID AC SV GI GN SY D6 D7 D1 D2 MT DE KW OS OX OC CC DR RA A1 A2 A3 F4 FB FC FD FE W1 HL W4 FT 
% lspcalc.THA -qs SGE -splitrole query -jobname documentation --
    -t 16.4	// should be in array options
    -M mapnc,kerr
    -O "[-fltThreshold 6 -errs 3 -extend -hitmap -r ]"  // -r –extend default
    -mn NUC.2.1
    -db GB_PRI[]
    -db LOCAL_YORUBA_READS[]
    -o res/repeats.res -best '1,single,hitcnt,rmdup,{RS}' // rmdup single {RS} default -progress

Again, the other details of lspcalc.THA are documented here, but you can see how databases in the Hotdrive or in Local are referenced through these commands.

The Metadata Layer

In between the GenomeQuest Engine (the lsp* commands) and the GenomeQuest web interface rests a relational database which stores metadata about the system.

For deployed systems, GenomeQuest supports both MySQL and ORACLE. We use MySQL for our own hosted solution, and the remainder of this documentation assumes MySQL as the database engine that stores GenomeQuest metadata.

Each of the tables in the GenomeQuest metadata layer is documented below:

GQ Metadata Tables
Table Name Description
USER TABLES - store information pertaining to users
gq_user Each row corresponds to a single user in the GenomeQuest system
accounting_group Each row corresponds to an accounting group - users can not share outside of their accounting group
preference Links user ids with application-level preferences (e.g., the user 349 has the preference EXPAND_ALIGNMENT_BY_DEFAULT)
preference_value Links application-level preferences with the values for those preferences (e.g., the preference EXPAND_ALIGNMENT_BY_DEFAULT described as preference.preference_id=1029 has the value TRUE)
tb_user For GenomeQuest Live only, stores those users who have signed up for Free Basic Accounts but who haven't activated their account yet.
user_class The set of legal user classes (typically gold, silver, bronze, admin, etc). Each user is associated with exactly one of these in the gq_user table.
SEQUENCE DATABASES - store information pertaining to sequence databases that exist in Hotdrive or Local
physical_seqdb Each row corresponds to a specific physical sequence database stored in either the Hotdrive or in Local
virtual_seqdb Each row corresponds to a specific virtual sequence database - basically a saved filter on a set of physical sequence databases. Exactly like the physical_seqdb table in other respects.
virtual_seqdb_physical_seqdb The set of physical_seqdb databases that comprise a particular virtual_seqdb.
GQ5 SEQUENCE SEARCH - tables to represent sequence searches in GQ5 and below
comparison A wide table that stores information about each GQ5 search that has been run in the system, as well as links to other tables such as gq_user
folder Stores folders of related GQ5 sequence searches. This concept is not employed in GQ6 and beyond.
resource_5_2 This is for GQ5 only. Unifies sequence databases and other entities (such as features) so that they can be referred to in a single namespace. (It's ok if you don't understand this.)
sharepriv Ancient. Gq5. Ignore.
sharepriv_user Ancient. Gq5. Ignore.
view_status GQ5 only - relates a comparison to a gq_user and indicates whether the gq_user has ever viewed the comparison.
WORKFLOWS - tables to manage workflows and workflow runs
plugin Stores the list of valid workflows in the system (e.g., RNA-Seq, Velvet) and descriptors for their code-level names
workflow Information about each workflow run that has ever occured. The rows in this table relate to the plugin table to describe which workflow was run, as well as the gq_user table, etc.
workflow_params Relates to the workflow table. Stores each parameter of the workflow for a given run in the workflow table.
workflow_transaction A record is saved here every time a workflow is run. Like the workflow table, however if a user deletes a workflow from his account, it is removed from the workflow table, but not from this table.
ACCESS CONTROL - tables to handle access control and sharing
feature Stores features of the system that can be controlled by access control logic.
gq_resource Internal table, subject to change at a moment's notice. This table homogenizes different resources inside of GQ to allow users to share resources with each other regardless of their type.
resource_user_account Relates gq_resource entries to specific users, and allows for specification of quota and expiration date on a per user basis.
physical_seqdb_user Stores sharing relationships between a physical_seqdb and a gq_user, where the original owner (physical_seqdb.gq_user_id) shared the physical_seqdb with a set of specific users. This table is not used when the original owner shares with a group or with "all".
plugin_user Stores relationships between workflows and users. Typically workflows (such as RNA-Seq) will be defined in the plugin table to have a share_level of ":acl" which implies that the workflow is shared globally. To limit the extent to which a workflow is accessible in the system, the share_level in the plugin table should be set to _____ and this plugin_user stores the relationship between accessible workflows and the users who can access them.
sharee A generalization of a set of users (either an individual user, an accounting group, or a user class) that can be used to describe a resource which is being shared with that group.
sharee_resource A relationship between a sharee and a gq_resource, indicating that the sharee gets access to the gq_resource. This table also allows for the specification of particular quota and expiries.
virtual_seqdb_user Stores sharing relationships between a virtual_seqdb and a gq_user, where the original owner (virtual_seqdb.gq_user_id) shared the virtual_seqdb with a set of specific users. This table is not used when the original owner shares with a group or with "all".
workflow_user Stores sharing relationships between a workflow that has been run and a gq_user, where the original owner (workflow.gq_user_id) shared the workflow with a set of specific users. This table is not used when the original owner shares with a group or with "all".
UTILITY - other tables used by the system
event_log Stores every event received by the system's dispatcher, by login id.
event_param Stores parameters associated with events in event_log
alert Used to store alerts that users have set up. This capability automatically runs searches for users and alerts them when new hits arrive in public or patent data.
link_record Stores the format of an HTML link-out for data in specific Hotdrive channels. For instance, a record in UNIPROT has a certain link to an external site, whereas a record in Drugbank has a different link-out URL format
tag_pair Currently unused - designed to store key-value pairs for tagging certain GQ entities.
upgrade_history Stores upgrade events to the GenomeQuest platform
SEQUENCES - used to increment primary keys
accounting_group_sequence A sequence to increment the primary key for the accounting_group table
alert_sequence A sequence to increment the primary key for the alert table
comparison_sequence A sequence to increment the primary key for the comparison table
folder_sequence A sequence to increment the primary key for the folder table
resource_sequence A sequence to increment the primary key for the gq_resource table
seqdb_sequence A sequence to increment the primary key for both the physical_seqdb table and the virtual_seqdb table, preserving the uniqueness of the primary key space across the union of these tables.
sharepriv_sequence A sequence to increment the primary key for the sharepriv table, which is Ancient. Gq5. Ignore.
workflow_sequence A sequence to increment the primary key for the workflow table.

Userdata

Where the Metadata layer stores the metadata that describes objects in the system, and the Hotdrive (and it's "Local" counterpart) store Sequence Databases available in the system, the Userdata layer stores the actual doings of the user - the results of runs of workflows and sequence searches.

Userdata: the location of the results of workflows run by users.

Userdata is available at the following location, assuming you installed GenomeQuest at $GQ_INSTALL:

$GQ_INSTALL/data/GQdata/userdata

The Userdata layer is implemented as a directory on the UNIX filesystem. Inside this directory you will find one directory for every user on the system who has ever run a workflow. The directory names are keyed not to their username, but rather to a unique id which is available in the Metadata layer's gq_user table. For instance:

% ls -l $GQ_INSTALL/data/GQdata/userdata | head
total 120
drwxrwxrwx  9 runner geneit 4096 Dec 22 04:50 09080310591634a76fb
drwxrwxrwx  8 runner geneit 4096 Feb  2 04:51 09080407051444a7815
drwxrwxrwx 11 runner geneit 4096 Feb  2 11:38 09080511494254a79aa
drwxrwxrwx 12 runner geneit 4096 Jan 15 05:40 09081105150664a8136
drwxrwxrwx 10 runner geneit 4096 Jan 19 05:21 09081114472374a81bc
drwxrwxrwx  6 runner geneit 4096 Dec  7 15:25 09081814025484a8aec
drwxrwxrwx  6 runner geneit 4096 Dec  7 15:16 090818173145124a8b1
drwxrwxrwx  3 runner geneit 4096 Aug 21 14:40 090821144013134a8ee
drwxrwxrwx 10 runner geneit 4096 Jan 25 16:10 090825115239144a940
% 

In the Metadata layer there is a mapping from these directory names to gq_users. It is found in the gq_user table:

mysql> describe gq_user;
+----------------------------+--------------+------+-----+---------+-------+
| Field                      | Type         | Null | Key | Default | Extra |
+----------------------------+--------------+------+-----+---------+-------+
| gq_user_id                 | int(11)      | NO   | PRI | NULL    |       | 
| old_unique_id              | varchar(48)  | YES  | UNI | NULL    |       | 
| login_name                 | varchar(64)  | NO   | UNI | NULL    |       | 
| directory_name             | varchar(128) | NO   | UNI | NULL    |       | 
| first_name                 | varchar(64)  | NO   |     | NULL    |       | 
| last_name                  | varchar(64)  | NO   |     | NULL    |       | 
| email                      | varchar(64)  | YES  |     | NULL    |       | 
| active                     | tinyint(1)   | NO   |     | NULL    |       | 
| anonymous                  | tinyint(1)   | NO   |     | NULL    |       | 
| password                   | varchar(64)  | YES  |     | NULL    |       | 
| user_profile_name          | varchar(128) | YES  |     | NULL    |       | 
| accounting_group_id        | int(11)      | NO   | MUL | NULL    |       | 
| user_class_id              | int(11)      | NO   |     | NULL    |       | 
| expiration_date            | int(11)      | YES  |     | NULL    |       | 
| creation_date              | int(11)      | YES  |     | NULL    |       | 
| last_activity              | int(11)      | YES  |     | NULL    |       | 
| access_key                 | varchar(128) | YES  | UNI | NULL    |       | 
| access_key_creation_date   | int(11)      | YES  |     | NULL    |       | 
| access_key_expiration_date | int(11)      | YES  |     | NULL    |       | 
| is_ppu                     | tinyint(1)   | NO   |     | 0       |       | 
| is_progressive             | tinyint(1)   | NO   |     | 1       |       | 
| current_login_time         | timestamp    | YES  |     | NULL    |       | 
| previous_login_time        | timestamp    | YES  |     | NULL    |       | 
+----------------------------+--------------+------+-----+---------+-------+
23 rows in set (0.00 sec)

The two fields of interest are login_name and directory_name. For instance:

mysql> select login_name, directory_name from gq_user where login_name = 'test1';
+------------+---------------------+
| login_name | directory_name      |
+------------+---------------------+
| test1      | 090904083727234aa10 | 
+------------+---------------------+
1 row in set (0.00 sec)

So to view the actual workflow results for the user test1 you would:

% cd $GQ_INSTALL/data/GQdata/userdata/090904083727234aa10
% ls -l
drwxrwxrwx  3 runner geneit 4096 Feb  2 12:23 1002021216071124561
drwxrwxrwx 10 runner geneit 4096 Feb  4 11:07 seqdbsearch
drwxr-xr-x 31 runner geneit 4096 Feb  4 16:04 workflow

Mapping Between Userdata and Metadata

It can be a bit troubling to always have to open up the Metadata relational database when you want to find a user's working userdata directory. Sometimes you know the directory but not the login name. Sometimes you know the login name but not the directory. On our systems we built two very simple scripts which we offer here for you to deploy on your own machine as well to quickly map between these two without having to open a mysql console.

I Know The Directory

Copy the following code into a file, make it executable, and make it available in your PATH variable:

#!/bin/bash

mysql -u runner --password='<the-password>' --execute="select login_name from gq_user where directory_name='$2'" $1

If you saved this as a file ud-i-know-dir, you would execute this as:

% ud-i-know-dir gqdb 090904083727234aa10
+------------+
| login_name |
+------------+
| test1      | 
+------------+
%

In this case, the identifier gqdb is the mysql database name associated with your installation of GQ.

I Know The Login

Copy the following code into a file, make it executable, and make it available in your PATH variable:

#!/bin/bash

mysql -u runner --password='<the-password>' --execute="select directory_name from gq_user where login_name='$2'" $1

If you saved this as a file ud-i-know-login, you would execute this as:

% ud-i-know-login gqdb test1
+---------------------+
| directory_name      |
+---------------------+
| 090904083727234aa10 | 
+---------------------+
%

In this case, the identifier gqdb is the mysql database name associated with your installation of GQ.

Contents of Userdata Directories

Let's have a look inside a given user's userdata directory. Remember, we won't expect to find their Sequence Databases here (those are stored in Local):

% ls -l $GQ_INSTALL/data/GQdata/userdata/090904083727234aa10
total 12
drwxrwxrwx  3 runner geneit 4096 Feb  2 12:23 1002021216071124561
drwxrwxrwx 10 runner geneit 4096 Feb  4 11:07 seqdbsearch
drwxr-xr-x 31 runner geneit 4096 Feb  4 16:04 workflow

There can be a variety of files and directories here but we concern ourselves only with the directory called workflow. All of the other directories either represent some transient state of the user's interactions with the system, or they represent sequence searches from older versions of GenomeQuest (pre 6.0).

Let's take a look at the workflow directory.

% cd $GQ_INSTALL/data/GQdata/userdata/090904083727234aa10/workflow
% ls -lt
total 116
drwxrwxrwx 3 runner geneit 4096 Feb  8 21:16 1690
drwxrwxrwx 3 runner geneit 4096 Feb  4 16:05 1714
drwxrwxrwx 3 runner geneit 4096 Feb  4 11:10 1572
drwxrwxrwx 3 runner geneit 4096 Jan 29 17:40 1612
drwxrwxrwx 6 runner geneit 4096 Jan 29 12:30 1588
drwxrwxrwx 7 runner geneit 4096 Jan 28 19:23 1605
drwxrwxrwx 3 runner geneit 4096 Jan 28 19:05 1604
drwxrwxrwx 7 runner geneit 4096 Jan 27 19:38 1584
drwxrwxrwx 7 runner geneit 4096 Jan 27 19:32 1582
drwxrwxrwx 6 runner geneit 4096 Jan 27 19:31 1571
drwxrwxrwx 7 runner geneit 4096 Jan 27 19:28 1583
drwxrwxrwx 5 runner geneit 4096 Jan 27 18:45 1577
drwxrwxrwx 3 runner geneit 4096 Jan 27 18:41 1576
drwxrwxrwx 4 runner geneit 4096 Jan 27 18:38 1575
drwxrwxrwx 4 runner geneit 4096 Jan 27 18:34 1574
drwxrwxrwx 7 runner geneit 4096 Jan 27 18:31 1570
drwxrwxrwx 5 runner geneit 4096 Jan 27 18:26 1573
drwxrwxrwx 3 runner geneit 4096 Jan 27 17:23 1569
drwxrwxrwx 7 runner geneit 4096 Jan 21 12:02 1546
drwxrwxrwx 3 runner geneit 4096 Jan 20 21:37 1534
drwxrwxrwx 3 runner geneit 4096 Jan  4 14:08 1388
drwxrwxrwx 4 runner geneit 4096 Dec 31 14:46 1358
drwxrwxrwx 4 runner geneit 4096 Dec 31 14:45 1357
drwxrwxrwx 4 runner geneit 4096 Dec 31 14:39 1354
drwxrwxrwx 5 runner geneit 4096 Dec 30 11:46 1348
drwxrwxrwx 7 runner geneit 4096 Dec 30 11:04 1350
drwxrwxrwx 2 runner geneit 4096 Dec 30 10:42 1347
drwxrwxrwx 2 runner geneit 4096 Nov 23 11:55 1122
drwxrwxrwx 2 runner geneit 4096 Sep 14 16:08 676

Each of the directories in this area correspond to the user running a single workflow. What are these numbers? Indeed, they map to columns in the Metadata layer - the MySQL (or ORACLE) database. In particular, they map to a table called workflow. Let's have a closer look:

mysql> describe workflow;
+--------------------+--------------+------+-----+----------+-------+
| Field              | Type         | Null | Key | Default  | Extra |
+--------------------+--------------+------+-----+----------+-------+
| workflow_id        | int(11)      | NO   | PRI | NULL     |       | 
| gq_user_id         | int(11)      | NO   |     | NULL     |       | 
| directory_name     | varchar(48)  | NO   |     | NULL     |       | 
| type               | varchar(32)  | NO   |     | NULL     |       | 
| text_label         | varchar(255) | YES  |     | NULL     |       | 
| description        | text         | YES  |     | NULL     |       | 
| total_nbresults    | int(11)      | YES  |     | NULL     |       | 
| submit_time        | int(11)      | YES  |     | NULL     |       | 
| finish_time        | int(11)      | YES  |     | NULL     |       | 
| status             | varchar(64)  | YES  |     | NULL     |       | 
| input1_id          | int(11)      | YES  |     | NULL     |       | 
| input1_type        | char(64)     | YES  |     | NULL     |       | 
| input2_id          | int(11)      | YES  |     | NULL     |       | 
| input2_type        | char(64)     | YES  |     | NULL     |       | 
| version            | varchar(3)   | YES  |     | NULL     |       | 
| email              | varchar(128) | YES  |     | NULL     |       | 
| disk_usage         | int(11)      | YES  |     | NULL     |       | 
| credit_cost        | int(11)      | NO   |     | 0        |       | 
| credit_paid        | int(11)      | NO   |     | NULL     |       | 
| credit_code        | varchar(16)  | YES  |     | NULL     |       | 
| parent_workflow_id | int(11)      | YES  |     | NULL     |       | 
| base_workflow_id   | int(11)      | NO   |     | NULL     |       | 
| share_level        | varchar(16)  | NO   |     | :private |       | 
+--------------------+--------------+------+-----+----------+-------+

The field workflow.workflow_id points the MySQL metadata to one of the numbers here. The workflow_id is unique across all users so you can simply ask against it, for instance:

mysql> select workflow_id,directory_name,type,from_unixtime(submit_time), status from workflow where workflow_id=1690;
+-------------+----------------+---------------+----------------------------+----------+
| workflow_id | directory_name | type          | from_unixtime(submit_time) | status   |
+-------------+----------------+---------------+----------------------------+----------+
|        1690 | 1690           | GqWfSeqSearch | 2010-02-04 11:08:26        | FINISHED | 
+-------------+----------------+---------------+----------------------------+----------+
1 row in set (0.00 sec)

We can see that in the directory 1690, we should find the results of a Sequence Search workflow (GqWfSeqSearch). Let's have a closer look:

% cd 1690
% ls -l
total 136
-rw-r--r-- 1 runner geneit  2213 Feb  4 11:08 body.script.comparison.sh
-rw-r--r-- 1 runner geneit     0 Feb  4 11:08 comparison.has.finished
-rw-r--r-- 1 runner geneit 13588 Feb  4 11:08 comparison.log
lrwxrwxrwx 1 runner geneit    82 Feb  4 11:08 lastjob -> /disk/GQtest/data/GQdata/userdata/090904083727234aa10/workflow/1690/.tmp_res.21373
lrwxrwxrwx 1 runner geneit    80 Feb  4 11:08 lastquery -> /disk/GQtest/data/GQdata/userdata/090904083727234aa10/workflow/1690/query.db.ind
lrwxrwxrwx 1 runner geneit    82 Feb  4 11:08 lastsubject -> /disk/GQtest/data/GQdata/userdata/090904083727234aa10/workflow/1690/subject.db.ind
-rw-r--r-- 1 runner geneit  2070 Feb  4 11:08 lspcalc.progress
-rw-r--r-- 1 runner geneit   378 Feb  4 11:08 lspextend.progress
-rw-r--r-- 1 runner geneit     7 Feb  4 11:08 overall.progress
-rw-r--r-- 1 runner geneit    24 Feb  4 11:08 query.db.ctb
-rw-r--r-- 1 runner geneit   325 Feb  4 11:08 query.db.ind
-rw-r--r-- 1 runner geneit    22 Feb  4 11:08 query.db.seq
-rw-r--r-- 1 runner geneit 24000 Feb  4 11:08 res_0
-rw-r--r-- 1 runner geneit   823 Feb  4 11:08 res_0.resinfo
-rw-r--r-- 1 runner geneit 10000 Feb  4 11:08 res_0.tab
-rw-r--r-- 1 runner geneit   618 Feb  4 11:08 res.resinfo
-rw-r--r-- 1 runner geneit  4564 Feb  4 11:08 script.comparison.sh
-rw-r--r-- 1 runner geneit  9189 Feb  4 11:08 script.comparison.sh.params
-rw-r--r-- 1 runner geneit  6552 Feb  4 11:08 script.comparison.sh.params.json
-rw-r--r-- 1 runner geneit     6 Feb  4 11:08 script.comparison.sh.pid
-rw-r--r-- 1 runner geneit     0 Feb  4 11:08 script.comparison.sh.stderr
-rw-r--r-- 1 runner geneit     0 Feb  4 11:08 script.comparison.sh.stdout
-rw-r--r-- 1 runner geneit  1187 Feb  4 11:08 subject.db.ind

These are all of the "file droppings" associated with that workflow. As you write your own workflows, the files they produce will be placed in the appropriate userdata directory under the appropriate workflow_id.

Documentation of these files is outside of the scope of this particular documentation. See How Do I Make My Own Workflows for more information.

Workflow Architecture

Workflows. GenomeQuest's version of the "App Store." This is where you build arbitrary applications and integrate them into the GenomeQuest architecture. The Workflow Architecture is totally open - you can use whatever programming language you want for most of the workflow. What you get from GenomeQuest is:

  • management of all of the underlying sequence data
  • the powerful GenomeQuest Engine for doing massive sequence comparison, analysis, and search
  • the presence of a compute array behind the scenes
  • a simple mechanism to publish results to users
  • the GenomeQuest framework allows users to share your workflow results like anything else in the system

See How Do I Write My Own Workflows for the details.

Where Does Workflow Code Go?

There are two kinds of workflows. System Workflows are workflows that are developed and supported by GenomeQuest. They live in $GQ_INSTALL/web/GQ/system_plugins/Workflows.

Your workflows should be place in $GQ_INSTALL/web/GQ/plugins/Workflows. Let's have a look.

% ls -l $GQ_INSTALL/web/GQ/plugins/Workflows
total 4
lrwxrwxrwx 1 runner geneit   46 Oct 20 10:59 GqWfChipseq -> /home/runner/heush/Services/trunk/GqWfChipseq/
drwxr-xr-x 4 runner geneit 4096 Jul 31  2009 GqWfVariant.old

In this directory we see two different workflows. The first, GqWfChipseq, is a working copy of a CHiP-Seq workflow. (Indeed, we can see that it is a symbolic link to another directory - don't let that worry you, it's just a mechanism to allow you to develop the workflow wherever you want, as long as there is a link to it in $GQ_INSTALL/web/GQ/plugins/Workflows.) The second workflow is an old version of a Variant workflow. </pre>

How Do I Write My Own Workflow?

See How Do I Write My Own Workflows for the details.

I'm Confused. What's the Relationship Between Workflow Code, Userdata, Local, and Hotdrive?

Remember, the Hotdrive is where the public reference sequence is kept, in GQ Engine format. $GQ_INSTALL/data/GQdata/content/hotdrive. We keep it up to date using GenomeCast. You never have to worry about this.

Local is where users' sequence data is kept. $GQ_INSTALL/data/GQdata/content/local. You can add data to this using the GenomeQuest Content Manager. When users use the web system to upload their own sequence data, the GenomeQuest Content Manager is automatically called in the background for them.

Workflow code is place in $GQ_INSTALL/web/GQ/plugins/Workflows. Each subdirectory should correspond to a single workflow.

Whenever a workflow is run, it creates a new directory in Userdata in the directory of the user who ran it: $GQ_INSTALL/data/GQdata/userdata. All of the file droppings of your workflow will be placed in this directory.

The key insight to take away is that your workflow code will live in one place, and it will run in another place - in userdata.

Plugin Architecture

Unlike workflows which allow you to provide arbitrary levels of customizability, plugins run either in the Sequence Database Browser or the Result Database Browser context. They iterate over sequences or results that have been selected by a user and perform some operation on each of them.

When you develop a plug-in, it becomes available in the Applications menu of the Sequence or Result Database Browser. Indeed, you can decide whether you want it to be available in a Sequence context, a Result context (remember - Results are databases of alignments!), or both.

When Would I Write a Plug-in?

Whenever you want to allow the user to pick a subset of sequences or alignments and process them somehow. Examples:

  • clustalw these sequences
  • analyze these alignments to see if they overlap
  • export these sequences/alignments in a particular format
  • plug-in some other application to run on a set of sequences or alignments

We use the plug-in architecture primarily to provide export services (GFF, Genbank/EMBL, Gbrowse, UCSC, etc), to add analytical functionality (Clustalw, EMBOSS, etc.) and to create useful post-analyses for customers (e.g., combining all of the SNPs - sequences, remember - into a table which is grouped by position and shows the haplotype for each of a series of lines or patients).

When Would It Be Better To Write a Workflow?

Plug-ins only run inside the Sequence Database Browser or the Result Database Browser. They have a limited user interface and they don't have the support of the My GenomeQuest page to show things like progress, success/failure, and so on. So if you are building a large application that allows users great specificity and takes hours to run, make a workflow. If you are building a quick way to analyze or export sequences or alignments, use a plugin.

Where does Plug-In Code Live?

There are two kinds of plugins. System Plugins are plugins that are developed and supported by GenomeQuest. They live in $GQ_INSTALL/web/GQ/system_plugins/Exports.

Your plugins should be place in $GQ_INSTALL/web/GQ/plugins/Exports. Let's have a look.

% ls -l $GQ_INSTALL/web/GQ/plugins/Exports
total 4
lrwxrwxrwx 1 runner geneit   48 Dec 22 15:40 GqAppExtractNoHits -> /home/runner/richard/Services/GqAppExtractNoHits
lrwxrwxrwx 1 runner geneit   49 Dec 17 11:24 GqAppExtractQueries -> /home/runner/richard/Services/GqAppExtractQueries
lrwxrwxrwx 1 runner geneit   50 Dec 21 14:41 GqAppExtractSubjects -> /home/runner/richard/Services/GqAppExtractSubjects
lrwxrwxrwx 1 runner geneit   41 Sep 29 13:11 GqAppHapTbl -> /home/runner/richard/Services/GqAppHapTbl

In this directory we see four different plugins. Each plugin is automatically activated by the system as soon as you place the directory containing the code into this directory. Indeed, these are actually just symbolic links to other directories - this allows developers to develop in whichever directory they want.

How Do I Write My Own Plugin?

See Plug-in My Application for detailed instructions.

Using The Array

Depending on the type of installation, you will generally have a an array of compute nodes available to you to launch GQ Engine sequence comparisons. Here's what you don't need to worry about:

  • locality of data - we automatically place the entire hotdrive on the local disks of each compute node. We also make local NFS mounted on each node. So each node has everything in the hotdrive and in local.
  • load balancing - we use Sun Grid Engine to queue up jobs
  • splitting searches - we automatically split the search up into smaller pieces and distribute each small piece on the right node
  • rejoining results - we automatically rejoin each smaller search on the head node that launched the job

So all you have to do to run a job on the array is use our array-aware comparison engine:

lspcalc.THA

Full documentation of this application is available on the command line via

% lspcalc.THA -help

Examples using lspcalc.THA

The following is an example of an ’all against all’ command (GB_PRI compared with GB_ROD) on the compute nodes:

% lspcalc.THA -progress -shadowrole both -- -t 8.2 -M bl2 -O ’[ -E 10 ]’ -mn NUC.3.1 -db GB_PRI[] -db GB_ROD[] -o res
  • -progress will create a progress file that updates as the search runs
  • -shadowrole both tells the system that the actual databases exist locally on the compute nodes (these are both hotdrive databases so they aren't NFS-mounted like local is)
  • -- everything past this double-dash is what is passed to the underlying lspcalc command on each node
  • -t 8.2 tells the system to use 8 jobs, 2 threads per job
  • -M bl2 tells the nodes to use the Blast2 algorithm
  • -O '[-E 10]' passes whatever options to Blast2 you desire, in this case, -E 10
  • -mn NUC.3.1 uses the NUC.3.1 matrix for Blast2
  • -db GB_PRI[] says that the subject database is the primate division of Genbank on the hotdrive
  • -db GB_ROD[] says that the query database is the rodent division of Genbank on the hotdrive
  • -o res reassemble the results into a Result Database called res

Let's do the same but with both subject and query databases split (8 chunks for subject, 3 for query) which will generate 24 jobs :

% lspcalc.THA -progress -splitrole both -shadowrole both -- -t 8.3.2 -M bl2 -O ’[ -E 10 ]’ -mn NUC.3.1 -db GB_PRI[] -db GB_ROD[] -o res
  • -splitrole both tells the system to split both the query and the subject databases into smaller pieces
  • -t 8.3.2 tells the system to split the subject into 8 dbs, the query into 3, and each job with two threads.

Some Suggestions On Parameters for NGS

ILLUMINA : short reads, constant quality.

lspcalc.THA -splitrole query -qs all.q -jobname 'my_job' --
     -t 16.4 -M mapnc,kerr
     -O ’[ -errs 2 –extend –fltThreshold 6 –fltCut 300 –fltOverlap 32 -r]’
     -db db1
     -db db2
     -o res

ABI : longer (but still short) reads, Quality decreases at extremities

lspcalc.THA -splitrole query -qs all.q -jobname 'my_job' --
    -t 8 -M mapnc,sw
     -O ’[ –ftlThreshold 6 –fltCut 300 -r]’
     -mn NUC.2.1 …
         // Modified NCBI matrix is better for short reads
     -scut 28

454 : longer reads, local alignments.

lspcalc.THA -splitrole query -qs all.q -jobname 'my_job' --
       -t 8 -M mapnc,bl2
     -O ’[ -fltThreshold  10 –ftlCut 5000 –fltOverlap 200 –r ]’ …
      //another way to specify the filter (old <LEN,PERID>)
     -scut 30

More on this lspmul functionality

You can read more about lspmul and this principle of filtering sequence pairs here.

More on the Array Computing

The GQ Engine Primer has a chapter on array computing.

GenomeCast

Downloading and converting the terabytes of reference databases can be a good exercise, but keeping the references up-to-date is generally a pain, with little value-added. GenomeQuest provides a database update service (GenomeCast™), a simple Perl script that brings to your organization all reference databases, converted with carefully chosen, indexed annotation fields. GenomeCast handles updates for you, at the frequency that you choose. We call this the Hotdrive. With GenomeCast, you get in real time mirror image of the Hotdrive that GenomeQuest continuously builds on its servers.

Personal tools