Welcome to the GenomeQuest Documentation Wiki

ContentManagerReferenceManual

From GQ Wiki
Jump to: navigation, search

GenomeQuest Content Manager is a Unix command line toolkit to enable administrators to add/delete/update local GenomeQuest Engine Sequence Databases to their GenomeQuest installation.

Full understanding of this Reference Manual may require that you familiarize yourself with the GenomeQuest System Concepts.

Contents

Content Manager Technical Overview

The GenomeQuest application provides sequence and keyword search functionalities against genetic sequence databases. The sequence databases need to be formatted and indexed in a specific way so that they are compatible with the GenomeQuest Engine. Preparation of the databases involves these main steps:

  1. Convert the sequence database from flat file FASTA/FASTQ/EMBL+ format into the GenomeQuest Engine format. This enables the database to be searchable via sequence search (e.g. BLAST) and keyword search.
  2. Index the annotation fields of the GenomeQuest Engine database just created. This enables faster keyword search.
  3. Configure sequence database meta information files so that GenomeQuest web application recognizes it.
  4. Activate the newly added/updated database so that it is usable through the GenomeQuest web interface.

While the tool kit attempts to automate the administration of local databases as much as possible, the administrator is required to know local site specific information for the commands to complete successfully.

Command Line Overview

The Content Manager is packaged into a script called admin_db.pl, which is installed at the following location assuming you have installed GenomeQuest at $GQ_INSTALL:

$GQ_INSTALL/data/GQdata/external_apps/gqadmindb/admin_db.pl

To get full-blown help in a man-page format for the Content Manager, try:

$GQ_INSTALL/data/GQdata/external_apps/gqadmindb/admin_db.pl -h

admin_db.pl	--action <convert‚index‚configure‚push‚activate‚add‚delete‚update‚list‚showtree,showfields,lookupfields>
  --db_file <input_db_file_or_dir>
  --gq_base_dir <GenomeQuest_installation_dir>
  --db_id <GenomeQuest_database_definition_id, e.g., “LOCAL_MYDB”>
  --db_format <input_seq_file_format, e.g., “EMBL+”, “FASTA”>
  --map <e.g. "ID‚AC‚DE"> --db_type <NUC‚PRT, PRO(same as PRT)>
  --db_name <database name> --index_fields <e.g. "ID,AC,DE">
  --target <hotdrive‚local‚all> --release <release number>
  --gq_fields <e.g. " ID,AC,DE,OS,CC,W1" or "ALL">
  --norm_pn --index_tablespace <e.g. 6>
  --pattern <e.g., “number*fragment”>
  --owner <login name of a GenomeQuest user who should own this db>
  --access <the access level for the database. e.g. "private", "public", "group">

Examples

All examples assume that you have installed GenomeQuest at the location $GQ_INSTALL. Inside this directory you should have at least the following two subdirectories:

% ls -ls $GQ_INSTALL
total 28
8 drwxr-xr-x 77 runner geneit 4096 Feb 12 01:54 data
8 drwxr-xr-x 75 runner geneit 4096 Feb 12 01:54 web
...

Show Fields

Show all currently allowed annotation fields, including GenomeQuest annotation fields and custom annotation fields.

Required fields: --gq_base_dir

% admin_db.pl --action showfields --gq_base_dir $GQ_INSTALL

Lookup Fields

Look up annotation fields with title(description) matching a pattern, within all currently allowed annotation fields.

Required fields: --gq_base_dir, --pattern

% admin_db.pl --action lookupfields --pattern "number*fragment" --gq_base_dir $GQ_INSTALL

Add a Database

Adds a database to the hotdrive or local.

Required fields: --gq_base_dir, --db_file, --db_id, --db_format, --db_type

% admin_db.pl --action add --gq_base_dir $GQ_INSTALL --db_file <path-to-db-file-or-dir> \
     --db_id LOCAL_MYDB2 --db_format EMBL+ --db_type NUC --db_name dbName \
     --release 20100211 --gq_fields ’:ALL’ --index_fields "ID,DE,W1" --norm_pn \
     --owner admin --access public --prop tags.ini

Add consists of chained operations of convert, configure, index, push(in case of cluster with local storage), and activate. For the details of those chained operations, please read the corresponding examples. If --db_name is omited, --db_id will be used as database name. Today’s date in YYYYMMDD format will be used if --release is omitted. The --owner is optional, default is admin. The --access is optional, default is private. Both the --gq_fields and the --index_fields are optional.

Update a Database

Updates a database in either hotdrive or local.

Required fields: --gq_base_dir, --db_file, --db_id, --db_format

% admin_db.pl --action update --gq_base_dir $GQ_INSTALL --db_file <path-to-db-file-or-dir> \
     --db_id LOCAL_MYDB2 --db_format EMBL+ --db_type NUC --owner admin --access public

Similar to the add operation, but definition and so-called "treeview" files remain unchanged. Today’s date will always be used as the release number while user’s input will be ignored. If the database has been configured, the db type from the definition file will be used. Otherwise, get the db type from user input.

If the type of a database needs to be changed, it is often easier to DELETE and ADD again.

Delete a Database

Deletes a database.

Required fields: --gq_base_dir, --db_id

% admin_db.pl --action delete --gq_base_dir $GQ_INSTALL --db_id LOCAL_MYDB2

The whole database directory $GQ_INSTALL/data/GQdata/content/local/LOCAL_MYDB2 will be deleted, and its entries in the metadata layer will be removed.

List Databases

List all active databases and their location (path_dbfilename) in either local, hotdrive, or all (both).

Required fields: --gq_base_dir

% admin_db.pl --action list --gq_base_dir $GQ_INSTALL

Show Tree

Shows the treeview of all databases requested in the console for either local, hotdrive, or all (both).

Required fields: --gq_base_dir

% admin_db.pl --action showtree --gq_base_dir $GQ_INSTALL --target hotdrive

Convert into a Biofacet Format

Typically for lower-level operations you should use lspbank for this. However, we provide similar functionality in the Content Manager because here we also place them in Local.

Required fields: --gq_base_dir, --db_file, --db_id --db_format --db_type

% admin_db.pl --action convert --gq_base_dir $GQ_INSTALL  --db_file <path-to-file-or-dir> \
     --db_id LOCAL_MYDB --db_format EMBL+ --db_type NUC --db_name dbName \
     --release 17

If --release is omitted, today’s date in YYYYMMDD will be used. If --db_name is omitted, --db_id will be used as database name.

Configure a Database

Generates and/or updates various configuration files in the hotdrive or local for your database.

Required fields: --gq_base_dir, --db_id, --db_type

% admin_db.pl --action configure --gq_base_dir $GQ_INSTALL --db_id LOCAL_MYDB  --db_type NUC

This command would automatically generate the following files if they do not exist.

$GQ_INSTALL/data/GQdata/content/local/LOCAL_MYDB/configuration/seqdb_definition.seqdbconf
$GQ_INSTALL/data/GQdata/content/local/LOCAL_MYDB/configuration/seqdb_instance.seqdbconf
$GQ_INSTALL/data/GQdata/content/local/LOCAL_MYDB/configuration/seqdb_link_definition.seqdbconf
$GQ_INSTALL/data/GQdata/content/local/configuration/local_treeview_nucleotide.seqdbconf
# (local_treeview_protein.seqdbconf for PRT database).

If they already existed, the definition file and link definition files remain unchanged, but the instance and treeview files would be updated.

Newly generated link definition files seqdb_link_definition.seqdbconf only contains the header and footer lines and needs to be manually edited.

Index a Database

Builds indices on annotation fields that you specify.

Required fields: --gq_base_dir, --db_id

% admin_db.pl --action index --gq_base_dir $GQ_INSTALL \
     --db_id LOCAL_MYDB \
     --index_fields "ID,AC,DE" \
     --release 20061113 \
     --index_dataspace 6

The above command indexes release 20061113 of database LOCAL_MYDB. Existing indexes will be removed prior to indexing. To allow for index building prior to the CONFIGURE action, indexing performs by release number - not necessarily the active instance. Use --action list for all active databases.

The --index_dataspace allows you to increase the amount of disk size to use for indices. Typically the default of 3 is sufficient.

Do not index NGS databases. There is no value in building indices on the ID field of NGS databases, and the disk space required will be enormous.

Activate a Database

Activates a database by updating the metadata layer.

Required fields: --gq_base_dir, --db_id

% admin_db.pl --action activate --gq_base_dir $GQ_INSTALL \
     --db_id LOCAL_SOMETHING_UNIQUE \
     --prop property.ini \
     --owner admin \
     --access public

Parameter Details

  • -action (required)
    • This is the action that the developer wants to perform on a sequence database. Additional parameters are set based on the developer's choice here.
    • convert converts a flat file into GenomeQuest Engine format.
    • index indexes the annotation fields of the GenomeQuest database
    • configure updates various configurations files, namely the definition, instance and treeview files which are part of Hotdrive and Local channels.
    • activate activates the database so it is available inside of the GenomeQuest application
    • add adds a database by chaining the operations convert, index, configure, and activate
    • delete deletes a database by removing all appropriate files and metadata entries and removing the database from all GenomeQuest listings
    • update updates an existing database with new data. The definition remains unchanged however the content of the database is completely changed - an overhaul rewrite of the underlying sequence and annotation content
    • list lists all active databases and their location (path and database filename) in the Local and Hotdrive areas
    • showtree shows the database treeview in the console
    • showfields shows all currently allowed database annotation fields. This includes the default GenomeQuest annotation fields as well as any custom annotation fields.
    • lookupfields look up annotation fields with title(description) matching a pattern, within all currently allowed annotation fields.
  • --db_file required
    • This parameter specifies either an input data file or directory. In the case of a directory, all files under the specified directory (excluding files that start with dot “.”) will be processed. These files should be the same sequence type and format.
  • --gq_base_dir required
    • This is the base directory where GenomeQuest software is installed. There should be a “web” and “data” under this directory. We have been referring to this location throughout as $GQ_INSTALL
  • --db_id required
    • The GenomeQuest database definition id for this database. It will be stored in the Metadata database using this name, and this will be also be the name of the directory in Hotdrive or Local under which the database is installed. This is a unique identifier for this particular database which will be used to define the specific behavior of this database e.g. where the database should be shown in the database tree, how external links for each record should be formulated, etc.
    • It should be noted that this value will be used within the GenomeQuest software filtering widget in the filtering parameter called “database name”.
    • This should be a string of characters [A-Z-_0-9]. No spaces are allowed, and the characters will be normalized to upper case. The string should have a prefix of “LOCAL_”, so that it can be easily differentiated from the databases provided from GenomeCast in the hotdrive. If the db_id does not have the “LOCAL_” prefix, the program will automatically add it and issue a warning message.
  • --db_format required
    • Input sequence file format
    • Currently supported formats are FASTA, FASTQ, and EMBL+.
    • FASTA formatted files need the additional argument: --map. This allows the mapping of additional fields in the fasta header line. It is assumed that the FASTA header line starts with the “>” character and additional fields are delimited by the bar (“|”) character. The —-map defines how each of the fields are mapped to a 2 character field name used internally by GenomeQuest. For example:
      • if the FASTA header is: >gi|3991100|gb|AAC84527.1|some description
      • the --map could be: --map “XX|ID|DB|AC|DE”
      • This mapping would perform the following: discard the first field in the header (“gi”), the 2nd field is sequence identifier (ID), 3rd field is database source (DB), 4th field is Accession Number (AC), 5th field is Description (DE).
      • Default is ’auto’, in which case the program will attempt to create the map based on the header of the first record. It recognizes two header styles:
1. basic, e.g. ">12345 description"
ID => 12345
DE => description
2. the NCBI fasta header, e.g. “>gi␣3991100␣gb␣AAC84527.1␣locus description”
ID => gi␣3991100␣gb␣AAC84527.1␣locus
DE => description
GI => 3991100
AC => AAC84527
SV => 1
    • Most EMBL fields are supported and there are many additional fields used for either internal business logic or for content outside of traditional sequence database (e.g. patent related fields). To retrieve a complete list of supported fields, use the URL API's gqfetch.get_db_field_list, e.g., my.genomequest.com/query?do=gqfetch.get_db_field_list. Note if you do this in the web browser you will need to "View Source" to make this human readable.
    • EMBL+ format annotation field identifiers consist of a two character abbreviation at the start of a line followed by a minimum of two spaces or tabs. The sequence itself is recognized by a minimum of two spaces or tabs at the beginning of the line. A sequence entry always ends with // on a single line.
    • In the EMBL+ formatted files, if the ID field has more than one word, only the first word will be retained.
    • Every record must have an ID field.
    • Sample EMBL+ file:
ID  002R_IIV3
AC  Q197F8
OA  
SV  1
GN  IIV3-002R
SY  
D6  20090728
D7  00000101
D1  20060000
D2  20090616
MT  protein
DE  RecName: Full=Uncharacterized protein 002R;
KW  Complete proteome; Virus reference strain
OS  Invertebrate iridescent
OX  345201
OC  Viruses; dsDNA viruses, no RNA stage; Iridoviridae; Chloriridovirus.
CC  
DR  EMBL:DQ643392\nRefSeq:YP_654574.1\nGeneID:4156251
RA  <B>1</B>. Delhon G., Tulman E.R., Afonso C.L., Lu Z., Becnel J.J., Moser B.A., Kutish G.F., Rock D.L.\nJ. Virol. 80:8439-8449(2006). PUBMED   16912294\n"Genome of invertebrate iridescent virus type 3 (mosquito iridescent virus)."
W1  SP
HL  [L[RA_PUBMEDid; 142; 8, 16912294]] [L[DR_EMBL; 5; 8, DQ643392]] [L[DR_RefSeq; 22; 11, YP_654574.1]]
W4  
FT  <B>CHAIN</B> 1 458\nUncharacterized protein 002R. /FTId=PRO_0000377938\n--------------------\nCHAIN(1)
  MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPGSIVPWKMNREQALAERYPELQTSEPSEDYSGPVESLELLP
  LEIKLDIMQYLSWEQISWCKHPWLWTRWYKDNVVRVSAITFEDFQREYAFPEKIQEIHFTDTRAEEIKAILETTPNVTRL
  VIRRIDDMNYNTHGDLGLDDLEFLTHLMVEDACGFTDFWAPSLTHLTIKNLDMHPRWFGPVMDGIKSMQSTLKYLYIFET
  YGVNKPFVQWCTDNIETFYCTNSYRYENVPRPIYVWVLFQEDEWHGYRVEDNKFHRRYMYSTILHKRDTDWVENNPLKTP
  AQVEMYKFLLRISQLNRDGTGYESDSDPENEHFDDESFSSGEEDSSDEDDPTWAPDSDDSDWETETEEEPSVAARILEKG
  KLTITNLMKSLGFKPKPKKIQSIDRYFCSLDSNYNSEDEDFEYDSDSEDDDSDSEDDC
//
ID  003L_IIV3
AC  Q197F7
OA  
SV  1
GN  IIV3-003L
SY  
D6  20090728
D7  00000101
D1  20060000
D2  20090616
MT  protein
DE  RecName: Full=Uncharacterized protein 003L;
KW  Complete proteome; Virus reference strain
OS  Invertebrate iridescent
OX  345201
OC  Viruses; dsDNA viruses, no RNA stage; Iridoviridae; Chloriridovirus.
CC  
DR  EMBL:DQ643392\nRefSeq:YP_654575.1\nGeneID:4156252
RA  <B>1</B>. Delhon G., Tulman E.R., Afonso C.L., Lu Z., Becnel J.J., Moser B.A., Kutish G.F., Rock D.L.\nJ. Virol. 80:8439-8449(2006). PUBMED   16912294\n"Genome of invertebrate iridescent virus type 3 (mosquito iridescent virus)."
W1  SP
HL  [L[RA_PUBMEDid; 142; 8, 16912294]] [L[DR_EMBL; 5; 8, DQ643392]] [L[DR_RefSeq; 22; 11, YP_654575.1]]
W4  
FT  <B>CHAIN</B> 1 156\nUncharacterized protein 003L. /FTId=PRO_0000377939\n--------------------\nCHAIN(1)
  MYQAINPCPQSWYGSPQLEREIVCKMSGAPHYPNYYPVHPNALGGAWFDTSLNARSLTTTPSLTTCTPPSLAACTPPTSL
  GMVDSPPHINPPRRIGTLCFDFGSAKSPQRCECVASDRPSTTSNTAPDTYRLLITNSKTRKNNYGTCRLEPLTYGI
//
  • --db_type required
    • Input sequence type. Valid values: NUC│PRT│PRO(same as PRT)│NUCCS. NUCCS is for Di-Nucleotide (color space) sequences.
  • --db_name optional
    • The human-readable database name. This will show up in the GQ user interface as the name of the database.
    • Allowed characters: [a-zA-Z_-0-9\s’]. All other characters will be stripped.
    • If omitted, the software will use the input from --db_id instead
  • --gq_fields optional comma delimited fields
    • The annotation fields used by action convert/update/add for EMBL+ format.
    • If not specified, the system will infer the GQ annotation fields from the first 100 sequences from the input file (or the first file in the input dir).
    • If the value is ":ALL", the full accepted annotation fields are used.
  • --index_fields optional comma delimited fields
    • The annotation fields to be indexed. Optional. Possible values:
      • ’:ALL’ or ’’: all annotation fields will be used.
      • ’:NONE’  : no index will be created.
      • ’ID,AC,DE’ : index ID, AC, and DE fields. Unmentioned fields will be skipped.
    • Please note - you should not index NGS reads - you get no benefit from indexing the IDs and the amount of additional storage is enormous.
  • --index_dataspace optional number
    • The size of data space assigned to index the text annotation fields.
    • Numeric
    • Default is 3
    • The size allocated for holding index for numeric fields is 1/3 of index_dataspace given.
    • If you run out of index dataspace during the running of this program, simply rerun with a larger dataspace
  • --norm_pn optional
    • Normalize patent numbers to be consistent with GQPAT or not. Default is OFF.
  • --pattern optional string
    • This option is used in conjunction with --action lookupfields
    • Supply a search pattern to look up fields
    • It should be a string, but also support common shell wildcard: *, ?, and [].
  • --target required string
  • --release optional number
    • The release number of the database.
    • If omitted, today’s date in YYYYMMDD format will be used
  • --owner optional string, default 'admin'
    • The owner of the database, as described in the user's login name.
    • Used in add/update/activate actions
    • Default is admin
  • --access optional string, default 'private'
    • The access level of the database
    • Used in add/update/activate actions
    • Legal values
      • private: only visible to the user specified in --owner (although they can choose to share it afterwards)
      • public: visible to all users
      • group: visible to all users in the group that --owner is in
      • container.$type.$id: access of this sequence database is subject to its parent object of type $type with id $id. GenomeQuest currently supports only the type workflow, in which case the $id is the id of the workflow. This allows this sequence database to be shared when the user shares the enclosing workflow.
  • --prop optional filename, default none
    • Provide additional properties about this database
    • Properties should be in INI format
    • These data will be loaded into the DBMS layer of GenomeQuest
    • Only certain keys are acceptable - currently undocumented
  • --refresh_channel flag, default OFF
    • During an add action, the program will quit if the target channel directory already exist. If this flag is present, it will first clean up the target channel, and then perform the add action.
Personal tools