Welcome to the GenomeQuest Documentation Wiki
ContentManagerReferenceManual
GenomeQuest Content Manager is a Unix command line toolkit to enable administrators to add/delete/update local GenomeQuest Engine Sequence Databases to their GenomeQuest installation.
Full understanding of this Reference Manual may require that you familiarize yourself with the GenomeQuest System Concepts.
Contents |
Content Manager Technical Overview
The GenomeQuest application provides sequence and keyword search functionalities against genetic sequence databases. The sequence databases need to be formatted and indexed in a specific way so that they are compatible with the GenomeQuest Engine. Preparation of the databases involves these main steps:
- Convert the sequence database from flat file FASTA/FASTQ/EMBL+ format into the GenomeQuest Engine format. This enables the database to be searchable via sequence search (e.g. BLAST) and keyword search.
- Index the annotation fields of the GenomeQuest Engine database just created. This enables faster keyword search.
- Configure sequence database meta information files so that GenomeQuest web application recognizes it.
- Activate the newly added/updated database so that it is usable through the GenomeQuest web interface.
While the tool kit attempts to automate the administration of local databases as much as possible, the administrator is required to know local site specific information for the commands to complete successfully.
Command Line Overview
The Content Manager is packaged into a script called admin_db.pl, which is installed at the following location assuming you have installed GenomeQuest at $GQ_INSTALL:
-
$GQ_INSTALL/data/GQdata/external_apps/gqadmindb/admin_db.pl
To get full-blown help in a man-page format for the Content Manager, try:
$GQ_INSTALL/data/GQdata/external_apps/gqadmindb/admin_db.pl -h admin_db.pl --action <convert‚index‚configure‚push‚activate‚add‚delete‚update‚list‚showtree,showfields,lookupfields> --db_file <input_db_file_or_dir> --gq_base_dir <GenomeQuest_installation_dir> --db_id <GenomeQuest_database_definition_id, e.g., “LOCAL_MYDB”> --db_format <input_seq_file_format, e.g., “EMBL+”, “FASTA”> --map <e.g. "ID‚AC‚DE"> --db_type <NUC‚PRT, PRO(same as PRT)> --db_name <database name> --index_fields <e.g. "ID,AC,DE"> --target <hotdrive‚local‚all> --release <release number> --gq_fields <e.g. " ID,AC,DE,OS,CC,W1" or "ALL"> --norm_pn --index_tablespace <e.g. 6> --pattern <e.g., “number*fragment”> --owner <login name of a GenomeQuest user who should own this db> --access <the access level for the database. e.g. "private", "public", "group">
Examples
All examples assume that you have installed GenomeQuest at the location $GQ_INSTALL. Inside this directory you should have at least the following two subdirectories:
% ls -ls $GQ_INSTALL total 28 8 drwxr-xr-x 77 runner geneit 4096 Feb 12 01:54 data 8 drwxr-xr-x 75 runner geneit 4096 Feb 12 01:54 web ...
Show Fields
Show all currently allowed annotation fields, including GenomeQuest annotation fields and custom annotation fields.
Required fields: --gq_base_dir
% admin_db.pl --action showfields --gq_base_dir $GQ_INSTALL
Lookup Fields
Look up annotation fields with title(description) matching a pattern, within all currently allowed annotation fields.
Required fields: --gq_base_dir, --pattern
% admin_db.pl --action lookupfields --pattern "number*fragment" --gq_base_dir $GQ_INSTALL
Add a Database
Adds a database to the hotdrive or local.
Required fields: --gq_base_dir, --db_file, --db_id, --db_format, --db_type
% admin_db.pl --action add --gq_base_dir $GQ_INSTALL --db_file <path-to-db-file-or-dir> \
--db_id LOCAL_MYDB2 --db_format EMBL+ --db_type NUC --db_name dbName \
--release 20100211 --gq_fields ’:ALL’ --index_fields "ID,DE,W1" --norm_pn \
--owner admin --access public --prop tags.ini
Add consists of chained operations of convert, configure, index, push(in case of cluster with local storage), and activate. For the details of those chained operations, please read the corresponding examples. If --db_name is omited, --db_id will be used as database name. Today’s date in YYYYMMDD format will be used if --release is omitted. The --owner is optional, default is admin. The --access is optional, default is private. Both the --gq_fields and the --index_fields are optional.
Update a Database
Updates a database in either hotdrive or local.
Required fields: --gq_base_dir, --db_file, --db_id, --db_format
% admin_db.pl --action update --gq_base_dir $GQ_INSTALL --db_file <path-to-db-file-or-dir> \
--db_id LOCAL_MYDB2 --db_format EMBL+ --db_type NUC --owner admin --access public
Similar to the add operation, but definition and so-called "treeview" files remain unchanged. Today’s date will always be used as the release number while user’s input will be ignored. If the database has been configured, the db type from the definition file will be used. Otherwise, get the db type from user input.
If the type of a database needs to be changed, it is often easier to DELETE and ADD again.
Delete a Database
Deletes a database.
Required fields: --gq_base_dir, --db_id
% admin_db.pl --action delete --gq_base_dir $GQ_INSTALL --db_id LOCAL_MYDB2
The whole database directory $GQ_INSTALL/data/GQdata/content/local/LOCAL_MYDB2 will be deleted, and its entries in the metadata layer will be removed.
List Databases
List all active databases and their location (path_dbfilename) in either local, hotdrive, or all (both).
Required fields: --gq_base_dir
% admin_db.pl --action list --gq_base_dir $GQ_INSTALL
Show Tree
Shows the treeview of all databases requested in the console for either local, hotdrive, or all (both).
Required fields: --gq_base_dir
% admin_db.pl --action showtree --gq_base_dir $GQ_INSTALL --target hotdrive
Convert into a Biofacet Format
Typically for lower-level operations you should use lspbank for this. However, we provide similar functionality in the Content Manager because here we also place them in Local.
Required fields: --gq_base_dir, --db_file, --db_id --db_format --db_type
% admin_db.pl --action convert --gq_base_dir $GQ_INSTALL --db_file <path-to-file-or-dir> \
--db_id LOCAL_MYDB --db_format EMBL+ --db_type NUC --db_name dbName \
--release 17
If --release is omitted, today’s date in YYYYMMDD will be used. If --db_name is omitted, --db_id will be used as database name.
Configure a Database
Generates and/or updates various configuration files in the hotdrive or local for your database.
Required fields: --gq_base_dir, --db_id, --db_type
% admin_db.pl --action configure --gq_base_dir $GQ_INSTALL --db_id LOCAL_MYDB --db_type NUC
This command would automatically generate the following files if they do not exist.
$GQ_INSTALL/data/GQdata/content/local/LOCAL_MYDB/configuration/seqdb_definition.seqdbconf $GQ_INSTALL/data/GQdata/content/local/LOCAL_MYDB/configuration/seqdb_instance.seqdbconf $GQ_INSTALL/data/GQdata/content/local/LOCAL_MYDB/configuration/seqdb_link_definition.seqdbconf $GQ_INSTALL/data/GQdata/content/local/configuration/local_treeview_nucleotide.seqdbconf # (local_treeview_protein.seqdbconf for PRT database).
If they already existed, the definition file and link definition files remain unchanged, but the instance and treeview files would be updated.
Newly generated link definition files seqdb_link_definition.seqdbconf only contains the header and footer lines and needs to be manually edited.
Index a Database
Builds indices on annotation fields that you specify.
Required fields: --gq_base_dir, --db_id
% admin_db.pl --action index --gq_base_dir $GQ_INSTALL \
--db_id LOCAL_MYDB \
--index_fields "ID,AC,DE" \
--release 20061113 \
--index_dataspace 6
The above command indexes release 20061113 of database LOCAL_MYDB. Existing indexes will be removed prior to indexing. To allow for index building prior to the CONFIGURE action, indexing performs by release number - not necessarily the active instance. Use --action list for all active databases.
The --index_dataspace allows you to increase the amount of disk size to use for indices. Typically the default of 3 is sufficient.
Do not index NGS databases. There is no value in building indices on the ID field of NGS databases, and the disk space required will be enormous.
Activate a Database
Activates a database by updating the metadata layer.
Required fields: --gq_base_dir, --db_id
% admin_db.pl --action activate --gq_base_dir $GQ_INSTALL \
--db_id LOCAL_SOMETHING_UNIQUE \
--prop property.ini \
--owner admin \
--access public
Parameter Details
-
-action(required)- This is the action that the developer wants to perform on a sequence database. Additional parameters are set based on the developer's choice here.
- convert converts a flat file into GenomeQuest Engine format.
- index indexes the annotation fields of the GenomeQuest database
- configure updates various configurations files, namely the definition, instance and treeview files which are part of Hotdrive and Local channels.
- activate activates the database so it is available inside of the GenomeQuest application
- add adds a database by chaining the operations convert, index, configure, and activate
- delete deletes a database by removing all appropriate files and metadata entries and removing the database from all GenomeQuest listings
- update updates an existing database with new data. The definition remains unchanged however the content of the database is completely changed - an overhaul rewrite of the underlying sequence and annotation content
- list lists all active databases and their location (path and database filename) in the Local and Hotdrive areas
- showtree shows the database treeview in the console
- showfields shows all currently allowed database annotation fields. This includes the default GenomeQuest annotation fields as well as any custom annotation fields.
- lookupfields look up annotation fields with title(description) matching a pattern, within all currently allowed annotation fields.
--db_filerequired- This parameter specifies either an input data file or directory. In the case of a directory, all files under the specified directory (excluding files that start with dot “.”) will be processed. These files should be the same sequence type and format.
--gq_base_dirrequired- This is the base directory where GenomeQuest software is installed. There should be a “web” and “data” under this directory. We have been referring to this location throughout as
$GQ_INSTALL
- This is the base directory where GenomeQuest software is installed. There should be a “web” and “data” under this directory. We have been referring to this location throughout as
--db_idrequired- The GenomeQuest database definition id for this database. It will be stored in the Metadata database using this name, and this will be also be the name of the directory in Hotdrive or Local under which the database is installed. This is a unique identifier for this particular database which will be used to define the specific behavior of this database e.g. where the database should be shown in the database tree, how external links for each record should be formulated, etc.
- It should be noted that this value will be used within the GenomeQuest software filtering widget in the filtering parameter called “database name”.
- This should be a string of characters [A-Z-_0-9]. No spaces are allowed, and the characters will be normalized to upper case. The string should have a prefix of “LOCAL_”, so that it can be easily differentiated from the databases provided from GenomeCast in the hotdrive. If the db_id does not have the “LOCAL_” prefix, the program will automatically add it and issue a warning message.
--db_formatrequired- Input sequence file format
- Currently supported formats are FASTA, FASTQ, and EMBL+.
- FASTA formatted files need the additional argument:
--map. This allows the mapping of additional fields in the fasta header line. It is assumed that the FASTA header line starts with the “>” character and additional fields are delimited by the bar (“|”) character. The—-mapdefines how each of the fields are mapped to a 2 character field name used internally by GenomeQuest. For example:- if the FASTA header is:
>gi|3991100|gb|AAC84527.1|some description - the
--mapcould be:--map “XX|ID|DB|AC|DE” - This mapping would perform the following: discard the first field in the header (“gi”), the 2nd field is sequence identifier (ID), 3rd field is database source (DB), 4th field is Accession Number (AC), 5th field is Description (DE).
- Default is ’auto’, in which case the program will attempt to create the map based on the header of the first record. It recognizes two header styles:
- if the FASTA header is:
- 1. basic, e.g. ">12345 description"
- ID => 12345
- DE => description
- 2. the NCBI fasta header, e.g. “>gi␣3991100␣gb␣AAC84527.1␣locus description”
- ID => gi␣3991100␣gb␣AAC84527.1␣locus
- DE => description
- GI => 3991100
- AC => AAC84527
- SV => 1
- 1. basic, e.g. ">12345 description"
- Most EMBL fields are supported and there are many additional fields used for either internal business logic or for content outside of traditional sequence database (e.g. patent related fields). To retrieve a complete list of supported fields, use the URL API's gqfetch.get_db_field_list, e.g.,
my.genomequest.com/query?do=gqfetch.get_db_field_list. Note if you do this in the web browser you will need to "View Source" to make this human readable. - EMBL+ format annotation field identifiers consist of a two character abbreviation at the start of a line followed by a minimum of two spaces or tabs. The sequence itself is recognized by a minimum of two spaces or tabs at the beginning of the line. A sequence entry always ends with // on a single line.
- In the EMBL+ formatted files, if the ID field has more than one word, only the first word will be retained.
- Every record must have an ID field.
- Sample EMBL+ file:
- Most EMBL fields are supported and there are many additional fields used for either internal business logic or for content outside of traditional sequence database (e.g. patent related fields). To retrieve a complete list of supported fields, use the URL API's gqfetch.get_db_field_list, e.g.,
ID 002R_IIV3 AC Q197F8 OA SV 1 GN IIV3-002R SY D6 20090728 D7 00000101 D1 20060000 D2 20090616 MT protein DE RecName: Full=Uncharacterized protein 002R; KW Complete proteome; Virus reference strain OS Invertebrate iridescent OX 345201 OC Viruses; dsDNA viruses, no RNA stage; Iridoviridae; Chloriridovirus. CC DR EMBL:DQ643392\nRefSeq:YP_654574.1\nGeneID:4156251 RA <B>1</B>. Delhon G., Tulman E.R., Afonso C.L., Lu Z., Becnel J.J., Moser B.A., Kutish G.F., Rock D.L.\nJ. Virol. 80:8439-8449(2006). PUBMED 16912294\n"Genome of invertebrate iridescent virus type 3 (mosquito iridescent virus)." W1 SP HL [L[RA_PUBMEDid; 142; 8, 16912294]] [L[DR_EMBL; 5; 8, DQ643392]] [L[DR_RefSeq; 22; 11, YP_654574.1]] W4 FT <B>CHAIN</B> 1 458\nUncharacterized protein 002R. /FTId=PRO_0000377938\n--------------------\nCHAIN(1) MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPGSIVPWKMNREQALAERYPELQTSEPSEDYSGPVESLELLP LEIKLDIMQYLSWEQISWCKHPWLWTRWYKDNVVRVSAITFEDFQREYAFPEKIQEIHFTDTRAEEIKAILETTPNVTRL VIRRIDDMNYNTHGDLGLDDLEFLTHLMVEDACGFTDFWAPSLTHLTIKNLDMHPRWFGPVMDGIKSMQSTLKYLYIFET YGVNKPFVQWCTDNIETFYCTNSYRYENVPRPIYVWVLFQEDEWHGYRVEDNKFHRRYMYSTILHKRDTDWVENNPLKTP AQVEMYKFLLRISQLNRDGTGYESDSDPENEHFDDESFSSGEEDSSDEDDPTWAPDSDDSDWETETEEEPSVAARILEKG KLTITNLMKSLGFKPKPKKIQSIDRYFCSLDSNYNSEDEDFEYDSDSEDDDSDSEDDC // ID 003L_IIV3 AC Q197F7 OA SV 1 GN IIV3-003L SY D6 20090728 D7 00000101 D1 20060000 D2 20090616 MT protein DE RecName: Full=Uncharacterized protein 003L; KW Complete proteome; Virus reference strain OS Invertebrate iridescent OX 345201 OC Viruses; dsDNA viruses, no RNA stage; Iridoviridae; Chloriridovirus. CC DR EMBL:DQ643392\nRefSeq:YP_654575.1\nGeneID:4156252 RA <B>1</B>. Delhon G., Tulman E.R., Afonso C.L., Lu Z., Becnel J.J., Moser B.A., Kutish G.F., Rock D.L.\nJ. Virol. 80:8439-8449(2006). PUBMED 16912294\n"Genome of invertebrate iridescent virus type 3 (mosquito iridescent virus)." W1 SP HL [L[RA_PUBMEDid; 142; 8, 16912294]] [L[DR_EMBL; 5; 8, DQ643392]] [L[DR_RefSeq; 22; 11, YP_654575.1]] W4 FT <B>CHAIN</B> 1 156\nUncharacterized protein 003L. /FTId=PRO_0000377939\n--------------------\nCHAIN(1) MYQAINPCPQSWYGSPQLEREIVCKMSGAPHYPNYYPVHPNALGGAWFDTSLNARSLTTTPSLTTCTPPSLAACTPPTSL GMVDSPPHINPPRRIGTLCFDFGSAKSPQRCECVASDRPSTTSNTAPDTYRLLITNSKTRKNNYGTCRLEPLTYGI //
--db_typerequired- Input sequence type. Valid values: NUC│PRT│PRO(same as PRT)│NUCCS. NUCCS is for Di-Nucleotide (color space) sequences.
--db_nameoptional- The human-readable database name. This will show up in the GQ user interface as the name of the database.
- Allowed characters: [a-zA-Z_-0-9\s’]. All other characters will be stripped.
- If omitted, the software will use the input from --db_id instead
--gq_fieldsoptional comma delimited fields- The annotation fields used by action convert/update/add for EMBL+ format.
- If not specified, the system will infer the GQ annotation fields from the first 100 sequences from the input file (or the first file in the input dir).
- If the value is ":ALL", the full accepted annotation fields are used.
--index_fieldsoptional comma delimited fields- The annotation fields to be indexed. Optional. Possible values:
- ’:ALL’ or ’’: all annotation fields will be used.
- ’:NONE’ : no index will be created.
- ’ID,AC,DE’ : index ID, AC, and DE fields. Unmentioned fields will be skipped.
- Please note - you should not index NGS reads - you get no benefit from indexing the IDs and the amount of additional storage is enormous.
- The annotation fields to be indexed. Optional. Possible values:
--index_dataspaceoptional number- The size of data space assigned to index the text annotation fields.
- Numeric
- Default is 3
- The size allocated for holding index for numeric fields is 1/3 of index_dataspace given.
- If you run out of index dataspace during the running of this program, simply rerun with a larger dataspace
--norm_pnoptional- Normalize patent numbers to be consistent with GQPAT or not. Default is OFF.
--patternoptional string- This option is used in conjunction with
--action lookupfields - Supply a search pattern to look up fields
- It should be a string, but also support common shell wildcard: *, ?, and [].
- This option is used in conjunction with
--releaseoptional number- The release number of the database.
- If omitted, today’s date in YYYYMMDD format will be used
--owneroptional string, default 'admin'- The owner of the database, as described in the user's login name.
- Used in add/update/activate actions
- Default is admin
--accessoptional string, default 'private'- The access level of the database
- Used in add/update/activate actions
- Legal values
private: only visible to the user specified in --owner (although they can choose to share it afterwards)public: visible to all usersgroup: visible to all users in the group that --owner is incontainer.$type.$id: access of this sequence database is subject to its parent object of type$typewith id$id. GenomeQuest currently supports only the typeworkflow, in which case the$idis the id of the workflow. This allows this sequence database to be shared when the user shares the enclosing workflow.
--propoptional filename, default none- Provide additional properties about this database
- Properties should be in INI format
- These data will be loaded into the DBMS layer of GenomeQuest
- Only certain keys are acceptable - currently undocumented
--refresh_channelflag, default OFF- During an add action, the program will quit if the target channel directory already exist. If this flag is present, it will first clean up the target channel, and then perform the add action.