Welcome to the GenomeQuest Documentation Wiki

GQEnginePrimer

From GQ Wiki
Jump to: navigation, search
THE GQ ENGINE PRIMER
Avatar.jpg
Sequence Comparison in 3D
February 2nd 2010 V1.0

Contents

Introduction

Starting in the mid-90s out of the Human Genome Global Physical Mapping Project, the GQ Engine (aka Biofacet) is a multiple databases, multiple-algorithms, high-performance sequence search engine. Initially developed for the needs of large scale comparative genomics studies, the GQ Engine has evolved over years to reach a unprecedented level of expressive power able to conduct fast and sophisticated analysis on any sequence dataset, including NGS data.

The GQ Engine is different from everything that you may have used so far with Blast searches, or more recently with popular analysis packages for short reads in the NGS bioinformatics arena. It is a breakthrough in sequence comparison.

It is different because:

  • It is not yet-another-stack of heterogeneous modules, whose throughput is limited by the thin pipes connecting multiple components, and by the runtimes of parsing huge flat files back and forth
  • It is not an ad-hoc solution for a given problem linked to a sequencing machine vendor technology, a read length, or to a given application. It is a versatile, global solution.

It is a breakthrough because:

  • No limits: GQ Engine handles large volumes by design; it operates from individual objects up to whole datasets, whether they be sequence data, annotations or alignments.
  • No dbms: GQ Engine definitively solves the dichotomy between annotation and sequences. All properties of a sequence record, or of an alignment record, whether they be text or data, whether the records be one or millions, are useable all together simultaneously, instantaneously.
  • No breakpoints: Alignments are not an ending anymore, but first-class objects. Hundred of millions of alignments, coming from one or multiple databases, computed with different algorithms, can be queried, filtered and reused as databases or alignments for the next steps. More generally speaking, objects in the system can be filtered, grouped, sorted, and dumped, from 1 to billions.
  • No parser: the free-formatting module allows you to massively dump sequences or alignments in any format, from simple tab delimited to standard SAM/BAM, including generating data structures directly usable by programs.
  • No compromise: whatever the algorithm chosen, the # of hits and distribution per query is computed, regardless of the options you choose for which hits to keep.
  • No mental barrier: one can decide to use the ultra-fast short-reads mapper algorithms included, gapped or not, or - as read lengths increase – to use the classical but unbeatable Blast algorithm, whichever the technology is (Ilmn, 454, LifeTech, Helicos, PacBio, Complete, etc.), whichever the database is (a reference genome, a genbank division, a protein database, any subset and/or any combination of them).

Ok, let’s forget the hype and the marketing for a second. Ever heard anything like this?

Discard all reads that both mapped to Human and Mouse, and give me the Human-only ones; btw, remove also those who are abnormally repeated (from the search itself I mean; not from known repeats). And, hey sorry to call back, I know it’s done and it’s great, but was wondering whether I could get only short gaps on references, and see how adding Rat would change the picture? Just checking in; nothing urgent, but would be nice to have by the end of the afternoon, thanks.

Pretty familiar request right? Hum... being able to do this quickly seems interesting, but this GQ Engine smells like yet-another “nuclear plant” from some crazy computer scientists, charged with heavy protocols and obscure concepts. You're pretty sure it must be an esoteric/unusable approach disconnected from the biological reality.

Well ... NOT. GQ engine is a light package, a couple of Unix binaries implementing a few but powerful concepts, together with highly optimized core algorithmics, solving problems we have encountered over 10+ years with our customers. Good-sense, market-driven, robust and proven, it is a longstanding commercially supported sequence-problem-driven solution that sticks to genomics.

Of course, while the GQ Web provides you with Google-like ease-of-use, Bioinformatics is not always as simple, and thinking a little bit is required before taking actions. So, yes, to take full advantage of GQ Engine, there are a couple of fundamental concepts that you need to understand.

How to read this document?

The next section presents the core operations of the engine, in a nutshell. A comprehensive description is provided in the more formal companion document: GQ Engine: Principle Of Operations. All along, what is not described in this section will be developed through examples within Section 4.

For “hard core bioinformatics” folks who have been around, Section 3 shows in one+ page a somewhat unusual blast-like of 10M Illumina 100bp reads against SwissProt Human proteins: a way to “map differently”.

Others might skip to Section 4 “Getting Started”, from where we will conduct a progressive analysis on a transcriptome sequencing of Kidney and Liver human cells.

Section 5 shows how to scale it up, and Section 6 shows some of the connections between the GQ Engine and the GenomeQuest web interface. For more on that, you should also check out System Concepts and How To Make My Own Workflows

Hereafter, we do assume BASH as being the Shell interpreter.

Core Operations

The GQ engine mainly performs three types of operations with associated Unix commands.

Primary GQ Engine Operations
Operation Command Description
1. Operations on Sequence Databases lspdb Look up sequence records using keyword (annotation) filtering capabilities.
2. Sequence Searching lspcalc/lspmul Search sequence datasets against sequence datasets, with keyword filtering capability on the fly, using various sequence searching algorithms. Provide results (alignments), that can be further analyzed by...
3. Operations on Results Databases lspres Import/export alignments, using filtering capabilities on alignment properties and annotation content

Sequence Databases and Results Databases (1 and 3)

No miracle: in order to support requests such as Blast my reads > 50nt against Swissprot Human Phosphodiesterases, using 30 compute nodes, 4 threads per node in one command line, GQ Engine handles sequences in its own format. This format is a mix of binary and text, a kind of Blast++ enriched format, handling at once annotations and sequences.

Yet-another-format right, but once you get it, no need to convert it again and again to other specific needs. (Yes, the GQ Engine also provides dynamic libraries that allow Open Source popular packages to read directly the GQ Engine format, instead of FASTA input.)

Converting popular formats into the GQ Engine is straightforward. One can use the low-level lspbank tool or use the GQ admindb.pl, termed the Content Manager, if we want the databases to be GQ web-published and shared with others under the GQ umbrella. See System Concepts for more on that.

On top of the three basic operations, the lspdb and lspres commands provide five types of actions, which apply to sequences or alignments databases, in the order below:

  1. Ranges: provides retrieval of intervals of sequences or results
  2. Filter: select a subset by applying various criteria such as Genbank GIs, keywords, lengths, dates or alignment properties. Applies on both subject and query as well as result databases.
  3. Group: Group objects sharing common properties.
  4. Sort: Sort sets, subsets or groups.
  5. Format: Export/print objects in pre- or user- defined format.

The figure below shows the sequence of actions (actions occur in that order) on a sequence database, on a simple example:

Biofacetprimerfig1.jpg

The fitering, grouping and sorting operations are operated using BFQL (BioFacet Query Language). BFQL allows to express complex queries based on annotation or results properties. Its is also a complete language, and can be seen as the GQ Engine scripting language. For extreme detail on BFQL, see the BFQL Primer and the BFQL Reference Manual at File:BFQLReference.pdf.

The same operations as described for lspdb apply to lspres. Same way, results (alignments) are encoded using a compact proprietary binary data format which does not store full alignments, but only “core alignments” properties (scores, positions, etc.) as well as pointers to query and subject sequences. Alignments are stored using a single data structure, regardless of the algorithm used. This format allows performing queries on alignment properties together with queries and subject annotations, in a single thread of execution, without the need of a dbms support. An alignment takes 48bytes on disk, and millions of alignments can be filtered/grouped within seconds.

Last, a mechanism of Virtual Databases and Virtual Results allows gathering multiple databases and viewing them as one, without any physical concatenation.

Sequence Searching

From its inception, the GQ Engine implements by design an All vs. All schema, with a variety of algorithms: homology-based, string matching, pattern matching, etc. An algorithm is a parameter on the command line. Whichever the algorithm is, frame translations, filtering on annotations and best results options apply on the fly to the databases given as input.

The lspcalc “schematic” can be summarized as follows:

Biofacetprimerfig2.jpg

Computation is multi-threaded, and, using Array Computing (also see Using the array in the System Concepts) companion scripts, computation is "cloud-enabled". The whole generally linearly scales as a function of the number of cores, tested up to thousands.

An Example

For example, let’s download and convert the SwissProt database. Conversion takes about 30secs.

% lftp
ftp.expasy.ch/.../swissprot/release_compressed/uniprot_sprot.dat.gz
% time gunzip uniprot_sprot.dat.gz | lspbank –T embl –prot –F sp
STDIN : sequences 512994, residues 180531504, max seq length 35213
real 0m24.024s user 0m34.382s sys 0m4.661s

(Mod for above by JMK):
The directory path at ftp.expasy.ch to 'get' uniprot_sprot.dat.gz is now:
./databases/uniprot/knowledgebase/uniprot_sprot.dat.gz


Then after:

% lspcalc –M bl2 –mp BLOSUM62
     –db sp –bfql ’os=”homo sapiens”’	// Subject (reference)
     –db sp -bfql ’os=”mus musculus”’	// Query
     –o HsMm.res –best 5,{-RS}

Blast all human versus mouse proteins, storing the 5 best hits according to best Blast scores (descending order), and:

% lspres HsMm.res –bfql ’rp1==1 && rs>200’ –o mouseunique

dumps the alignments whose queries have a unique hit (rp1), and with a Blast score (rs) greater than 200. Of course, you can drill down into alignments:

% lspres res/mouseunique | head
1A1L2_HUMAN 31 555 1A1L2_MOUSE 51 577 S= 1605 E= 1.37637e-178 Bits= 622.854
3HIDH_HUMAN 1 336 3HIDH_MOUSE 1 335 S= 1562 E= 6.66903e-174 Bits= 606.29
5HT1A_HUMAN 1 421 5HT1A_MOUSE 1 421 S= 1901 E= 4.42149e-213 Bits= 736.873
5HT3B_HUMAN 6 438 5HT3B_MOUSE 1 434 S= 1702 E= 5.52321e-190 Bits= 660.218
5NT3L_HUMAN 1 291 5NT3L_MOUSE 1 291 S= 1390 E= 4.86686e-154 Bits= 540.035

% lspres res/mouseunique -a | head
1A1L2_HUMAN 31 555 1A1L2_MOUSE 51 577 S= 1605 E= 1.37637e-178 Bits= 622.854
Q:	51 EKMLKFQHVIRNQFLQQISQQMQCVPPGDQQCTQTSRKRKKM-GYLLSQMVNFLWSNTVK 109
           |  |  |  +   |+|  |+|   +   +++ |+   + + +   |+ +|+| | |
S:	31 EITLHLQQAMTEHFVQLTSRQGLSLE--ERRHTEAICEHEALLSRLICRMINLLQSGAAS 88

Q:     110 KLKFKVPLPCLDSRCGIKVGHQTLSPWQTGQSRPSLGGFEAALASCTLSKRGAGIYESYH 169
            |+ +||||  |||  ++ | +     |     | |   |||  +  || ||  |    |
S:	89 GLELQVPLPSEDSRGDVRYGQRAQLSGQP-DPVPQLSDCEAAFVNRDLSIRGIDISVFYQ 147

So to say you are up and running in a couple of minutes...

Reference Data

Downloading and converting the terabytes of reference databases can be a good exercise, but keeping the references up-to-date is generally a pain, with little value-added. GenomeQuest provides a database update service (GenomeCast™), a simple Perl script that brings to your organization all reference databases, converted with carefully chosen, indexed annotation fields. GenomeCast handles updates for you, at the frequency that you choose. We call this the Hotdrive. With GenomeCast, you get in real time mirror image of the Hotdrive that GenomeQuest continuously builds on its servers.

Collections

The ranges are an essential mechanism in the GQ Engine. They provide a numerical, low-level, very fast access to subsets of databases. A tuple <Database,Range> defines a subset of a database. A range is can be viewed as a “cumb” - or a vector - on a one-dimensional dataset. This “named range” is called a Collection. Collections allow to dump sequences or results, and resuse them subsequently, without duplicating data.

Simple collections can be dumped using flags on the command line. For example, from the previously mouseunique computed result file, the two command lines:

% lspres mouseunique –oqcoll MouseDb
% lspres mouseunique –oscoll HumanDb

dump the databases of sequences of queries (subject) that had hits. Those collections can be reused as plain GQ Engine databases for further sequence searching.

More complex collections are given using BFQL programming. The example below dumps the collection of mouse sequences that did not align to a human with an alignment length greater than 70% of the query sequence.

% lspres mouseunique
   -bfql ’rl > 0.7 * q.l ;	// filter the ones that align
     POST {
      // $$ contains the resulting collection of the filter
      $r := $$->qcoll() ;	// get the collection of queries
      $r := $r->complement() ; // get the complement
      $r->write (”NonMatchingMmQueries”);
      echo ”Dumped: ” + $r->size() + ” sequences”;
     }’
   -printf ’’

Do those ones would match with Rat ?

% lspcalc –M bl2 –mp BLOSUM62
   –db sp –bfql ’os=”rattus norvegicus”’
   –db NonMatchingMmQueries
   –o RnMm.res –best 5

And so on so forth.

Truly Large Scale

Collections are sets, in the mathematical sense. Together with sequence comparison high-performance module, they allow to write workflows having:

  1. Entire data sets as input and output.
  2. Operations such as union, intersection, complement, etc. as connectors. Those operations are programmed to be fast.

This is the true “large scale” capability of the GQ Engine that makes it different from all other technologies.

In One+ Page

As described in previous section, we assume we have downloaded and converted SwissProt database.

How many Human proteins are there?

% lspdb sp -bfql 'os ="homo sapiens"' –printf ’%POST Tot: %NBSEQS\n’
Tot: 20329

Let’s go now to the following NCBI SRA archive reference:
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=viewer&run=SRR027894

Genome-wide analysis of allelic expression imbalance in human primary cells by high throughput transcriptome resequencing. SRA008367 by Barts and The London on 2009-04-23T09:50:34Z.
Platform: ILLUMINA
% export file=SRR027894_2.fastq.gz ; lftp $file
% gunzip $file | lspbank -nuc -T fastq -A ID -F $file

// What about size?
% du -sh *
992M    SRR027894_2.fastq.gz
4.0K    SRR027894_2.ind     // GQ engine databases are compressed,
117M    SRR027894_2.ctb     // all IUPAC chars left intact
461M    SRR027894_2.seq     // qualities are discarded

% lspdb ilmn/SRR027894_2 -count
                 db nbseqs = 10147832	// # of sequences
                 db nbres  = 1085818024	// # of residues
                 db maxlen = 107		// longest sequence
                 db fields = ID 		// Annotation fields

// check that all reads are all 107nt long
% echo 1085818024/10147832 | bc
107               

The following command line does an equivalent of a blastx with short reads as queries, and Human SwissProt proteins as the subject. With 100nt long reads, this starts to make sense. This example shows the versatility and openness of the GQ Engine, where you are not linked to a predefined “short read mapper” way of thinking. It is of course not meant to be biologically significant.

We apply our fast filtering module lspmul based on the number of words in common between two sequences, (“a la BLAT” in some way). For proteins, words are composed of Amino Acids, and default length is 3.

% time lspmul.TH -t 4                     // we run with 4 threads
   -M mappc,bl2    // word filtering followed by Blast (blastx like)
   -O "[-fltThreshold 10]"                // 10 3AA words in common
   -mp BLOSUM62                           // classical scoring matrix
   -db sp -bfql 'os ="homo sapiens"'  // sp human proteins on the fly
   -db -f a ilmn/SRR027894_2      // query translated on all 6 frames
   -best '1,single,hitcnt,{-RS}'  // keep one best result,
   -scut 75              // based on score (RS) and among score >= 75
   -o res/SRR027894_2.res                        // output as it says
   –progress 2>ILMN.err 1>ILMN.out     // progress and Unix redirects

% tail ILMN.out
lspmul.TH:[206295276728/206295276728]  [4370447]    [r:1922.7s, u:6983.8s, s:42.5s]

206,295,276,728 comparisons (x6 frames) Total real time (r:) was 1992 secs (30mns), and 6983 secs user in cpu (u:) Numbers of alignments kept: 4,370,447 So about 44% of reads aligned.

% lspres res/SRR027894_2.res –count     // indeed
Nb Selected = 4370447

% lspres res/SRR027894_2.res -bfql 'rp1==1' –count
Nb Selected = 3959724  // alignments that had a single hit above scut

% time lspres res/SRR027894_2.res
      -bfql 'rp1==1'
      -bfqlgroup '{[s.id]}'    // grouping alignments by subject ids
      -printf '%POST Ret: %NBGROUPS\n'     // how many groups
Ret: 10768
real    0m12.545s    // 12 secs to group 4M alignments

Half of Human proteins (10768 out of 20329) have been mapped.
We can finally export to SAM format.

% gq2sam.sh res/SRR027894_2.res -o res/SRR027894_2.sam

Getting Started

The Project

Let’s start with the following article:

An assessment of technical reproducibility and comparison with gene expression arrays. John C. Marioni, Christopher E. Mason, Shrikant M. Mane, Matthew Stephens, and Yoav Gilad. Genome Res. 2008 18: 1509-1517 - http://genome.cshlp.org/content/18/9/1509

This very good paper describes an RNA-Seq analysis between two samples of transcriptome sequencing of Kidney and Liver human cells. Sequencing technology is Illumina short reads of 32bp. We’ll take about 65M of reads of each tissue, coming from 4 different lanes, and see how we can mine this from our own.

We are going to perform the following operations:

  • Preparing data;
  • Do a quick Quality Check on reads; suspicious repeats, vectors around?
  • Get rid of the repeats, and align remaining reads to human transcripts extracted on the fly from RefSeq mRNA;
  • Recheck for unique locations of those reads against genes;
  • Then provide accurate reads counts, and global classification on gene names or tissues, or lanes, or any combination of those.
  • Export results to third-party packages such as Expressionist or GeneSpring.

Preparing Data

Input generally comes from two sides:

  • Reads from my experiments, or NCBI SRA archive, or whatever could it be from my own needs.
  • Reference, general purpose public domain databases (Genbank, RefSeq, etc …) whether plain or partial - subsets; or references from my own.

Reads

We assume having downloaded various fastq files from NCBI SRA archive, searching with “gilad liver human transcript”. If we pick one of them (e.g. SRX000604) then the following command:

% lspbank –nuc –T fastq –K ID SRX000604

will convert the fastq database into GQ Engine format, producing 3 files on the disk. For efficiency reasons, we have chosen to not store Quality Values (-K ID), as they will not be used for this study. This query database is directly useable with the lspdb command. If we repeat this process on all individual fastq files from this SRA archive, we end up with a list of files, say SRX000*, that are GQ Engine individual so-called physical databases. By using lspvbank we can now group them altogether and view them as a single GQ Engine database.

% lspvbank –o livkid SRX000*.ind

In GQ Engine terminology, livkid is a Virtual Database. It is only defined by the list of all physical databases that made it. Livkid consumes 3Kb on disk. It is directly usuable by lspdb. Let’s have a look:

% lspdb livkid -count
                 db nbseqs = 136022708
                 db nbres  = 4352726656
                 db maxlen = 32
                 db fields = ID

We have selected for this study 136M reads, half liver, half kidney. The origin of each individual physical file is kept. That means we can at any time extract from the virtual database any physical database by various means, including of course their filenames. For clarity, we have selected some runs, and some lanes inside each run, and renamed the physical databases. They can be roughly be viewed using:

% cat livkid.ind
dbformat = [LSPvDb] [3.0]
dbtype = NUC
dbtitle = livkid
date = [Fri Nov 27 09:14:09 EST 2009]
nbseq = 136022708
dbtotlen = 4352726656
maxseqlen = 32
nbfiles = 10
files = [LOCAL_GILADKIDNEY/20090609/Kidney_R3:2] [Kidney_R3:2] [13687929] [seq,373630636,bff7e1b47d75c2d296cf1622053cd1c2] [ctb,164255156,5a4988521ce7189bb0c58dafb68d875e]
files = [LOCAL_GILADKIDNEY/20090609/Kidney_R3:6] [Kidney_R3:6] [13449864] [seq,377881545,b7fbba34db58ea7e227b190718055afb] [ctb,161398376,d32d0c0cb2d9678355617ffb1bba061b]
…
files = [LOCAL_GILADLIVER/20090609/Liver_R7:2] [Liver_R7:2] [14003322] [seq,382597323,3da8c4a3b0c2efdc4c68331ab6c267bd] [ctb,168039872,a6112853daae102fc2c0b0054bb4f14d]
files = [LOCAL_GILADLIVER/20090609/Liver_R7:4] [Liver_R7:4] [14230879] [seq,399992169,7a9f81296cdd0388b556dcf44c683e1d] [ctb,170770556,98e452bb896a0f2d15bd7080dad4bb61]
…
annots = [1] [1,0,0,0] [ID]
VdbName = [10] [livkid] [livkid]
VdbFile = [livkid] [LOCAL_GILADKIDNEY/20090609/Kidney_R3:2] [1]
VdbFile = [livkid] [LOCAL_GILADKIDNEY/20090609/Kidney_R3:6] [2]
…
VdbFile = [livkid] [LOCAL_GILADLIVER/20090609/Liver_R3:3] [6]
VdbFile = [livkid] [LOCAL_GILADLIVER/20090609/Liver_R7:2] [7]
…
CryptoIdent = [136022708] [ind,-,ba2c8cde566045c1419c38d14d496357]

Livkid contains 10 physical databases, 5 for Kidney, 5 for Liver. The name of a physical databases encodes the Run Number and the lane (e.g. R3:2). We will use this information when we will group aligned reads by tissue origin or by lanes. Each line describing a physical database contains the # of reads, # of nucleotides (per run/lane) as well as some other magic numbers (checksum) used for database integrity retrieval purposes.

Reference Databases - Hotdrive Basics

One generally associates short read with a single genome, or a single reference, something static. Therefore, it’s not so easy to for example align all reads to the Genbank Synthetic or Bacterial division, or a subset of it (e.g. Human L1-line repeats). This is nevertheless a pretty “mandatory” operation when doing primary analysis, such as checking for some contamination or unwanted experimental effects. So let’s see how we perform such operations from the references standpoint.

We assume you have a GQ engine installed locally with a populated Hotdrive located in:

$GQ_INSTALL/data/GQdata/content/hotdrive

(Much more on the Hotdrive)

For fast set up, we do not assume that the Array-Computing layer, operating the engine on Linux clusters, is installed. This layer provides also some naming abstraction of the Hotdrive that we will skip for now, at the “cost” of manipulating databases at the Unix file-system level. Once computation needs will become heavier, we will switch to using this layer (see Section 5, Scaling it Up).

At the lowest level, the Hotdrive is organized with channels (GB_PRI, GB_BCT, etc.), and each channel maintains a hierarchy used for handling releases and configuration files for GQ. e.g.

% ls -ls $GQ_INSTALL/data/GQdata/content/hotdrive/GB_SYN
total 12
4 drwxr-xr-x 5 runner geneit 4096 Oct  7 04:54 173
4 drwxr-xr-x 5 runner geneit 4096 Nov 19 08:24 174  // Release 174
4 drwxr-xr-x 3 runner geneit 4096 Aug 21 04:56 configuration

A given release contains the main Release, Updates, some other files used to store pre-computed indices (esmdb directory), and “master” entry points, labeled with a date, and having a .ind extension. They are GQ Engine virtual databases handling the latest dataset of all files defining a Release and Updates. The one used here is GB_SYN174_20091118.ind:

% ls -ls $GQ_INSTALL/data/GQdata/content/hotdrive/GB_SYN/174
total 36
4 drwxrwxrwx 2 runner geneit 4096 Nov 19 08:24 esmdb
4 -rw-r--r-- 1 runner geneit  841 Oct 29 21:06 GB_SYN174_20091029.ind
4 -rw-r--r-- 1 runner geneit 1056 Nov 11 09:48 GB_SYN174_20091111.ind
4 -rw-r--r-- 1 runner geneit 1270 Nov 18 09:38 GB_SYN174_20091118.ind
4 drwxr-xr-x 2 runner geneit 4096 Nov  2 15:05 release
4 drwxr-xr-x 2 runner geneit 4096 Nov 19 08:25 upd

So, the file:

$GQ_ENGINE/data/GQdata/content/hotdrive/GB_SYN/174/GB_SYN174_20091118.ind

is a GQ Engine file (a Virtual DB) that you can use.

Let’s use a syntactic shortcut:

% export syn=$GQ_ENGINE/data/GQdata/content/hotdrive/GB_SYN/174/GB_SYN174_20091118

What can we do on a db?

% lspdb $syn –count 						// counting
% lspdb $syn | less						// displaying
% lspdb $syn –printf ’%H#AC\t%L\n%VOID’  			// printing
% lspdb $syn –bfql ’os=”homo sapiens”’ –count 		        // querying

The default display is an EMBL-like format. We call it internally db2 (Historical, sorry for that.) db2 is an extended version of EMBL (any two letters field allowed, etc.). It is referred as EMBL+ with the admindb.pl command.

lspdb is one of the GQ Engine modules.

GQ Engine Modules

For some historical reasons, all GQ Engine commands start with lsp. The following table displays the most frequently used modules. The .TH suffix refers to the multi-threaded version of a module. db refers to a GQ Engine database, res refers to a result database, which is our compact form of storing alignments.

The third column shows equivalent commands using Array-Computing/Hotdrive modules (see Section 5).

Module Name Short Description Array
lspbank
lspvbank
Create a physical db
Create a virtual db from multiple physical db
lspdb Operations on db (querying, counting, printing,..) lspdb.H
lspmul,lspmul.TH
lspcalc, lspcalc.TH
Compute: takes one or two dbs as input and produces a physical res. lspcalc.THA
lspres

lspvres

lspextend
lspextend.TH
Operations on res (querying, counting, printing,…)

Create a virtual res from multiple physical res

Computes additional alignment properties on res (lengths, etc.)

Note: For any of these commands, extended online help can be obtained with:

lsp* -help

Experienced users will refer to it pretty often. The Wiki forms are coming available as well - for instance, lspdb.

For those who know or guess enough on what can be done without willing to know more details, please skip to Subsection 4.4 (Mapping against repeats); others might read the following subsection who gives more insights on Printing (-printf), Filtering (-bfql) and Grouping (-bfqlgroup) actions.

Printing, Filtering, Grouping

-printf

The printf statement is an output formatter. It prints out text and values of variables, (variables are prefixed with a % sign), using a mix of C and Perl syntax. Each record is processed in a sequential and optimized way that so far saturates any I/O subsequent operation. There are more than 50 variables available, from databases names down to individual annotations. For example, Fasta format can be simulated as:

% lspdb $syn –printf ’>%H#AC\n%S\n%VOID’

Records with its sequence on one line will be printed. %H#AC means AC field in the Header (annotation) section of the record. (This syntax, a bit esoteric, will gradually be replaced with a more BFQL-compliant one.)

Some variables are called complex variables and perform nested operations, with optional PRE and POST processing. For example, %S (the sequence) is a complex object that has subfields. Subfields are labeled with numbers. For example:

% lspdb $syn –printf ’>%H#AC\n%[%2\t]S.60[\t%3]\n%VOID’

will print sequences split every 60 nucleotides, and print with tabs current starting and ending nucleotide position (%2 and %3) surrounding each 60nt line.

Printf is an extremely powerful module that allows any kind of free formatting. For example, all PhP data structures needed for our GQ Web Server are generated directly that way, avoiding costly parsers.

-bfql

BFQL is the Scripting language of the Engine. BFQL is a complete language allowing complex queries (filtering) on sequences and results databases. Its complete capabilities are provided in the BFQL Reference Manual: File:BFQLReference.pdf. A primer can be found in BFQL Primer.

We describe here some basics of filtering. Collections have been introduced above in Section 2.4 - we assume your familiarity with them.

The basic workings of BFQL are somewhat similar to awk:

-bfql ’PRE { some commands to run before doing any filtering  }

             FILTER: a suite of instructions applied 
             for each record (result or sequence db) 

       POST { some commands to run post filtering }’

PRE and POST blocks are optional. // is used for comments. The central block is not surrounded by { }. BFQL contains special preset variables $& and $$ that are used when deeper programming is necessary (scripting). They respectively contain PRE and POST contents of a filter.

Let’s describe the instructions used for filtering. For basic keyword searching, a single instruction is very often sufficient.

Sequences

A filtering instruction is made of:

<annotation> <query_op> <value>

where <annotation> is a database annotation field name, <query_op> a filtering operator, and <value> the data (can be an annotation).

For example,

ac = ”NM_12345”

performs a ‘word case insensitive equality’ comparison (query_op = ’=’) to retrieve records such as Accession Number (ac) equals NM_12345.

ac = id

will return all records whose Accessions and Identifiers are identical.

Annotation field names are database dependant. The list of annotations available for a given database can be retrieved using printf variables, or using the –count flag.

The table below displays the most commonly used fields as prepared in the GQ Hotdrive.

NOTE: ALL and ANNOT are so-called aliases and play a special role.

Field Name Description Example of content
ALL All ‘important’ indexed fields ALL:=DE+KW+SV+OS+MT
ANNOT All fields, indexed or not For performance reasons, use with caution.
ID or AC Identifier or Accession Number NM_201656
SV Sequence version 4
GI Genbank Identifier 42570622
DE Description Arabidopsis thaliana transcription factor (AT2G01060) mRNA, complete cds
KW Keyword Coupled; receptor
MT Molecular Type DNA, mRNA, RNA, PRT. …
OS Organism and extended species Arabidopsis thaliana (thale cress)
OX Organism taxonomy ID (NCBI) 9606
GN Gene Name Cox2
DT Date of entry, in string format 23-Nov-2006
D2 Date of entry, numerical sortable 20061023

When multiple words are provided, and when the operator is ‘=’, an implicit logical AND is performed. For example:

de=”coupled receptor”

will return all records whose definition fields (de) contains the words coupled and receptor in any order - like Google.

Many operators are available, case sensitive or case insensitive, performing keyword or phrase searching, including regex. A table of extended searching capabilities can be found by issuing the command:

lspdb -help

The annotation fields are composed of annotation field names, plus three dedicated fields:

  • L: denotes the length of the sequence
  • N,ON: is the GQ Engine numerical IDs of a record; ON varies from 1 to the number of sequences in the database. It is an absolute index, the one used for ranges. N is the relative numerical ID after –range and –bfql have been applied.
  • S: the sequence itself

Other fields can contain numerical data, such as version numbers or numerical dates. When indexed, type is implicit (although type can also be enforced by explicit casting), and regular arithmetic operators are directly available (>, <, >=, etc.). For filtering on sequence lengths, simply do:

L > 100

BFQL also provides for lists of values. A list is a set of elements. The operator returns true if ONE OF the elements matches. It is a logical OR. For example:

de = {”coupled”, ”receptor”}

returns all records that contain the word “coupled” or the word “receptor”.

Expressions can be expressed with AND and OR logical operators.

de = ”coupled” AND os=”musculus”

They can be arbitrarily nested, with usual AND/OR priorities.

de = ”coupled” AND (os=”musculus” OR os=”sapiens”)

Note: && and || can be used in place of AND and OR.

Results

A result database contains alignments. The GQ Engine stores alignments in a compact way. The table below displays the most frequently used alignment properties (case insensitive):

Field Name Description Example
OBQ Offset Begin position on the Query obq >= 10
OEQ Query End position on the Query oeq <= 500
OBD Offset Begin Position on the Database obd >= 20
OED Offset End Position on the Database oed <= 750
RL Alignment Length rl >= 500
RE Expect value (Blast) re <= 1e-6
RS Score (e.g. Blast Score or # of errs) rs > 250
RP P-value (Blast) rp >= 1e-6
RI %identity over the alignment (depends on lspextend) ri > 90
q.ac Accession # of the query sequences q.ac = “NM_12345”
s.ac Accession # of the subject sequences s.ac = “NM_12345”

Then, BFQL statements are expressed using those alignment “annotations”. For example:

re < 1e-6

would return alignments whose e-value is lower than 1e-6. Same,

rl >= 500 AND (re < 1e-6 OR ri >= 90)

would return alignments greater than or equal to 500 (nt or aa) and having either an expect value lower than 1e-6, or greater than or equal to 90% identity on the alignment.

The GQ Engine does not store alignments per se, but rather maintains alignment core properties together with references to the query and subject records (a GQE E numerical ID, a range). Therefore, annotations of those sequences are simply accessible through the q. and s. accessors. For example:

rl >= 100 and s.de = ”kinase” // de means description

will return alignments whose lengths are greater than or equal to 100 and that contain “kinase” in the description field of the subject sequence. Similarly,

s.pa != ”CompanyX” AND s.d1 > 20070101

will return alignments whose subject sequences have been published after 01-JAN-2007 (d1) and whose Patent Assignee (pa) is not from the "CompanyX".

-bfqlgroup

Grouping is a powerful concept primarily based on “Identity Rules”. For example:

  • All alignments with same E-value
  • All sequences from the same species

belong to the same group.

Say I did a NGS search of all my reads against all divisions of Genbank (metagenomics run). A typical need would be:

  1. I want to see all alignments falling into from the same taxonomy id grouped all together.
  2. Inside each of the groups above, I want to see the hits sorted by decreasing number of errors.
  3. I want then the groups to be lexically sorted by species.

Grouping does this with a single instruction that specifies the three different criteria. The following figure displays the general mechanism of grouping:

Groupingschematic.jpg

The Rules are Identity Rules. More precisely, they are sorting instructions: all objects that belong to the same sort key will be grouped together. For example, if the Rule is {ND} – which is the Biofacet numerical ID of the subject sequence in an alignment - then all alignments having same ND will be grouped together.

Syntax:

% lsp[db|res] file
     –bfqlgroup '{Grouping Rule},{Leader Rule},{Sort Group Rule}'

A Rule is defined by BFQL Sorting Conventions and supports multiple keys sort criteria. For the example described previously, the group statement would simply be:

-bfqlgroup ’{[s.ox]},{-[rs]},{[s.os]}’

Let’s consider another example: we want Biofacet BLAST Alignments ordered the NCBI way. This is:

  • Group together all HSPs between two sequences.
    • Criteria: {ND,NQ}
  • Within each group, alignment of best E-val on top.
    • Criteria: {RE}
  • Groups are sorted by best E-val as well.
    • Criteria: {RE}

Which is:

% lspres file -bfqlgroup '{[NQ],[ND]},{[RE]},{[RE]}'

Quality Check - Mapping against repeats

Now that we have downloaded reads and got familiar with the Hotdrive, let’s prepare a “repeat” reference database. We’ll focus on a pretty simple definition of repeats: Alu and Line-1 mRNA repeat.

As we have introduced lspvbank command in Section 4.2.2, let’s build the vdb of synthetic and primate divisions.

% export pri=$GQ_INSTALL/data/GQdata/content/hotdrive/GB_PRI/174/ GB_PRI174_20091118
% lspvbank –o syn_pri $syn $pri

The following bfql filter returns 9 sequences, as printed out with lengths and descriptions.

% lspdb syn_pri
   -bfql 'de ="Human" and ("alu consensus" or "Line-1 repeat mrna")'
   -printf '%H#AC\t%L\t%H#DE\n%VOID'

M19503 5976 Human Line-1 repeat mRNA with 2 open reading frames.
U14570 296 ***ALU WARNING: Human Alu-Sb2 subfamily consensus sequence
U14567 290 ***ALU WARNING: Human Alu-J subfamily consensus sequence.
U14568 288 ***ALU WARNING: Human Alu-Sb subfamily consensus sequence.
U14569 288 ***ALU WARNING: Human Alu-Sb1 subfamily consensus sequence
U14571 287 ***ALU WARNING: Human Alu-Sc subfamily consensus sequence.
U14572 291 ***ALU WARNING: Human Alu-Sp subfamily consensus sequence.
U14573 291 ***ALU WARNING: Human Alu-Sq subfamily consensus sequence.
U14574 290 ***ALU WARNING: Human Alu-Sx subfamily consensus sequence.

We can now check for reads aligning to those repeat sequences. We look for reads aligning to repeats with up to 4 errors – mismatches or indels, keeping one result, the best minimizing the # of errs. More details on the lspmul mapping technology and the best module can be found here.

% lspmul.TH –t 4
    -M mapnc,kerr
    -O "[-fltThreshold 1 -errs 4 -r ]"   
    -mn NUC.2.1
    -db  syn_pri 
    -bfql 'de =="Human" and("alu consensus" or "Line-1 repeat mrna")'
    -db livkid
    -o res/repeats.res
    -best 1
    -progress 

lspmul.TH:[1224204372/1224204372]	[2028731]	[r:373.594s, u:1008.428s, s:70.406s]

Using 4 threads, this took about 6 minutes, and we got 2M reads that we can consider as being repeats.

% lspres res/repeats.res –count
Nb Selected = 2028731

2M reads are involved.

Discarding repeats, mapping against transcripts

Building collections

Let’s take now all reads that we do not consider as repeats, and align them against Human transcripts. We then build from the previous run the collection of non-matching reads.

A reminder that BFQL contains special preset variables $& and $$ that respectively contain PRE and POST contents of a filter (see the File:BFQLReference.pdf and the BFQL Primer). Please note that the printf ’’ statement forces the output to be empty.

% lspres res/repeats.res
   -bfql 'true; POST {
	    // $$ contains what is returned by the filter; false by default
            $r := $$->qcoll() ;
            echo "# of reads aligned:  " + $r->Size();
          }'
   -printf ''
# of reads aligned:  2028731

Similarly, we can now dump the collection on Non-Matching reads, but simply taking the complement of $r, and then write on disk the resulting collection of name LK.nonrep (LiverKidney.nonrep).

% lspres res/repeats.res
   -bfql 'true; POST {
          $r := $$->qcoll()=>complement();
          $r->write("LK.nonrep");
          }’

How many of reads are left?

% lspdb LK.nonrep -printf '%POST%NBSEQS\n'
133993977

We can check that collections, as they only dump numerical IDs, take little space on disk. Those 134M reads are packed within 32Mb.

% du -shc LK.nonrep*
3.4M	LK.nonrep_10.scol
2.7M	LK.nonrep_1.scol
…
3.6M	LK.nonrep_9.scol
8.0K	LK.nonrep.ind
32M	total

We can now align this the collection against Human transcripts. We’ll grab them on the fly from RefSeq mRNA database in Hotdrive. We tune a little bit the –O parameters. Again, please refer here for more explanations on lspmul algorithms and parameters.

export rsm=$GQ_INSTALL/data/GQdata/content/hotdrive/RSM/38/RSM38_20091228

% time lspmul.TH -t 8
   -M mapnc,kerr
   -O '[-fltCut 720 -fltThreshold 15 -errs 2 -r ]'
   -db $rsm -bfql 'os="homo sapiens"'
   -db LK.nonrep
   -best 1
   -o res/rsm.res
   -progress

This search took 2 hours using 8 threads, and we got 34M reads aligning to Human transcripts, with up to 2 errors.

% lspres res/rsm.res -count
Nb Selected = 34500382

Transcript Counting

From then, we can do transcript counting by building groups of alignments having the same subject (individual transcripts).

% time lspres res/rsm.res
    -bfqlgroup '{[nd]}'
    -printf '%POST%NBGROUPS\n'
30301
real	2m1.2s 

We can refine this further by doing gene counting instead of transcript counting, which is grouping alignments falling into the same gene. We then group by the gn field property available in the RefSeq database (we use s.gn BFQL accessor); therefore, all alignments from different isoforms of the same gene will be grouped together into one.

% time lspres res/rsm.res
    -bfqlgroup '{[s.gn]}'
    -printf '%POST%NBGROUPS\n'
23495
real	5m30.864s

Runtime is longer because we access and compare sequence database annotation there, instead of numerical ids (s.gn) contained in the alignments themselves (nd). The number of genes is obviously less than the number of transcripts. If we want to see the distribution of number of reads per gene, we use special printf instructions that display only one element per group, and not all elements.

% time lspres res/rsm.res
    -bfqlgroup '{[s.gn]}'
    -printf '^[%HD#GN\t# reads: %6\n]GROUP.[]\n' | head
A1BG		# reads: 3924
A1CF		# reads: 1076
A2BP1		# reads: 15
A2LD1		# reads: 109
A2M		# reads: 25512
A2ML1		# reads: 21
A4GALT	# reads: 460
A4GNT		# reads: 14
…

As we are not 100% sure that we don’t measure reads coming from elsewhere on the genome, and/or whether alignments do not contain kind of low complexity or repeats, let’s refine and eliminate reads that are multiply located on the genome. For practical issues, let’s do it on genes.

Remapping against genes

We first build the collection the reads that matched to transcripts. We use $$ in this example, knowing that an empty filtering statement is implemented as false.

lspres res/rsm.res
  -bfql 'true; POST {$r := $$->qcoll->write("MatchRSM");}'
  -printf ''

We can now checking for uniqueness of those reads against genes. We use the GQgene database of Human genes (more info on GQgene). Here again, without Array Computing layer, we use the raw GQgene Human Virtual db on disk.

% export genes=$GQ_INSTALL/data/GQdata/content/hotdrive/GQGENE_Homo_sapiens/20090515/GQGENE

We run now the

% time lspmul.TH
   -t 8 -M mapnc,kerr
   -O '[-hitmap -fltCut 1000 -fltThreshold 15 -errs 2 -r ]'
   -db $genes -bfql 'os="homo sapiens"'
   -db MatchRSM
   -o res/rsmgenes.res
   -bfqlest 1
   -progress 1>res/rsmgenes.out 2>res/rsmgenes.err 

Please note the use of the hitmap flag in the options of the algorithm. This flag computes on the fly the hits distribution per number of errors. Still on the fly, a weighted score is computed: it is available as rp2 variable in BFQL. Above a given rp2 threshold, the read is flagged as “ambiguous” and is no more processed. For example, two hits with 0 errors leads to a non-decidable location of where the reads maps. This allows raising multiply-located reads ambiguities, and drops-off runtime for repeats. Let’s exclude the reads that are considered mutilply-located, and build the final file of results.

% lspres res/rsmgenes.res
    -bfql 'rp2 >= 128 ;
     POST { $r := $$->qcoll();
         echo "Count: "+ $r->size(); $r->write("ToExcludeFromRsm");}'
    -printf ''
Count: 4513082

% lspres res/rsm.res
    -bfql 'PRE {$r:=new collection("ToExcludeFromRsm.ind");
                $& := ~$r->rescoll();} 
    -dump res/Final.res

% lspres res/Final.res -count
Nb Selected = 29987300

We got 4,513,082 alignments that are considered as ubiquitous. This leads to a total of 34,500,382 – 4,513,082 = 29,987,300 unique alignments (= 29,987,300 reads since one alignment is kept per read).

As a final exercise, let’s characterize a bit further where those differences come from, which kind of reads were discarded by using this mechanisms. One can roughly look at discarded reads. Nothing better than an expert eye, but hey, there is a lot. Another and more interesting way is for example to build the table of most expressed genes and see how the filtering affected gene count. So we use bfqlgroup on both results sets: Final.res and original RefSeq rsm.res, sorting by decreasing group sizes.

% time lspres res/rsm.res
    -bfqlgroup '{[s.gn]},{[nq]},{-[$group->size()]}'
    -printf '^[%HD#GN\t# reads: %6\n]GROUP.[]\n' > C2.txt

% time lspres res/Final.res
    -bfqlgroup '{[s.gn]},{[nq]},{-[$group->size()]}'
    -printf '^[%HD#GN\t# reads: %6\n]GROUP.[]\n' > C1.txt

By comparing the two tables, we have:

C1.txt - FinalRes C2.txt - RefSeq
Gene # of reads Gene # of reads
LOC100288578 3777669 LOC100288578 3839038
LOC100008589 2474742 LOC100008589 2867761
LOC100288418 2149650 LOC100288418 2149866
LOC100293090 1321447 LOC100293090 1322458
LOC100131754 1014125 LOC100131754 1014201
LOC100008588 727402 LOC100008588 728958
LOC100293563 655902 LOC100293563 656142
MALAT1 637903 MALAT1 637903
ALB 441580 ALB 443769
SAA2 144712 KIAA1704 265813
GPX3 135650 SAA2 244370
HP 120084 HPR 232040
SERPINA1 119062 ORM1 176731
SAA1 117878 GPX3 135904
ALDOB 114177 HP 127423
FGB 103461 SAA1 120626
LOC100288871 97519 SERPINA1 119207
FTL 89241 FGA 114676
CRP 87044 ALDOB 114451
FGG 85340 FGB 103674
LOC100293593 80066 LOC100288871 97876

The genes (in green) are clearly featuring in the top RefSeq, while absent in the Final tops. Gene counts are very different, and could lead to wrong measures. Let’s look at one.

% lspres res/rsm.res -bfql 's.gn="KIAA1704"' –count
265813

Let’s see the distribution of start mapping positions on the transcript:

% lspres res/rsm.res -bfql 's.gn="KIAA1704"'
   -bfqlgroup '{[OBD]},{[OBD]},{-[$group->size()]}'
   -printf 'ˆ[Trans. Pos: %OBD\t%6\n]GROUP.[]'  | more
Trans. Pos: 1400	242275
Trans. Pos: 1399	14716
Trans. Pos: 1398	6678
Trans. Pos: 1390	1173
Trans. Pos: 1389	459
Trans. Pos: 1388	256
Trans. Pos: 1387	9
Trans. Pos: 1386	8
Trans. Pos: 547	7
Trans. Pos: 1016	7
Trans. Pos: 1027	4
Trans. Pos: 532	4
Trans. Pos: 646	4
Trans. Pos: 1380	4

The coverage around positions 1400 is obviously outrageous. Let’s pick one around:

% lspres res/rsm.res -bfql 's.gn="KIAA1704" && obd==1390' -a NM_018559  1388	1421	SRR002320.18947408     1  32	Errs= 2 
Q:          1  CAAAAAAAAAAAAAAAAAAAAAAAAACACAAA  32
               |||||||||||||||||||||||||| | |||
S:       1390  CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  1421

It is clearly a poly-A tail that has not been cleaned up in RefSeq, as it can be checked with: http://www.ncbi.nlm.nih.gov/nuccore/33354278

Grouping by lanes, by gene names

Finally, we can have a quick look at the distribution of reads aligned among lanes, by simply grouping on the query initial data set, the Virtual database livkid. We use the ngname property in BFQL grouping: this allows to group efficiently by physical datasets (a lane). We print out the generic name of the query physical dataset (GNAMEQ) and the number of members of a group with the sub-variable %6 in printf.

% time lspres res/Final.res
      -bfqlgroup '{[q.ngname]}'
      -printf '^[%GNAMEQ\t%6\n]GROUP.[]'
Kidney_R3:2	3178836
Kidney_R3:6	3308273
Kidney_R7:1	3076119
Kidney_R7:3	3182951
Kidney_R7:7	3024836
Liver_R3:3	2893762
Liver_R7:2	2893501
Liver_R7:4	2892333
Liver_R7:6	2823059
Liver_R7:8	2713630

real	5m13.438s

The GNAMEQ variable has many sub-fields. They can be displayed using

lspres -help | less <search-for-printf>

We can then for example use the %8 variable, which is the total number of reads in the physical dataset. Therefore, by doing the following, we can observe the total number of uniquely reads mapped together with the original number of reads.

% time lspres res/Final.res
      -bfqlgroup '{[q.ngname]}'
      -printf '^[%GNAMEQ[ %8]\t%6\n]GROUP.[]' > RPM.txt

% cat RPM.txt
Kidney_R3:2 13687929	3178836
Kidney_R3:6 13449864	3308273
Kidney_R7:1 13017169	3076119
Kidney_R7:3 13401343	3182951
Kidney_R7:7 12848201	3024836
Liver_R3:3 14761931	2893762
Liver_R7:2 14003322	2893501
Liver_R7:4 14230879	2892333
Liver_R7:6 13525355	2823059
Liver_R7:8 13096715	2713630

So, we can compute a say RPLM: Reads mapped Per Lane per Millions.

% awk '{printf ("%s :: RPM %s\n", $0, $3/$2);}' RPM.txt 
Kidney_R3:2 13687929	3178836 :: RPM 0.232236
Kidney_R3:6 13449864	3308273 :: RPM 0.245971
Kidney_R7:1 13017169	3076119 :: RPM 0.236312
Kidney_R7:3 13401343	3182951 :: RPM 0.23751
Kidney_R7:7 12848201	3024836 :: RPM 0.235429
Liver_R3:3 14761931	2893762 :: RPM 0.196029
Liver_R7:2 14003322	2893501 :: RPM 0.20663
Liver_R7:4 14230879	2892333 :: RPM 0.203243
Liver_R7:6 13525355	2823059 :: RPM 0.208723
Liver_R7:8 13096715	2713630 :: RPM 0.207199

By grouping by <lane, gene-names>, and knowing the transcripts lengths, we compute RPKM in our web-based GenomeQuest RNA-Seq workflow that way.

Finally, exporting to third-party packages such GeneSpring or Expressionist is grouping with a tab-delimited printf format. This has been previously addressed.

What about reads that did not map?

Remapping non-aligned reads is simply building the collection of non matching reads, and use it as the next query. In our case, there are some subtleties depending on whether one wants to include the reads that did match with repeats, or not. The following give the two ways of doing it.

Collection of remaining versus total; includes reads that hit repeats:

% lspres res/rsm.res
   -bfql 'true; POST {$r := ~$$->qcoll();
                $r->write("Remaining");}'
   -printf '%POST%NBSEQS\n''
101522326

Collection of remaining ; repeats thrown-away:

% lspdb LK.nonrep   // we start from reads – repeats.
   -bfql 'PRE {
    $r :=new collection("MatchRSM.ind");
    $& := $&-$r;  // substract hits to RefSeq
    }
    true;
          POST{$$->write("Remaining");}'
   -printf '%POST%NBSEQS\n'
99493595

Conclusion

We have downloaded and converted two samples of transcriptome sequencing of Kidney and Liver human cells. 130M Illumina short reads of 32bp have been quickly gathered into a GQE Virtual database of the 10 associated lines, 5 for Kidney, 5 for Liver.

We did quick Quality Check on reads against a simple dataset of 9 Human repeats extracted from Genbank on the Fly.

Using collections, we did the following operations:

  • Get rid of the repeats, and align remaining reads to human transcripts extracted on the fly from RefSeq mRNA;
  • Recheck for unique locations of those reads against genes;

We took a look of the discarded repeats and show how we could isolate other types of repeats. In our case, it was a poly-A tail. Although this is a trivial case, the method is generalist and avoids to (1) either do heavy pre-processing that will anyway be limited by a priori knowledge (2) either generate wrong gene counts.

We ended by providing grouping on lanes, and showed how these examples can easily compute RPKM.

Scaling It Up: Array Computing

The “GQ Engine Array Computing” is an extension of the GQ Engine that provides:

  • Hotdrive naming abstraction
  • Cluster Computing

The two following Perl modules implement the Array Computing:

Module Name Short Description
lspdb.H Operations on db (querying, counting, printing, ..)
lspcalc.THA Compute: takes one or two dbs as input and produces a physical res.

lspdb.H

lspdb.H is the extension of lspdb with an abstraction for Hotdrive naming. This means that all the physical databases we have been using in SubSection 4.2.2 can be addressed using symbolic names instead of Unix filenames. A symbolic name is simply a channel name. For example GB_PRI is the virtual database in the Hotdrive of the latest Genbank Primate Division files.

To differentiate such a name with a Unix filenames (which can of course still be used with lspdb.H), the convention GB_PRI[] is taken. Multiple channel names can be given, separated by “,”. Last, as there are multiple technical options specific to lspdb.H, all options belonging to lspdb feature-up after the -- option.

For example, the command line used page 19:

% lspdb syn_pri
   -bfql 'de ="Human" and ("alu consensus" or "Line-1 repeat mrna")'
   -printf '%H#AC\t%L\t%H#DE\n%VOID'

which was “painfully” built with physical file names, and bash variables, can directly be written as:

lspdb.H GB_SYN,GB_PRI[] --
   -bfql 'de ="Human" and ("alu consensus" or "Line-1 repeat mrna")'
   -printf '%H#AC\t%L\t%H#DE\n%VOID'

More information on lspdb.H can be found in the System Concepts article.

lspcalc.THA

lspcalc.THA is the Array Computing extension of lspcalc.TH and lspmul.TH. In a nutschell, it provides execution of GQ Engine sequence searching commands on a Linux Cluster through the use of a queuing system. It inherits the hotdrive naming abstraction of lspdb.H and provides flags to control the queues and the mode of execution.

The modes of execution include various schemas. Briefly:

  • Split the “search-space” by dividing jobs among (a) query-space (b) subject space (c) both. (a) is mainly used for NGS. (b) is well-suited when interactive, fast-answers are required. (c) is on your own.
  • Associated with the above, the splitting can be done based on the number of sequences, but also based on the number of residues – while slow at start, this strategy ensures much better load balancing, especially on Subjects.
  • Specify the number of jobs, and the number of threads per job (equivalent in most cases as number of nodes and number of cores per node).
  • Specify the way jobs are scheduled on queues. Related to the database-splitting strategies, the Array can maintain a sophisticated schema where any virtual database can always remap the same ranges of all split physical databases a job can be concerned with on a node. This “in-memory” mode ensures no more disk access, whether NFS or local.
  • Can take or not advantage of local stripped (sequence-only) Hotdrive staged on each of the nodes.

The general synopsis is:

% lspcalc.THA <specific-options> -- <GQE options>

For example, the multithreaded run of reads against RefSeq transcripts we presented above can be performed in 10 mns using 12 nodes, 4 threads per node. Please note the use of Hotdrive naming (RSM[]), as well as BFQL on the fly and the use of collections.

lspcalc.THA -qs SGE –qn priv.q -splitrole query -jobname jj –j 12 --
   -t 4
   -M mapnc,kerr
   -O '[-fltCut 720 -fltThreshold 15 -errs 2 -r ]'
   -db RSM[] -bfql 'os="homo sapiens"'
   -db LK.nonrep
   -o res/rsm.res
   -best 1
   -progress

Multiple queuing systems are supported (SGE, LSF, …).

More on lspcalc.THA can be found in here.

Up to GQ

All the concepts that have been presented in this GQ Engine primer ultimately end up with a Unix script producing results, whether they be databases, alignments, or tab-delimited files, or a combination of those. Once enough iterations have been performed, and the script looks good enugh to go into production, the time has come to open it for wider consumption than bioinformatics developers only.

The GQ Workflow APIs are designed to smoothly integrate this script, and unlock the GQ Engine power – and script’s author sharpness ☺ - for use by Scientists. Below is a hint of what you can do inside of GenomeQuest to turn your GQ Engine work into consumable reports by scientists, taken from the GenomeQuest RNA-Seq workflow report page.

Rnaseqreport.png

More Docs

Now that you've read the GQ Engine Primer, there are a few directions you can go:

As always, we're here to help. Reach out to support@genomequest.com for any question, any time.

Personal tools