Welcome to the GenomeQuest Documentation Wiki

SAM / BAM Import

From GQ Wiki
Jump to: navigation, search

GQlogo.png


Contents

What does it do?


  • The SAM / BAM Import workflow allows the upload of pre-mapped reads, that are already aligned in some way, onto GenomeQuest servers for downstream analyses through the GenomeQuest software application.


Samsum.jpg


A Map of how to Analyze your NGS data through the SAM / BAM Import Workflow:


Sambam3.jpg



What does it produce?


  • This workflow takes your aligned reads in SAM, or BAM format, and converts them into GenomeQuest's internal form.
  • A reads database, viewable in the Sequence Database Browser, is produced along with a mapping results database, which is viewable in the Results Database Browser.
  • Additionally, results can be utilized in all downstream workflows as if they were run from their raw form through any of GenomeQuest's workflows.


Reppage.jpg



Next Steps:


  • The next logical step after runnings this workflow is to either call Variants (Variant Calling workflow), or look for protein binding sites (ChIP-Seq workflow).
  • Variants can further be annotated using the Variants Annotation Workflow.
  • Reads can be remapped using GenomeQuest's powerful alignment algorithm GASSST, or digital expression can be determine through GenomeQuest's RNA-Seq (DigEx) workflow.
  • Following further analyses, your reads can then be exported using GenomeQuest's SAM / BAM Export plugin to give users awesome functionality to aid them in their research (below):


Export2.jpg


  • The next steps that are accessible directly from the report page (below):

Nextsteps.jpg


Important Parameters that are used in the Workflow:


  • The launch page lists the following parameters:


Launchsam.jpg

  • File Format Type:
    • The type of format the file is in (SAM, or BAM).


  • Reference:
    • The reference the reads were aligned to.


  • Library Type:
    • Paired End
    • Single End
    • Mate Pair


  • Filter mapped reads below 'x':
    • Remove reads falling below threshold 'x'.



Report Page Description:


  • The report of a finished workflow run shows the following things:


  • Number of reads pruned, or filtered that fell below a mapping quality of 'x'.
  • Number of uniquely mapped reads, or informative reads.
  • Number of ambiguously mapped reads.
  • Number of duplicate reads.


  • Coverage metrics (genome-wide):
    • Average coverage, and standard deviation.
    • Median Coverage.
    • Bar chart of reads binned by their respective Mapping qualities.


  • Paired End/Mate Pair specific:
    • Number of paired reads.
    • Number of singletons.
    • Unmapped reads (if applicable).


Stats.jpg


A High Level, and Algorithmic Description of this Workflow:


  • The Alignment Import algorithm works in the following way:


  • If file is SAM, convert to BAM format.
  • If file is BAM, proceed- convert to GenomeQuest's internal format.


  • For each alignment:
    • Find corresponding pair (if applicable).
    • Apply filter if alignment is below threshold 'x'.
    • Count unique / informative, ambiguous, and duplicate reads.


  • Tally average, standard deviation, and median coverages to return as genome-wide metrics for quality control.
  • Plot / keep track of reads' corresponding mapping qualities in order to return bar chart at report page.
  • Tally unmapped reads.



Miscellaneous Information, and GenomeQuest Recommendations:


  • The processed alignment database in GenomeQuest's internal format will contain one alignment per record.


  • An alignment record shows information about a single read's alignment.


  • Each alignment record contains the following information:
    • Identifier - The read name.
    • Begin - Start position.
    • End - Stop position.
    • Mapping Quality - The quality score of the read's mapping.
    • Pair location - The location of the read's corresponding pair (if applicable).



Troubleshooting the SAM / BAM Import Workflow:


  • This section exists as there are some important factors to consider when importing your SAM, or BAM file into GenomeQuest.


  • Recall that GenomeQuest has its own proprietary, internal database format that is sometimes referred to as biofacet.
  • In order to get your aligned reads from SAM / BAM into GenomeQuest, an internal script must be run that takes your reads and precisely maps them unto a GenomeQuest supported reference.
  • The references that the reads mapped to externally must precisely match the internal reference the reads are being placed on within GenomeQuest. This means:
    • The length of the reference genomes must be the same; ie we are implying the references are the exact same.
    • Reference, or chromosome names that the reads were aligned to externally, must also be the same as those used in the GenomeQuest reference.
    • NOTE - GenomeQuest takes standard reference names for contigs, references, chromosomes, etc. For example, for your reads to be successfully imported they must share identical reference names. That implies that if your reads were mapped to chr1 in human, the reference for all reads mapped to this chromosome in your SAM / BAM file must be chr1 as that is the name used in GenomeQuest. Otherwise, if your references are mis-formatted (chr_1 instead of chr1, or CHR1 instead of chr1) they will be skipped. This is because an internal map is created during the WF that allows GenomeQuest's internal conversion program to identify what reads go where.


  • Why is this required?
    • GenomeQuest doesn't require identical reference names to make the user's life more difficult, but imagine all the possibilities that could exist. Say one user uses chr_1, another uses CHR1, another user uses just 1, and another user decides to go with some accession ID like NC_00001. It would be very difficult to consider, and match all the possibilities. Thereby, it is much smoother to require SAM / BAM references to match those standard IDs that GenomeQuest uses. One can be certain that GenomeQuest will use a standard format, and this format will not be subject to change. This makes the process of importing alignments much smoother on both the client side as well as the server side.


  • What does this map look like?
  • The map that GenomeQuest's internal conversion program uses has the following space-separated format:
[runner@tesla 21548]$ cat READY.txt 
chr7 chr7
chr20 chr20
chr22 chr22
chr14 chr14
chrY chrY
chr19 chr19
chr8 chr8
chr1 chr1
chr11 chr11
chr6 chr6
chr17 chr17
chr21 chr21
chr16 chr16
chr3 chr3
chr18 chr18
chr12 chr12
chr15 chr15
chrX chrX
chr4 chr4
chr2 chr2
chr9 chr9
chr13 chr13
chr10 chr10
chr5 chr5


  • What does the format of an internal reference within GenomeQuest, and in GenomeQuest's internal format look lie?
  • Below is an example of an internal reference maintained by GenomeQuest. This excerpt is from homo sapien, and is that of reference hg19 from build GRCh37:


ID	chr1
AC	NC_000001.10
GI	224589800
D6	20101025
MT	genomic DNA
CH	1
DE	Homo sapiens chromosome 1, GRCh37.p2 primary reference assembly.
OS	Homo sapiens (human)
OX	9606
OC	Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;Catarrhini; Hominidae; Homo.
CC	... truncated for brevity...

  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
...
  • Notice the ID is the field from GenomeQuest's reference that must match the reference, or chromosome field within the SAM or BAM file that is being imported through the workflow. If these do not match precisely, the workflow will skip the reads mapped to that specific reference within the file, and move onto the next, new reference name until it finds a reference that matches GenomeQuest's.
  • Below is an example of a properly-formatted reference name within a SAM or BAM file. These reads will successfully be converted into GenomeQuest's internal format through the SAM / BAM Import Workflow:
SRR098401.42290474	161	chrY	10003	0	65M	chr13	111894063	0	ACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTCTGAAAGTGGACCTATCAGCAGGATGTGG	7)7'.(;8.0@
6A6A1/*9-2+427>;BB51-111:><<,2+7-6+79==DB?BB.?99<D@;DC	XT:A:R	NM:i:1	SM:i:0	AM:i:0	X0:i:2	X1:i:0	XM:i:1	XO:i:0	XG:i:0	MD:Z:1A63	XA:Z:chrX,+60003,65M,1;
SRR098401.93377528	99	chrY	10267	0	76M	=	10297	106	AGACCACAACCCCACCAGAAAGAAGAAACTCAGAACACATCTGAACATCAGAAGAAACAAACTCCGGACGCGCCAC	HHHHHHHHHHH
HHHHHHGHGHHHHHHHHHHHFHHHHHHHHHHFHHHHHFFHDEG?FHHHEGHBHFHEHHHG@FEFG	XT:A:R	NM:i:0	SM:i:0	AM:i:0	X0:i:2	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:76	XA:Z:chrY,+10267,76M,0;
SRR098401.93377528	147	chrY	10297	0	76M	=	10267	-106	CAGAACACATCTGAACATCAGAAGAAACAAACTCCGGACGCGCCACCTTTAAGAACTGTAACACTCACCGCGAGGT	6EHDEHEBECC
FFFA6HHHGHGGGGEBFEEEHE>HHFEFFBHFFBFHFHFHHHHFHHHHBFFEHHHCHHHHHHHHH	XT:A:R	NM:i:0	SM:i:0	AM:i:0	X0:i:2	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:76	XA:Z:chrY,-10297,76M,0;
  • Take notice that the reference in this instance is chromosome Y. There are three records, or reads that map to this chromosome, and the chromosome reference is properly labeled as such; chrY in this instance. In bold (below), you can better get a feel for what is meant:


SRR098401.42290474 161 chrY 10003 0 65M chr13 111894063 0 ACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTCTGAAAGTGGACCTATCAGCAGGATGTGG 7)7'.(;8.0@ 6A6A1/*9-2+427>;BB51-111:><<,2+7-6+79==DB?BB.?99<D@;DC XT:A:R NM:i:1 SM:i:0 AM:i:0 X0:i:2 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:1A63 XA:Z:chrX,+60003,65M,1;

  • Some common reasons for errors:
    • Files are mislabeled - A SAM file was improperly named 'file.bam' rather than 'file.sam', and vice-versa. This will cause the workflow to fail.
    • Invalid extensions - Only .SAM | .sam, or .BAM | .bam are allowed suffixes.
    • Truncation - When the SAM or BAM files was being created, it was truncated prematurely, and therefore there is no End of File (EOF) record. This will crash the workflow.
    • Mislabeled references - If the reference's chromosomes that reads are mapped to within the SAM / BAM files do not match the reference's chromosome names within GenomeQuest, a warning will be issued in the workflow log, and all those reads will fail to be processed. The workflow may or may not fail.



FAQs:


  • What are SAM / BAM files?
    • SAM stands for the Sequence Alignment/Map format, and was created through the cooperation of a number of sequencing cores in order to produce a very dense data format that preserved alignment information while remaining in text-format. BAM is the binary version of SAM, which is highly compressed, and much quicker to process due to its binary nature.


  • Will all alignments be kept once the workflow is run?
    • Reads that fall below threshold 'x' specified on the launch page will be removed.
    • Reads without an alignment will be removed.



References:



Personal tools