Welcome to the GenomeQuest Documentation Wiki

Var import

From GQ Wiki
Jump to: navigation, search

Contents

What does it do?

  • The SNPs and Indels Import workflow allows the upload of variants in VCF, TSV, or SVA formats.

Supported Formats

VCF Format:

  • Variant Calling Format (VCF) was created by the 1000 Genomes project, and is growing in popularity as the de facto variant format.
  • VCF format appears as such:


##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF    ALT     QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3




SVA Format:



Complete Genomics Format:

  • Complete Genomics formatted variant files come in the form of .tsv files.
  • .tsv files appear as such:


#ASSEMBLY_ID	GS19238-1100-36-ASM
#COSMIC	COSMIC v48
#DBSNP_BUILD	dbSNP build 130
#FORMAT_VERSION	1.5
#GENERATED_AT	2010-Nov-20 05:29:46.334776
#GENERATED_BY	dbsnptool
#GENOME_REFERENCE	NCBI build 36
#SAMPLE	GS00028-DNA_A01
#SOFTWARE_VERSION	1.10.0.22
#TYPE	VAR-ANNOTATION
 
>locus	ploidy	allele	chromosome	begin	end	varType	reference	alleleSeq	totalScore	hapLink	xRef
1	2	all	chr1	0	901	no-call	=	?			
2	2	all	chr1	901	906	ref	=	=			
3	2	all	chr1	906	959	no-call	=	?			
4	2	all	chr1	959	972	ref	=	=			
5	2	all	chr1	972	1005	no-call	=	?			
6	2	all	chr1	1005	1013	ref	=	=			
7	2	all	chr1	1013	1033	no-call	=	?			
8	2	all	chr1	1033	1084	ref	=	=			
9	2	1	chr1	1084	1096	ref	AGGGCGCCCCCT	AGGGCGCCCCCT	25		
9	2	1	chr1	1096	1098	no-call-rc	GC	NN	41		
9	2	1	chr1	1098	1106	ref	TGGCGACT	TGGCGACT	23		
9	2	2	chr1	1084	1106	no-call	AGGGCGCCCCCTGCTGGCGACT	?




What does it produce?

  • This workflow takes your variants in VCF, TSV, or SVA format, and converts them into GenomeQuest's biofacet format. Once the workflow is done, variants can be viewed and queried in GenomeQuest's sequence database browser. For VCF, multiple samples can be contained in the same file, and they will be identified accordingly.



Next Steps:

  • The next logical step after running this workflow is to either annotate variants (Whole Genome Annotator workflow), or compare them to multiple samples utilizing the Multi-Sample Variant workflows.



Important Parameters that are used in the Workflow:

  • The launch page lists the following parameters:
    • Database Name: An intuitive name to give your run
    • Reference Genome: The genome that was used to call variants
    • Variant file type: the file-format of the variants
    • Sample Information:
  • Patient/Organism ID
  • Ethnicity/Population
  • Sample ID
  • Experiment ID


  • An example of the parameters used in the workflow can be found on the launch page:


Launch.jpg



Report Page:

  • The report of a finished workflow run shows the following things:
    • Import Parameters: Parameters used to identify your variants.
    • Imported Variant Database(s): A list of variant database that exist from your import. Multiple if your file contained multiple variant samples (for VCF only)


Report page.jpg


  • The processed variant database in GenomeQuest's biofacet form will contain one variant call per record.
    • A variant record shows information about the nature of the alternate allele:
    • Each variant record contains the following information:
      • Identifier
      • Description
      • Database Name
      • Link out (if applicable)
      • Begin/End positions
      • Variant type: mismatch or reference
      • Variant score: the score given to the variant
      • Description: a description of the variant
      • Chromosome and Position
      • Chromosome
      • Chromosomal Location
      • Variant length
      • Haplotype phase
      • Chromosome Number
      • Sequence length
      • Database name


  • An example of the processed database in biofacet form, and being viewed in the sequence database browser:


Seqdb.jpg



A High-level and Algorithmic Description of this workflow:

  • The Variant Import algorithm works in the following way:
    • If the file is VCF, SVA, or TSV, perform proper conversion into GenomeQuest's internal form
    • If VCF, identify all samples (if there are multiple samples present in single VCF file), and put them into their own corresponding biofacet database



FAQs:

  • Frequently Asked Questions:
    • What specific Complete Genomics formats are supported?
      • GenomeQuest supports Complete Genomic's .tsv format corresponding to variants called in the Assembly (ASM) phase of the project.
    • Can multi-sample VCF files be imported?
      • Yes, GenomeQuest can import multi-sample VCF files where the sample is specified after the FORMAT field.



References:

Personal tools