Welcome to the GenomeQuest Documentation Wiki
Var import
From GQ Wiki
Contents |
What does it do?
- The SNPs and Indels Import workflow allows the upload of variants in VCF, TSV, or SVA formats.
Supported Formats
VCF Format:
- Variant Calling Format (VCF) was created by the 1000 Genomes project, and is growing in popularity as the de facto variant format.
- VCF format appears as such:
##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
- Read more about VCF here:
SVA Format:
- Sequence Variant Analyzer (SVA) format was created by the SVA Project at Duke University.
- Read more about SVA here:
Complete Genomics Format:
- Complete Genomics formatted variant files come in the form of .tsv files.
- .tsv files appear as such:
#ASSEMBLY_ID GS19238-1100-36-ASM #COSMIC COSMIC v48 #DBSNP_BUILD dbSNP build 130 #FORMAT_VERSION 1.5 #GENERATED_AT 2010-Nov-20 05:29:46.334776 #GENERATED_BY dbsnptool #GENOME_REFERENCE NCBI build 36 #SAMPLE GS00028-DNA_A01 #SOFTWARE_VERSION 1.10.0.22 #TYPE VAR-ANNOTATION >locus ploidy allele chromosome begin end varType reference alleleSeq totalScore hapLink xRef 1 2 all chr1 0 901 no-call = ? 2 2 all chr1 901 906 ref = = 3 2 all chr1 906 959 no-call = ? 4 2 all chr1 959 972 ref = = 5 2 all chr1 972 1005 no-call = ? 6 2 all chr1 1005 1013 ref = = 7 2 all chr1 1013 1033 no-call = ? 8 2 all chr1 1033 1084 ref = = 9 2 1 chr1 1084 1096 ref AGGGCGCCCCCT AGGGCGCCCCCT 25 9 2 1 chr1 1096 1098 no-call-rc GC NN 41 9 2 1 chr1 1098 1106 ref TGGCGACT TGGCGACT 23 9 2 2 chr1 1084 1106 no-call AGGGCGCCCCCTGCTGGCGACT ?
- Read more about Complete Genomics .tsv format here:
What does it produce?
- This workflow takes your variants in VCF, TSV, or SVA format, and converts them into GenomeQuest's biofacet format. Once the workflow is done, variants can be viewed and queried in GenomeQuest's sequence database browser. For VCF, multiple samples can be contained in the same file, and they will be identified accordingly.
Next Steps:
- The next logical step after running this workflow is to either annotate variants (Whole Genome Annotator workflow), or compare them to multiple samples utilizing the Multi-Sample Variant workflows.
Important Parameters that are used in the Workflow:
- The launch page lists the following parameters:
- Database Name: An intuitive name to give your run
- Reference Genome: The genome that was used to call variants
- Variant file type: the file-format of the variants
- Sample Information:
- Patient/Organism ID
- Ethnicity/Population
- Sample ID
- Experiment ID
- An example of the parameters used in the workflow can be found on the launch page:
Report Page:
- The report of a finished workflow run shows the following things:
- Import Parameters: Parameters used to identify your variants.
- Imported Variant Database(s): A list of variant database that exist from your import. Multiple if your file contained multiple variant samples (for VCF only)
- The processed variant database in GenomeQuest's biofacet form will contain one variant call per record.
- A variant record shows information about the nature of the alternate allele:
- Each variant record contains the following information:
- Identifier
- Description
- Database Name
- Link out (if applicable)
- Begin/End positions
- Variant type: mismatch or reference
- Variant score: the score given to the variant
- Description: a description of the variant
- Chromosome and Position
- Chromosome
- Chromosomal Location
- Variant length
- Haplotype phase
- Chromosome Number
- Sequence length
- Database name
- An example of the processed database in biofacet form, and being viewed in the sequence database browser:
A High-level and Algorithmic Description of this workflow:
- The Variant Import algorithm works in the following way:
- If the file is VCF, SVA, or TSV, perform proper conversion into GenomeQuest's internal form
- If VCF, identify all samples (if there are multiple samples present in single VCF file), and put them into their own corresponding biofacet database
FAQs:
- Frequently Asked Questions:
- What specific Complete Genomics formats are supported?
- GenomeQuest supports Complete Genomic's .tsv format corresponding to variants called in the Assembly (ASM) phase of the project.
- Can multi-sample VCF files be imported?
- Yes, GenomeQuest can import multi-sample VCF files where the sample is specified after the FORMAT field.
- What specific Complete Genomics formats are supported?
References:
- Complete Genomics data formats:
- VCF format:
- SVA format:


