Welcome to the GenomeQuest Documentation Wiki

Newbler Assembler

From GQ Wiki
Jump to: navigation, search

Contents

What does it do?

The 454 Newbler Assembly is a de novo assembler that allows you to assemble your raw reads. Newbler was designed with the intent to be used with 454 data.


A Map of how to Analyze your NGS data through the Newbler Workflow:


Newb.jpg



What does it produce?

  • This workflow takes as input a resulting read sequence database created by uploading your reads, and running them through the Read Processing Workflow. The Newbler workflow then assembles your selected reads, and returns interactive results databases, and flat file reports.


  • INPUT:




Next Steps:



Important Parameters that are used in the Workflow:


  • NOTE: When accessed from 'Launch Workflows' > 'Assembly' > 'Newbler', if "Large / Complex Genome" is selected then workflow parameters will be set by default:


Newb launch.jpg

  • NOTE: When accessed from the sequence or results database browsers through the 'Applications' > 'Assembly' > 'Newbler' plugin, the following (below) advanced parameters will be accessible through "Advanced Options":


Newbler
Newbler parameters

  • Result title:
    • You need to name your run with an appropriate name. The default will be "Rapid annotation DATE", where DATE is today's date.
  • Minimum Overlap Length:
    • The minimum length of overlaps used by the assembler (>1). Two sequences need to have an overlap of at least this value to be assembled into contigs.
  • Minimum Overlap Identity:
    • The minimum percent identity of overlaps used by the assembler (0-100). Two sequences need to have an overlap identity of at least this value to be assembled into contigs.
  • Seed Step:
    • The number of bases between seed generation locations used in the exact k-mer matching part of the overlap detection (>1)
  • Seed Length:
    • The number of bases used for each seed in the exact k-mer matching part of the overlap detection (i.e. the k value of the k-mer matching) (6-16).
  • Seed Count:
    • The number of seeds required in a window before an extension is made (>1).



Report Page Description

  • Statistics:


Newbler statistics Newbler Statistics Page

  • This section displays some global statistics such as the total number of reads, the total number of non-redundant reads, and the total number of assigned reads; the ones with at least one hit.
    • Total number of sequences: the number of sequences used in the assembly computation
    • Total number of contigs: the number of large contigs identified
    • Contigs size avg / longest / N50: average size of the contigs, longest, and N50 (median) size
    • Total number of repeat sequences: the number of sequences deemed to be from repeat regions
    • Total number of aligned sequences: the number of sequences that aligned with other sequences in the pairwise alignment step.
    • Total number of assembled sequences: the number of sequences fully assembled into the contigs
    • Total number of unassembled sequences: the number of sequences that did not overlap with other sequences because the sequences were too short, or were repeats, etc; see "Read status" below.


  • Assembled Sequences:
    • This section contains a pie chart showing the percentage of assembled and unassembled sequences, as well as links to several files and databases described here after.


  • Browse contigs:
    • Click this link to browse the database of contigs produced by Newbler and available in GenomeQuest.
    • Once clicked, GQ will open the contigs, such as shown below:



Newbler statistics
Newbler contigs as a sequence database

  • Browse unassembled sequences:
    • GenomeQuest will automatically add the unassembled sequences to the resulting database. This link will lead you to the list of sequences that are not a part of any contig; see below for a description of why sequences can be excluded from contigs.
    • Important: - The number of singletons can be different from the number of unassembled sequences, since singletons are sequences that are taken into account in the assembly, but are not assembled into a contig, whereas unassembled sequences can be sequence too short, too long, etc.


  • Browse all:
    • Click this link to browse the database of contigs produced by Newbler, as well as the sequences that have not been assembled into contigs.


  • Launch a sequence search with the contigs:
    • Click this link to preload the Sequence Search page with the contigs as query.


  • Download Contigs (FASTA):
    • Click this link to download the contigs in Fasta format.



  • For example, you can open ACE files with Consed, or Geneious (download trial version) to visualize the consensus sequence, or the tiling of the sequences into contigs, etc.


  • Below is an example of a contig visualized in Geneious, and composed of 3 sequences (c3_r2, c3_r3, and c3_r4):


Contig visualization in Geneious

  • Download read IDs list and status:
    • Click this link to download the read status in the Statistics section of the report.
    • This file, generated by Newbler, contains the status identifiers for all the reads used in the assembly computation, listed one per line, and in tab-delimited format.
    • The status string describes the read’s fate in the assembly, and can be one of the following five values:
      • Assembled – the read is fully incorporated into the assembly
      • PartiallyAssembled – only part of the read was included in the assembly, the rest was deemed to have diverged sufficiently to not be included
      • Singleton – the read did not overlap with any other reads in the input
      • Repeat – the read was identified by the assembler as likely coming from a repeat region, and so was excluded from the final contigs
      • Outlier – the read was identified by the GS De Novo Assembler as problematic, and was excluded from the final contigs; one explanation of these outliers are chimeric sequences, but sequences may be identified as outliers simply as an assembler artifact.
      • TooShort – the trimmed read was too short to be used in the computation (i.e., shorter than 50 bases, unless 454 Paired End Reads are included in the dataset, in which case, shorter than 15 bases).


  • Example:
Accno   Read Status
AB000098        Singleton
AB000107        Singleton
AB000121        Singleton
AB000492        Singleton
AB000500        PartiallyAssembled
AB000502        PartiallyAssembled
AB000503        PartiallyAssembled
AB000507        Singleton
AB000545        PartiallyAssembled
AB000546        PartiallyAssembled
AB000547        PartiallyAssembled
AB000548        Assembled
AB000549        Assembled
AB000550        PartiallyAssembled
AB000551        PartiallyAssembled
AB000552        PartiallyAssembled
AB000629        Singleton
AB000677        PartiallyAssembled
AB000710        PartiallyAssembled
AB001351        TooShort
AB001352        TooShort
AB001353        TooShort



A High-level and Algorithmic Description of this workflow:

  • The Newbler Assembler Workflow works in the following way:
    • With a sequence database generated by the Read Processing Workflow, assemble raw reads into contigs via Newbler.



Miscellaneous Information and GenomeQuest Recommendations:


  • A Video Describing Newbler:





FAQs:

  • What is Newbler?
    • Newbler is a de novo sequence assembler developed by 454/Roche for use with 454 sequencing data.


  • Why use Newbler?
    • Newbler allows one to assemble their reads into contiguous sequences (contigs). This can be very beneficial when a reference or 'build' is not available to map to, or if it is suspected that one's sample might contain Copy Number Variations (CNVs), or other structural variations, which would hinder the accuracy of normal read to reference mapping.


  • What to do with Newbler results?
    • GenomeQuest currently allows the searching of assembled contigs and/or unassembled reads through the Sequence Search Workflows.
    • Also, Copy Number Variation and Structural Variation detection using de novo assembly results is currently in development, and is expected in the near future.



References:


  • Understanding Sequence Assembly:
    • Sequence Assembly
    • Quinn et al. Assessing the feasibility of GS FLX Pyrosequencing for sequencing the Atlantic salmon genome. BMC Genomics. 2008 Aug 28;9:404. link




Personal tools