Welcome to the GenomeQuest Documentation Wiki
Newbler Assembler
From GQ Wiki
Contents |
What does it do?
The 454 Newbler Assembly is a de novo assembler that allows you to assemble your raw reads. Newbler was designed with the intent to be used with 454 data.
A Map of how to Analyze your NGS data through the Newbler Workflow:
What does it produce?
- This workflow takes as input a resulting read sequence database created by uploading your reads, and running them through the Read Processing Workflow. The Newbler workflow then assembles your selected reads, and returns interactive results databases, and flat file reports.
- INPUT:
- A sequence database of your uploaded and processed reads. This is created by running the Read Processing Workflow.
Next Steps:
- A number of downstream steps can be taken once you have attained your Newbler Workflow results including:
- Sequence Searching for Biology users: Search using your assembled contigs.
- Sequence Searching for IP users: Search using your assembled contigs.
- Results Download.
Important Parameters that are used in the Workflow:
- NOTE: When accessed from 'Launch Workflows' > 'Assembly' > 'Newbler', if "Large / Complex Genome" is selected then workflow parameters will be set by default:
- NOTE: When accessed from the sequence or results database browsers through the 'Applications' > 'Assembly' > 'Newbler' plugin, the following (below) advanced parameters will be accessible through "Advanced Options":
- Result title:
- You need to name your run with an appropriate name. The default will be "Rapid annotation DATE", where DATE is today's date.
- Minimum Overlap Length:
- The minimum length of overlaps used by the assembler (>1). Two sequences need to have an overlap of at least this value to be assembled into contigs.
- Minimum Overlap Identity:
- The minimum percent identity of overlaps used by the assembler (0-100). Two sequences need to have an overlap identity of at least this value to be assembled into contigs.
- Seed Step:
- The number of bases between seed generation locations used in the exact k-mer matching part of the overlap detection (>1)
- Seed Length:
- The number of bases used for each seed in the exact k-mer matching part of the overlap detection (i.e. the k value of the k-mer matching) (6-16).
- Seed Count:
- The number of seeds required in a window before an extension is made (>1).
Report Page Description
- Statistics:
- This section displays some global statistics such as the total number of reads, the total number of non-redundant reads, and the total number of assigned reads; the ones with at least one hit.
- Total number of sequences: the number of sequences used in the assembly computation
- Total number of contigs: the number of large contigs identified
- Contigs size avg / longest / N50: average size of the contigs, longest, and N50 (median) size
- Total number of repeat sequences: the number of sequences deemed to be from repeat regions
- Total number of aligned sequences: the number of sequences that aligned with other sequences in the pairwise alignment step.
- Total number of assembled sequences: the number of sequences fully assembled into the contigs
- Total number of unassembled sequences: the number of sequences that did not overlap with other sequences because the sequences were too short, or were repeats, etc; see "Read status" below.
- Assembled Sequences:
- This section contains a pie chart showing the percentage of assembled and unassembled sequences, as well as links to several files and databases described here after.
- Browse contigs:
- Click this link to browse the database of contigs produced by Newbler and available in GenomeQuest.
- Once clicked, GQ will open the contigs, such as shown below:
Newbler contigs as a sequence database
- Browse unassembled sequences:
- GenomeQuest will automatically add the unassembled sequences to the resulting database. This link will lead you to the list of sequences that are not a part of any contig; see below for a description of why sequences can be excluded from contigs.
- Important: - The number of singletons can be different from the number of unassembled sequences, since singletons are sequences that are taken into account in the assembly, but are not assembled into a contig, whereas unassembled sequences can be sequence too short, too long, etc.
- Browse all:
- Click this link to browse the database of contigs produced by Newbler, as well as the sequences that have not been assembled into contigs.
- Launch a sequence search with the contigs:
- Click this link to preload the Sequence Search page with the contigs as query.
- Download Contigs (FASTA):
- Click this link to download the contigs in Fasta format.
- Download Contigs (ACE):
- Click this link to download the contigs description in ACE.
- This format is used by many assemblers. Click here for a complete description of the ACE format.
- For example, you can open ACE files with Consed, or Geneious (download trial version) to visualize the consensus sequence, or the tiling of the sequences into contigs, etc.
- Below is an example of a contig visualized in Geneious, and composed of 3 sequences (c3_r2, c3_r3, and c3_r4):
- Download read IDs list and status:
- Click this link to download the read status in the Statistics section of the report.
- This file, generated by Newbler, contains the status identifiers for all the reads used in the assembly computation, listed one per line, and in tab-delimited format.
- The status string describes the read’s fate in the assembly, and can be one of the following five values:
- Assembled – the read is fully incorporated into the assembly
- PartiallyAssembled – only part of the read was included in the assembly, the rest was deemed to have diverged sufficiently to not be included
- Singleton – the read did not overlap with any other reads in the input
- Repeat – the read was identified by the assembler as likely coming from a repeat region, and so was excluded from the final contigs
- Outlier – the read was identified by the GS De Novo Assembler as problematic, and was excluded from the final contigs; one explanation of these outliers are chimeric sequences, but sequences may be identified as outliers simply as an assembler artifact.
- TooShort – the trimmed read was too short to be used in the computation (i.e., shorter than 50 bases, unless 454 Paired End Reads are included in the dataset, in which case, shorter than 15 bases).
- Example:
Accno Read Status AB000098 Singleton AB000107 Singleton AB000121 Singleton AB000492 Singleton AB000500 PartiallyAssembled AB000502 PartiallyAssembled AB000503 PartiallyAssembled AB000507 Singleton AB000545 PartiallyAssembled AB000546 PartiallyAssembled AB000547 PartiallyAssembled AB000548 Assembled AB000549 Assembled AB000550 PartiallyAssembled AB000551 PartiallyAssembled AB000552 PartiallyAssembled AB000629 Singleton AB000677 PartiallyAssembled AB000710 PartiallyAssembled AB001351 TooShort AB001352 TooShort AB001353 TooShort
A High-level and Algorithmic Description of this workflow:
- The Newbler Assembler Workflow works in the following way:
- With a sequence database generated by the Read Processing Workflow, assemble raw reads into contigs via Newbler.
Miscellaneous Information and GenomeQuest Recommendations:
- A Video Describing Newbler:
FAQs:
- What is Newbler?
- Newbler is a de novo sequence assembler developed by 454/Roche for use with 454 sequencing data.
- Why use Newbler?
- Newbler allows one to assemble their reads into contiguous sequences (contigs). This can be very beneficial when a reference or 'build' is not available to map to, or if it is suspected that one's sample might contain Copy Number Variations (CNVs), or other structural variations, which would hinder the accuracy of normal read to reference mapping.
- What to do with Newbler results?
- GenomeQuest currently allows the searching of assembled contigs and/or unassembled reads through the Sequence Search Workflows.
- Also, Copy Number Variation and Structural Variation detection using de novo assembly results is currently in development, and is expected in the near future.
References:
- Newbler Params in Detail through Bio-Perl docs:
- Understanding Sequence Assembly:
- Sequence Assembly
- Quinn et al. Assessing the feasibility of GS FLX Pyrosequencing for sequencing the Atlantic salmon genome. BMC Genomics. 2008 Aug 28;9:404. link
- Understanding read length, and its correlation with de novo assembler accuracy:
- Understanding Newbler:




