Welcome to the GenomeQuest Documentation Wiki

ChipSeq Workflow

From GQ Wiki

Jump to: navigation, search

Use the ChIP-seq work flow to analyze protein interactions with DNA. ChIP-Seq combines chromatin immunoprecipitation (ChIP) with Next Generation Sequencing (NGS) to identify binding sites of DNA-associated proteins. It can be used to precisely map global binding sites for any protein of interest.


Contents

[edit] Example use cases

  • Your goal is to study STAT1 targets in HeLA S3 cells. After enrichment of DNA binding STAT1 (the ChIP step) the resulting DNA library was sequenced using next generation sequencing technology. You now want to find all potential STAT1 binding sites and know which genes are in the surrounding area.
  • Your goal is to study gene targets in TNF1 stimulated cells that are not present in untreated control cells. Both stimulated and control cells have undergone the same experimental ChIP step, and the resulting DNA libraries are sequenced. You now want to find all TNF binding sites present in stimulated but not in the control cells.


[edit] Recommendations

  • We recommend that you understand and tweak the peakfinder parameters for your data set. Once the reads have been aligned to the reference genome you can do multiple peakfinder runs on it and compare the outcome.
  • We recommend that you read Zhang et al. (2008) for a detailed explanation of the MACS peak finding algorithm. Here we provide a brief summary only.
  • We strongly recommend that you always look at the MACS logfile to see how well MACS did on your data set (download logfile from the peakfinder report page). Failures to complete the MACS analysis are often related to the experimental data and/or the analysis parameters used. The GenomeQuest report will warn you, but the real reasons can only be seen in the logfile.
  • We recommend that you use a real control in your experimental setup. The experiment and control samples should have a comparable (high) number of reads. MACS simply linearly scales (normalizes) the number of reads and therefore noise will be scaled in the same way as signal.


[edit] The work flow

The alignment step is typically something that you only want to do once. This step is more or less independent from the specifics of your experiment, and when done right doesn't require any further tweaking. The peak identification step is different. Here it does makes sense to do multiple iterations to tweak the parameters (finding those that will work for your particular experiment). Therefore, these two steps are separated in the interface. Allowing you to align once, and analyze as many times as you would like.


[edit] Align reads to the reference genome

[edit] Run name

Give the database of alignments a meaningful name. This way you, and anyone you might share this with, will be able to find it back at a later date.

[edit] Experimental Read Data set

This is where you choose the reads coming from the genomic DNA sequencing. For your reads to show up here, you need to upload them first. Currently the ChIP-seq work flow supports Illumina and Solid (colorspace) reads up to 128 nucleotides length in fastq format. Because the work flow needs to cleanup the reads prior to alignment, base call quality values are mandatory.

[edit] Reference Genome

Either select from the list of GenomeQuest supported reference databases, or choose one of your own reference databases that have been uploaded.

  • GenomeQuest supported reference databases are usually coupled to information that can be used in the downstream analysis of the work flow. Like for example the description and position of known genes in the vicinity of ChIP-seq peaks, or the effect of an identified SNP on a protein it falls into.
  • If you use one of your own uploaded reference databases, you will only get the basic information on identified peaks (like chromosome, position, statistics, sequence of the peak region, etc).


[edit] Identify the significant peaks

The ChIP-seq peaks are identified using the MACS v1.3.6 software (Zhang et al., 2008). See the description of this method below to get a better understanding of the parameters used here. If you require other peak-finding software, or want to tweak parameters that are currently not exposed, then read this page to see what we can do for you.

[edit] Run name

Give the database of peaks a meaningful name. This way you, and anyone you might share this with, will be able to find it back at a later date.

[edit] Experiment Alignment Database

This is database of reads coming from the experimental sample aligned to the reference genome in the previous work flow step.

[edit] Control Alignment Database

This is database of reads coming from the control sample aligned to the reference genome in the previous work flow step. If your experiment doesn't have a control sample you can indicate that here as well (in that case the experimental sample will be used to estimate the background noise).

Note that MACS can also be applied to differential binding between two conditions by treating one of the experiment samples as the control.

[edit] Mappable genome size

Mappable genome size (or effective genome size) is the size part of the reference genome that can be sequenced and analyzed. Because of repetitive features, the effective genome size will be smaller than the full genome size. The default of 2,700,000,000 nucleotides is recommended for the human genome. For other species you may simply regard 70% to 90% of the full genome size as mappable. This parameter corresponds to the MACS --gsize option.

[edit] Sequence read size

The average size of the reads in nucleotides. This parameter corresponds to the MACS --tsize option.

[edit] Sheared genomic fragment size

The average size of the sheared / sonication fragments in your experiment. This is used to build a model that estimates the real fragment size used to call the peaks (using the experimental alignment data). This parameter corresponds to the MACS --bw option, but the size entered in the web interface will be divided by two before it is passed on to MACS.

[edit] P-value cutoff

P-value cutoff for peak detection. This parameter corresponds to the MACS --pvalue option.

[edit] Background enrichment for peak model

This parameter determines which genomic regions are selected to build a experiment-specific peak model (used to find all the peaks). A region is used for the model if it contains more than mfold reads than the region surrounding it. This parameter corresponds to the MACS --mfold option.

[edit] Background noise lambda set

These three parameters indicate the regions around a potential peak that are used to estimate the local background noise.


[edit] Work Flow Details

This work flow consists of two major steps:

  1. Align the reads coming from the ChIP DNA libraries to the reference genome.
  2. Find the significant peaks using the alignments from the first step.


[edit] Align Reads to the reference genome

[edit] Alignment algorithm and parameters

This is how the alignment algorithm works

  • The algorithm used is the HS3 alignment suite, combined with the GenePAST best fit alignment algorithm (Dudoignon et al., 2002).
  • the algorithm will try to fit the entire (small) read into the (large) reference sequence
  • the algorithm allows mismatches and gaps
  • for colorspace reads (Solid) the comparison is done directly in colorspace
  • reads that align well to multiple positions in the reference genome are considered non-informative and removed from further analysis

In more detail:

  • The reference genome is cut up in overlapping pieces of X nucleotides (with Y nucleotides overlap between the pieces)
  • For each genomic piece all different words of 10 consecutive nucleotides are counted (words are shifted one nucleotide position at a time)
  • For each read in the reads database all different words of 10 consecutive nucleotides are counted
  • If a genomic piece and a read have six words in common the algorithm will attempt to create a real alignment
  • This alignment is made using the GenePAST algorithm that will try to fit the entire (small) read into the (large) reference sequence
  • The GenePAST algorithm allows mismatches and gaps.
  • The number of mismatches and gaps depend on the average length of the reads (2 mismatches/gaps for reads up to 38 nucleotides, 3 for reads up to 64 nucleotides, 4 for reads up to 93 nucleotides and 5 for reads up to 128 nucleotides)
  • Alignments are scored using a simple scoring scheme: matches +1, mismatches -1, gaps -1 (gap opening and extension are counted the same way)
  • A read is flagged as non-informative (and removed from further analysis) if the two best alignments have the same score, or if there are a single best alignment and four similar second best alignments (similar meaning with a score that is the best alignment score minus 1).


[edit] Identify the significant peaks

  • A main issue with peak finding is that reads typically represent the end of the ChIP fragments, making it difficult to pinpoint the precise protein-DNA binding sites. MACS addresses this issue by building a model that uses the distance between reads aligned to the forward and reverse reference strand to estimate where the real binding site should be. More details on how this model is build and used below.
  • Not all genomic regions are equal. Things like sequencing biases, mapping biases, chromatin and copy number variations, and repeat structures create regional differences in ChIP-Seq data. MACS addresses these issues by looking at the background noise in a control sample (if present), and the direct surroundings of a potential peak.

[edit] Building a peak model

  • MACS will try to identify 1000 high-quality peaks in the data. These high-quality peaks are used to build an accurate peak model. This peak model is then used to analyze the entire data set again and call the final peaks.
  • To find the 1000 high-quality peaks MACS will slide a window over the reference genome to identify regions that have more reads than can be expected from a random read distribution.
    • The size of this window is equal to two times the sheared genomic fragment size, or bandwidth parameter.
    • Before an enriched region is considered a high-quality peak, it should have at least mfold more reads aligned to it than can be expected from a random read distribution.
  • Getting the bandwidth and mfold parameters right for your data set is important. If these parameters are set too stringently, MACS is unable to find enough high-quality peaks and will exit with an error. GenomeQuest will issue a warning on the report page, and recommend you look at the log file.
  • The idea is that reads on the forward strand and reads on the reverse strand form separate peaks that surround the real protein-DNA binding site.
  • For each of the 1000 high-quality peaks MACS computes the distance between these strand-specific peaks using their modes. This distance is referred to as d in the article. The actual protein-DNA binding site is predicted to be in the middle, at a distance of d/2 from the strand-specific peaks.

[edit] Data normalization and cleanup

  • If your experiment has a control MACS will linearly scale (normalize) the read counts for the control and the experiment sample.
  • MACS handles biases introduced in the ChIP-DNA amplification and sequencing library preparation steps by removing duplicate reads. Duplicate reads are removed if their count is higher than can be expected from the sequencing depth (binomial distribution with a p-value smaller than 10e-5).

[edit] Identification of final peaks

  • All aligned read positions are shifted by d/2 nucleotides towards the 3' ends of the reads (so depending on whether they are aligned to the forward or reverse reference strand they go left, or right respectively).
  • MACS slides a window of length 2d nucleotides across the genome to find candidate peaks with significant enrichment (compared to the background noise)
    • MACS models the read distribution on the reference genome using a Poisson distribution. In such a model one parameter lambda captures both mean and variance of the read distribution.
    • Significant enrichment is called using this Poisson model with a default p-value of 10e-5. This value can be changed in the launch page.
    • The background noise comes either from the entire genome, or locally around the candidate peak. Whichever method indicates the biggest background noise is the method that is used.
      • When a control data set is present this is used to measure background noise (both entire genome and local background noise). If a control data set is not available then the experiment data set itself is used.
      • Local background noise around the candidate peak is measured over three different (overlapping) regions. By default these regions are 1000, 5000 and 10000 nucleotides long. The region with the highest background noise is used.

[edit] False Discovery Rate (FDR)

  • The FDR is only computed for experiments where there is a control present
  • The FDR is empirically estimated by doing a sample swap (swapping the experiment and control sample) and calling peaks using the same parameters. The FDR is the ratio between the number of peaks in the control and the experiment.
  • Note that MACS can also be applied to differential binding between two conditions by treating one of the experiment samples as the control. In that case the FDR sample swap method does not apply, and the quality of the individual samples needs to be evaluated against a real control (in a separate peakfinder work flow run).

[edit] Annotation of identified peaks

  • If the reference genome allows this, the resulting peaks will be annotated against a database of known genes (GQ gene). When a peak falls within a gene region, the gene name and description will be added to the peak information.

[edit] Workflow Output

[edit] Alignment Report

The report of the aligment step will look like this

File:MWSnap022.gif

[edit] About your read database

This section contains the following information:

  • Original read database, number of reads you have uploaded
  • Reads after cleanup procedure, number of reads left after low quality and repetitive reads have been removed
  • Reads aligned to the reference, number of reads that can be aligned somewhere on the reference
  • Aligned reads selected as informative, number of reads aligned to a unique position on the reference
  • Aligned reads that are not informative, number of reads that aligned well to multiple positions on the reference
  • Reads that could not be aligned to the reference, number of reads that could not be aligned

Clicking on the numbers will bring you to the reads as a GenomeQuest sequence database. Here you can browse, filter, sort, group and export the reads.


File:MWSnap020.gif

[edit] Interactive reports & analysis

Clicking on the "browse the alignments" link will bring you to the alignments as a GenomeQuest alignment database. Here you can browse, filter, sort, group and export the alignments to other formats.


File:MWSnap021.gif

[edit] Peakfinder Report

The report of the peakfinder step will look like this

File:MWSnap023.gif

[edit] Workflow Statistics

The workflow statistics part will show you the name of the run, the name of the experiment and control data set (if any), and a link to the complete MACS logfile.

[edit] Interactive Reports

This part will offer

  • Download of the peaks and associated information in Excel format.
  • Browsing the peaks and associated information as a GenomeQuest sequence database.
  • Download of BED and Wiggle files that can be used to display the peaks in external genome browsers like the one at UCSC.

[edit] Excel Peak Report

The excel spreadsheet is directly generated by the MACS software and contains the following columns:

  1. chr, chromosome
  2. start, the start position of the peak in nucleotides
  3. end, the stop position of the peak in nucleotides
  4. length, the length of the peak in nucleotides
  5. summit, the position of the highest point in nucleotides from the start
  6. tags, the number of reads that are aligned within the peak region
  7. -10*10log(pvalue), this is an indication for the quality of the peak
  8. fold_enrichment, the height of the peak expressed in number of times background noise for that peak
  9. FDR(%), False Discovery Rate or the ratio between the number of peaks in the control and the experiment when the two are swapped. This column is empty when the experiment has no control.


File:MWSnap025.gif

[edit] GenomeQuest Peak Sequence Database

You can browse the peaks as a GenomeQuest sequence database. Every record holds the following information:

  1. Identifier, unique name for the peak
  2. Gene name, if this field is present the peak falls within the gene boundaries. This field is only if the reference organism is supported by GQ Gene.
  3. Description, if this field is present the peak falls within the gene boundaries.. Only if the reference organism is supported by GQ Gene.
  4. Chromosome, the chromosome on which the peak resides
  5. GQ Genomic Begin Pos / GQ Genomic End Pos, the start and end positions for the peak region
  6. Sequence length, the length of the peak region
  7. ChIP-seq peak summit, the position of the highest point in nucleotides from the GQ Genomic Begin Pos
  8. ChIP-seq peak number of reads, number of reads aligned within the peak region
  9. ChIP-seq peak enrichment, the height of the peak expressed in number of times background noise for that peak
  10. ChIP-seq peak quality, -10*10log(pvalue) this is an indication for the quality of the peak
  11. ChIP-seq peak FDR, False Discovery Rate or the ratio between the number of peaks in the control and the experiment when the two are swapped. This column is empty when the experiment has no control.
  12. Database cross-references, when applicable there is a link to the GQ Gene database and a link that will directly display the genomic region within the UCSC genome browser.
  13. Sequence of the peak region itself


File:MWSnap026.gif


Like any sequence database within the GenomeQuest web interface, all the information that is displayed can also be used to create a filter. You could use this filtering to bring up the peaks that interest you most. For example all highly significant peaks that fall within a certain class of genes.

File:MWSnap029.gif


[edit] Visualization in external genome browsers

  • BED file (zipped) with coverage every 10 nucleotides (see here for a description of the format).
  • WIG file (zipped) with separate files for every chromosome and experiment and control samples (see here for more information on the format). Go here to upload these files as a custom track in the UCSC browser (or here for more information on custom tracks).


[edit] References

  • Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137 (Medline, Article full text, Article PDF)
  • Dudoignon L, Glemet E, Heus HC, Raffinot M. High similarity sequence comparison in clustering large sequence databases. Proc IEEE Comput Soc Bioinform Conf. 2002;1:228-36.
Personal tools