Welcome to the GenomeQuest Documentation Wiki
ChipSeq Workflow
What does it do?
- The ChIP-Seq workflow allows the analysis of reads coming from a ChIP-Seq library by either aligning the reads to their respective reference or immediately calling peaks to detect binding sites corresponding to the protein they ChIP'ed.
A Map of how to Analyze your NGS data through the ChIP-Seq Workflow:
Example use cases:
- Your goal is to study STAT1 targets in HeLA S3 cells. After enrichment of DNA binding STAT1 (the ChIP step) the resulting DNA library was sequenced using next generation sequencing technology. You now want to find all potential STAT1 binding sites and know which genes are in the surrounding area.
- Your goal is to study gene targets in TNF1 stimulated cells that are not present in untreated control cells. Both stimulated and control cells have undergone the same experimental ChIP step, and the resulting DNA libraries are sequenced. You now want to find all TNF binding sites present in stimulated but not in the control cells.
- Use the ChIP-Seq workflow to analyze protein interactions with DNA. ChIP-Seq combines chromatin immunoprecipitation (ChIP) with Next Generation Sequencing (NGS) to identify binding sites of DNA-associated proteins. It can be used to precisely map global binding sites for any protein of interest.
What does it produce?
- This workflow calls peaks using mapped reads coming from a ChIP-Seq library. Once peaks are called, they can be browsed and filtered using the sequence database browser.
Next Steps:
- The next logical step after running this workflow is to analyze the results through the sequence database browser in order to find peaks of interest, or of high or low significance. Results can then be exported into Excel or Word format.
Important Parameters that are used in the Workflow:
- The launch page lists the following parameters:
- If chosen- "Align the reads coming from the ChIP DNA libraries to the reference genome." *This requires that the ChIP-Seq workflow be re-accessed in order to call peaks using the mapped reads.
- Run Name: A description of your run
- Dataset: the dataset you want mapped
- Choose either a GQ reference database or one of your own reference databases: a public reference database maintained by GenomeQuest, or a reference database uploaded by the user may be chosen.
- If chosen- "Find the significant peaks using the alignments from the first step."
- Run Name: A description of you run
- Alignment data: Previously mapped reads
- Use a control data set to estimate noise/No control, use experimental data set to estimate noise: Did this experiment contain a control? If yes, the control data set should be chosen. Otherwise, the treated sample being analyzed will be used to estimate noise.
- Peakfind Parameters:
- Mappable genome size is 'x' nucleotides: the mappable genome size
- Sequence read size is 'x' nucleotides: the length of the sequence
- Sheared genomic fragment size is 'x' nucleotides: the estimated sanitation size
- P-value cutoff: the p-value for which to filter on
- 'x' times background enrichment for peak model (mfold): enrichment must be 'x' or 'mfold' times the background to detect a binding site
- lambdaset: the three levels of nearby regions in basepairs to calculate dynamic lambda (background)
- If chosen- "Align the reads coming from the ChIP DNA libraries to the reference genome." *This requires that the ChIP-Seq workflow be re-accessed in order to call peaks using the mapped reads.
Understanding key Parameters:
The ChIP-seq peaks are identified using the MACS v1.3.6 software (Zhang et al., 2008). See the description of this method below to get a better understanding of the parameters used here.
Run name
Give the database of peaks a meaningful name. This way you, and anyone you might share this with, will be able to find it back at a later date.
Experiment Alignment Database
This is database of reads coming from the experimental sample aligned to the reference genome in the previous work flow step.
Control Alignment Database
This is database of reads coming from the control sample aligned to the reference genome in the previous work flow step. If your experiment doesn't have a control sample you can indicate that here as well (in that case the experimental sample will be used to estimate the background noise).
Note that MACS can also be applied to differential binding between two conditions by treating one of the experiment samples as the control.
Mappable genome size
Mappable genome size (or effective genome size) is the size part of the reference genome that can be sequenced and analyzed. Because of repetitive features, the effective genome size will be smaller than the full genome size. The default of 2,700,000,000 nucleotides is recommended for the human genome. For other species you may simply regard 70% to 90% of the full genome size as mappable. This parameter corresponds to the MACS --gsize option.
Sequence read size
The average size of the reads in nucleotides. This parameter corresponds to the MACS --tsize option.
Sheared genomic fragment size
The average size of the sheared / sonication fragments in your experiment. This is used to build a model that estimates the real fragment size used to call the peaks (using the experimental alignment data). This parameter corresponds to the MACS --bw option, but the size entered in the web interface will be divided by two before it is passed on to MACS.
P-value cutoff
P-value cutoff for peak detection. This parameter corresponds to the MACS --pvalue option.
Background enrichment for peak model
This parameter determines which genomic regions are selected to build a experiment-specific peak model (used to find all the peaks). A region is used for the model if it contains more than mfold reads than the region surrounding it. This parameter corresponds to the MACS --mfold option.
Background noise lambda set
These three parameters indicate the regions around a potential peak that are used to estimate the local background noise.
Report Page Description:
- The report (above) of a finished workflow run shows the following things:
- The report page displays a description of run information, and the MACS log output from peak-calling.
- Additionally, peaks may be viewed in Excel Format, browsed through the GenomeQuest results database browser, or even downloaded in BED or WIG format.
- The processed results database in GenomeQuest's biofacet form will contain one peak call per record.
Browsing your Peaks in the Sequence Database Browser:
You can browse the peaks as a GenomeQuest sequence database. Every record holds the following information:
- Identifier- the unique name for the peak.
- Gene name- if this field is present the peak falls within the gene boundaries. This field is only if the reference organism is supported by GQ Gene.
- Description- if this field is present the peak falls within the gene boundaries. Only if the reference organism is supported by GQ Gene.
- Chromosome- the chromosome on which the peak resides.
- GQ Genomic Begin Pos / GQ Genomic End Pos, the start and end positions for the peak region.
- Sequence length- the length of the peak region.
- ChIP-seq peak summit- the position of the highest point in nucleotides from the GQ Genomic Begin Pos.
- ChIP-seq peak number of reads- number of reads aligned within the peak region.
- ChIP-seq peak enrichment- the height of the peak expressed in number of times background noise for that peak.
- ChIP-seq peak quality- -10*10log(pvalue) this is an indication for the quality of the peak.
- ChIP-seq peak FDR- False Discovery Rate or the ratio between the number of peaks in the control and the experiment when the two are swapped. This column is empty when the experiment has no control.
- Database cross-references- when applicable there is a link to the GQ Gene database and a link that will directly display the genomic region within the UCSC genome browser.
- Sequence of the peak region itself.
Interactive Reports:
Interactive reports include the following:
- Download of the peaks and associated information in Excel format.
- Browsing the peaks and associated information as a GenomeQuest sequence database.
- Download of BED and Wiggle files that can be used to display the peaks in external genome browsers like the one at UCSC.
Excel Peak Report:
The excel spreadsheet is directly generated by the MACS software and contains the following columns:
- chr- chromosome
- start- the start position of the peak in nucleotides
- end- the stop position of the peak in nucleotides
- length- the length of the peak in nucleotides
- summit- the position of the highest point in nucleotides from the start
- tags- the number of reads that are aligned within the peak region
- -10*10log(pvalue)- this is an indication for the quality of the peak
- fold_enrichment- the height of the peak expressed in number of times background noise for that peak
- FDR (%)- False Discovery Rate or the ratio between the number of peaks in the control and the experiment when the two are swapped. This column is empty when the experiment has no control.
A High-level and Algorithmic Description of this workflow:
- The ChIP-Seq algorithm works in the following way:
- If reads aren't mapped, map reads for downstream peak-calling.
- If reads are mapped, with mapped reads call peaks using the MACS algorithm.
- Convert results to GenomeQuest's biofacet form.
Miscellaneous Information and GenomeQuest Recommendations:
- We recommend that you understand and tweak the peakfinder parameters for your data set. Once the reads have been aligned to the reference genome you can do multiple peak finding runs on it and compare the outcome.
- We recommend that you read Zhang et al. (2008) for a detailed explanation of the MACS peak finding algorithm. Here we provide a brief summary only.
- We strongly recommend that you always look at the MACS logfile to see how well MACS did on your data set (download logfile from the peakfinder report page). Failures to complete the MACS analysis are often related to the experimental data and/or the analysis parameters used. The GenomeQuest report will warn you, but the real reasons can only be seen in the logfile.
- We recommend that you use a real control in your experimental setup. The experiment and control samples should have a comparable (high) number of reads. MACS simply linearly scales (normalizes) the number of reads and therefore noise will be scaled in the same way as signal.
Identify the significant peaks:
- An important issue with peak finding is that reads typically represent the end of the ChIP fragments, making it difficult to pinpoint the precise protein-DNA binding sites. MACS addresses this issue by building a model that uses the distance between reads aligned to the forward and reverse reference strand to estimate where the real binding site should be. More details on how this model is build and used below.
- Not all genomic regions are equal. Things like sequencing biases, mapping biases, chromatin and copy number variations, and repeat structures create regional differences in ChIP-Seq data. MACS addresses these issues by looking at the background noise in a control sample (if present), and the direct surroundings of a potential peak.
Building a peak model:
- MACS will try to identify 1000 high-quality peaks in the data. These high-quality peaks are used to build an accurate peak model. This peak model is then used to analyze the entire data set again and call the final peaks.
- To find the 1000 high-quality peaks MACS will slide a window over the reference genome to identify regions that have more reads than can be expected from a random read distribution.
- The size of this window is equal to two times the sheared genomic fragment size, or bandwidth parameter.
- Before an enriched region is considered a high-quality peak, it should have at least mfold more reads aligned to it than can be expected from a random read distribution.
- Getting the bandwidth and mfold parameters right for your data set is important. If these parameters are set too stringently, MACS is unable to find enough high-quality peaks and will exit with an error. GenomeQuest will issue a warning on the report page, and recommend you look at the log file.
- The idea is that reads on the forward strand and reads on the reverse strand form separate peaks that surround the real protein-DNA binding site.
- For each of the 1000 high-quality peaks MACS computes the distance between these strand-specific peaks using their modes. This distance is referred to as d in the article. The actual protein-DNA binding site is predicted to be in the middle, at a distance of d/2 from the strand-specific peaks.
Data normalization and cleanup:
- If your experiment has a control MACS will linearly scale (normalize) the read counts for the control and the experiment sample.
- MACS handles biases introduced in the ChIP-DNA amplification and sequencing library preparation steps by removing duplicate reads. Duplicate reads are removed if their count is higher than can be expected from the sequencing depth (binomial distribution with a p-value smaller than 10e-5).
Identification of final peaks:
- All aligned read positions are shifted by d/2 nucleotides towards the 3' ends of the reads (so depending on whether they are aligned to the forward or reverse reference strand they go left, or right respectively).
- MACS slides a window of length 2d nucleotides across the genome to find candidate peaks with significant enrichment (compared to the background noise)
- MACS models the read distribution on the reference genome using a Poisson distribution. In such a model one parameter lambda captures both mean and variance of the read distribution.
- Significant enrichment is called using this Poisson model with a default p-value of 10e-5. This value can be changed in the launch page.
- The background noise comes either from the entire genome, or locally around the candidate peak. Whichever method indicates the biggest background noise is the method that is used.
- When a control data set is present this is used to measure background noise (both entire genome and local background noise). If a control data set is not available then the experiment data set itself is used.
- Local background noise around the candidate peak is measured over three different (overlapping) regions. By default these regions are 1000, 5000 and 10000 nucleotides long. The region with the highest background noise is used.
False Discovery Rate (FDR):
- The FDR is only computed for experiments where there is a control present
- The FDR is empirically estimated by doing a sample swap (swapping the experiment and control sample) and calling peaks using the same parameters. The FDR is the ratio between the number of peaks in the control and the experiment.
- Note that MACS can also be applied to differential binding between two conditions by treating one of the experiment samples as the control. In that case the FDR sample swap method does not apply, and the quality of the individual samples needs to be evaluated against a real control (in a separate peakfinder work flow run).
Visualization in external genome browsers:
- BED file (zipped) with coverage every 10 nucleotides (see here for a description of the format).
- WIG file (zipped) with separate files for every chromosome and experiment and control samples (see here for more information on the format). Go here to upload these files as a custom track in the UCSC browser (or here for more information on custom tracks).
FAQs:
- Frequently Asked Questions:
- What algorithm does GenomeQuest use to call peaks?
- GenomeQuest utilizes MACS for peak-calling. More information on MACS can be obtained from the references section.
- What should I filter my peaks on?
- Filtering for significant peaks isn't necessarily trivial. Intimate knowledge of the ChIP experiment, and the binding patterns of the protein are necessary. When filtering, keep in mind the size of the protein, its binding patterns, and how efficiently the protein binds chromatin. Methods utilized upstream and downstream of your ChIP, such as antibody/cross-linker selection as well as PCR, can and will significantly effect the throughput and ability of detecting binding events.
- What algorithm does GenomeQuest use to call peaks?
References:
- MACS reference: Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137.
- Dudoignon L, Glemet E, Heus HC, Raffinot M. High similarity sequence comparison in clustering large sequence databases. Proc IEEE Comput Soc Bioinform Conf. 2002;1:228-36.



