加载中…
个人资料
Jenny
Jenny
  • 博客等级:
  • 博客积分:0
  • 博客访问:398,652
  • 关注人气:45
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
相关博文
推荐博文
谁看过这篇博文
加载中…
正文 字体大小:

【MACS】(nature-HANDBOOK)Identifying ChIP-seq enrichment using MACS

(2014-07-30 05:10:20)
标签:

macs

生物信息

分类: 【转】NGS算法
http://www.nature.com/nprot/journal/v7/n9/full/nprot.2012.101.html

Identifying ChIP-seq enrichment using MACS

J Feng, T LiuB QinY ZhangXS Liu - Nature protocols, 2012 - nature.com

Procedure

Expand All

Steps 7 - 10: Loading results generated by MACS into IGV

Procedure

Collapse all

Steps 1 - 4: Installing MACS

Timing: 10 min

  1. Set up the necessary operating system and computing environment as listed in the Equipment section.

  2. Download the MACS source code from http://github.com/downloads/taoliu/MACS/MACS-1.4.2-1.tar.gz. Locate the directory containing the downloaded source code package, and unpack the package using the following command:

    > tar xvzf MACS-1.4.2-1.tar.gz

    Critical step:A precompiled MACS package for Debian or Ubuntu Linux is available for download from the above link.

  3. Change the working directory to MACS-1.4.2 and use the standard installation command for Python packages as follows:

    > cd MACS-1.4.2

    > python setup.py install

    The second command will install MACS globally, which requires root or administrator privileges. Alternatively, a user can install MACS to a specified directory in which the user has write privileges by using the following command:

    > python setup.py install --prefix /your_directory/

    Caution:Do not install MACS in the source code directory.

    Troubleshooting

  4. Configure the shell environment variable PATH (such as the Unix shell Bash) as shown below. If MACS is installed in a user-specified directory (in Step 3), then add the following lines to the user's configuration file .bashrc in the home directory:

    > export PATH=/your_directory/bin:$PATH

    > export PYTHONPATH=/your_directory/lib/python2.X/site-packages/:$PYTHONPATH

    Here, python2.X represents the version of Python used for the setup script in Step 3. To determine the current version of Python, type the following:

    > python --version

    For example, if the output is Python 2.7.1, then 2.X must be replaced by 2.7. Load the configuration file by typing either source ∼ /.bashrc or bash on the command line to reload Bash. To temporarily change the environment variables, type the above two exportcommands in Bash.

    Troubleshooting

Step 5: Installing optional software

Timing: 30 min

  1. Download Bowtie from http://www.bowtie-bio.sourceforge.net/manual.shtml,
     a prebuilt index of hg19 for Bowtie from  ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/hg19.ebwt.zip,
    SAMtools from http://www.samtools.sourceforge.net/,
    R from http://www.cran.r-project.org/,
    PeakSplitter_Cpp_1.0.tar.gz from http://www.ebi.ac.uk/bertone/software.html and
    IGV from http://www.broadinstitute.org/igv/.
     Install each software package according to the corresponding instructions.


(根据网站http://beijicy.blog.sohu.com/281623164.html得到R(Linux版)可以从下面网站下载:http://lib.stat.cmu.edu/R/CRAN/src/base/R-3/R-3.0.0.tar.gz  )

Step 6: Running MACS to call peaks

  1. We use four different ChIP-seq data sets to illustrate how to run MACS using varying parameters: use option A to call FoxA1 peaks; option B to call H3K4me3 peaks with fragment size estimation turned on; option C to call H3K4me3 peaks with a specified DNA fragment size; or option D to call H3K36me3 peaks. Please note that some of these data sets are very large and may take an hour or more to download.

    1. Calling FoxA1 peaks

      Timing: 90 min

      1. Locate the downloaded prebuilt index for Bowtie, and unpack the package using the following command:

        > unzip hg19.ebwt.zip

        This command will generate several files with names prefixed by 'hg19' in the current directory.

      2. Download the HudsonAlpha Institute FoxA1 raw reads fromhttp://cistrome.dfci.harvard.edu/MACSNatureProtocol/HAIB_T47D_FoxA1.tar.gz, locate the download directory, unpack the compressed file and map the raw reads to the reference genome using Bowtie by entering the following two commands:

        > tar xvzf HAIB_T47D_FoxA1.tar.gz

        > bowtie –m 1 -S -q /path_to/hg19 HAIB_T47D_FoxA1.fastq HAIB_T47D_FoxA1.sam

        In these commands,

        -m 1 specifies that reads with only one hit on the genome are retained;

        -S specifies the output format as SAM;

        -q specifies the input format as FASTQ;

        /path_to/ is the directory containing the unzipped bowtie prebuilt indexes; and

        HAIB_T47D_FoxA1.fastq contains the downloaded raw reads for FoxA1.

        Please refer to the Bowtie manual for more information.

      3. Run MACS in the same directory by entering the following command:

        > macs14 -t HAIB_T47D_FoxA1.sam -n HAIB_T47D_FoxA1 –g hs -B -S --call-subpeaks

        The meanings of the parameters in this command are as follows (see also Box 1 for further parameters that could be used):

        Box 1: Additional MACS parameters

         

        -t specifies the file name for the ChIP-seq sample read alignment. MACS supports and can automatically detect any of the following file formats: SAM, BAM, BED, ELAND, ELANDMULTI, ELANDMULTIPET, ELANDEXPORT and BOWTIE. The user-specified parameter --format can override the automatic format detection.

        -g specifies the genome size. The hs parameter is a shortcut for the approximate effective genome size of humans, which equals 2.7e9.

        -n applies the prefix 'HAIB_T47D_FoxA1' to the output file names.

        -B generates signal files in the bedGraph format containing the extended read pileup at every base pair. This step is very time-consuming and memory-intensive; therefore, only specify -B if bedGraph output files are needed.

        -S generates a single bedGraph file for the whole genome; otherwise, signal files will be generated for each chromosome separately.

        --call-subpeaks asks MACS to call PeakSplitter automatically after peak calling, if the latter has been installed properly.

        Caution:Ensure that the character '/' does not appear in the specified file prefix after the -n option, as MACS will interpret the string before '/' as a directory (causing an error if this directory does not exist).

        Troubleshooting

      4. Check the screen output for the running status of MACS in the terminal. MACS generates warnings and progress reports similar to the following:

        INFO @ Sun, 03 Jun 2012 23:36:03:

        # ARGUMENTS LIST:

        # name = HAIB_T47D_FoxA1

        # format = AUTO

        # ChIP-seq file = HAIB_T47D_FoxA1.sam

        # control file = None

        # effective genome size = 2.70e+09

        # band width = 300

        # model fold = 10,30

        # pvalue cutoff = 1.00e-05

        # Large dataset will be scaled towards smaller dataset.

        # Range for calculating regional lambda is: 10000 bps

        INFO #1 read tag files.

        INFO #1 read treatment tags.

        INFO Detected format is: SAM

         

        INFO #2 Build Peak Model.

        INFO #2 number of paired peaks: 16586

        INFO #2 finished!

        INFO #2 predicted fragment length is 114 bps

        INFO #2.2 Generate R script for model : HAIB_T47D_FoxA1_model.r

        INFO #3 Call peaks.

        INFO #3 shift treatment data

        INFO #3 merge +/- strand of treatment data

        INFO #3 save the shifted and merged tag counts into bedGraph file.

        INFO write to a bedGraph file

         

        INFO #3 call peak candidates

        INFO #3 use self to calculate local lambda and filter peak candidates.

        INFO #3 Finally, 74761 peaks are called!

        INFO #4 Write output xls file. HAIB_T47D_FoxA1_peaks.xls

        INFO #4 Write peak bed file. HAIB_T47D_FoxA1_peaks.bed

        INFO #4 Write summits bed file. HAIB_T47D_FoxA1_summits.bed

        INFO #5 Done! Check the output files!

        INFO #6 Try to invoke PeakSplitter.

        INFO #6 Please check HAIB_T47D_FoxA1_peaks.subpeaks.bed file for PeakSplitter output!

        The messages provide information such as the date (the first line), key parameters (lines starting with '#') and the run progress (lines starting with 'INFO'). For lines indicating run progress, we have removed the date information and several lines to make the screen output more concise. If MACS encounters exceptions (e.g., if MACS estimates a fragment size that is too small), then warning messages appear in the list.

        Caution:Although warning messages do not affect the success of a MACS run, the majority should still be carefully evaluated. For example, the warning message 'unbalanced reads between treatment and control' means that the FDR of the resulting peaks will be overestimated when the control sample has more reads and will be underestimated when the ChIP-seq sample is sequenced more deeply. The message 'Fewer paired peaks X than 1,000' means that MACS only identified X model peaks and may indicate potential data quality issues because 1,000 model peaks are needed to robustly estimate ChIP-DNA fragment size. The message 'missing chromosome X data' might suggest that the raw input file for that chromosome is incomplete.

      5. Generate a PDF figure for the peak model using the following command (assuming that R has been installed properly):

        > Rscript HAIB_T47D_FoxA1_model.r

        This command will produce a PDF image named HAIB_T47D_FoxA1_model.pdf in the current working directory. This image illustrates the distribution of reads on positive and negative strands in the model peaks and the estimated fragment size.

      6. Verify the existence of the files listed in Table 1 in the current directory. Details of the output files are described in the ANTICIPATED RESULTS section.

        Table 1: Files generated by MACS for the HudsonAlpha Institute FoxA1 data set.

         

    2. Calling H3K4me3 peaks with fragment size estimation turned on

      Timing: 90 min

      1. Download the University of Washington H3K4me3 data set fromhttp://cistrome.dfci.harvard.edu/MACSNatureProtocol/UW_K562_H3K4me3.tar.gz. This data set contains one control replicate and one ChIP-seq replicate. Locate the directory in which the downloaded file has been stored. Extract the bundle using the following command:

        > tar xvzf UW_K562_H3K4me3.tar.gz

      2. In the same directory, run MACS as follows:

        > macs14 -t UW_K562_H3K4me3.bam -c UW_K562_H3K4me3_Control.bam -g hs -n UW_K562_H3K4me3 -B -S --call-subpeaks

        The parameter -c specifies the file name for the control sample read alignment. The other parameters follow the same convention as described in Step 6A(iii); see Box 1.

        Troubleshooting

      3. Check the screen output generated by MACS as described in Step 6A(iv). MACS will report a successful model build by displaying the following messages:

         

        INFO : #2 Build Peak Model.

        INFO : #2 number of paired peaks: 12267

        INFO : #2 finished!

        INFO : #2 predicted fragment length is 156 bps

         

        INFO : #3 use control data to filter peak candidates.

        INFO : #3 Finally, 20632 peaks are called!

        INFO : #3 find negative peaks by swapping treat and control

        INFO : #3 Finally, 4006 peaks are called!

         

      4. Generate a PDF figure named UW_K562_H3K4me3_model.pdf for the read distribution in model peaks and the estimation of fragment size by applying the following command:

        > Rscript UW_K562_H3K4me3_model.r

      5. Verify that all output files are present as described in Step 6A(vi), except that they should have the file name prefix 'UW_K562_H3K4me3' instead of 'HAIB_T47D_FoxA1'. Because a control sample is available in this case, another file ('UW_K562_H3K4me3_negative_peaks.xls') is generated that contains the peaks called by comparing the control sample with the ChIP-seq sample using the same parameters. These peaks are used by MACS for estimating the FDR of each reported ChIP-seq peak.

    3. Calling H3K4me3 peaks with a specified DNA fragment size

      Timing: 90 min

      1. Download the Broad Institute H3K4me3 data set fromhttp://cistrome.dfci.harvard.edu/MACSNatureProtocol/BROAD_GM12878_H3K4me3.tar.gz. This data set contains one ChIP-seq replicate and two control replicates. MACS runs either on a single ChIP-seq sample or on a single ChIP-seq sample having a single control; in this case, the two control replicates must be concatenated. Extract the bundle, and merge the two control replicates using the following two commands:

        > tar xvzf BROAD_GM12878_H3K4me3.tar.gz

        > samtools merge BROAD_GM12878_H3K4me3_Control.bam

        BROAD_GM12878_H3K4me3_Control_1.bam

        BROAD_GM12878_H3K4me3_Control_2.bam

      2. Run MACS as follows:

        > macs14 -t BROAD_GM12878_H3K4me3.bam -c

        BROAD_GM12878_H3K4me3_Control.bam -g hs -n BROAD_GM12878_H3K4me3 -B

        -S --call-subpeaks

      3. Check the screen output of MACS; it should contain the following lines:

         

        INFO : #2 number of paired peaks: 31077

        INFO : #2 finished!

        INFO : #2 predicted fragment length is 53 bps

         

        The model built by MACS has a fragment length of 53, which is unusually short in a typical ChIP-seq experiment. Therefore, it is preferable to rerun MACS with modified parameters, as described in Step 6C(v).

      4. If MACS is still running, terminate it by typing Ctrl + C (hold the Control key and press C). Then, remove the directory BROAD_GM12878_H3K4me3_MACS_bedGraph from the previous MACS run using the following command:

        > rm -rf BROAD_GM12878_H3K4me3_MACS_bedGraph

      5. Rerun MACS using modified parameters, as follows:

        > macs14 -t BROAD_GM12878_H3K4me3.bam -c

        BROAD_GM12878_H3K4me3_Control.bam -g hs -n BROAD_GM12878_H3K4me3 --

        nomodel --shiftsize 73 -B -S --call-subpeaks

        This command uses --nomodel to instruct MACS not to estimate the fragment size. --shiftsize 73 tells MACS to use a fixed DNA fragment size of 146 = 73 × 2.

        Critical step:The fragment size is set to 146 because a nucleosome is wrapped in a DNA sequence that is ∼146 bp in length, extending reads mapped to either DNA strand in the 3′ direction by 146 bp. Users can also specify the fragment size according to their sequencing library preparation, often in the range of 150–200 bp.

      6. Check the MACS screen output. The following messages relay that MACS did not build the model but instead used shiftsize 73:

         

        INFO : #2 Build Peak Model.

        INFO : #2 Skipped.

        INFO : #2 Use 73 as shiftsize, 146 as fragment length

         

      7. Verify all the generated files except the R script as in Step 6A(vi), where file names must contain the prefix 'BROAD_GM12878_H3K4me3' instead of 'HAIB_T47D_FoxA1'. The fileBROAD_GM12878_H3K4me3_negative_peaks.xls contains the peaks identified in the control sample over the ChIP-seq sample. In this case, because MACS did not build the peak model, no R script is generated.

    4. Calling H3K36me3 peaks

      Timing: 90 min

      1. Download the Broad Institute H3K36me3 data set fromhttp://cistrome.dfci.harvard.edu/MACSNatureProtocol/BROAD_GM12878_H3K36me3.tar.gz. This data set contains two control replicates and two ChIP-seq replicates. Extract the bundle and merge the replicates using the following commands:

        > tar xvzf BROAD_GM12878_H3K36me3.tar.gz

        > samtools merge BROAD_GM12878_H3K36me3.bam

        BROAD_GM12878_H3K36me3_1.bam BROAD_GM12878_H3K36me3_2.bam

        > samtools merge BROAD_GM12878_H3K36me3_Control.bam

        BROAD_GM12878_H3K36me3_Control_1.bam

        BROAD_GM12878_H3K36me3_Control_2.bam

      2. Run MACS to call the peaks using the following command:

        > macs14 -t BROAD_GM12878_H3K36me3.bam -c

        BROAD_GM12878_H3K36me3_Control.bam -g hs -n BROAD_GM12878_H3K36me3 --

        nomodel --shiftsize 73 -B -S --pvalue 1e-3 --call-subpeaks

        Compared with Step 6C(v), this command sets a less stringent P value cutoff (--pvalue 1e-3) than the default (1e-5). Because H3K36me3 ChIP-seq data often form broader but less-enriched regions, the parameters --nomodel and --shiftsize 73 are preferred.

        Troubleshooting

      3. Verify all the generated files as in Step 6C(vii) using the file name prefix 'BROAD_GM12878_H3K36me3' rather than 'BROAD_GM12878_H3K4me3'.

Steps 7 - 10: Loading results generated by MACS into IGV

Timing: 10 min

  1. To load a bedGraph generated by MACS into IGV, first decompress and rename the bedGraph file. As an example, consider the results of MACS on the FoxA1 data set. First, locate the bedGraph in the directory HAIB_T47D_FoxA1_MACS_bedGraph/treat.

  2. Unzip the file as follows:

    > gzip -d HAIB_T47D_FoxA1_treat_afterfiting_all.bdg.gz

  3. Change the extension of the file to 'bedGraph', which can be recognized by IGV, via the following command:

    > mv HAIB_T47D_FoxA1_treat_afterfiting_all.bdg HAIB_T47D_FoxA1_treat_afterfiting_all.bedGraph

  4. Load HAIB_T47D_FoxA1_treat_afterfiting_all.bedGraph into IGV following the IGV manual.

0

阅读 评论 收藏 转载 喜欢 打印举报/Report
  • 评论加载中,请稍候...
发评论

    发评论

    以上网友发言只代表其个人观点,不代表新浪网的观点或立场。

      

    新浪BLOG意见反馈留言板 欢迎批评指正

    新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 会员注册 | 产品答疑

    新浪公司 版权所有