TopHat的使用以及输出文件_LeoZhang

http://blog.sina.com.cn/u/1748429300

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

TopHat的使用以及输出文件

(2014-10-09 10:20:19)

分类： Bioinformatics

TopHat现在的版本主要是Tophat.2.0.x以上的版本，相比于Tophat.1.x.x，主要的不同点：

Tophat2能够同时兼容bowtie1和bowtie2，但是Tophat1只能使用bwotie1.
Tophat2能够产生insertions.bed, delections.bed, 而Tophat1没有这个功能。

Tophat2最大的修改是能够兼容bowtie2，它能使用bwotie2来配对reads数据。Tophat2从2.0.2升级到2.0.13，每次升级主要是修改bug或者增加一些参数的设置，核心原理是没有变化的。

Tophat一些主要参数的设置：

-r/--mate-inner-dist ,int> 和 --mate-std-dev ：这组参数是非常重要的，用来设置paired-end reads中间的insert长度，默认平均长度是50bp和标准差20bp，但是实际情况来说，平均长度50bp通常太小，因为frangment的长度通常是200～300bp，测序的reads通常是50～75bp，若设置为50bp会导致匹配的效果很差，即properly paired非常低。通常我们可以设置为100～120bp。
--no-mixed，对于paired-end reads, 在使用bowtie2的时候，only report read alignments if both reads in a pair can be mapped.
--no-discordant，对于paired-end reads，report only concordant mappings.
--no-convert-bam, Tophat默认的输出结果是.bam文件格式，设置这个参数后将以.sam格式输出。
-G/--GTF ，使用注释文件。 If this option is provided, TopHat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the genome. The reads that did map on the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final tophat output.
--transcriptome-index，当提供GTF注释文件后，需要生成transcript序列的索引文件，这是非常耗时的，但数据包含多个samples时，我们就只需要第一次生成即可。
-T/--transcriptome-only， Only align the reads to the transcriptome and report only those mappings as genomic mappings. 此命令需要在命令5或者命令6使用情况下才能使用。
--no-novel-juncs， Only look for reads across junctions indicated in the supplied GFF or junctions file. (ignored without -G/-j)，不去寻找新的junction区域，只考虑annotation中出现的，当没有-G/-j设置下，此设置是无效的。
--no-coverage-search，默认是开的，当reads大于75bp，此命令是关闭。通过converage能搜索junctions。

Example：

无GTF注释文件：

tophat -p 4 -r 120 --mate-std-dev 40 --no-mixed --no-coverage-search --no-discordant -o ./tophat_output/ ./bowtie.index/Homo_sapiens.ref_genome sample_1.fasta sample_2.fasta

有GTF注释文件

tophat -p 4 -G ensGene.gtf --no-mixed --no-coverage-search --no-discordant -o ./tophat_output/ ./bowtie.index/Homo_sapiens.ref_genome sample_1.fasta sample_2.fasta

有GTF注释文件，有多个samples，通过transcript-index减少耗时

For sample1：

tophat -p 4 -G ensGene.gtf --transcriptome-index=transcriptome_data/ensGene -o /sample1/ ./bowtie.index/Homo_sapiens.ref_genome sample1_1.fasta sample1_2.fasta

在第一次运行，transcript的index就存在transcriptome_data文件内，下次直接指向它即可，不再需要每次通过GTF文件来生成transcript的index，所以-G参数都可以省略。

For sample2：

tophat -p 4 --transcriptome-index=transcriptome_data/ensGene ./bowtie.index/Homo_sapiens.ref_genome -o /sample2/ sample2_1.fasta sample2_2.fasta

需要注意的在最新的版本Tophat.2.0.13和bowtie2.v2.2.3中，设置参数--no-discordant会在写结果的时候出错。当我降到Tophat.2.0.8b和Bowtie2.v2.1.0时，这个bug就不在出现。对于我们的方法，需要设置--no-mixed和--no-discordant参数，所以我们给Tophat和Bowtie2降版本来避免bug。

Output输出文件：

在Tophat输出文件中，我们关心accepted_hits.bam 和junctions.bed 两个文件。

accept_hits.bam 包含reads匹配的结果，匹配记录按照reads在genome序列上的位置排序，和读段名字无关。若采用bowtie2匹配，结果中连续两条记录分别是paired-end reads的两个reads的匹配位置，他和匹配的起始位置无关。

通过命令：

samtools flagstat accepted_hits.bam

可查看匹配的结果，其中 properly paired 是满足条件（双末端都匹配成功，且中间距离满足设置）的匹配读段。

当我们使用：

samtools view accepted_hits.bam|cut -f 1|sort | uniq |wc -l

会发现获得读段数目与properly paired的数目不一致，因为这里包括很多insert长度超过设置的，这部分应该是跨内含子的读段，可以用来判断junctions。（疑问？）

选择properly paired读段：

samtools view -f 0x2 accepted_hits.bam 选出properly paired读段

samtools view -F 0x2 accepted_hits.bam 排除properly paired读段

加上-c参数，可以直接获得reads的数目

junction.bed中定义的junction就是中间是intron两边是exon的这种序列结构, JUNC00000001包含两个exon和中间的intron

比如：

track name=junctions description="TopHat junctions"

test_chromosome 177 400 JUNC00000001 61 + 177 400 255,0,0 2 73,50 0,173

test_chromosome 350 550 JUNC00000002 45 + 350 550 255,0,0 2 50,50 0,150

http://images.cnitblog.com/blog/13318/201310/03085530-c735b1965cad49bf93c511652957925b.jpg

junction.bed文件格式：

chrom(test_chromosome): The name of the chromosome. 这里就是test_chromosome
chromStart(177): The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. chromosome上feature（在这里就是junction）的起始位置
chromEnd(400): The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.对应的终止位置
name(JUNC00000001): Defines the name of the BED line.
score(61): A score between 0 and 1000.
strand(+): Defines the strand: either '+' or '-'.
thickStart(177): The starting position at which the feature is drawn thickly (for example, the start codon in gene displays).
thickEnd(400): The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).
itemRgb(255,0,0): An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser.
blockCount(2): The number of blocks (exons) in the BED line. 外显子个数，2个
blockSizes(73,50): A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.外显子大小，两个数字用逗号隔开
blockStarts(0,173): A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount. 外显子起始位置，不从chromosome的起始开始，而是从chromStart开始数，也就是junction的起始位置开始计算，第一个碱基是0

用bed_to_junc转换后得到.juncs其中只保留了junction中intron的部分

bed_to_junc new_list.juncs

cat new_list.juncs

test_chromosome 249 350 +

test_chromosome 399 500 +

<+/->

比较.bed格式文件和gene tracker格式： refGene.bed and refGene.txt

refGene.bed:

chr1 66999824 67210768 NM_032291 0 + 67000041 67208778 0 25 227,64,25,72,57,55,176,12,12,25,52,86,93,75,501,128,127,60,112,156,133,203,65,165,2013, 0,91705,98928,101802,105635,108668,109402,126371,133388,136853,137802,139139,142862,145536,147727,155006,156048,161292,185152,195122,199606,205193,206516,207130,208931,

refGene.txt:

0 NM_032291 chr1 + 66999824 67210768 67000041 67208778 25 66999824,67091529,67098752,67101626,67105459,67108492,67109226,67126195,67133212,67136677,67137626,67138963,67142686,67145360,67147551,67154830,67155872,67161116,67184976,67194946,67199430,67205017,67206340,67206954,67208755, 67000051,67091593,67098777,67101698,67105516,67108547,67109402,67126207,67133224,67136702,67137678,67139049,67142779,67145435,67148052,67154958,67155999,67161176,67185088,67195102,67199563,67205220,67206405,67207119,67210768, 0 SGIP1 cmplcmpl 0,1,2,0,0,0,1,0,0,0,1,2,1,1,1,1,0,1,1,2,2,0,2,1,1,

两种格式是可以转换的：

25表示此isoform有25个exon
227 表示第一个exon的长度
【66999824+0， 6699824+227】第一个exon的起始位置，对应与refGene.txt中第一个eoxn
【66999824+91705， 6699824+91705+64】第二个exon的起始位置，对应与refGene.txt中第二个exon

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：Next-generation transcriptome assembly 应用第二代测序技术的转录组组装

后一篇：samtools常用命令详解

新浪BLOG意见反馈留言板　欢迎批评指正