GWAS 分析常用文件格式总结_菜鸟

http://blog.sina.com.cn/u/1257500882

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

GWAS 分析常用文件格式总结

(2017-06-17 14:20:42)

分类：生物信息资源

转自：http://www.cnblogs.com/freemao/p/6414898.html

一， Hapmap Format

前11列是SNP的attributes, 其余列是 nucleotides observed at each SNP for each individuals。tab 分隔.

第一行是header

每一行代表一个SNP。

Genotypic data 可以是double bit 也可以是single bit（IUPAC code）。

http://images2015.cnblogs.com/blog/635312/201702/635312-20170219103000785-1506057908.png分析常用文件格式总结" />

missing data 用NN for double bit 或者N for single bit.

http://images2015.cnblogs.com/blog/635312/201702/635312-20170219102640457-761716220.png分析常用文件格式总结" />

http://images2015.cnblogs.com/blog/635312/201702/635312-20170220102701148-2065643518.png分析常用文件格式总结" />

二， Numeric format

由于genotype file (GD file)里没有SNP位置信息，因此需要一个额外的map file(GM file)。GM file里的SNP顺序需要个GD中的保持一致。

对于GD file:

第一行是header，包括SNP name。

每一行代表的是individual 而不是 SNP。和hapmap正好相反。0代表纯合00， 1代表杂合01， 2代表纯合11.

http://images2015.cnblogs.com/blog/635312/201702/635312-20170219103728847-166156095.png分析常用文件格式总结" />

GM file:

http://images2015.cnblogs.com/blog/635312/201702/635312-20170219103901738-624797437.png分析常用文件格式总结" />

三， PLINK PED File Format

1, ped 和 map

ped 文件必须 accompanied by a map file

ped文件没有header line.

每行有6 + 2V Fields. V 是 SNP的数目。

前6行：

Family ID ('FID')
Within-family ID (sample ID) ('IID'; cannot be '0')
Within-family ID of father (Paternal ID)('0' if father isn't in dataset)
Within-family ID of mother (Maternal ID)('0' if mother isn't in dataset)
Sex code ('1' = male, '2' = female, '0' = unknown)
Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)

第7th 8th 是第一个SNP的alleles. 9th, 10th 是第二个SNP的alleles. 以此类推。。。 0 0代表missing data

TAB delimited。

http://images2015.cnblogs.com/blog/635312/201702/635312-20170219112326816-1100513411.png分析常用文件格式总结" />

map文件没有header

Each line corresponds to a SNP.

one line per SNP with 4 fields

Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not.
Variant identifier
Position in morgans or centimorgans (optional; also safe to use dummy value of '0')
Base-pair coordinate

http://images2015.cnblogs.com/blog/635312/201702/635312-20170219112432222-1909949474.png分析常用文件格式总结" />

2 bed(binary geotype table)，bim and fam

这里的bed和UCSC Genome BED format完全不一样。

将ped 和 map转换为bed, bim, fam

plink --noweb --file PedMapPrefix --make-bed --out BedBimFamPrefrix

3 phenotype format

三列，可以 no header。

FID IID pheno

http://images2015.cnblogs.com/blog/635312/201702/635312-20170220130038976-958156751.png分析常用文件格式总结" />

GWAS常用软件：

一， GAPIT

Zhiwu Zhang lab（http://www.zzlab.net/GAPIT/index.html）

phenotypic data:

第一行： header

第一列 sample名称第二列表型值

http://images2015.cnblogs.com/blog/635312/201702/635312-20170219101027035-1799311983.png分析常用文件格式总结" />

Genotypic data:

可以是hmp格式，只用rs（SNP name）列，chrom列和pos列。前11列的其余列用可以用NA。

也可以是numeric 格式。

二， GEMMA

Xiang Zhou lab (http://www.xzlab.org/index.html)

Genotype 要先impute，不能有missing data。

genotype和phenotype data 用PLINK Binary PED file format.

三, FARMCPU

Zhiwu Zhang lab (http://www.zzlab.net/FarmCPU/)

Genotypic data 用numeric format

Phenotypic 和GAPIT一样

可选PCA文件，可以先运行GAPIT 得到这个文件，第一行是header,

each line corresponds a individual's priciple component value.下面只有三个components。

http://images2015.cnblogs.com/blog/635312/201702/635312-20170219115718472-1322383764.png分析常用文件格式总结" />

四， TASSEL5

Buckler Lab (http://www.maizegenetics.net/tassel)

https://bitbucket.org/tasseladmin/tassel-5-source/wiki/UserManual/Load/Load#markdown-header-hapmap

做SNP kinship 和 PCA 用hmp format, 推荐用single bit。 N 表示missing data。

For TASSEL to correctly read Hapmap data, the data must be in order of position within each chromosome, and the file should be TAB delimited (example below is in Excel only for easy viewing). If some of the data is missing the correct number of TABs must still be present, so that TASSEL can properly assign data to columns.

用-h 来指定导入的文件格式是hmp。不要用-importGuess

-h *.hmp

五， LDAK

http://dougspeed.com/

Phenotypic data 用plink的phenotype格式

--pheno *.txt

UNL

Chenyong

cmiao@huskers.unl.edu

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：距离上次更新2013-09-21已经2年零3个月了

后一篇：plink二进制格式基因型数据下载地址

新浪BLOG意见反馈留言板　欢迎批评指正