nr_nt数据库分割_fanyucai

http://blog.sina.com.cn/u/2214034580

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

nr_nt数据库分割

(2016-10-07 23:07:26)

分类：生物信息学

链接：ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

一：需要的几个文件

文件一：nodes.dmp

---------

This file represents taxonomy nodes. The description for each node includes

the following fields:

tax_id -- node id in GenBank taxonomy database

parent tax_id -- parent node id in GenBank taxonomy database

rank -- rank of this node (superkingdom, kingdom, ...)

embl code -- locus-name prefix; not unique

division id -- see division.dmp file

inherited div flag (1 or 0) -- 1 if node inherits division from parent

genetic code id -- see gencode.dmp file

inherited GC flag (1 or 0) -- 1 if node inherits genetic code from parent

mitochondrial genetic code id -- see gencode.dmp file

inherited MGC flag (1 or 0) -- 1 if node inherits mitochondrial gencode from parent

GenBank hidden flag (1 or 0) -- 1 if name is suppressed in GenBank entry lineage

hidden subtree root flag (1 or 0) -- 1 if this subtree has no sequence data yet

comments

文件二：division.dmp

------------

Divisions file has these fields:

division id -- taxonomy database division id

division cde -- GenBank division code (three characters)

division name -- e.g. BCT, PLN, VRT, MAM, PRI...

comments

文件三：gi_taxid_nucl.dmp

------------

The file gi_taxid_nucl.dmp contains two columns: the first (left) column is

the GenBank identifier (gi) of nucleotide record, the second (right) column is

taxonomy identifier (taxid).

根据以上三个文件可以建立division_name----->division_id----->tax_id------->gene_id的映射关系，生成依据不同division_name生成的gene_id，由此可以将gene_id按照division.dmp文件中显示的进行分割：

0 | BCT | Bacteria | |

1 | INV | Invertebrates | |

2 | MAM | Mammals | |

3 | PHG | Phages | |

4 | PLN | Plants and Fungi | |

5 | PRI | Primates | |

6 | ROD | Rodents | |

7 | SYN | Synthetic and Chimeric | |

8 | UNA | Unassigned | No species nodes should inherit this division assignment |

9 | VRL | Viruses | |

10 | VRT | Vertebrates | |

11 | ENV | Environmental samples | Anonymous sequences cloned directly from the environment |

2:然后下载第四个文件也就是nr文件，下载地址：ftp://ftp.ncbi.nlm.nih.gov/blast/db/,在获得分割的gene_id文件后，借助blast的blastdbcmd程序可以得到你想要的分割的nr数据库，举例如下：

blastdbcmd -db /database/nr/nr -dbtype 'prot' -entry_batch BCT.list -out nr_BCT

如果遇到“OID not found”错误，是因为你提供的gi可能不在非冗余的nr库中。

3：还有一种方法，当你获得gene_list后，使用blastdb_aliastool程序，进行分割：

blastdb_aliastool -db nr -dbtype 'prot' -gilist BCT.list -out nr_BCT

生成文件：

nr_BCT.pal

nr_BCT.p.gil

比对举例：

blastx -query query.fasta -db nr_BCT -num_threads 20 -evalue 1e-10 -outfmt 5 -out blast.out

#############################

获得相关物种的list,需要得到物种的分类号，例如：古细菌的分类号为2157

就可以得到所有古细菌的gi list.

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：GWAS分析软件GAPIT学习笔记

后一篇：宏基因组分析软件MEGAN--命令行运行（2016.10.19）

新浪BLOG意见反馈留言板　欢迎批评指正