nr_nt数据库分割

分类: 生物信息学 |
一:需要的几个文件
文件一:nodes.dmp
---------
This file represents taxonomy
nodes. The description for each node
includes
the following
fields:
文件二:division.dmp
------------
Divisions file has these
fields:
division id --
taxonomy database division id
division cde -- GenBank division
code (three characters)
division
name -- e.g. BCT, PLN, VRT, MAM, PRI...
comments
文件三:gi_taxid_nucl.dmp
------------
The file gi_taxid_nucl.dmp
contains two columns: the first (left) column
is
the GenBank
identifier (gi) of nucleotide record, the second (right) column
is
taxonomy
identifier (taxid).
根据以上三个文件可以建立division_name----->division_id----->tax_id------->gene_id的映射关系,生成依据不同division_name生成的gene_id,由此可以将gene_id按照division.dmp文件中显示的进行分割:
0 | BCT
| Bacteria | |
1 | INV
| Invertebrates | |
2 | MAM
| Mammals | |
3 | PHG
| Phages | |
4 | PLN
| Plants and Fungi | |
5 | PRI
| Primates | |
6 | ROD
| Rodents | |
7 | SYN
| Synthetic and Chimeric | |
8 | UNA
| Unassigned | No species nodes should inherit this division
assignment |
9 | VRL
| Viruses | |
10 | VRT
| Vertebrates | |
11 | ENV
| Environmental samples | Anonymous sequences cloned directly from
the environment |
2:然后下载第四个文件也就是nr文件,下载地址:ftp://ftp.ncbi.nlm.nih.gov/blast/db/,在获得分割的gene_id文件后,借助blast的blastdbcmd程序可以得到你想要的分割的nr数据库,举例如下:
blastdbcmd -db /database/nr/nr
-dbtype 'prot' -entry_batch BCT.list -out nr_BCT
如果遇到“OID not found”错误,是因为你提供的gi可能不在
非冗余的nr库中。
3:还有一种方法,当你获得gene_list后,使用blastdb_aliastool程序,进行分割:
blastdb_aliastool -db nr -dbtype 'prot'
-gilist BCT.list -out nr_BCT
生成文件:
nr_BCT.pal
nr_BCT.p.gil
比对举例:
blastx -query query.fasta -db nr_BCT -num_threads
20 -evalue 1e-10 -outfmt 5 -out
blast.out
#############################
获得相关物种的list,需要得到物种的分类号,例如:古细菌的分类号为2157
就可以得到所有古细菌的gi list.