IDBA-UD组装基因组简单用法_ZhongjieWang

http://blog.sina.com.cn/u/1812096841

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

IDBA-UD组装基因组简单用法

(2016-11-24 20:34:04)

标签：

linux

分类： Bioinformatics

之前组装基因组一直用另外一个软件：SPAdes，组装效果还不错，但是IDBA的大名早就听说过，所以趁着这次刚那个两个菌的数据，分别用这两个软件组装一下，对比一下效果，在SPAdes的网站上面看到过几个组装软件的对比图，毫无疑问，SPAdes排第一，但是IDBA能排第二，说明IDBA的组装效果还可以。

一，使用说明

安装

If you use the release package.

Exract the package, then use make to compile the source code.

$ ./configure
$ make

Introduction

IDBA is the basic iterative de Bruijn graph assembler for second-generation sequencing reads.

主要的部分分为以下三个：

IDBA-UD, an extension of IDBA, is designed to utilize paired-end reads to assemble low-depth regions and use progressive depth on contigs to reduce errors in high-depth regions. It is a generic purpose assembler and epspacially good for single-cell and metagenomic sequencing data.

IDBA-Hybrid is another update version of IDBA-UD, which can make use of a similar reference genome to improve assembly result.

IDBA-Tran is an iterative de Bruijn graph assembler for RNA-Seq data.

The basic IDBA is included for comparison, you should use more specific assemblers for your data.

If you are assembling genomic data without reference, please use IDBA-UD.

If you are assembling genomic data with a similar reference genome, please use IDBA-Hybrid. If you are assembling transcriptome data, please use IDBA-Tran.

转换格式fastq—fasta

需要注意的是IDBA的输入数据只能是fasta格式，并且正反向序列只能放在一个文件中，比较贴心的软件自带格式转换工具。

IDBA series assemblers accept fasta format reads. Fastq format reads can be converted by fq2fa program in the packcage.

$ bin/fq2fa read.fq read.fa

IDBA-UD IDBA-Hybrid and IDBA-Tran require paired-end reads stored in single FastA file and a pair of reads is in consecutive two lines. If not, please use fq2fa to merge two FastQ read files to single file.

$ bin/fq2fa --merge --filter read_1.fq read_2.fq read.fa

or convert a FastQ read file to FastA file.

$ bin/fq2fa --paired --filter read.fq read.fa

The this tools assume the paired-end reads are in order (->, <-). If your data is in order (<-, ->), please convert it by yourself.

二，参数

Note that IDBA assemblers are designed for short reads (around 100bp). If you want to assemble paired-end reads with longer read length, please modify the constant kMaxShortSequence in src/sequence/short_sequence.h to support longer read length.

Please find the manual by running the assembler without any parameters. For example:

$ bin/idba




IDBA-UD - Iterative de Bruijn Graph Assembler for sequencing data with highly uneven depth.

Usage: idba_ud -r read.fa -o output_dir

Allowed Options:

  -o, --out arg (=out)                   output directory

  -r, --read arg                         fasta read file (<=128)
      --read_level_2 arg                 paired-end reads fasta for second level scaffolds

      --read_level_3 arg                 paired-end reads fasta for third level scaffolds

      --read_level_4 arg                 paired-end reads fasta for fourth level scaffolds

      --read_level_5 arg                 paired-end reads fasta for fifth level scaffolds

  -l, --long_read arg                    fasta long read file (>128)
      --mink arg (=20)                   minimum k value (<=124)

      --maxk arg (=100)                  maximum k value (<=124)

      --step arg (=20)                   increment of k-mer of each iteration

      --inner_mink arg (=10)             inner minimum k value

      --inner_step arg (=5)              inner increment of k-mer

      --prefix arg (=3)                  prefix length used to build sub k-mer table

      --min_count arg (=2)               minimum multiplicity for filtering k-mer when building the graph

      --min_support arg (=1)             minimum supoort in each iteration

      --num_threads arg (=0)             number of threads

      --seed_kmer arg (=30)              seed kmer size for alignment

      --min_contig arg (=200)            minimum size of contig

      --similar arg (=0.95)              similarity for alignment

      --max_mismatch arg (=3)            max mismatch of error correction

      --min_pairs arg (=3)               minimum number of pairs

      --no_bubble                        do not merge bubble

      --no_local                         do not use local assembly

      --no_coverage                      do not iterate on coverage

      --no_correct                       do not do correction

      --pre_correction                   perform pre-correction before assembly



将fastq数据转换并合并成一个fasta文件后，因为read长度大于128，所以选用了 -l 参数。 最后使用命令：

$ idba_ud -l ***.fasta --pre_correction --min_contig 500 --num_threads 50 -o /home/mydata/idba-output

事实证明，用SPAdes组装两个基因组分别最终得到35和67个scaffold，而即使在我将kMaxShortSequence的默认值由128改为256之后（因为我们的每条reads长度是150bp）， 用IDBA最终也分别得到142和561个scaffold，当然最终得到的scaffold越少越好，这样每个scaffold会更长，效果越好，最好的莫过于最终拼成一条序列， 这就是完整基因组了。 所以IDBA的组装效率分别低了4倍和9倍左右， 里面的算法不一样，直接会导致不同的结果，样品的reads长短也很重要，IDBA更长于组装短reads序列， 对于这批数据，IDBA的组装效果确实不如SPAdes！

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：进化树上Bootstrap和Identity区别

后一篇：2016年11月07日 blastp本地建库及 all vs all

新浪BLOG意见反馈留言板　欢迎批评指正