加载中…
个人资料
  • 博客等级:
  • 博客积分:
  • 博客访问:
  • 关注人气:
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

IDBA-UD组装基因组简单用法

(2016-11-24 20:34:04)
标签:

linux

分类: Bioinformatics

       之前组装基因组一直用另外一个软件:SPAdes,组装效果还不错,但是IDBA的大名早就听说过,所以趁着这次刚那个两个菌的数据,分别用这两个软件组装一下,对比一下效果,在SPAdes的网站上面看到过几个组装软件的对比图,毫无疑问,SPAdes排第一,但是IDBA能排第二,说明IDBA的组装效果还可以。

一,使用说明

安装

If you use the release package.

Exract the package, then use make to compile the source code.

$ ./configure
$ make

Introduction

IDBA is the basic iterative de Bruijn graph assembler for second-generation sequencing reads.

主要的部分分为以下三个:

 

IDBA-UD, an extension of IDBA, is designed to utilize paired-end reads to assemble low-depth regions and use progressive depth on contigs to reduce errors in high-depth regions. It is a generic purpose assembler and epspacially good for single-cell and metagenomic sequencing data.

 

IDBA-Hybrid is another update version of IDBA-UD, which can make use of a similar reference genome to improve assembly result.

 

IDBA-Tran is an iterative de Bruijn graph assembler for RNA-Seq data.

 

The basic IDBA is included for comparison, you should use more specific assemblers for your data.

If you are assembling genomic data without reference, please use IDBA-UD.

If you are assembling genomic data with a similar reference genome, please use IDBA-Hybrid. If you are assembling transcriptome data, please use IDBA-Tran.

转换格式fastq—fasta

需要注意的是IDBA的输入数据只能是fasta格式,并且正反向序列只能放在一个文件中,比较贴心的软件自带格式转换工具。

IDBA series assemblers accept fasta format reads. Fastq format reads can be converted by fq2fa program in the packcage.

$ bin/fq2fa read.fq read.fa

IDBA-UD IDBA-Hybrid and IDBA-Tran require paired-end reads stored in single FastA file and a pair of reads is in consecutive two lines. If not, please use fq2fa to merge two FastQ read files to single file.

$ bin/fq2fa --merge --filter read_1.fq read_2.fq read.fa

or convert a FastQ read file to FastA file.

$ bin/fq2fa --paired --filter read.fq read.fa

The this tools assume the paired-end reads are in order (->, <-). If your data is in order (<-, ->), please convert it by yourself.

 

二,参数

Note that IDBA assemblers are designed for short reads (around 100bp). If you want to assemble paired-end reads with longer read length, please modify the constant kMaxShortSequence in src/sequence/short_sequence.h to support longer read length.

Please find the manual by running the assembler without any parameters. For example:

$ bin/idba


IDBA-UD - Iterative de Bruijn Graph Assembler for sequencing data with highly uneven depth.
Usage: idba_ud -r read.fa -o output_dir
Allowed Options:
  -o, --out arg (=out)                   output directory
  -r, --read arg                         fasta read file (<=128)
      --read_level_2 arg                 paired-end reads fasta for second level scaffolds
      --read_level_3 arg                 paired-end reads fasta for third level scaffolds
      --read_level_4 arg                 paired-end reads fasta for fourth level scaffolds
      --read_level_5 arg                 paired-end reads fasta for fifth level scaffolds
  -l, --long_read arg                    fasta long read file (>128)
      --mink arg (=20)                   minimum k value (<=124)
      --maxk arg (=100)                  maximum k value (<=124)
      --step arg (=20)                   increment of k-mer of each iteration
      --inner_mink arg (=10)             inner minimum k value
      --inner_step arg (=5)              inner increment of k-mer
      --prefix arg (=3)                  prefix length used to build sub k-mer table
      --min_count arg (=2)               minimum multiplicity for filtering k-mer when building the graph
      --min_support arg (=1)             minimum supoort in each iteration
      --num_threads arg (=0)             number of threads
      --seed_kmer arg (=30)              seed kmer size for alignment
      --min_contig arg (=200)            minimum size of contig
      --similar arg (=0.95)              similarity for alignment
      --max_mismatch arg (=3)            max mismatch of error correction
      --min_pairs arg (=3)               minimum number of pairs
      --no_bubble                        do not merge bubble
      --no_local                         do not use local assembly
      --no_coverage                      do not iterate on coverage
      --no_correct                       do not do correction
      --pre_correction                   perform pre-correction before assembly

将fastq数据转换并合并成一个fasta文件后,因为read长度大于128,所以选用了 -l 参数。 最后使用命令:
$ idba_ud -l ***.fasta --pre_correction --min_contig 500 --num_threads 50 -o /home/mydata/idba-output
事实证明,用SPAdes组装两个基因组分别最终得到35和67个scaffold,而即使在我将kMaxShortSequence的默认值由128改为256之后(因为我们的每条reads长度是150bp), 用IDBA最终也分别得到142和561个scaffold,当然最终得到的scaffold越少越好,这样每个scaffold会更长,效果越好,最好的莫过于最终拼成一条序列, 这就是完整基因组了。 所以IDBA的组装效率分别低了4倍和9倍左右, 里面的算法不一样,直接会导致不同的结果,样品的reads长短也很重要,IDBA更长于组装短reads序列, 对于这批数据,IDBA的组装效果确实不如SPAdes!

0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有