三代基因组组装学习笔记一（文献阅读）

(2017-06-01 08:53:00)

分类：文献推荐

l 摘自：2017-Best Practices for Whole Genome Sequencing Using the Sequel System

对于动植物基因组最新的Sequel试剂一个cell可以产生5~8G数据，在打断建库的时候建议是SMRTbell libraries (>30 kb) 。Read lengths averaging 10 – 15 kb can be routinely achieved, with the longest reads >60 kb. Furthermore, 50% of useable bases are in reads greater than 20 kb.

l 摘自：2016-Long-read sequence assembly of the gorilla genome

建立文库范围(20- 30 kbp inserts) 在打断成35 kbp 背景下，运行时间为6个小时。使用RSII共测序了236 SMRT cells，83.7倍的数据。平均subread length 12.9 kbp

l 摘自：2015-Long-read sequencing and de novo assembly of a Chinese genome

平均测序深度7Kb, resulting in 5,843 contigs (N50 1⁄4 8.3 Mb),覆盖倍数为103X，使用的组装软件为修订后的FALCON(https://github.com/WGLab/EnhancedFALCON)其中两个关键参数如下：

# The length cutoff used for seed reads used for initial mapping

length_cutoff = 6000

# The length cutoff used for seed reads usef for pre-assembly

length_cutoff_pr = 12000

l 2016-De novo assembly and phasing of a Korean human genome

拼接使用FALCON and Quiver 3,128 contigs with a contig N50 length of 17.9 Mb,平均测序长度为7K.结合来自软件官方的建议：length_cutoff这个值可以设置的小一点例如接近平均长度，length_cutoff_pr这个值可以进行多多尝试几个值。在这篇文章中length_cutoff选择的是10KB。如下图：Extended Data Figure 2

pa_concurrent_jobs控制的是一次向集群提交的任务数目。

pa_DBsplit_option对于大的基因组for large genomes, you should use -s400 (400Mb sequence per block),这个数值增大可以减少任务数，一般的可以设置为200，或者50.

pa_HPCdaligner_option控制的参数原先是dal已经被B取代，

--max_diff 一般设置为平均覆盖度的2倍过滤那些两端差异覆盖大较大的reads因为这可能是重复序列

--min_cov一般设置为5是安全的

--max_cov一般为平均测序深度的3倍

l 摘自：2015-Single-molecule sequencing of the desiccation- tolerant grass Oropetium thomaeum

如果拼接的基因组较小，HGAP的组装结果要好于Falcon.

三代测序可承诺指标：

全三代拼接	简单基因组de novo 测序	复杂基因组de novo 测序
测序策略与深度	三代（60×）+10x Genomics／BioNano／ChiCago	三代（80×）+10X Genomics／BioNano／ChiCago
承诺指标	contig N50≥1 Mb scaffold N50≥3 Mb	contig N50≥500 Kb scaffold N50≥1 Mb

l 摘自：2017-Improving and correcting the contiguity of long-read genome assemblies of three plant species using optical mapping and chromosome conformation capture data

测序试剂P6-C4,平均subread-length分别为：8.5 kb, 6.9 kb and 7.9 kb

本文共测序三个物种平均测序深度为：86x, 47x and 54x。本文分别使用了2种拼接方法：Falcon (v0.3.0) and PBcR (with Celera Assembler 8.3rc2) 。

首先对三代数据进行质控使用SMRT Analysis software (v2.3) 删除长度小于500或者是quality (QV) 小于80的subread.文章中使用的一些脚本：https://github.com/wen-biao/OM-HiC-scaffolding

l 摘自：2015-BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs

评估基因组拼接的完整性使用的是BUSCO

l 摘自：2017-A High-Quality Genome Assembly of SMRT Sequences Reveals Long-Range Haplotype Structure in the Diploid Mosquito Aedes aegypti

测序建库20-30Kb,开机运行时间为6个小时，测序覆盖度110X，使用拼接软FALCON-Unzip，测序使用RSII

l 摘自：2017-Long-read assembly of the Aedes aegypti genome reveals the nature of heritable adaptive immunity sequences

测序深度76X，测序数据平均长度15.5 kb，subread平均长度为13.2kb，使用FALCON与Quiver进行基因组拼接。总共测序116个cell,其中84 cell >15Kb; 32 Cell>17Kb.

l 摘自：2017-High-quality, highly contiguous re-assembly of the pig genome

测序层数65X,使用Quiver对测序数据进行错误校正，Falcon进行拼接。再借助65X的数据使用PBJelly补洞。对组装结果纠错使用Arrow(基于65X)，Pilon（40x PE Illumina)。

l 摘自：2017-The genome of Chenopodium quinoa

藜麦P6_C4试剂测序，数据平均长度12,444bp,平均软件使用的是（https://github.com/PacificBiosciences/smrtmake) ,read过滤使用的参数是filter='MinReadScore=0.80,MinSRL=500,MinRL=100' ，拼接软件使用的是Celera Assembler，polish最后使用的是quiver。

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：学习html的几个网上资源

后一篇：三代基因组学习笔记二（组装提高工具Pilon）

新浪BLOG意见反馈留言板　欢迎批评指正