MIRA-related(An Automated
Genome and EST Assembler, 作者:Bastien
Chevreux 最新版本:V2.9.46, 22/06
2009)
MIRA是一个多通道(multipass)组装软件,主要用于基因组或EST(表达序列标签)数据.
它能够处理各种混合型数据,如Solexa, 454, 3730等,并且它能用于检测重复序列和SNPs;
但MIRA需要比较长的时间运行,尤其你在处理mate pair(双末端测序序列)信息时;
MIRA采取逐步递进式精化组装的contigs。 在每条consensus序列中的每个碱基,它都会产生一个repeat
ration的柱状图,并给予打分和标记;
MIRA采用了一个跟DNPTrapper处理SNP的相似策略;
MIRA会变得混乱,如果你的输入的DNA序列是来自不同个体的 -
他们会有着很大不同的SNPs位点。此时你则必须要告诉MIRA你的DNA序列是来自不同个体的基因组序列。
一些意见:
用
GERALD(Solexa)
流程可以用ELAND(Illumina比对软件)跟参考基因组比对获得质量分值。你可以从测序中心中通过其他软件(如R包ROLEXA)转换原始(raw)碱基概率值为碱基序列和质量值。
gsAssembler(Newbler,454的分析软件)的V2.3版本,可以进行对序列或核甘酸的序列重叠和多序列比对,同时可以从模板(flow-space)中得到连续序列的碱基识别(basecalling)和确定质量值。参考GSAssembly
v.2.3的手册中提到:“Read overlaps and multiple alignments are made in
“nucleotide” space while the consensus basecalling and quality
value determination for contigs are performed in “flowspace”. Work
in flowspace allows the quality-weighted averaging of processed
flow signals (a continuous variable) at each nucleotide flow of the
sequencing Run(s) and allows the use of information from the
“negative flows', i.e. flows where no nucleotide incorporation is
detected. The use of flowspace in determining the properties of the
consensus sequence results in an improved accuracy for the final
basecalls.”(参考:
实用的贴士:
* 概述
Illumina 和 Sanger 的fastq 有着不同的ASCII质量值表,如下:
FASTQ格式:
@HWI-EAS59:1:1:0:899#0/1
NAGTAAATCCCTACTTGAATTCGAGCACTGCAACAAACTA
+HWI-EAS59:1:1:0:899#0/1
DNWWZYPYZMKNYS[SBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-EAS59:1:1:1:449#0/1
NGGGAATGATTAATTCCACAAAACAAAAGAAAGAAGTCGG
+HWI-EAS59:1:1:1:449#0/1
DIZZZ[YXS[U[XTWYTXXXRQLGY[YWPPZSXY[BBBBB
@HWI-EAS59:1:1:1:1018#0/1
NGTAGATAAAAAAATAAACTCAAATAAATTAAAGAG

(2009-12-19 01:53)
为了更好了了解表观遗传学(Epigenetics),在这里放了些图例,以方便理解。如果你想了解更多关于Epigenetics的相关背景你可以访问 University
of Utah’s Epigenetics training site 或者全能的wikipedia

How to become a bioinformatics expert
Why become a bioinformatics
expert?
Recent years have seen an explosive growth in biological data.
Large sequencing projects are producing increasing quantities of
nucleotide sequences. The contents of nucleotide databases are
doubling in size approximately every 14 months. The latest release
of GenBank (V.102) exceeded one billion base pairs. Not only the
size of sequence data is rapidly increasing, but also the number of
characterized genes from many organisms and protein structures
doubles about every two years. To cope with this great quantity of
data, a new scientific discipline has emerged:
bioinformatics, biocomputing or computational
biology.
But how to become a bioinformatics
expert?
Bioinformatics combines the tools and techniques of mathematics,
computer science and biology in order to understand the biological
significance
A reasonably thorough table of next-gen-seq software
available in the commercial and public domain
Source: http://seqanswers.com/forums/showthread.php?t=43
Integrated solutions
1. CLCbio Genomics Workbench - de novo and reference assembly
of Sanger, Roche FLX, Illumina, Helicos, and SOLiD data. Commercial
next-gen-seq software that extends the CLCbio Main Workbench
software. Includes SNP detection, CHiP-seq, browser and other
features. Commercial. Windows, Mac OS X and Linux.
2. Galaxy - Galaxy = interactive and reproducible genomics. A
job webportal.
3. Genomatix - Integrated Solutions for Next Generation
Sequencing data analysis.
4. JMP Genomics - Next gen visualization and statistics tool
from SAS. They are working with NCGR to refine this tool and
produce others.
5. NextGENe - de novo and reference assembly of
今年还是犒劳一下自己买了一辆自行车,主要是想运动一下。工作的地除了早上走30分钟的路程,基本不知道这儿是长啥样的(晚上通常都一两点才回去,看到的只能是星空了)。
昨日,吃过午饭就出发了,没有计划,来次无目的无方向的小旅程。首先穿过的是邻近的小区,那有个绿葱葱的大草坪,自然少不了富家女在溜狗,我想在这里看书也不错,不过还欠点静谧。出到公路,看到对面已是中央美院,我被旁边整齐的柳树道吸引了过去,这些柳树都剪得齐刷刷的,就像小时候剪的窝盖头,骑着车,头刚好在柳叶子下,风吹一下,宛若一缕绿丝划过。走到了尽头,折返,寻觅他处。没有选择却迂回的走过长长的公路,看到出现了一条村庄土路,虽然都是土石路,却不喜在公路上吸尾气,就拐了进去。颠颠跛跛,就进村了,看来北京的农村跟南方的差不多,只是墙矮了些,里面都是那么破落,带总能感觉到些暖和,有时能看到些节日的彩带烟花爆竹散落的碎纸片,鸡犬相鸣,妇人们七嘴八舌的谈话,只是在这里比南方安静了很多。不一会儿我就穿梭出去了,村落很小,但很平和。
走出了村路,看到远处,原来还有几个很漂亮的小区,估
大多数开发人员通常都有这个观点,即汇编语言和 C 语言适合用来编写对性能要求非常高的程序。而 C++
语言的主要应用范围是编写复杂度非常高的程序,但是对性能要求不是那么严格的程序。但是事实往往并非如此,很多时候,一个程序的速度在框架设计完成时大致已经确定了,而并非是因为采用了C++语言才使其速度没有达到预期的目标。因此当一个程序的性能需要提高时,首先需要做的是用性能检测工具对其运行的时间分布进行一个准确的测量,找出关键路径和真正的瓶颈所在,然后针对瓶颈进行分析和优化,而不是一味盲目地将性能低劣归咎于所采用的语言。事实上,如果框架设计不做修改,即使用C语言或者汇编语言重新改写,也并不能保证提高总体性能。
因此当遇到性能问题时,首先检查和反思程序的总体框架。然后用性能检测工具对其实际运行做准确地测量,再针对瓶颈进行分析和优化,这才是正确的思路。
但不可否认的是,确实有一些操作或者C++的一些语言特性比其他因素更容易成为程序的瓶颈,一般公认的有如下因素。
(1)缺页:如第四章中所述,缺页往往意味着需要访问外部存储。因为外部存储访问相对于访问内存或者代码执行,有数量级的差别。因此只要有可能
(2009-10-01 19:56)
昨晚,工作到11点,独自跨度走回小区住处。外面下着蒙蒙细雨,雾(可能是霭)很大。临近门口,一只白乎乎的东西“啾”一声从路面穿进草丛,直到一层楼房下的避荫住。夜很静,连灯光都已像星空一样稀疏,所以本能的愣了一下。往下面寻觅,只听一声细腻而腼腆的声音“喵”,我好奇心起,我喜欢小动物,尤其是猫科的。于是也应声叫了一声,伸过手去向它招了招,没想到它竟走了过来,用它的胡须和脸在我的指端和手侧轻抚了一下。我捊了一下它的头毛,跟它玩了起来。然后就离开了。没想,我上楼时,它也跟了上来,我打开了门,它犹豫了一下,我想起前天还有吃泡面剩下的火腿肠,于是赶紧撕开了一根,诱使它一下,果然,它就进屋了。我把门关上后,它开始有点紧张,环绕着房子四处打探了一下,就躲在沙发里了,我把火腿肠弄成一小段一小段,唤了几次它,就马上凑过来,屡试不爽,于是我跟它玩闹在一起。因为它长得很像我以前养过的一只白猫,叫“大白”,我叫了它好几个名字都不怎么搭理,只好不客气的叫“小白”了,可能因为食物之故它也适应了。
我把它捧上沙发,玩起来,没多久它就睡着了,唯一调皮的是,只要你抓它的爪
Introductory Books
1. Introduction
to Bioinformatics: A Theoretical and Practical Approach by Stephen
A., Krawetz and David D.
2. Introduction to Computational Molecular Biology by Joao
Carlos Setubal, Joao Meidanis, Jooao Carlos Setubal
3. Bioinformatics for Dummies by Jean-Michel Claverie, Cedric
Notredame
4. Developing Bioinformatics Computer Skills by Cynthia
Gibas, Per Jambeck
5. Beginning Perl for Bioinformatics by James Tisdall
General Bioinformatics Books (Including Genomics)
1. A Primer of Genome Science by Gibson G and Muse SV
2. Bioinformatics : A Biologist's Guide to Biocomputing and
the Internet by Stuart M. Brown
3. Bioinformatics Basics Applications in Biological Science
and Medicine by Hooman H. Rashidi, Lukas K. Buehler
4. Bioinformatics: Methods and Protocols by Stephen Misener,
Stephen A. Krawetz
5. Bioinformatics : Sequence