如何把sra格式转成fastq格式_胖小妖

http://blog.sina.com.cn/u/2510103112

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

如何把sra格式转成fastq格式

(2013-07-21 15:20:05)

分类：学术

sra是NCBI 推出的存储高通量数据的格式，而平常我们工作用得多是fastq格式。

Fasta/Fastq格式的文本，这是转录组的最初的数据，后续分析都是在这个

文件上进行的。

NCBI SRA，是short reads archive的简写，二代测序的数据一般都会传到这个数据库，所以你自己测的可以传进去，另外如果你想分析别人已经测的，应该可以从这个数据库里面直接下载数据，进行分析。这里面的Fastq的质量值都转换为了ASCII33了。

如果需要把sra 转成fastq，则

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

下载相应的软件。

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump

Tool: fastq-dump

Name:

fastq-dump - dump sra data in fastq format

Usage:

fastq-dump [options] [ -A ] <accession>

fastq-dump [options] <path [path...]>

Options:

Input:
-A	\|	--accession <accession>	Replaces accession derived from <path> in filename(s) and deflines (only for single table dump)
		--table <table-name>	Table name within cSRA object, default is "SEQUENCE"
Processing:
Read Splitting. Sequence data may be used in raw form or split into individual reads
		--split-spot	Split spots into individual reads
Full Spot Filters. Applied to the full spot independently of --split-spot
-N	\|	--minSpotId <rowid>	Minimum spot id
-X	\|	--maxSpotId <rowid>	Maximum spot id
		--spot-groups <[list]>	Filter by SPOT_GROUP (member): name[,...]"
-W	\|	--clip	Apply left and right clips
Common Filters. Applied to spots when --split-spot is not set, otherwise - to individual reads
-M	\|	--minReadLen <len>	Minimum read length to output, default is 25
-R	\|	--read-filter <[filter]>	Split into files by READ_FILTER value optionally filter by value: pass\|reject\|criteria\|redacted
-E	\|	--qual-filter	Filter used in early 1000 Genomes data: no sequences starting or ending with >= 10N
Filters based on alignments. Filters are active when alignment data are present
		--aligned	Dump only aligned sequences
		--unaligned	Dump only unaligned sequences
		--aligned-region <name[:from-to]>	Filter by position on genome. Name can either be accession.version (ex: NC_000001.10) or file specific name (ex: "chr1" or "1"). "from" and "to" are 1-based coordinates
		--matepair-distance <from-to\|unknown>	Filter by distance between matepairs. Use "unknown" to find matepairs split between the references. Use from-to to limit matepail distance on the same reference
Filters for individual reads. Applied only with --split-spot set
		--skip-technical	Applied only with --split-spot set. Dump only biological reads
Output:
-O	\|	--outdir <path>	Output directory, default is working directory '.'
-Z	\|	--stdout	Output to stdout, all split data become joined into single stream
		--gzip	Compress output using gzip
		--bzip2	Compress output using bzip2
Multiple File Options. Setting these options will produce more than 1 file, each of which will be suffixed according to splitting criteria
		--split-files	Dump each read into separate file. Files will receive suffix corresponding to read number
		--split-3	Legacy 3-file splitting for mate-pairs: first biological reads satisfying dumping conditions are placed in files _1.fastq and _2.fastq If only one biological read is present it is placed in *.fastq. Biological reads and above are ignored
-G	\|	--spot-group	Split into files by SPOT_GROUP (member name)
-R	\|	--read-filter <[filter]>	Split into files by READ_FILTER value. Optionally filter by value: pass\|reject\|criteria\|redacted
-T	\|	--group-in-dirs	Split into subdirectories instead of files
-K	\|	--keep-empty-files	Do not delete empty files
Formatting:
Sequence
-C	\|	--dumpcs <[cskey]>	Formats sequence using color space (default for SOLiD),"cskey" may be specified for translation
-B	\|	--dumpbase	Formats sequence using base space (default for other than SOLiD)
Quality
-Q	\|	--offset <integer>	Offset to use for quality conversion, default is 33
		--fasta <[line width]>	FASTA only, no qualities, optional line wrap width (set to zero for no wrapping)
Defline
-F	\|	--origfmt	Defline contains only original sequence name
-I	\|	--readids	Append read id after spot id as "accession.spot.readid" on defline
		--helicos	Helicos style defline
		--defline-seq <fmt>	Defline format specification for sequence
		--defline-qual <fmt>	Defline format specification for quailty. <fmt> is string of characters and/or variables. The variables can be one of: $ac - accession, $si spot id, $sn spot name, $sg spot group (barcode), $sl spot length in bases, $ri read number, $rn read name, $rl read length in bases. '[]' could be used for an optional output: if all vars in [] yield empty values whole group is not printed. Empty value is empty string or for numeric variables. Ex: @$sn[_$rn]/$ri '_$rn' is omitted if name is empty
Other:
-h	\|	--help	Output brief explanation of program usage
-V	\|	--version	Display the version of the program
-L	\|	--log-level <level>	Logging level as number or enum string One of (fatal\|sys\|int\|err\|warn\|info) or (0-5). Current/default is warn
-v	\|	--verbose	Increase the verbosity level of the program. Use multiple times for more verbosity
		--report	Control program execution environment report generation (if implemented). One of (never\|error\|always). Default is error

fastq-dump v2.2.2

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：爱情在哪里？为何独角戏？

后一篇：无题

新浪BLOG意见反馈留言板　欢迎批评指正