加载中…
个人资料
  • 博客等级:
  • 博客积分:
  • 博客访问:
  • 关注人气:
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

如何把sra格式转成fastq格式

(2013-07-21 15:20:05)
分类: 学术

sra是NCBI 推出的存储高通量数据的格式,而平常我们工作用得多是fastq格式。

Fasta/Fastq格式的文本,这是转录组的最初的数据,后续分析都是在这个

 

文件上进行的。

NCBI SRA,是short reads archive的简写,二代测序的数据一般都会传到这个数据库,所以你自己测的可以传进去,另外如果你想分析别人已经测的,应该可以从这个数据库里面直接下载数据,进行分析。这里面的Fastq的质量值都转换为了ASCII33了。

 

如果需要把sra 转成fastq,则

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

下载相应的软件。

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump

Tool: fastq-dump

Name:
fastq-dump - dump sra data in fastq format
Usage:
fastq-dump [options] [ -A ] <accession>
fastq-dump [options] <path [path...]>
Options:
Input:
-A | --accession <accession> Replaces accession derived from <path> in filename(s) and deflines (only for single table dump)
--table <table-name> Table name within cSRA object, default is "SEQUENCE"
Processing:
Read Splitting. Sequence data may be used in raw form or split into individual reads
--split-spot Split spots into individual reads
Full Spot Filters. Applied to the full spot independently of --split-spot
-N | --minSpotId <rowid> Minimum spot id
-X | --maxSpotId <rowid> Maximum spot id
--spot-groups <[list]> Filter by SPOT_GROUP (member): name[,...]"
-W | --clip Apply left and right clips
Common Filters. Applied to spots when --split-spot is not set, otherwise - to individual reads
-M | --minReadLen <len> Minimum read length to output, default is 25
-R | --read-filter <[filter]> Split into files by READ_FILTER value optionally filter by value: pass|reject|criteria|redacted
-E | --qual-filter Filter used in early 1000 Genomes data: no sequences starting or ending with >= 10N
Filters based on alignments. Filters are active when alignment data are present
--aligned Dump only aligned sequences
--unaligned Dump only unaligned sequences
--aligned-region <name[:from-to]> Filter by position on genome. Name can either be accession.version (ex: NC_000001.10) or file specific name (ex: "chr1" or "1"). "from" and "to" are 1-based coordinates
--matepair-distance <from-to|unknown> Filter by distance between matepairs. Use "unknown" to find matepairs split between the references. Use from-to to limit matepail distance on the same reference
Filters for individual reads. Applied only with --split-spot set
--skip-technical Applied only with --split-spot set. Dump only biological reads
Output:
-O | --outdir <path> Output directory, default is working directory '.'
-Z | --stdout Output to stdout, all split data become joined into single stream
--gzip Compress output using gzip
--bzip2 Compress output using bzip2
Multiple File Options. Setting these options will produce more than 1 file, each of which will be suffixed according to splitting criteria
--split-files Dump each read into separate file. Files will receive suffix corresponding to read number
--split-3 Legacy 3-file splitting for mate-pairs: first biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq. Biological reads and above are ignored
-G | --spot-group Split into files by SPOT_GROUP (member name)
-R | --read-filter <[filter]> Split into files by READ_FILTER value. Optionally filter by value: pass|reject|criteria|redacted
-T | --group-in-dirs Split into subdirectories instead of files
-K | --keep-empty-files Do not delete empty files
Formatting:
Sequence
-C | --dumpcs <[cskey]> Formats sequence using color space (default for SOLiD),"cskey" may be specified for translation
-B | --dumpbase Formats sequence using base space (default for other than SOLiD)
Quality
-Q | --offset <integer> Offset to use for quality conversion, default is 33
--fasta <[line width]> FASTA only, no qualities, optional line wrap width (set to zero for no wrapping)
Defline
-F | --origfmt Defline contains only original sequence name
-I | --readids Append read id after spot id as "accession.spot.readid" on defline
--helicos Helicos style defline
--defline-seq <fmt> Defline format specification for sequence
--defline-qual <fmt> Defline format specification for quailty. <fmt> is string of characters and/or variables. The variables can be one of: $ac - accession, $si spot id, $sn spot name, $sg spot group (barcode), $sl spot length in bases, $ri read number, $rn read name, $rl read length in bases. '[]' could be used for an optional output: if all vars in [] yield empty values whole group is not printed. Empty value is empty string or for numeric variables. Ex: @$sn[_$rn]/$ri '_$rn' is omitted if name is empty
Other:
-h | --help Output brief explanation of program usage
-V | --version Display the version of the program
-L | --log-level <level> Logging level as number or enum string One of (fatal|sys|int|err|warn|info) or (0-5). Current/default is warn
-v | --verbose Increase the verbosity level of the program. Use multiple times for more verbosity
--report Control program execution environment report generation (if implemented). One of (never|error|always). Default is error
fastq-dump v2.2.2

0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有