如何把sra格式转成fastq格式
(2013-07-21 15:20:05)| 分类: 学术 |
sra是NCBI 推出的存储高通量数据的格式,而平常我们工作用得多是fastq格式。
Fasta/Fastq格式的文本,这是转录组的最初的数据,后续分析都是在这个
文件上进行的。
NCBI SRA,是short reads archive的简写,二代测序的数据一般都会传到这个数据库,所以你自己测的可以传进去,另外如果你想分析别人已经测的,应该可以从这个数据库里面直接下载数据,进行分析。这里面的Fastq的质量值都转换为了ASCII33了。
如果需要把sra 转成fastq,则
http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
下载相应的软件。
http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc
http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump
Tool: fastq-dump
Name:
fastq-dump - dump sra data in fastq format
Usage:
fastq-dump [options] [ -A ]
<accession>
fastq-dump [options]
<path [path...]>
Options:
| Input: | ||||
| -A | | | --accession <accession> | Replaces accession derived from <path> in filename(s) and deflines (only for single table dump) | |
| --table <table-name> | Table name within cSRA object, default is "SEQUENCE" | |||
| Processing: | ||||
| Read Splitting. Sequence data may be used in raw form or split into individual reads | ||||
| --split-spot | Split spots into individual reads | |||
| Full Spot Filters. Applied to the full spot independently of --split-spot | ||||
| -N | | | --minSpotId <rowid> | Minimum spot id | |
| -X | | | --maxSpotId <rowid> | Maximum spot id | |
| --spot-groups <[list]> | Filter by SPOT_GROUP (member): name[,...]" | |||
| -W | | | --clip | Apply left and right clips | |
| Common Filters. Applied to spots when --split-spot is not set, otherwise - to individual reads | ||||
| -M | | | --minReadLen <len> | Minimum read length to output, default is 25 | |
| -R | | | --read-filter <[filter]> | Split into files by READ_FILTER value optionally filter by value: pass|reject|criteria|redacted | |
| -E | | | --qual-filter | Filter used in early 1000 Genomes data: no sequences starting or ending with >= 10N | |
| Filters based on alignments. Filters are active when alignment data are present | ||||
| --aligned | Dump only aligned sequences | |||
| --unaligned | Dump only unaligned sequences | |||
| --aligned-region <name[:from-to]> | Filter by position on genome. Name can either be accession.version (ex: NC_000001.10) or file specific name (ex: "chr1" or "1"). "from" and "to" are 1-based coordinates | |||
| --matepair-distance <from-to|unknown> | Filter by distance between matepairs. Use "unknown" to find matepairs split between the references. Use from-to to limit matepail distance on the same reference | |||
| Filters for individual reads. Applied only with --split-spot set | ||||
| --skip-technical | Applied only with --split-spot set. Dump only biological reads | |||
| Output: | ||||
| -O | | | --outdir <path> | Output directory, default is working directory '.' | |
| -Z | | | --stdout | Output to stdout, all split data become joined into single stream | |
| --gzip | Compress output using gzip | |||
| --bzip2 | Compress output using bzip2 | |||
| Multiple File Options. Setting these options will produce more than 1 file, each of which will be suffixed according to splitting criteria | ||||
| --split-files | Dump each read into separate file. Files will receive suffix corresponding to read number | |||
| --split-3 | Legacy 3-file splitting for mate-pairs: first biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq. Biological reads and above are ignored | |||
| -G | | | --spot-group | Split into files by SPOT_GROUP (member name) | |
| -R | | | --read-filter <[filter]> | Split into files by READ_FILTER value. Optionally filter by value: pass|reject|criteria|redacted | |
| -T | | | --group-in-dirs | Split into subdirectories instead of files | |
| -K | | | --keep-empty-files | Do not delete empty files | |
| Formatting: | ||||
| Sequence | ||||
| -C | | | --dumpcs <[cskey]> | Formats sequence using color space (default for SOLiD),"cskey" may be specified for translation | |
| -B | | | --dumpbase | Formats sequence using base space (default for other than SOLiD) | |
| Quality | ||||
| -Q | | | --offset <integer> | Offset to use for quality conversion, default is 33 | |
| --fasta <[line width]> | FASTA only, no qualities, optional line wrap width (set to zero for no wrapping) | |||
| Defline | ||||
| -F | | | --origfmt | Defline contains only original sequence name | |
| -I | | | --readids | Append read id after spot id as "accession.spot.readid" on defline | |
| --helicos | Helicos style defline | |||
| --defline-seq <fmt> | Defline format specification for sequence | |||
| --defline-qual <fmt> | Defline format specification for quailty. <fmt> is string of characters and/or variables. The variables can be one of: $ac - accession, $si spot id, $sn spot name, $sg spot group (barcode), $sl spot length in bases, $ri read number, $rn read name, $rl read length in bases. '[]' could be used for an optional output: if all vars in [] yield empty values whole group is not printed. Empty value is empty string or for numeric variables. Ex: @$sn[_$rn]/$ri '_$rn' is omitted if name is empty | |||
| Other: | ||||
| -h | | | --help | Output brief explanation of program usage | |
| -V | | | --version | Display the version of the program | |
| -L | | | --log-level <level> | Logging level as number or enum string One of (fatal|sys|int|err|warn|info) or (0-5). Current/default is warn | |
| -v | | | --verbose | Increase the verbosity level of the program. Use multiple times for more verbosity | |
| --report | Control program execution environment report generation (if implemented). One of (never|error|always). Default is error | |||
fastq-dump v2.2.2
前一篇:爱情在哪里?为何独角戏?
后一篇:无题

加载中…