BLAST相关术语及参数_じ☆ve涛的天空

http://blog.sina.com.cn/u/1271331561

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

BLAST相关术语及参数

(2011-03-09 18:59:51)

标签：

杂谈

Alignment: 序列比对。将两个或多个序列排在一起，以达到最大一致性的过程（对于氨基酸序列是比较它们的保守性），这样可以评估序列间的相似性和同源性。

Algorithm: 算法。在计算机程序中包含的一种固定过程。

Bit score: 二进制。二进制值S＇源于统计性质被数量化的打分系统中产生的原始比对分数S。由于二进制值相对于打分系统已经被标准化，它们常用于比较不同搜索之间的比对分数。

BLOSUM: 模块替换矩阵。在替换矩阵中，每个位置的打分是在相关蛋白局部比对模块中观察到的替换的频率而获得的。每个矩阵被修改成一个特殊的进化距离。例如，在BLOSUM62矩阵中，是使用一致性不超过62%的序列进行配对来获得打分值的。一致性大于62%的序列在配对时用单个序列表示，以避免过于强调密切相关的家族成员。

Conservation: 保守。指氨基酸或DNA（普遍性较小）序列某个特殊位置上的改变，并不影响原始序列的物理化学性质。

Domain: 结构域。蛋白质在折叠时与其他部分相独立的一个不连续的部分，它有着自己独特的功能。

DUST: 一个低复杂性区段过滤程序。

E value: E值。期望值。在一个数据库中所搜索到的打分值等于或大于S的不同比对的个数。E值越低，表明该打分值的显著性越好。

Filtering: 过滤，也叫掩蔽（masking）。指对那么经常产生乱真的高分数的核苷酸或氨基酸序列区域进行隐藏的过程。

Gap: 空位。在两条序列比对过程中需要在检测序列或目标序列中引入空位，以表示插入或删除。为了避免在比对时出现太多的空位，可以在收入空位的同时，从比对的打分值中减去一个固定值（空位值）。在多余的核苷酸或氨基酸周围引入空位时，也要对比对的打分值进行罚分。

Global Alignment: 整体联配。对两个核苷酸或蛋白质序列的全长进行的比对。

H: 相对熵值。目标残基和底物残基频率的相对熵记作H。H可以衡量某个位置（这个位置可以通过概率来区分比对）上由于偶然因素而得到的平均信息（用字节表示）。H值越高，短的比对就越可以通过概率来区分；H值越低，需要的比对长度越长。

Homology: 同源性。由共同的祖先所遗传得到的相似性。

HSP: High-scoring segment pair，高打分值片段。在一个给定的搜索中，没有空位的局部比对能得到最高的比对打分值。

Identity: 一致性。两个（核苷酸或氨基酸）序列比对时不变部分的长度。

K: K值。用来计算BLAST程序中打分函数的一个统计参数。它可以看作搜索空间大小的一个自然衡量尺度。K值通常用于将原始比对值S转换为二进制值S＇。

Lambda: λ值。用来计算BLAST程序中打分函数的一个统计参数；它可以看作打分系统的一个自然衡量尺度。λ值通常用于将原始比对值S转换为二进制值S＇。

Local Alignment: 局部联配。对两个核苷酸或蛋白质序列的一部分所进行的比对。

Low Complexity Region（LCR）: 低复杂性区域。指组分（包括均聚物、短周期重复片段）区域和有许多单个或多个残基的区域。SEG程序用来筛选或过滤氨基酸序列中低复杂性区域。DUST程序用来筛选或过滤核苷酸序列中的低复杂性区域。

Masking: 掩蔽。也叫过滤（filtering），指为了提高对序列相似性搜索是时的敏感性，而从序列中移除重复的或低复杂性区域的过程。

Motif: 模体或序列模式。蛋白质序列中短的保守区域。它们是结构域中保守性很高的部分。

Multiple Sequence Alignment: 多序列比对。三个或三个以上的多个序列之间的比对，如果序列在同一列有相同结构位置的残基和（或）祖传的残基，则会在该位置插入空位。ClustalW是一种最为广泛使用的多序列比对程序之一。

Optimal alignment: 最佳联配。两个序列之间有最高打分值的排列。

Orthologous: 直系同源。指不同种类的同源序列，它们是在物种形成事件中从一个祖先序列独立进化形成的；可能有相似功能，也可能没有。

P value: P值。在比对时，获得某个打分值或更高的打分值的可能性。通过数据库中具有相同长度或组分的随机序列之间的比对，可以得到高打分值的片段的预期分布，将它与观察到的比对打分值S相连，就可以计算出P值。显著性最高的P值应该接近于零。P值和E值用不同的方法来表示比对的显著性。

PAM: Percent Accepted Mutation，可接受点突变。一个用于衡量蛋白质序列的进化突变程度的单位。一个PAM的进化距离表示蛋白质序列中平均1%的氨基酸残基发生突变的概率。PAM（x）替换矩阵是一个查找表，其中每个氨基酸残基的替换打分值是基于进化趋异程度为x的紧密相关蛋白的替换频率而计算的。

Paralogous: 共生同源。指在单个种类中由于基因复制事件而产生的同源序列。

Profile: 表达谱。一种罗列了蛋白质序列的每个位置上每个氨基酸出现频率的表格。这些频率是通过包含指定结构域的序列进行多次比对而得到的。参见PSSM。

PSSM: Position-specific scoring matrix，特定位点记分矩阵。PSSM给出了在目标序列中寻找特定的相配对的氨基酸的对数比分值。参见Profile。

Query: 检测。输入序列（或其他搜索项）与数据库中的所有条目进行的比较。

Raw Score: 初值。指通过计算替换和空位所得打分值之和而得到的联配值S。替换打分值以查找表的形式表示。空位打分值是通过计算空位开放罚分G和空位拓展罚分L求和而得到的。对于长度为n的空位，空位罚分值是G+Ln。空位罚分G与L的选择完全是根据经验，通常G选择一个较高的数值（10~15），L选择一个较低的数值（1~2）。参见PAM、BLOSUM。

Similarity: 相似性。指核苷酸或蛋白质序列的相关程度。两个序列之间的相似性是基于相同和（或）保守序列所占的百分比的。在BLAST中，相似性指一个正定的打分值矩阵。

SEG: 一种过滤氨基酸序列中低复杂性区域的程序，在比较中被过滤掉的氨基酸用“X”表示。在BLAST2.0的blastp子程序中，SEG过滤是默认执行的。

Substitution: 替换。在指定的位置不相同的氨基酸进行联配。如果联配的残基有相似的物理化学性质，那么替换是保守的。

Substitution Matrix: 替换矩阵。替换矩阵中的值与氨基酸对中的第i个氨基酸突变为第j个氨基酸的概率成比例。构建这样的矩阵需要组装一个大的、含有不同的成对排列的氨基酸样本。如果样本足够大，其统计性显著，那么得到的替换矩阵可以反映经过某一阶段进化后的突变概率的真实值。

Unitary Matrix: 酉矩阵，幺正矩阵。也称为单位矩阵。是一个只有在字符相同时才能得到正打分值的打分系统。

Subsequence; 用来设定查询序列中进行比对的子序列。

Descriptions: 对核苷酸或者蛋白质序列的描述。

Alignments: 比对结果。

Query Number: 查询序列的个数。

Job ID: 是在进行BLAST比对的过程中程序自动生成的流水号，用来唯一标识一次比对过程。利用Job ID可以快速找回你曾经进行过的比对结果。

Query ID: 查询序列的ID。

Subject ID: 与查询序列比对的序列的ID。

Length: 比对序列的长度。

Identities: 一致性。指两个（核苷酸或氨基酸）序列比对时不变部分的长度。

Q.start: 查询序列的起始位置。

Q.end: 查询序列的终止位置。

Q.Length: 查询序列的长度。

S.start: 与查询序列相比对的序列的起始位置。

S.end: 与查询序列相比对的序列的终止位置。

S.Length: 与查询序列相比对的序列的长度。

以下是blast2.2.23+版的参数使用（直接copy上来的留着慢慢看吧）
USAGE
blastp [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-negative_gilist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-subject subject_input_file] [-subject_loc range] [-query input_file]
    [-out output_file] [-evalue evalue] [-word_size int_value]
    [-gapopen open_penalty] [-gapextend extend_penalty]
    [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value] [-seg SEG_options]
    [-soft_masking soft_masking] [-matrix matrix_name]
    [-threshold float_value] [-culling_limit int_value]
    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
    [-window_size int_value] [-lcase_masking] [-query_loc range]
    [-parse_deflines] [-outfmt format] [-show_gis]
    [-num_descriptions int_value] [-num_alignments int_value] [-html]
    [-max_target_seqs num_sequences] [-num_threads int_value] [-ungapped]
    [-remote] [-comp_based_stats compo] [-use_sw_tback] [-version]

DESCRIPTION
Protein-Protein BLAST 2.2.23+

OPTIONAL ARGUMENTS
-h
   Print USAGE and DESCRIPTION; ignore other arguments
-help
   Print USAGE, DESCRIPTION and ARGUMENTS description; ignore other arguments
-version
   Print version number; ignore other arguments

*** Input query options
-query <File_In>
   Input file name
   Default = `-'
-query_loc <String>
   Location on the query sequence (Format: start-stop)

*** General search options
-task <String, Permissible values: 'blastp' 'blastp-short' >
   Task to execute
   Default = `blastp'
-db <String>
   BLAST database name
    * Incompatible with: subject, subject_loc
-out <File_Out>
   Output file name
   Default = `-'
-evalue <Real>
   Expectation value (E) threshold for saving hits
   Default = `10'
-word_size <Integer, >=2>
   Word size for wordfinder algorithm
-gapopen <Integer>
   Cost to open a gap
-gapextend <Integer>
   Cost to extend a gap
-matrix <String>
   Scoring matrix name (normally BLOSUM62)
-threshold <Real, >=0>
   Minimum word score such that the word is added to the BLAST lookup table
-comp_based_stats <String>
   Use composition-based statistics for blastp / tblastn:
       D or d: default (equivalent to 2)
       0 or F or f: no composition-based statistics
       1: Composition-based statistics as in NAR 29:2994-3005, 2001
       2 or T or t : Composition-based score adjustment as in Bioinformatics
   21:902-911,
       2005, conditioned on sequence properties
       3: Composition-based score adjustment as in Bioinformatics 21:902-911,
       2005, unconditionally
   For programs other than tblastn, must either be absent or be D, F or 0
   Default = `2'

*** BLAST-2-Sequences options
-subject <File_In>
   Subject sequence(s) to search
    * Incompatible with: db, gilist, negative_gilist, db_soft_mask
-subject_loc <String>
   Location on the subject sequence (Format: start-stop)
    * Incompatible with: db, gilist, negative_gilist, db_soft_mask, remote

*** Formatting options
-outfmt <String>
   alignment view options:
     0 = pairwise,
     1 = query-anchored showing identities,
     2 = query-anchored no identities,
     3 = flat query-anchored, show identities,
     4 = flat query-anchored, no identities,
     5 = XML Blast output,
     6 = tabular,
     7 = tabular with comment lines,
     8 = Text ASN.1,
     9 = Binary ASN.1
    10 = Comma-separated values

   Options 6, 7, and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
        qseqid means Query Seq-id
           qgi means Query GI
          qacc means Query accesion
        sseqid means Subject Seq-id
   sallseqid means All subject Seq-id(s), separated by a ';'
           sgi means Subject GI
        sallgi means All subject GIs
          sacc means Subject accession
       sallacc means All subject accessions
        qstart means Start of alignment in query
          qend means End of alignment in query
        sstart means Start of alignment in subject
          send means End of alignment in subject
          qseq means Aligned part of query sequence
          sseq means Aligned part of subject sequence
        evalue means Expect value
      bitscore means Bit score
         score means Raw score
        length means Alignment length
        pident means Percentage of identical matches
        nident means Number of identical matches
      mismatch means Number of mismatches
      positive means Number of positive-scoring matches
       gapopen means Number of gap openings
          gaps means Total number of gaps
          ppos means Percentage of positive-scoring matches
        frames means Query and subject frames separated by a '/'
        qframe means Query frame
        sframe means Subject frame
   When not provided, the default value is:
   'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
   evalue bitscore', which is equivalent to the keyword 'std'
   Default = `0'
-show_gis
   Show NCBI GIs in deflines?
-num_descriptions <Integer, >=0>
   Number of database sequences to show one-line descriptions for
   Default = `500'
-num_alignments <Integer, >=0>
   Number of database sequences to show alignments for
   Default = `250'
-html
   Produce HTML output?

*** Query filtering options
-seg <String>
   Filter query sequence with SEG (Format: 'yes', 'window locut hicut', or
   'no' to disable)
   Default = `no'
-soft_masking <Boolean>
   Apply filtering locations as soft masks
   Default = `false'
-lcase_masking
   Use lower case filtering in query and subject sequence(s)?

*** Restrict search or results
-gilist <String>
   Restrict search of database to list of GI's
    * Incompatible with: negative_gilist, remote, subject, subject_loc
-negative_gilist <String>
   Restrict search of database to everything except the listed GIs
    * Incompatible with: gilist, remote, subject, subject_loc
-entrez_query <String>
   Restrict search with the given Entrez query
    * Requires: remote
-db_soft_mask <Integer>
   Filtering algorithm ID to apply to the BLAST database as soft masking
    * Incompatible with: subject, subject_loc
-culling_limit <Integer, >=0>
   If the query range of a hit is enveloped by that of at least this many
   higher-scoring hits, delete the hit
    * Incompatible with: best_hit_overhang, best_hit_score_edge
-best_hit_overhang <Real, (>=0 and =<0.5)>
   Best Hit algorithm overhang value (recommended value: 0.1)
    * Incompatible with: culling_limit
-best_hit_score_edge <Real, (>=0 and =<0.5)>
   Best Hit algorithm score edge value (recommended value: 0.1)
    * Incompatible with: culling_limit
-max_target_seqs <Integer, >=1>
   Maximum number of aligned sequences to keep

*** Statistical options
-dbsize <Int8>
Effective length of the database
-searchsp <Int8, >=0>
Effective length of the search space

*** Search strategy options
-import_search_strategy <File_In>
   Search strategy to use
    * Incompatible with: export_search_strategy
-export_search_strategy <File_Out>
   File name to record the search strategy used
    * Incompatible with: import_search_strategy

*** Extension options
-xdrop_ungap <Real>
   X-dropoff value (in bits) for ungapped extensions
-xdrop_gap <Real>
   X-dropoff value (in bits) for preliminary gapped extensions
-xdrop_gap_final <Real>
   X-dropoff value (in bits) for final gapped alignment
-window_size <Integer, >=0>
   Multiple hits window size, use 0 to specify 1-hit algorithm
-ungapped
   Perform ungapped alignment only?

*** Miscellaneous options
-parse_deflines
   Should the query and subject defline(s) be parsed?
-num_threads <Integer, >=1>
   Number of threads to use in the BLAST search
   Default = `1'
    * Incompatible with: remote
-remote
   Execute search remotely?
    * Incompatible with: gilist, negative_gilist, subject_loc, num_threads
-use_sw_tback
   Compute locally optimal Smith-Waterman alignments?

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：Blast使用方法攻略(二)

后一篇：序列比对和数据库搜索

新浪BLOG意见反馈留言板　欢迎批评指正