转： FST < 0 的原因：一個關於分析族群分化方法的筆記_胖小妖

http://blog.sina.com.cn/u/2510103112

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

转： FST < 0 的原因：一個關於分析族群分化方法的筆記

(2015-09-09 21:24:19)

分类：分子水平的自然选择

FST < 0 的原因：一個關於分析族群分化方法的筆記

个人认为的原因：大组内的小组多态性高，且各小组之间样品数量差异较大，数量分布不均匀，是的局部小组的杂合增高，从而使小组杂合均值偏低。这种情况下，组应该是能分开但是样品量相差太大，不具有比较意义。

Why did I got FST < 0: Study note on a method estimating popultion differentiation

摘要／Abstract

拜其完整的功能選項所賜，Alrequin可以說是目前族群遺傳學研究中最常被使用的軟體之一。但是在另一方面，受到其方便性的影響，許多使用者常常會忽略自己所使用的估算方法與過去在教科書上學到的有所出入。以下將簡單介紹在Arlequin程式中被廣泛用來估計族群分化的方法，也就是由Hudson et al. (1992)所發展的族群分化估計法。本文並將比較其與Wright以及Nei (1982)所各自發展的族群分化估計法間的異同。在最後則將提供兩個族群對的實例，說明Hudson et al. (1992)的方法在估計族群分化時所可能產生的問題。

Hudson et al. (1992) have developed a method to estimate the paiwise differentiation among populations. This method has been applied in the famous sofware Arlequin to estimate parameters of pairwise differentiation (FST) and gene flow (Nm) via pairwise comparisons among populations. However, under some circumstances, this method may result in FST values below zero. Such results have made many beginners confused because the method estimating FSTs provided in most courses are based on Wright's model, in which a FST < 0 is impossible. Here I provide a simple description about the method developed by Hudson et al. and its differences comparing to methods developed by Wright and by Nei (1982). Two examples are also included in this article to point out conditions which may result to FST values below zero using the method, in which two populations are varied largely in their sizes and have high similarity in their genetic compositions. ——

在DNA序列開始被應用於族群遺傳研究之後，族群間分化的估計方式便不再侷限於比較基因型（alleles）間的差異，而擴張至序列（sequences）間的差異。與之前所發展的分析方法不同，新的分析方法在發展時都希望能夠充分利用DNA序列間所包含的差異資訊，也就是屬於各個鹼基位置的差異。

Ever since DNA sequences being applied in studies of population genetics, estimations of differentiation among populations have extended from comparison between alleles to comparisons between sequences. Different from previous methods, the basic goal of these new methods is to apply all information of differences among sequences, namly, the differences (or polymorphisms) on every nucleotide residue.

以Hudson et al. (1992)的方法為例，其估算族群分化的公式如下：

For example, the formula from Hudson et al. (1992) described:

FST = 1 - Hw / Hb，其中
FST 表示族群間的分化參數；
Hw 表示所有族群內部序列配對比較後得到的序列間差異數總和；
Hb 表示所有族群之間序列配對比較後得到的序列間差異數總和。

FST is the population differentiation parameter;
Hw is the sum of differences found among sequences within single population;
Hb is the sum of differences found among sequences from different populations.

這個方法和傳統教科書會提到的，源自Wright的計算方法（FST = 1 - Hs / Ht）有兩點不同。其一在於前者將單一子族群的序列視為一個整體，將整體內任意兩條序列的差異皆納入計算，而後者考慮的則是雙倍體（diploid）族群中異形合子（heterozygote）出現的頻率；其二在於前者使用的分母Hb僅包含「源自不同族群的成對序列」之間的差異，後者卻包含根據所有母族群內各基因型比例所推估出的異型合子出現頻率。由於Wright的計算方法不考慮由單型（haplotype）構成的單倍體（haploid）族群，因此像是葉綠體或是粒線體的DNA序列資料在理論上便不能直接套用Wright的計算方法。再加上Wright的模型只考慮基因型間的差異，不考慮不同基因型間差異的幅度，因此Hudson et al.的方法會更適合被應用在估算單型族群間的分化。

There are two differences between this method and Wright's (FST = 1 - Hs / Ht). First, Hudson et al. consider the differences between every pair of individual sequences within the same subpopulation as the nominator Hw. Second, they use the differences of sequences "from different populations" as the denominator Hb. Wright's method can not be applied to genetic data generated from organelles (eg. chloroplasts or mitochondria) because it does not consider conditions of haplotype (or haploid) populations. Moreover, method form Hudson et al. puts weights to difference between sequences based on the number of different nucleotide residues found between them, making it more powerful in estimating differentiations among haplotype populations.

然而Hudson et al.在發展這套方法的過程中使用了一些假設，使得這套方法在某些場合之下會產生令人訝異的結果。

However, some of the assumptions in this method have leaded to strange outcomes under some conditions.

以下用兩個例子實際說明：範例中各單型間差異（鹼基數）假定為１。

The following are two examples showing weird results when applying the method by Hudson et al. (1992). Differences (number of nucleotide residues) between haplotypes in both examples are all assumed as 1.

範例一：
Case 1:
Haplotype A B C Total
population 1 4 3 1 8
population 2 3 1 0 4

Hw = Hw1 + Hw2 = (12 + 3 + 4) + 3 = 22
Hb = Hb(AB) + Hb(AC) + Hb(BC) = (4 + 9) + 3 + 1 = 17
FST = 1 - Hw / Hb = 1 - 22 / 17 = -0.29411... ~ -0.294

範例二：
Case 2:
Haplotype A B C D E F G Total
population 1 16 1 1 1 0 0 0 19
population 2 29 2 0 0 1 1 1 34

Hw = Hw1 + Hw2 = (16 * 3 + 3) + (29 * 5 + 2 * 3 + 3) = 51 + 154 = 205
Hb = Hb(AB) + Hb(AC) + Hb(AD) + Hb(AE) + Hb(AF) + Hb(AG) + Hb(BC) + Hb(BD) + Hb(BE)
+ Hb(BF) + Hb(BG) + Hb(CE) + Hb(CF) + Hb(CG) + Hb(DE) + Hb(DF) + Hb(DG)
= (32 + 29) + 29 * 2 + 16 * 3 + 2 * 2 + 9 = 180
FST = 1 - Hw / Hb = 1 - 205 / 180 = -0.138888... ~ -0.139

這兩個範例的共同特徵在於兩個族群的大小差異極大（比例上接近一比二），且各族群內單型的類別與出現頻率皆十分接近。在這種狀況之下，Hudson et al.的計算方式所得到的Hw將有可能大於Hb，並進而造成FST小於零的情況發生。在Hudson et al. (1992)的原始文章當中，所有參與比較的子族群大小皆為16，在這種狀況之下，就算各族群內單型的類別與出現頻率十分接近，也不會造成FST小於零的情況：

Both examples show a pair of populations with a nearly 1:2 ratio in size and similar frequencies of shared haplotypes. Under this circumstance, method from Hudson et al. may give a Hw larger than Hb, causing the result of FST < 0. In the original article of Hudson et al. (1992), size of all subpopulations were set as 16 (sequences) for the simulations. So there will never have a FST value below zero even if the frequencies of shared haplotypes are similar in both populations.

範例三，各單型在各族群內頻率與範例一相同：
Case 3, in which frequencies of each haplotype within each population are equal to case 1:

Haplotype A B C Total
population 1 8 6 2 16
population 2 12 4 0 16

Hw = Hw1 + Hw2 = (48 + 16 + 12) + 48 = 124
Hb = Hb(AB) + Hb(AC) + Hb(BC) = (32 + 72) + 24 + 8 = 136
FST = 1 - Hw / Hb = 1 - 124 / 136 = 0.088235... ~ 0.088

雖然在解釋上，我們可以將FST小於零的狀況視為族群內變異高於族群間變異的結果，但是追根究底說來，如此結果的出現純粹是Hudson et al.的計算方式既不採用重複取樣（取樣所得的序列可能與自己做比較），也不將所有子族群視為一個母族群（將同一子族群內的序列比較結果納入分母）的緣故。而這兩個假設剛好在Nei (1982)的δst計算公式中都被採用：

The reason for FST < 0 is that both comparisons of the selected sequence itself in each subpopulation and of sequences from the same subpopulation in the total population are not consideredin the method from Hudson et al.. On the other hand, both kinds of comparison are included in Nei's (1982) estimation of δst:
δst = πT - πS，其中
πT是母族群的核酸多樣性
（該數值並未排除各子族群內部所貢獻的核酸多樣性）
πS是各子族群和酸多樣性的平均。

πT is the nucleotide diversity of the total population
(which does not exclude contributions from the nucleotide diversities of each subpopulation);
πS is the average of the nucleotide diversities of each subpopulation.

因此在實際操作上，要解決估算部分族群間遺傳分化值小於零的問題，最簡單的做法便是採用Nei (1982)的估算方法。而這個估算方法可以在DNAsp中找到。

Therefore, to solve the problem of FST < 0, Nei's (1982) method should be applied instead of the one from Hudson et al. (1992). And this method has been applied in the software DNAsp.

Further readings:

Hudson RR, Slatkint M, Maddison WP (1992). Estimation of levels of gene flow from DNA sequence data. Genetics 132, 583-589.

Nei, M (1982). Evolution of human races at the gene level, pp. 167-181. In B. Bonne-Tamir, T. Cohen, and R. M. Goodman (eds.), Human genetics, part A: The unfolding genome. Alan R. Liss, New York.

Wright S (1951) The genetical structure of populations. Annuals of Eugenics 15, 323-354....

本人根据 Hudson Fst，假设有G g一个组里出现A频率为x，另一个组里是y，那么fst=(x^2+y^2-2*x*y)/(x+y-2*x*y)，模拟如下：

library(scatterplot3d)
library(Rcmdr)
x<-seq(0.01,1,0.01)
y<-seq(0.01,1,0.01)
matrix<-matrix(0,1,3)

for( i in x ){
for( j in y ){
  val<-(i^2+j^2-2*i*j)/(i+j-2*i*j)
  mat<-c(i,j,val)
  rbind(matrix,mat)->matrix
}
}
matrix[,1]->x1
matrix[,2]->y1
matrix[,3]->z1

scatter3d(x1, y1, z1)
scatter3d(x1, y1, z1,surface=FALSE,point.col=2)

得图如下：

http://s7/mw690/002JS8u4gy6VjKxURCee6&690 FST < 0 的原因：一個關於分析族群分化方法的筆記" TITLE="转：  FST < 0 的原因：一個關於分析族群分化方法的筆記" />

http://s15/mw690/002JS8u4gy6VjKyH7kO6e&690 FST < 0 的原因：一個關於分析族群分化方法的筆記" TITLE="转：  FST < 0 的原因：一個關於分析族群分化方法的筆記" />

http://s16/mw690/002JS8u4gy6VjKz14Rxaf&690 FST < 0 的原因：一個關於分析族群分化方法的筆記" TITLE="转：  FST < 0 的原因：一個關於分析族群分化方法的筆記" />

结论：当两个 sub group 数量差不多的情况下两组频率组成差异越大则Fst越大，频率组成相同则出现的Fst越小

当两个组相差C倍时再讨论

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：[转载]R软件中的泊松分布

后一篇：[转载]R绘图基础-热图 heatmap

新浪BLOG意见反馈留言板　欢迎批评指正