一致性hash算法和lvs sh调度算法改进

标签:
一致性hashlvs调度算法杂谈 |
分类: 算法 |
1 LVS-sh调度算法(souce address hash)
本节是LVS中sh调度算法的实现分析;
ip_vs_sh_init_svc() - 创建256个hash bucket的hash table,并将rs映射到这256个bucket中,增加rs的引用计数器;
ip_vs_sh_done_svc() - 释放rs引用计数器,并销毁hash table;
ip_vs_sh_update_svc - 销毁hash table,重新创建hash table并映射rs;
ip_vs_sh_schedule – 计算hash key,找到hash bucket对应的rs;
注意:sh调度算法中rs数目必须<=256;
该算法存在的问题:当一台rs down了之后,rs映射到hash table的顺序会被重置,从而导致当前连接100%损失。
为了解决上述问题,我们第一步自然想到了一致性hash;
下面我们来分析一下haproxy中一致性hash的实现方法。
2 Consistent hash
本节我们将分析haproxy1.4.5中一致性hash的实现方法。
Haproxy官网:http://haproxy.1wt.eu/;
2.1 Consistent HASH
一致性hash用于解决集群(N台)中server发生DOWN/UP(x台)时,保证只有x/N的业务会受到影响,详细参考:http://www.spiteful.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/;
http://www.codeproject.com/Articles/56138/Consistent-hashing;
http://blog.csdn.net/sparkliang/article/details/5279393;
以cache为例,实现思想:
1.
2.
Haproxy利用EBtree来管理node,实现一致性hash。
1.
虚拟节点的分配在chash_init_server_tree()中实现;
虚拟节点个数srv->lb_nodes_tot = srv->uweight(32) * BE_WEIGHT_SCALE(16)=512个;
即1台server会有512个虚拟节点;
2.
每个虚拟节点的hash key按如下公式计算:
for (node = 0; node < srv->lb_nodes_tot; node++) {
其中,puid为server配置的id号,SRV_EWGHT_RANGE=4K;
chash_hash()是一个32位hash散列算法,参见http://burtleburtle.net/bob/hash/integer.html;
3.
采用EBtree管理节点,添加删除节点在chash_queue_dequeue_srv()中实现;
chash_queue_dequeue_srv()相应操作函数为eb32_insert()和eb32_delete();
EBtree原理参见:http://1wt.eu/articles/ebtree/;
2.2 EBtree
an elastic binary tree (or EB tree or EBtree) is a binary search tree, a fast radix tree(参见http://1wt.eu/articles/ebtree/ )
EBtree采用entry包含node和leaf的方法,解决了radix tree中entry个数过多的问题;
Complexity
-
lookup
: min=1, max=log(P), avg=log(N) -
insertion from
root
: min=1, max=log(P), avg=log(N) -
insertion of dups
: min=1, max=log(D), avg=log(D)/2 after lookup -
deletion
: min=1, max=1, avg=1 -
prev/next
: min=1, max=log(P), avg=2 :
数据结构
struct eb_root {
};
struct eb_node {
};
The node part itself is composed of a parent pointer (node_p), a level (bit), and a root, which itself links to lower nodes constituting a subtree. The level part represents the lowest bit position in the key, above which all bits are equal for all keys in the lower subtrees.
A root only contains two links called branches, which represent the value of the bit below current node's. There is always a left and a right branch, except for the top of the tree. Left branch represents bit value 0 while right branch reprensents value 1.
Branches are typed. This means that a node knows whether its branches point to other nodes or to leaves. Parent pointers are typed too. This means that each node or leaf knows if it is attached to the left or the right branches of the upper root.
2.2.1 Branches and Parent pointers are typed
#define
EB_LEFT
#define
EB_RGHT
#define
EB_NODE
#define
EB_LEAF
所有的地址是4字节对齐的,因此采用lowest bit来作标记。
- a node pointer will be stored verbatim (with lowest bit cleared)
- a leaf pointer will be stored with lowest bit set (which equals pointer + 1)
- a parent pointer in a left branch will be stored verbatim (with lowest bit cleared)
- a parent pointer in a right branch will be stored with lowest bit set (which equals pointer + 1)
2.2.2 Eb32tree
1.
a)
if (unlikely(troot == NULL)) {
b)
c)
if (eb_gettag(troot) == EB_LEAF) {
……
d)
if ((old_node_bit < 0) ||
2.
remove leaf node and marks the node as unlinked;
a)
b)
c)
d)
e)
parent->node_p = NULL;
if (!node->node_p)
f)
g)
h)
3.
遍历查找指定key的node;
a)
troot = node->node.branches.b[(x >> node_bit) & EB_NODE_BRANCH_MASK];
b)
c)
4.
a)
b)
5.
a)
b)
6.
a)
b)
7.
a)
b)
3 LVS-scsh调度算法(simple consistent souce address hash)
通过上述分析,一致性hash实现比较复杂,对于LVS调度来说,显得较重;
为此,我们设计了一个简化的一致性调度算法,本算法可以保证rs down/up时的一致性,无法保证add new rs时的一致性;
本算法适用于CDN业务,因为CDN节点添加新rs频度几乎为0。
本算法其实现原理如下:
ip_vs_sh_init_svc() – 同sh,但分配total_rs个hash bucket;
ip_vs_sh_done_svc() – 同sh;
ip_vs_sh_update_svc – 同sh,但分配total_rs个hash bucket;同时,更新avalid_rs个数;
ip_vs_sh_schedule – 计算hash key,key mod total_rs找到hash bucket对应的rs;如果rs不可用,则key mod avalid_rs,跳过weight=0的rs,找到对应的rs;
我们主要实现一次hash,二次取模;之所以采取二次取模,而不是取下一个可用rs,是为了分散热点;
其中:
total_rs=svc-> num_dests;
avalid_rs存储在svc->sched_data中;
注:二次hash算法有2个注意点
a)
b)