加载中…
个人资料
benjiaming
benjiaming
  • 博客等级:
  • 博客积分:0
  • 博客访问:1,089
  • 关注人气:68
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

一致性hash算法和lvs sh调度算法改进

(2013-01-17 16:55:19)
标签:

一致性hash

lvs

调度算法

杂谈

分类: 算法

1 LVS-sh调度算法souce address hash

本节是LVSsh调度算法的实现分析;

ip_vs_sh_init_svc() - 创建256hash buckethash table,并将rs映射到这256bucket,增加rs的引用计数器;

ip_vs_sh_done_svc() - 释放rs引用计数器,并销毁hash table

ip_vs_sh_update_svc - 销毁hash table,重新创建hash table并映射rs

ip_vs_sh_schedule – 计算hash key,找到hash bucket对应的rs

注意:sh调度算法中rs数目必须<=256;

 

该算法存在的问题:当一台rs down了之后,rs映射到hash table的顺序会被重置,从而导致当前连接100%损失。

 

为了解决上述问题,我们第一步自然想到了一致性hash

下面我们来分析一下haproxy中一致性hash的实现方法。

2 Consistent hash

本节我们将分析haproxy1.4.5中一致性hash的实现方法。

Haproxy官网:http://haproxy.1wt.eu/

2.1 Consistent HASH

一致性hash用于解决集群(N台)中server发生DOWN/UPx台)时,保证只有x/N的业务会受到影响,详细参考:http://www.spiteful.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/

http://www.codeproject.com/Articles/56138/Consistent-hashing

http://blog.csdn.net/sparkliang/article/details/5279393

cache为例,实现思想:

1.       根据servercache)某个参数计算出的key,将server映射到2^31-1的环形空间中;object对象根据自身的key存储到相近的cache server中;当cache A server down之后,只会影响环中其key附近的object

2.       由于cache server无法均匀的映射到环形空间中(不均衡问题),所以引入虚拟节点的概念,每台cache server分配X个虚拟节点映射到环中。


Haproxy利用EBtree来管理node,实现一致性hash

1.         虚拟节点个数

虚拟节点的分配在chash_init_server_tree()中实现;

虚拟节点个数srv->lb_nodes_tot = srv->uweight(32) * BE_WEIGHT_SCALE(16)=512;

1server会有512个虚拟节点;

 

2.         hash key的计算

每个虚拟节点的hash key按如下公式计算:

for (node = 0; node < srv->lb_nodes_tot; node++) {

               ……

           srv->lb_nodes[node].node.key=chash_hash(srv->puid*SRV_EWGHT_RANGE + node);

         }

其中,puidserver配置的id号,SRV_EWGHT_RANGE=4K;

chash_hash()是一个32hash散列算法,参见http://burtleburtle.net/bob/hash/integer.html

 

3.         添加/删除节点

采用EBtree管理节点,添加删除节点在chash_queue_dequeue_srv()中实现;

chash_queue_dequeue_srv()相应操作函数为eb32_insert()eb32_delete()

EBtree原理参见:http://1wt.eu/articles/ebtree/

2.2 EBtree

an elastic binary tree (or EB tree or EBtree) is a binary search tree, a fast radix tree(参见http://1wt.eu/articles/ebtree/

EBtree采用entry包含nodeleaf的方法,解决了radix treeentry个数过多的问题;

Complexity :

  • lookup  : min=1, max=log(P), avg=log(N)
  • insertion from root : min=1, max=log(P), avg=log(N)
  • insertion of dups  : min=1, max=log(D), avg=log(D)/2 after lookup
  • deletion  : min=1, max=1, avg=1
  • prev/next  : min=1, max=log(P), avg=2 :

 

数据结构

 

struct eb_root {

         eb_troot_t    *b[EB_NODE_BRANCHES];

};

 

struct eb_node {

         struct eb_root branches;

         eb_troot_t    *node_p; 

         eb_troot_t    *leaf_p; 

         short int      bit;    

         short int      pfx;    

};

The node part itself is composed of a parent pointer (node_p), a level (bit), and a root, which itself links to lower nodes constituting a subtree. The level part represents the lowest bit position in the key, above which all bits are equal for all keys in the lower subtrees.

A root only contains two links called branches, which represent the value of the bit below current node's. There is always a left and a right branch, except for the top of the tree. Left branch represents bit value 0 while right branch reprensents value 1.

Branches are typed. This means that a node knows whether its branches point to other nodes or to leaves. Parent pointers are typed too. This means that each node or leaf knows if it is attached to the left or the right branches of the upper root.

2.2.1 Branches and Parent pointers are typed

#define EB_LEFT     0

#define EB_RGHT     1

#define EB_NODE    0

#define EB_LEAF     1

 

所有的地址是4字节对齐的,因此采用lowest bit来作标记。

  1. a node pointer will be stored verbatim (with lowest bit cleared)
  2. a leaf pointer will be stored with lowest bit set (which equals pointer + 1)
  3. a parent pointer in a left branch will be stored verbatim (with lowest bit cleared)
  4. a parent pointer in a right branch will be stored with lowest bit set (which equals pointer + 1)

2.2.2 Eb32tree

1.         eb32_insert

a)         Tree is empty, insert the leaf part below the left branch

if (unlikely(troot == NULL)) {

         ……

b)         Walk down the tree by newnode->key and oldnode’s bit;

                    root = &old->node.branches;

                    side = (newkey >> old_node_bit) & EB_NODE_BRANCH_MASK;

                        troot = root->b[side];

c)         if we have reached a leaf node, insert above a leaf;

if (eb_gettag(troot) == EB_LEAF) {

……

d)         if we don't have common bits anymore/ duplicate tree, insert above the node;

if ((old_node_bit < 0) ||

                       (((new->key ^ old->key) >> old_node_bit) >= EB_NODE_BRANCHES))

                   ……

 

2.         eb32_delete

remove leaf node and marks the node as unlinked

a)         获得父节点parent = eb_root_to_node(eb_untag(node->leaf_p, pside));

b)         获得祖父节点gparent = eb_untag(parent->node_p, gpside);

c)         祖父节点分支ç父节点分支节点gparent->b[gpside] = parent->branches.b[!pside];

d)         父节点分支节点 node_p/leaf_p è祖父节点分支;

e)         Mark the parent unusedif the parent is our own nodesafely revmoce it;

parent->node_p = NULL

if (!node->node_p)

                   goto delete_unlink;

f)          New nodeLeaf-parent instead node

      parent->node_p = node->node_p;

      parent->branches = node->branches;

         parent->bit = node->bit;

g)         Update new node’s parent

      gpside = eb_gettag(parent->node_p);

      gparent = eb_untag(parent->node_p, gpside);

              gparent->b[gpside] = eb_dotag(&parent->branches, EB_NODE);

h)         Update new node’s branches

      for (pside = 0; pside <= 1; pside++) {

               if (eb_gettag(parent->branches.b[pside]) == EB_NODE) {

                         eb_root_to_node(eb_untag(parent->branches.b[pside], EB_NODE))->node_p =

                                  eb_dotag(&parent->branches, pside);

               } else {

                         eb_root_to_node(eb_untag(parent->branches.b[pside], EB_LEAF))->leaf_p =

                                  eb_dotag(&parent->branches, pside);

               }

         }

 

3.         eb32_lookup

遍历查找指定keynode

a)         根据keyx)和node->bit 查找

troot = node->node.branches.b[(x >> node_bit) & EB_NODE_BRANCH_MASK];

b)         如果是leaf,则判断key是否相等;

               if ((eb_gettag(troot) == EB_LEAF)) {

                         node = container_of(eb_untag(troot, EB_LEAF),

                                               struct eb32_node, node.branches);

                         if (node->key == x)

                                  return node;

                         else

                                  return NULL;

                   }

c)         如果是node,则判断是否是dup-tree,或者 no more common bitsreturn NULL);

 

4.         eb32_first

a)         遍历左子树,直到叶子节点;

b)         实现:eb_walk_down(root->b[0], EB_LEFT);

 

5.         eb32_last

a)         遍历右子树,直到叶子节点;

b)         实现:eb_walk_down(root->b[0], EB_RGHT);

 

6.         eb32_prev

a)         挂在左子树上,当前指针指向父节点,继续执行a

b)         挂在右子树上,eb32_last遍历父节点左子树;

 

7.         eb32_next

a)         挂在右子树上,当前指针指向父节点,继续执行a

b)         挂在左子树上,eb32_first遍历父节点右子树;

3 LVS-scsh调度算法simple consistent souce address hash

通过上述分析,一致性hash实现比较复杂,对于LVS调度来说,显得较重;

为此,我们设计了一个简化的一致性调度算法,本算法可以保证rs down/up时的一致性,无法保证add new rs时的一致性;

本算法适用于CDN业务,因为CDN节点添加新rs频度几乎为0

本算法其实现原理如下:

ip_vs_sh_init_svc() – sh,但分配total_rshash bucket

ip_vs_sh_done_svc() – sh

ip_vs_sh_update_svc – sh,但分配total_rshash bucket;同时,更新avalid_rs个数;

ip_vs_sh_schedule – 计算hash keykey mod total_rs找到hash bucket对应的rs;如果rs不可用,则key mod avalid_rs,跳过weight=0rs,找到对应的rs

我们主要实现一次hash,二次取模;之所以采取二次取模,而不是取下一个可用rs,是为了分散热点;

其中:

total_rs=svc-> num_dests;

avalid_rs存储在svc->sched_data中;

 

注:二次hash算法有2个注意点

a)         vip必须配置inhibit-on-failure,以保证rs down时,只是修改weight=0,而不是重新map rs

b)         新添rs时,会重新map rs,导致hash不一致;

0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有