加载中…
个人资料
whutgy
whutgy
  • 博客等级:
  • 博客积分:0
  • 博客访问:3,227
  • 关注人气:0
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
相关博文
推荐博文
谁看过这篇博文
加载中…
正文 字体大小:

Chapter 11-1: Hash Tables

(2009-09-07 16:12:06)
标签:

杂谈

分类: DataStructures&Algorith笔记

(1)
Hash tables do have several disadvantages. They're based on arrays, and arrays are difficult to expand once they've been created. For some kinds of hash tables, performance may degrade catastrophically when the table becomes too full, so the programmer needs to have a fairly accurate idea of how many data items will need to be stored (or be prepared to periodically transfer data to a larger hash table, a timeconsuming process).
哈希表也有一些缺点:它是基于数组的,数组创建后难于扩展。某些哈希表被填满时,性能下降的非常严重,所以程序员必须要清楚表中将要存储多少数据(或者准备好定期地把数据转移到更大的哈希表中,这是个费时的过程)


A similar widely used application for hash tables is in computer-language compilers, which maintain a symbol table in a hash table. The symbol table holds all the variable and function names made up by the programmer, along with the addresses where they can be found in memory. The program needs to access these names very quickly, so a hash table is the preferred data structure.
哈希表还在另一个类似的领域得到广泛应用,这就是高级计算机语言的编译器,它们通常用哈希表保留符号表。符号表记录了程序员声明的所有变量和函数名,以及它们在内存中的地址。程序需要快速地访问这些名字,所以哈希表是理想的数据结构。


This is an example of a hash function. It hashes (converts) a number in a large range into a number in a smaller range. This smaller range corresponds to the index numbers in an array. An array into which data is inserted using a hash function is called a hash table.
这就是一种哈希函数。它把一个大范围的数字哈希(转化)成一个小范围的数字。这个小的范围对应着数组的下标。使用哈希函数向数组插入数据后,这个数组就称为哈希表。

 

(2)
冲突(Collisions)
  把巨大的数字空间压缩成较小的数字空间,必然要付出代价,即不能保证,每个单词都映射到数组的空白单元。

 

  Remember that we've specified an array with twice as many cells as data items. Thus perhaps half the cells are empty. One approach, when a collision occurs, is to search the array in some systematic way for an empty cell, and insert the new item there, instead of at the index specified by the hash function. This approach is called open addressing. If cats hashes to 5,421, but this location is already occupied by parsnip, then we might try to insert cats in 5,422, for example.


  冲突的可能性会导致哈希化方案无法实施,实际上,可以通过另外的方式解决这个问题。当冲突发生时,一个方法是通过系统的方法找到数组的一个空位,并把这个单词填入,而不再用哈希函数得到数组下标,这个方法叫做开放地址法。

 

  A second approach (mentioned earlier) is to create an array that consists of linked lists of words instead of the words themselves. Then when a collision occurs, the new item is simply inserted in the list at that index. This is called separate chaining.
  第二种方法是创建一个存放单词链表的数组,数组内不直接存储单词。这样,当发生冲突时,新的数据项直接接到这个数组下标所指的链表中。这种方法叫做链地址法。

 

(3)
开放地址法

  In open addressing, when a data item can't be placed at the index calculated by the hash function, another location in the array is sought. We'll explore three methods of open addressing, which vary in the method used to find the next vacant cell. These methods are linear probing, quadratic probing, and double hashing.


  在开放地址法中,若数据不能直接放在由哈希函数计算出来的数组下标所指的单元时,就要寻找数组的其他位置。下面要探索开放地址法的三种方法,它们在找下一个空白单元时使用的方法不同。这三种方法分别是线性探测、二次探测和再哈希法。

  在线性探测中,线性地查找空白单元。如果5421是要插入数据的位置,它已经被占用了,那么就使用5422,然后是5423,依此类推,数组下标一直递增,直到找到空位。这就叫做线性探测,因为它沿着数组的下标一步一步顺序地查找空白单元。

 

  Deletion isn't accomplished by simply removing a data item from a cell, leaving it empty. Why not? Remember that during insertion the probe process steps along a series of cells, looking for a vacant one. If a cell is made empty in the middle of this sequence of full cells, the Find routine will give up when it sees the empty cell, even if the desired cell can eventually be reached. For this reason a deleted item is replaced by an item with a special key value that identifies it as deleted.

  If there are many deletions, the hash table fills up with these ersatz *Del* data items, which makes it less efficient. For this reason many hash table implementations don't allow deletion.


  哈希表中,一串连续的已填充单元叫做填充序列。增加越来越多的数据项时,填充序列变的越来越长。这叫做聚集。

  当哈希表变得越来越满时,聚集变得越来越严重。这导致产生非常长的探测长度。意味着存取序列最后的单元会非常耗时。

  数组填的越满,聚集越可能发生。数组有一半数据项时,这通常不是问题,当三分之二满的时候,情况也不会太坏,然而,如果超过这个界限,随着聚集越来越严重,性能下降也很严重。因此,设计哈希表的关键是确保它不会超过整个数组容量的一半,最多到三分之二。

 

import java.io.*;
////////////////////////////////////////////////////////////////
class DataItem
                                 // (could have more data)
   private int iData;               // data item (key)
//--------------------------------------------------------------
   public DataItem(int ii)          // constructor
      { iData = ii; }
//--------------------------------------------------------------
   public int getKey()
      { return iData; }
//--------------------------------------------------------------
   // end class DataItem
////////////////////////////////////////////////////////////////
class HashTable
   {
   private DataItem[] hashArray;    // array holds hash table
   private int arraySize;
   private DataItem nonItem;        // for deleted items
// -------------------------------------------------------------
   public HashTable(int size)       // constructor
      {
      arraySize = size;
      hashArray = new DataItem[arraySize];
      nonItem = new DataItem(-1);   // deleted item key is -1
      }
// -------------------------------------------------------------
   public void displayTable()
      {
      System.out.print("Table: ");
      for(int j=0; j<arraySize; j++)
         {
         if(hashArray[j] != null)
            System.out.print(hashArray[j].getKey() + " ");
         else
            System.out.print("** ");
         }
      System.out.println("");
      }
// -------------------------------------------------------------
   public int hashFunc(int key)
      {
      return key % arraySize;       // hash function
      }
// -------------------------------------------------------------
   public void insert(DataItem item) // insert a DataItem
   // (assumes table not full)
      {
      int key = item.getKey();      // extract key
      int hashVal = hashFunc(key);  // hash the key
                                    // until empty cell or -1,
      while(hashArray[hashVal] != null &&
                      hashArray[hashVal].getKey() != -1)
         {
         ++hashVal;                 // go to next cell
         hashVal %= arraySize;      // wraparound if necessary
         }
      hashArray[hashVal] = item;    // insert item
      // end insert()
// -------------------------------------------------------------
   public DataItem delete(int key)  // delete a DataItem
      {
      int hashVal = hashFunc(key);  // hash the key

      while(hashArray[hashVal] != null)  // until empty cell,
                                      // found the key?
         if(hashArray[hashVal].getKey() == key)
            {
            DataItem temp = hashArray[hashVal]; // save item
            hashArray[hashVal] = nonItem;       // delete item
            return temp;                        // return item
            }
         ++hashVal;                 // go to next cell
         hashVal %= arraySize;      // wraparound if necessary
         }
      return null;                  // can't find item
      // end delete()
// -------------------------------------------------------------
   public DataItem find(int key)    // find item with key
      {
      int hashVal = hashFunc(key);  // hash the key

      while(hashArray[hashVal] != null)  // until empty cell,
                                      // found the key?
         if(hashArray[hashVal].getKey() == key)
            return hashArray[hashVal];   // yes, return item
         ++hashVal;                 // go to next cell
         hashVal %= arraySize;      // wraparound if necessary
         }
      return null;                  // can't find item
      }
// -------------------------------------------------------------
   // end class HashTable
////////////////////////////////////////////////////////////////


(4)
扩展数组

  One option when a hash table becomes too full is to expand its array. In Java, arrays have a fixed size and can't be expanded. Your program could create a new, larger array, and then rehash the contents of the old small array into the new large one. However, this is a time-consuming process.


  当哈希表变得太满时,一个选择是扩展数组。在Java中,数组有固定的大小,而且不能扩展。编程时只能另外创建一个新的更大的数组,然后把旧数组的所有内容插入到新的数组中。

记住,哈希函数根据数组大小计算给定数据项的位置,所以,这些数据项不能再放在新数组中和老数组相同的位置上。因此,不能简单的从一个数组向另一个数组拷贝数据。需要按顺序遍历老数组,用insert()方法向新数组中插入每个数据项。这叫做重新哈希化,这是一个耗时的过程,但如果要进行扩展,这个过程就是必要的。

  扩展后的数组容量通常是原来的两倍。实际上,因为数组容量应该是一个质数,所以新数组要比两倍的容量多一点。计算新数组的容量是重新哈希化的一部分。

 

  Java offers a class Vector that is an array-like data structure that can be expanded. However, it's not much help because of the need to rehash all data items when the table changes size. Expanding the array is only practical when there's plenty of time available to carry it out.


(5)
二次探测(Quadratic Probing)

  We've seen that clusters can occur in the linear probe approach to open addressing. Once a cluster forms, it tends to grow larger. Items that hash to any value in the range of the cluster will step along and insert themselves at the end of the cluster, thus making it even bigger. The bigger the cluster gets, the faster it grows.


The ratio of the number of items in a table, to the table's size, is called the load factor.
  已填入哈希表的数据项和表长的比率叫做装填因子。有10000个单元的哈希表填入6667个数据后,它的装填因子是2/3.

  当装填因子不太大时,聚集分布得比较连贯。哈希表的某个部分可能包含大量的聚集,而另一个部分还很稀疏。聚集降低了哈希表的性能。

 

  Quadratic probing is an attempt to keep clusters from forming. The idea is to probe more widely separated cells, instead of those adjacent to the primary hash site.
二次探测是防止聚集产生的一种尝试。思想是探测相隔较远的单元,而不是和原始位置相邻的单元。

(步骤是步数的平方)


  In a linear probe, if the primary hash index is x, subsequent probes go to x+1, x+2, x+3, and so on. In quadratic probing, probes go to x+1, x+4, x+9, x+16, x+25, and so on. The distance from the initial probe is the square of the step number.

(有些像网络中以太网处理冲突的一个方法。)

 

二次探测的问题
  二次探测消除了在线性探测中产生的聚集问题,这种聚集问题叫做原始聚集。然而,二次探测产生了另外一种,更细的聚集问题。之所以会发生,是因为所有映射到同一个位置的关键字在寻找空位时,探测的单元都是一样的。

  比如将184,302,420和544依次插入到表中,它们都映射到7.那么302需要以一为步长的探测,420需要以四为步长的探测,544需要以九为步长的探测。只要有一项,其关键字映射到7,就需要更长步长的探测。这个现象叫做二次聚集。

  二次聚集不是一个严重的问题,但是,二次探测不会经常使用,因为还有稍微好些的解决方案。

 

(6)
再哈希法(Double Hashing)
  为了消除原始聚集和二次聚集,可以使用另外的一个方法:再哈希法。二次聚集产生的原因是,二次探测的算法产生的探测序列步长总是固定的:1,4,9,16,依此类推。

  现在需要的一种方法是产生一种依赖关键字的探测序列,而不是每个关键字都一样。那么,不同的关键字即使映射到相同的数组下标,也可以使用不同的探测序列。

  方法是把关键字用不同的哈希函数再做一遍哈希化,用这个结果作为步长。对指定的关键字,步长在整个探测中是不变的,不过不同的关键字使用不同的步长。

 

  经验说明,第二个哈希函数必须具备如下特点:
1. 和第一个哈希函数不同。
2. 不能输出0(否则,将没有步长;每次探测都是原地踏步,算法将陷入死循环)。

 

Experts have discovered that functions of the following form work well:
stepSize = constant-(key%constant);
where constant is prime and smaller than the array size.


Table Size a Prime Number(表的容量是一个质数)
  Double hashing requires that the size of the hash table is a prime number. To see why, imagine a situation where the table size is not a prime number. For example, suppose the array size is 15 (indices from 0 to 14), and that a particular key hashes to an initial index of 0 and a step size of 5. The probe sequence will be 0, 5, 10, 0, 5, 10, and so on, repeating endlessly. Only these three cells are ever examined, so the algorithm will never find the empty cells that might be waiting at 1, 2, 3, and so on. The algorithm will crash and burn.
(再哈希法要求哈希表的容量是一个质数。为了考察为什么会有这个限制,假设表的容量不是质数。例如表长是15(下标从0到14),有一个特定关键字映射到0,步长为5.探测序列是0,5,10,0,5,10,依此类推,一直循环下去。算法只尝试着三个单元,所以不可能找到某些空白单元,例如位置1,2,3或其他位置。算法最终会导致崩溃。)

 

  If the array size were 13, which is prime, the probe sequence eventually visits every cell. It's 0, 5, 10, 2, 7, 12, 4, 9, 1, 6, 11, 3, and so on and on. If there is even one empty cell, the probe will find it. Using a prime number as the array size makes it impossible for any number to divide it evenly, so the probe sequence will eventually check every cell.
(如果数组容量是13,即一个质数,探测序列最终会访问所有单元。即0,5,10,2,7,12,4,9,1,6,11,3,一直下去。只要表中有一个空位,就可以探测到它。用质数作为数组容量使得任何数想整除它都是不可能的,因此探测序列会最终检查所有单元。)

 

  A similar effect occurs using the quadratic probe. In that case, however, the step size gets larger with each step, and will eventually overflow the variable holding it, thus preventing an endless loop.

(类似的影响在二次探测中也存在。然而,由于每步的步长都在变化,且最终会超出变量的范围,所以避免了无限的循环。)

 

  In general, double hashing is the probe sequence of choice when open addressing is used.

(使用开放地址策略时,探测序列通常用再哈希法生成。)


import java.io.*;
////////////////////////////////////////////////////////////////
class DataItem
                                  // (could have more items)
   private int iData;                // data item (key)
//--------------------------------------------------------------
   public DataItem(int ii)           // constructor
      { iData = ii; }
//--------------------------------------------------------------
   public int getKey()
      { return iData; }
//--------------------------------------------------------------
   // end class DataItem
////////////////////////////////////////////////////////////////
class HashTable
   {
   private DataItem[] hashArray;     // array is the hash table
   private int arraySize;
   private DataItem nonItem;         // for deleted items
// -------------------------------------------------------------
   HashTable(int size)               // constructor
      {
      arraySize = size;
      hashArray = new DataItem[arraySize];
      nonItem = new DataItem(-1);
      }
// -------------------------------------------------------------
   public void displayTable()
      {
      System.out.print("Table: ");
      for(int j=0; j<arraySize; j++)
         {
         if(hashArray[j] != null)
            System.out.print(hashArray[j].getKey()+ " ");
         else
            System.out.print("** ");
         }
      System.out.println("");
      }
// -------------------------------------------------------------
   public int hashFunc1(int key)
      {
      return key % arraySize;
      }
// -------------------------------------------------------------
   public int hashFunc2(int key)
      {
      // non-zero, less than array size, different from hF1
      // array size must be relatively prime to 5, 4, 3, and 2
      return 5 - key % 5;
      }
// -------------------------------------------------------------
                                     // insert a DataItem
   public void insert(int key, DataItem item)
   // (assumes table not full)
      {
      int hashVal = hashFunc1(key);  // hash the key
      int stepSize = hashFunc2(key); // get step size
                                     // until empty cell or -1
      while(hashArray[hashVal] != null &&
                      hashArray[hashVal].getKey() != -1)
         {
         hashVal += stepSize;        // add the step
         hashVal %= arraySize;       // for wraparound
         }
      hashArray[hashVal] = item;     // insert item
      // end insert()
// -------------------------------------------------------------
   public DataItem delete(int key)   // delete a DataItem
      {
      int hashVal = hashFunc1(key);      // hash the key
      int stepSize = hashFunc2(key);     // get step size

      while(hashArray[hashVal] != null)  // until empty cell,
                                      // is correct hashVal?
         if(hashArray[hashVal].getKey() == key)
            {
            DataItem temp = hashArray[hashVal]; // save item
            hashArray[hashVal] = nonItem;       // delete item
            return temp;                        // return item
            }
         hashVal += stepSize;            // add the step
         hashVal %= arraySize;           // for wraparound
         }
      return null;                   // can't find item
      // end delete()
// -------------------------------------------------------------
   public DataItem find(int key)     // find item with key
   // (assumes table not full)
      {
      int hashVal = hashFunc1(key);      // hash the key
      int stepSize = hashFunc2(key);     // get step size

      while(hashArray[hashVal] != null)  // until empty cell,
                                      // is correct hashVal?
         if(hashArray[hashVal].getKey() == key)
            return hashArray[hashVal];   // yes, return item
         hashVal += stepSize;            // add the step
         hashVal %= arraySize;           // for wraparound
         }
      return null;                   // can't find item
      }
// -------------------------------------------------------------
   // end class HashTable
////////////////////////////////////////////////////////////////

0

阅读 评论 收藏 转载 喜欢 打印举报/Report
  • 评论加载中,请稍候...
发评论

    发评论

    以上网友发言只代表其个人观点,不代表新浪网的观点或立场。

      

    新浪BLOG意见反馈留言板 欢迎批评指正

    新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 会员注册 | 产品答疑

    新浪公司 版权所有