# 加载中...

whutgy
• 博客等级：
• 博客积分：0
• 博客访问：3,189
• 关注人气：0
• 获赠金笔：0支
• 赠出金笔：0支
• 荣誉徽章：

## Chapter 11-2: Hash Tables

(2009-09-07 16:17:44)

### 杂谈

(1)

In open addressing, collisions are resolved by looking for an open cell in the hash table. A different approach is to install a linked list at each index in the hash table. A data item's key is hashed to the index in the usual way, and the item is inserted into the linked list at that index. Other items that hash to the same index are simply added to the linked list; there's no need to search for empty cells in the primary array.

The load factor (the ratio of the number of items in a hash table to its size) is typically different in separate chaining than in open addressing. In separate chaining it's normal to put N or more items into an N-cell array; thus the load factor can be 1 or greater. There's no problem with this; some locations will simply contain two or more items in their lists.

In open addressing, performance degrades badly as the load factor increases above one half or two thirds. In separate chaining the load factor can rise above 1 without hurting performance very much. This makes separate chaining a more robust mechanism, especially when it's hard to predict in advance how much data will be placed in the hash table.
（在开放地址法中，当装填因子超过二分之一或三分之二后，性能下降的很快。在链地址法制，装填因子可以达到1以上，且对性能影响不大。因此，链地址法是更健壮的机制，特别是当事先难以确定哈希表要存储多少数据时更是如此。）

Deletion
In separate chaining, deletion poses no special problems as it does in open addressing. The algorithm hashes to the proper list and then deletes the item from the list. Because probes aren't used, it doesn't matter if the list at a particular cell becomes empty. （链地址法中，删除并没有像开放地址那样的问题。算法找到正确的链表，从链表中删除数据项。因为不需要探测，无所谓链表中的某一项是否为空。）

Table Size（表的容量）
With separate chaining it's not so important to make the table size a prime number, as it is with quadratic probes and double hashing. There are no probes in separate chaining, so there's no need to worry that a probe will go into an endless sequence because the step size divides evenly into the array size.
（如果用链地址法的话，表容量是质数的要求就不像在二次探测和再哈希法制显得那么重要。这里没有探测，所以不需要担心探测会因为表容量能被步长整除，而陷入无限的循环中。）

On the other hand, certain kinds of key distributions can cause data to cluster when the array size is not a prime number. We'll have more to say about this when we discuss hash functions.
（但是，当数组容量不是质数时，这种关键字的分布还是能够导致数据的聚集。在讨论哈希函数的时候还会谈到这个问题。）

import java.io.*;
////////////////////////////////////////////////////////////////
// (could be other items)
private int iData;                  // data item
// -------------------------------------------------------------
{ iData= it; }
// -------------------------------------------------------------
public int getKey()
{ return iData; }
// -------------------------------------------------------------
{ System.out.print(iData + " "); }
////////////////////////////////////////////////////////////////
class SortedList
{
private Link first;               // ref to first list item
// -------------------------------------------------------------
public void SortedList()          // constructor
{ first = null; }
// -------------------------------------------------------------
{
Link previous = null;          // start at first
// until end of list,
while( current != null && key > current.getKey() )
// or current > key,
previous = current;
current = current.next;     // go to next item
}
if(previous==null)             // if beginning of list,
else                           // not at beginning,
// end insert()
// -------------------------------------------------------------
public void delete(int key)       // delete link
// (assumes non-empty list)
Link previous = null;          // start at first
// until end of list,
while( current != null && key != current.getKey() )
// or key == current,
previous = current;
current = current.next;     // go to next link
}
if(previous==null)             //   if beginning of list
first = first.next;         //      delete first link
else                           //   not at beginning
previous.next = current.next; //    delete current link
// end delete()
// -------------------------------------------------------------
{
Link current = first;          // start at first
// until end of list,
while(current != null &&  current.getKey() <= key)
// or key too small,
if(current.getKey() == key)    // is this the link?
return current;          // found it, return link
current = current.next;     // go to next item
}
return null;                   // didn't find it
// end find()
// -------------------------------------------------------------
public void displayList()
{
System.out.print("List (first-->last): ");
Link current = first;       // start at beginning of list
while(current != null)      // until end of list,
{
current = current.next;  // move to next link
}
System.out.println("");
}
// end class SortedList
////////////////////////////////////////////////////////////////
class HashTable
{
private SortedList[] hashArray;   // array of lists
private int arraySize;
// -------------------------------------------------------------
public HashTable(int size)        // constructor
{
arraySize = size;
hashArray = new SortedList[arraySize];  // create array
for(int j=0; j<arraySize; j++)          // fill array
hashArray[j] = new SortedList();     // with lists
}
// -------------------------------------------------------------
public void displayTable()
{
for(int j=0; j<arraySize; j++) // for each cell,
{
System.out.print(j + ". "); // display cell number
hashArray[j].displayList(); // display list
}
}
// -------------------------------------------------------------
public int hashFunc(int key)      // hash function
{
return key % arraySize;
}
// -------------------------------------------------------------
{
int hashVal = hashFunc(key);   // hash the key
// end insert()
// -------------------------------------------------------------
public void delete(int key)       // delete a link
{
int hashVal = hashFunc(key);   // hash the key
// end delete()
// -------------------------------------------------------------
{
int hashVal = hashFunc(key);   // hash the key
}
// -------------------------------------------------------------
// end class HashTable
////////////////////////////////////////////////////////////////

(2)
Hash Functions

1. Quick Computation
A good hash function is simple, so it can be computed quickly. The major advantage of hash tables is their speed. If the hash function is slow, this speed will be degraded. A hash function with many multiplications and divisions is not a good idea. (The bitmanipulation facilities of Java or C++, such as shifting bits right to divide a number by a multiple of 2, can sometimes be used to good advantage.)

The purpose of a hash function is to take a range of key values and transform them into index values in such a way that the key values are distributed randomly across all the indices of the hash table. Keys may be completely random or not so random.

2. Random Keys（随机关键字）
A so-called perfect hash function maps every key into a different table location. This is only possible for keys that are unusually well behaved, and whose range is small enough to be used directly as array indices (as in the employee-number example at the beginning of this chapter).
(所谓完美的哈希函数把每个关键字都映射到表中不同的位置。只有在关键字组织得异乎寻常的好，且它的范围足够小，可以直接用于数组下标的时候，这种情况才出现)

In most cases neither of these situations exist, and the hash function will need to compress a larger range of keys into a smaller range of index numbers.
（大多数情况下，这种情况不会发生，哈希函数需要把较大的关键字值范围压缩成较小的数组下标的范围。）

The distribution of key values in a particular database determines what the hash function needs to do. In this chapter we've assumed that the data was randomly distributed over its entire range. In this situation the hash function
index = key % arraySize;
is satisfactory. It involves only one mathematical operation, and if the keys are truly random the resulting indices will be random too, and therefore well distributed.

3.Non-Random Keys（非随机关键字）
However, data is often distributed non-randomly. Imagine a database that uses car-part numbers as keys. Perhaps these numbers are of the form 033-400-03-94-05-0-535

Every part of the key (except non-data, as described above) should contribute to the hash function. Don't just use the first 4 digits or some such expurgation. The more data that contributes to the key, the more likely it is that the keys will hash evenly into the entire range of indices.

4. Use a Prime Number for the Modulo Base（使用质数作为取模的基数）
Often the hash function involves using the modulo operator (%) with the table size. We've already seen that it's important for the table size to be prime number when using a quadratic probe or double hashing. However, if the keys themselves may not be randomly distributed, it's important for the table size to be a prime number no matter what hashing system is used.（通常，哈希函数包含对数组容量的取模操作。前面已经看到，使用二次探测和再哈希法时，要求数组容量为质数是多么的重要。然而，如果关键字本身不是随机分布的，不论使用什么哈希化系统，都应该要求数组容量是质数。）

This is because, if many keys share a divisor with the array size, they may tend to hash to the same location, causing clustering. Using a prime table size eliminates this possibility. For example, if the table size is a multiple of 50 in our car part example, the category codes will all hash to index numbers that are multiples of 50. However, with a prime number such as 53, you are guaranteed that no keys will divide into the table size.
（这个论述是正确的，因为如果许多关键字共享一个数组容量作为除数，它们会趋向于映射到相同的位置，这会导致聚集。使用质数，可以消除这种可能性。）

The moral is to examine your keys carefully, and tailor your hash algorithm to remove any irregularity in the distribution of the keys.
（所以，应该仔细检查关键字，并修正哈希算法，删除任何关键字分布不规则的地方。）

(3)
Hashing Strings
We saw at the beginning of this chapter how to convert short strings to key numbers by multiplying digit codes by powers of a constant. In particular, we saw that the four-letter word cats could turn into a number by calculating
key = 3*27^3 + 1*27^2 + 20*27^1 + 19*27^0
This approach has the desirable attribute of involving all the characters in the input string. The calculated key value can then be hashed into an array index in the usual way:
index = (key) % arraySize;

Here's a Java method that finds the key value of a word:

public static int hashFunc1(String key)
{
int hashVal = 0;
int pow27 = 1; // 1, 27, 27*27, etc
for(int j=key.length()-1; j>=0; j--) // right to left
{
int letter = key.charAt(j) - 96; // get char code
hashVal += pow27 * letter; // times power of 27
pow27 *= 27; // next power of 27
}
return hashVal % arraySize;
} // end hashFunc1()

The loop starts at the rightmost letter in the word. If there are N letters, this is N–1. The numerical equivalent of the letter, according to the code we devised at the beginning of this chapter (a=1 and so on), is placed in letter. This is then multiplied by a power of 27, which is 1 for the letter at N–1, 27 for the letter at N–2, and so on.

The hashFunc1() method is not as efficient as it might be. Aside from the character conversion, there are two multiplications and an addition inside the loop. We can eliminate a multiplication by taking advantage of a mathematical identity called Horner's method. (Horner was an English mathematician, 1773–1827.) This states that an expression_r like
var4*n 4 + var3*n 3 + var2*n 2 + var1*n 1 + var0*n 0
can be written as
(((var4*n + var3)*n + var2)*n + var1)*n + var0
To uate this, we can start inside the innermost parentheses and work outward. If we translate this to a Java method we have the following code:

public static int hashFunc2(String key)
{
int hashVal = 0;
for(int j=0; j<key.length(); j++) // left to right
{
int letter = key.charAt(j) - 96; // get char code
hashVal = hashVal * 27 + letter; // multiply and add
}
return hashVal % arraySize; // mod
} // end hashFunc2()

Here we start with the leftmost letter of the word (which is somewhat more natural than starting on the right), and we have only one multiplication and one addition each time through the loop (aside from extracting the character from the string).

The hashFunc2() method unfortunately can't handle strings longer than about 7 letters. Longer strings cause the value of hashVal to exceed the size of type int. (If we used type long, the same problem would still arise for somewhat longer strings.)

Can we modify this basic approach so we don't overflow any variables? Notice that the key we eventually end up with is always less than the array size, because we apply the modulo operator. It's not the final index that's too big; it's the intermediate key values.

It turns out that with Horner's formulation we can apply the modulo (%) operator at each step in the calculation. This gives the same result as applying the modulo operator once at the end, but avoids overflow. (It does add an operation inside the loop.) The hashFunc3() method shows how this looks:

public static int hashFunc3(String key)
{
int hashVal = 0;
for(int j=0; j<key.length(); j++) // left to right
{
int letter = key.charAt(j) - 96; // get char code
hashVal = (hashVal * 27 + letter) % arraySize; // mod
}
return hashVal; // no mod
} // end hashFunc3()

This approach or something like it is normally taken to hash a string. Various bitmanipulation tricks can be played as well, such as using a base of 32 (or a larger power of 2) instead of 27, so that multiplication can be effected using the shift (>>) operator, which is faster than the modulo (%) operator.

You can use an approach similar to this to convert any kind of string to a number suitable for hashing. The strings can be words, names, or any other concatenation of characters.

0

• 评论加载中，请稍候...

发评论

以上网友发言只代表其个人观点，不代表新浪网的观点或立场。

新浪BLOG意见反馈留言板　欢迎批评指正

新浪公司 版权所有