An integer hash function accepts an integer hash key, and returns an integer hash result with uniform distribution. In this article, we will be discussing the construction of integer hash functions.

Hash table is an important data structure. All elementary data structure text books contain some algorithms of hash table. However, all too often the treatment of hash function is discussed as an after-thought. Thus examples abound in systems where the poor choice of the hash function resulted in inferior performance.

Certainly the integer hash function is the most basic form of the hash function. The integer hash function transforms an integer hash key into an integer hash result. For a hash function, the distribution should be uniform. This implies when the hash result is used to calculate hash bucket address, all buckets are equally likely to be picked. In addition, similar hash keys should be hashed to very different hash results. Ideally, a single bit change in the hash key should influence all bits of the hash result.

A good mixing function must be reversible. A hash function has form h(x) -> y. If the input word size and the output word size are identical, and in addition the operations in h() are reversible, then the following properties are true.

1. If h(x1) == y1, then there is an inverse function h_inverse(y1) == x1

2. Because the inverse function exists, there cannot be a value x2 such that x1 != x2, and h(x2) == y1.

The case of h(x1) == y1, and h(x2) == y1 is called a collision. Using only reversible operations in a hash function makes collisions impossible.

Beside reversibility, the operations must use a chain of computations to achieve avalanche. Avalanche means that a single bit of difference in the input will make about 1/2 of the output bits be different. At a point in the chain, a new result is obtained by a computation involving earlier results.

For example, the operation a = a + b is reversible if we know the value of 'b', and the after value of 'a'. The before value of 'a' is obtained by subtracting the after value of 'a' with the value of 'b'.

In Knuth's "The Art of Computer Programming", section 6.4, a multiplicative hashing scheme is introduced as a way to write hash function. The key is multiplied by the golden ratio of 2^32 (2654435761) to produce a hash result.

Since 2654435761 and 2^32 has no common factors in common, the multiplication produces a complete mapping of the key to hash result with no overlap. This method works pretty well if the keys have small values. Bad hash results are produced if the keys vary in the upper bits. As is true in all multiplications, variations of upper digits do not influence the lower digits of the multiplication result.

As a work around, one possible solution would be to multiply with different prime numbers, each time with the key's bit orders reversed. This will ensure we will scramble the bits sufficiently. However, the performance will be slowed down by the multiplication operations, and bit reversal operations. This can be a good work-around if performance is not a concern.

Robert Jenkins has developed a hash function based on a sequence of subtraction, exclusive-or, and bit shift. The following is the mixing part of the hash function.

#define mix(a,b,c) \ { \ a=a-b; a=a-c; a=a^(c>>13); \ b=b-c; b=b-a; b=b^(a<<8); \ c=c-a; c=c-b; c=c^(b>>13); \ a=a-b; a=a-c; a=a^(c>>12); \ b=b-c; b=b-a; b=b^(a<<16); \ c=c-a; c=c-b; c=c^(b>>5); \ a=a-b; a=a-c; a=a^(c>>3); \ b=b-c; b=b-a; b=b^(a<<10); \ c=c-a; c=c-b; c=c^(b>>15); \ }

Variable 'c' contains the input key. When the mixing is complete, variable 'c' also contains the hash result. Variable 'a', and 'b' contain initialized random bits. Notice the total number of internal state is 96 bits, much larger than the final output of 32 bits. Also notice the sequence of subtractions rolls through variable 'a' to variable 'c' three times. Each row will act on one variable, mixing in information from the other two variables, followed by a shift operation.

Subtraction is similar to multiplication in that changes in upper bits of the key do not influence lower bits of the addition. The 9 bit shift operations in Robert Jenkins' mixing algorithm shifts the key to the right 61 bits in total, and shifts the key to the left 34 bits in total. As the calculation is chained, each exclusive-or doubles the number of states. There are at least 2^9 different combined versions of the original key, shifted by various amounts. That is why a single bit change in the key can influence widely apart bits in the hash result.

The uniform distribution of the hash function can be determined from the nature of the subtraction operation. Look at a single bit subtraction operation between a key, and a random bit. If the random bit is 0, then the key remains unchanged. If the random bit is 1, then the key will be flipped. A carry will occur in the case where both the key bit and the random bit are 1. Subtracting the random bits will cause about half of the key bits to be flipped. So even if the key is not uniform, subtracting the random bits will result in uniform distribution.

In e-mail communication with Robert Jenkins, he mentioned that the following mix function may be more suitable for an integer hash function.

unsigned int inthash(unsigned int key) { key += (key << 12); key ^= (key >> 22); key += (key << 4); key ^= (key >> 9); key += (key << 10); key ^= (key >> 2); key += (key << 7); key ^= (key >> 12); return key; }

By keeping the internal state the same size as the output state, the function is shorter than the 96 bit state version. This function served as a starting point for my research. The number of operations are a little high; also notice the problem of h(0) = 0 in this hash function.

I have since searched for a faster version of the integer hash function. This is my latest version.

int inthash(int key) { key += ~(key << 15); key ^= (key >>> 10); key += (key << 3); key ^= (key >>> 6); key += ~(key << 11); key ^= (key >>> 16); return key; }

By taking advantages of the native instructions such as 'add complement', and 'shift & add', the above hash function runs in 11 machine cycles on HP 9000 workstations.

Having more rounds will strengthen the hash function by making the result more random looking, but performance will be slowed down accordingly. Simulation seems to prefer small shift amounts for inner rounds, and large shift amounts for outer rounds.

long longhash1(long key) { key += ~(key << 32); key ^= (key >>> 22); key += ~(key << 13); key ^= (key >>> 8); key += (key << 3); key ^= (key >>> 15); key += ~(key << 27); key ^= (key >>> 31); return key; }

The longer width of 64 bits require more mixing than the 32 bit version.

As previously mentioned, using multiplication requires a mechanism to transport changes from high bit positions to low bit positions. Bit reversal is best, but is slow to implement. The alternatives are left shifts, and Substitution box lookup.

This example hash function uses multiplication with odd constants, and left shifts. For more strength, one can simply add more rounds of multiplication, alternating between c1 and c2.

long longhash2(long key) { long c1 = 0x6e5ea73858134343L; long c2 = 0xb34e8f99a2ec9ef5L; key ^= ((c1 ^ key) >>> 32); key *= c1; key ^= ((c2 ^ key) >>> 31); key *= c2; key ^= ((c1 ^ key) >>> 32); return key; }

A substitution box is useful, because its output can impact all bit positions. We take the high bit position of the multiplication as input, and exclusive-or the value of the substitution box with the value of the multiplication result. Care need to be considered to make the operation reversible, by making sure the high bit is untouched.

int samplehash(int key) { int c1 = 0xd2d84a61; int c2 = 0x7832c9f4; key *= c1; key ^= (key < 0) ? c2 : (0x7ffffffff ^ c2); // 1 -> 32 Sbox ... }

If a CPU can dispatch multiple instructions in the same clock cycle, one can consider adding more parallelism in the formula.

For example, for the following formula, the two shift operations can be performed in parallel.

key ^= (key << 17) | (key >>> 16);

For 32 bit word operations, only certain pairs of shift amounts will be reversible. The valid pairs include: (17,16) (16,17) (14,19) (19,14) (13,20) (20,13) (10,23) (23,10) (8,25) (25,8)

Multiplication can be computed in parallel. Any multiplication with odd number is reversible.

key += (key << 4) + (key << 9); // key = key * 529

This is a test program testing the choices of the shift amounts with regard to the resulting avalanche property. The program detects if a certain bit position has both changes and no changes, based on a single bit flip.

Programmer uses hash table size that is power of 2 because address calculation can be performed very quickly. The integer hash function can be used to post condition the output of a marginal quality hash function before the final address calculation is done.

addr = inthash(marginal_hash_value) & (tablesize - 1);

Using the inlined version of the integer hash function may end up faster than doing a remaindering operation with a prime number! An integer remainder operation may take up to 18 cycles or longer to complete, depending on machine architecture.

In this article, we have examined a number of hash function construction algorithms. Knuth's multiplicative method is the simplest, but has some known defects. Robert Jenkins' 96 bit mix function can be used as an integer hash function, but is more suitable for hashing long keys. A dedicated hash function is well suited for hashing an integer number.

We have also presented an application of the integer hash function to improve the quality of a hash value.

Give feedbacks to wang@cup.hp.com