How likely are hash collisions?

Bitcache by default uses the SHA-1 algorithm for fingerprinting data.

Here follows a brief analysis of hash collision probabilities by the designers of the Plan 9 Venti file system, which also relies on SHA-1 for content addressability:

[Our design] requires a hash function that generates a unique fingerprint for every data block that a client may want to store. Obviously, if the size of the fingerprint is smaller than the size of the data blocks, such a hash function cannot exist since there are fewer possible fingerprints than blocks. If the fingerprint is large enough and randomly distributed, this problem does not arise in practice. For a server of a given capacity, the likelihood that two different blocks will have the same hash value, also known as a collision, can be determined. If the probability of a collision is vanishingly small, we can be confident that each fingerprint is unique.

It is desirable that Venti employ a cryptographic hash function. For such a function, it is computationally infeasible to find two distinct inputs that hash to the same value. This property is important because it prevents a malicious client from intentionally creating blocks that violate the assumption that each block has a unique fingerprint. As an additional benefit, using a cryptographic hash function strengthens a client's integrity check, preventing a malicious server from fulfilling a read request with fraudulent data. If the fingerprint of the returned block matches the requested fingerprint, the client can be confident the server returned the original data.

Venti uses the [SHA-1] hash function developed by the US National Institute for Standards and Technology (NIST). SHA-1 is a popular hash algorithm for many security systems and, to date, there are no known collisions. The output of SHA-1 is a 160 bit (20 byte) hash value. Software implementations of SHA-1 are relatively efficient; for example, a 700Mhz Pentium 3 can compute the SHA-1 hash of 8 Kbyte data blocks in about 130 microseconds, a rate of 60 Mbytes per second.

Are the 160 bit hash values generated by SHA-1 large enough to ensure the fingerprint of every block is unique? Assuming random hash values with a uniform distribution, a collection of n different data blocks and a hash function that generates b bits, the probability p that there will be one or more collisions is bounded by the number of pairs of blocks multiplied by the probability that a given pair will collide, i.e.

Today, a large storage system may contain a petabyte (1015 bytes) of data. Consider an even larger system that contains an exabyte (1018 bytes) stored as 8 Kbyte blocks (~1014 blocks). Using the SHA-1 hash function, the probability of a collision is less than 10-20. Such a scenario seems sufficiently unlikely that we ignore it and use the SHA-1 hash as a unique identifier for a block. Obviously, as storage technology advances, it may become feasible to store much more than an exabyte, at which point it maybe necessary to move to a larger hash function. NIST has already proposed variants of SHA-1 that produce 256, 384, and 512 bit results. For the immediate future, however, SHA-1 is a suitable choice for generating the fingerprint of a block.