Hash Functions

Hash functions have two principal applications: cryptographic hash functions, and non-cryptographic hash functions.

Cryptographic hash functions must have a much lower probability of collision, since a collision could result in identity theft or forged documents.

Non-cryptographic hash functions are mainly used for mapping keys to values, so a collision is an inconvenience rather than a problem.

Non-Crypto Hash Functions

Trivial Hash Functions

Trivial hash functions use the data itself as the hash value. This works if the data are small in value or limited in length. (Example: hashing integers below 100. Can just use the integers themselves as their own index in the hash collection.)

Checksums

For longer sequences of data, be they strings or bytes, the approach is to break the sequence into small parts, and to make the output dependent on each piece, and each piece dependent on the others. This can be done with an iterative algorithm that maintains a "state" (integer or set of integers) modified by each chunk of the stream.

For example:

# Start by initializing the state
S = S0

# For each piece of data,
for k in range(1,m):
    # fold the current chunk of data into the state
    S = modifyHash(S, b[k])
# Extract the hash value from the state
return returnHash(S, n)

Perfect Hashing

Perfect hash functions, as indicated by the term "perfect," fulfill everything that hash functions must do. Specifically, perfect hash functions are injective and map each valid input to a different valid output.

(If the map were a function, we would say the function is a one-to-one function.)

Minimal Perfect Hashing

A minimal perfect hash maps every valid input to a different valid output, but do so in a way that requires minimal space, so that all of the outputs are consecutive.

A minimal perfect hash is very challenging to find.

Avoiding Collisions

If you're using a simpler hash function, it is possible to determine the effect that similar inputs will have on the output. For example, if hashing phone numbers, most of the leading digits will be similar. Any method that depends heavily on the leading digits is likely to lead to higher collisions.

Therefore, the distribution of data can have an affect on the hash function and make different hash functions better or worse choices. Ultimately it is important to test your hash function of interest on your data set of interest.

Flags