Digest Hashing Algorithm

A digest hashing algorithm, or "hash function" (aka “checksum”), takes an input (e.g. the contents of a file) and generates an output value, referred to as a digest or hash.  A well-designed hash function has properties that can be valuable to digital preservation by helping to document a lack of data loss and file uniqueness.

Hash function properties that can be valuable to digital preservation by helping to document a lack of data loss and file uniqueness:

  • Any change to the input data, however small, will result in a different output value.
  • Regardless of the size of the input, the output value is always the same length. For example, the SHA-512 algorithm generates a 64-byte output, typically displayed to users as a value which is 128 characters (hexadecimal digits) long.
  • It should be effectively impossible to recreate the input from the output value.
  • It should be effectively impossible for two different inputs to generate the same output value (i.e, collisions should not occur).

These properties make such hash functions ideal tools for fixity checking.

Examples of hash functions are:

  • MD5 — Generates a 128-bit output value. MD5 was originally designed to be a cryptographic (secure) hash function, but has since been found to suffer from extensive vulnerabilities.
  • SHA-512 — A more robust cryptographic hash function.  Part of the "SHA-2" family of hash functions.
  • C4 — A hash function based on SHA-512, but with some modifications and additional properties.