43

I have heard the word "hash" being used in different contexts (all within the world of computing) with different meanings. For example, in the book Learn Python the Hard Way, in the chapter on dictionaries it is said "Python calls them "dicts." Other languages call them "hashes."" So, are hashes dictionaries?

The other common usage of the word is in relation to encryption. I have also heard (& read) people using the word "hash" as a specific function within high-level programing.

So, what exactly is it?

Can anyone (with time and who is knowledgeable) kindly explain the nitty-gritties of "hash (or hashes)?"

Basil Ajith
  • 601
  • 1
  • 5
  • 9

4 Answers4

47

The Wikipedia article on hash functions is very good, but I will here give my take.


What is a hash?

"Hash" is really a broad term with different formal meanings in different contexts. There is not a single perfect answer to your question. I will explain the general underlying concept and mention some of the most common usages of the term.

A "hash" is a function $h$ referred to as hash function that takes as input objects and outputs a string or number. The input objects are usually members of basic data types like strings, integers, or bigger ones composed of other objects like user defined structures. The output is a typically a number or a string. The noun "hash" often refers to this output. The verb "hash" often means "apply a hash function". The main properties that a hash function should have are:

  1. It should be easy to compute and
  2. The outputs should be relatively small.

Example:

Say we want to hash numbers in the range from 0 to 999,999,999 to number between 0 and 99. One simple hash function can be $h(x) = x \mod 100$.

Common additional properties:

Depending on use case we might want the hash function to satisfy additional properties. Here are some common additional properties:

  1. Uniformity: Often we want the hashes of objects to be distinct. Moreover we may want the hashes to be "spreading-out". If I want to hash some objects down into 100 buckets (so the output of my hash function is a number from 0-99), then I am usually hoping that about 1/100 objects land in bucket 0, about 1/100 land in bucket 1, and so on.

  2. Cryptographic collision resistance: Sometimes this is taken even farther, for instance, in cryptography I may want a hash function such that it is computationally difficult for an adversary to find two different inputs that map to the same output.

  3. Compression: I often want to hash arbitrarily-large inputs down into a constant-size output or fixed number of buckets.

  4. Determinism: I may want a hash function whose output doesn't change between runs, i.e. the output of the hash function on the same object will always remain the same. This may seem to conflict with uniformity above, but one solution is to choose the hash function randomly once, and not change it between runs.


Some applications

One common application is in data structures such as a hash table, which are a way to implement dictionaries. Here, you allocate some memory, say, 100 "buckets"; then, when asked to store an (key, value) pair in the dictionary, you hash the key into a number 0-99, and store the pair in the corresponding bucket in memory. Then, when you are asked to look up a key, you hash the key into a number 0-99 with the same hash function and check that bucket to see if that key is in there. If so, you return its value.

Note that you could also implement dictionaries in other ways, such as with a binary search tree (if your objects are comparable).

Another practical application is checksums, which are ways to check that two files are the same (for example, the file was not corrupted from its previous version). Because hash functions are very unlikely to map two inputs to the same output, you compute and store a hash of the first file, usually represented as a string. This hash is very small, maybe only a few dozen ASCII characters. Then, when you get the second file, you hash that and check that the output is the same. If so, almost certainly it is the exact same file byte-for-byte.

Another application is in cryptography, where these hashes should be hard to "invert" -- that is, given the output and the hash function, it should be computationally hard to figure out the input(s) that led to that output. One use of this is for passwords: Instead of storing the password itself, you store a cryptographic hash of the password (maybe with some other ingredients). Then, when a user enters a password, you compute its hash and check that it matches the correct hash; if so, you say the password is correct. (Now even someone who can look and find out the hash saved on the server does not have such an easy time pretending to be the user.) This application can be a case where the output is just as long or longer than the input, since the input is so short.

usul
  • 4,189
  • 23
  • 30
11

A hash function is a function that takes an input and produces a value of fixed size. For example you might have a hash function stringHash that accepts a string of any length and produces a 32-bit integer.

Typically it is correct to say that the output of a hash function is a hash (aslo known as a hash value or a hash sum). However, sometimes people refer to the function itself as a hash. This is technically incorrect, but usually overlooked as it is generally understood (in context) that the person meant hash function.

The typical usage of a hash function is to implement a hash table. A hash table is a data structure that associates values with other values typically referred to as keys. It does this by using a hash function on the key to produce a fixed-sized hash value that it can use for fast look-up of the data it stores. I won't go into the full detail as to how it does that, but the key fact here is that it is called a hash table because it relies upon a hash function to produce hash values (hashes).

This is where some of the confusion comes in, because some people (again, somewhat incorrectly) refer to a hash table as a hash. As stated in other answers, sometimes a given language's implementation of a hash table refers to the hash table as a hash (notably Perl does this, though I expect other languages do as well). Other languages choose to refer to their implementation of a hash table as a dictionary. Python is one of these languages, but owing to how ingrained in the language they are, many Python users shorten the term dictionary to 'dict'.

So whilst the correct use of the term hash is to refer to the hash value produced by a hash function, people also sometimes use the term informally to refer to hash functions and hash tables, hence creating the confusion.

Pharap
  • 311
  • 2
  • 8
2

A hash function is broadly any function where the image is smaller than the domain. The output of such a function f(x) can be referred to as "the hash of x".

In computer science we typically encounter two applications of hash functions.

The first is for data structures such as hash tables, where we want to map the key domain (e.g. 32-bit integers or arbitrary-length strings) to an array index (e.g. integer between 0 and 100). The goal here is to maximise the performance of the data structure; properties of the hash function that are typically desirable are simplicity and uniform output distribution.

Perl calls its built-in associative array type a "hash", which appears to be what is causing your confusion here. I don't know of any other languages that do this. Loosely the data structure could be seen as a hash function itself (where the domain is the current set of keys), but is also implemented as a hash table.

The second is for cryptography: message authentication, password/signature verification, etc. The domain is typically arbitrary byte strings. Here we are concerned with security - which sometimes means deliberately low performance - where useful properties are collision and pre-image resistance.

OrangeDog
  • 124
  • 8
-1

I'll try just to add a short summary of what others say.

Hash function

There is a special kind of functions called hash functions.

"SHA256 is a well-known hash function that is cryptographically secure"

Three main applications are * hash tables, * checksums (data integrity checks e.g. in hard drives or ADSL protocols), * and cryptography (various forms of cryptographic authentication including but not limited to digital signatures and secure password storage).

Hash table

Hash table is a data structure for fast search. It uses hash functions internally, hence the name.

"Databases use hash tables and search trees internally to speed up execution of search requests"

Hash

  1. a dictionary abstract data type

"Hash" is the official name of built-in dictionaries in Perl. They are hash tables internally, hence the name. "This subroutine accept a hash as its first argument". These days can be used for any associative array, not necessarily a hash table.

  1. result of applying a hash function to some input

"MD5 hashes of the .iso images are provided to check their integrity after downloading".

nponeccop
  • 137
  • 2