What is a hash function ?
Let’s start gently. A hash function is a mathematical function which takes a string of any size in input and gives us a string of static size in output. Here is an example :
The output size depend on which hash algorithm you choose to use, in the above example we use the algorithm SHA-256 which produce a 256 bits output (64 characters because 256 bits = 32 bytes and 1 byte is encoded with 2 characters).
Property #1 : It hides your ass very well
As you just seen above, “John” became a bunch of letter representing a big number in hexadecimal. The hash function used a lot of maths, and roughly chopped “John” several times in a way nobody can recognize or reconstruct it. Trying to find “John” only from his hash is like trying to reconstruct your fruits after having mixing them together to make a smoothie. It is impossible. This kind of function have a name: we call it a one way function.
Property #2 : It can be used as a footprint
Hash functions are deterministic, which mean that the hash for “John” will always be the same. You can try one hundred times, the output won’t change. However, if you change just one little thing, the hash will be TOTALLY different.
This is a very interesting property of hash functions because not only it shows that there is no logic way to guess an input from its hash, but it also give a fixed size identity to this input. As we said in the hash function definition, the input size can be as long as possible. For example, you can put a whole book’s content as a hash function input : it still give you a hash of the same size and with the same properties.
It is for example possible to check if a file was successfully received by a recipient by comparing the hash before the transfer with the hash after.
Property #3 : It is strong against collision attacks
There is a lot of different hash functions, and they all have a fixed size output. Let’s keep using the SHA-256 hash function in our examples, because it is the one used by the Bitcoin protocol. SHA-256 produces a 256 bits output, which is a number between 0 and 2²⁵⁶.
2²⁵⁶ is a freaking big number, but as we said the input can be of any size, so there is an infinity of possible inputs. This lead us to something problematic about hash functions : there is less possible outputs than inputs, so a same hash can in theory be produced by two or more different inputs. This is what we call a collision.
So okay, we know it should be possible to find something different than “John” which give the same hash, so how do we find it ?
The good new is, there is a way to find it with a 99.8% probability.
The bad new is, we will all be dead before we can solve it.
Why that ? Because of the hiding property (#1), the only possible way to find a collision is to try random inputs until we find it. All good hash functions are designed to avoid this issue. If a hash function takes too much time to give a result, it can be inefficient for honest users, but if it’s too fast, it can helps attackers to quickly find a collision. Therefore, hash functions are designed to be not too fast, not too slow.
Finding a collision is in theory possible, but in practice infeasible. According to maths, we have to try 2¹³⁰ inputs before being 99.8% certain to find a collision. To give you a hint about the time it would take, if all computers in the world would try together to find a collision since the beginning of the Universe, the probability to find a collision would still be negligible.
I hope this post helped you understand hash functions, and why they are heavily used to hide information. Please don’t hesitate if you have any question !