Breaking CityHash64, MurmurHash2/3, wyhash, and more...
Hash functions are incredibly neat mathematical objects. They can map arbitrary data to a small fixed-size output domain such that the mapping is deterministic, yet appears to be random. This “deterministic randomness” is incredibly useful for a variety of purposes, such as hash tables, checksums, monte carlo algorithms, communication-less distributed algorithms, etc, the list goes on.
In this article we will take a look at the dark side of hash functions: when things go wrong. Luckily this essentially never happens due to unlucky inputs in the wild (for good hash functions, at least). However, people exist, and some of them may be malicious. Thus we must look towards computer security for answers. I will quickly explain some of the basics of hash function security and then show how easy it is to break this security for some commonly used non-cryptographic hash functions.
As a teaser, this article explains how you can generate strings such as these, thousands per second:
cityhash64("orlp-cityhash64-D-:K5yx*zkgaaaaa") == 1337
murmurhash2("orlp-murmurhash64-bkiaaa&JInaNcZ") == 1337
murmurhash3("orlp-murmurhash3_x86_32-haaaPa*+") == 1337
farmhash64("orlp-farmhash64-/v^CqdPvziuheaaa") == 1337
I also show how you can create some really funky pairs of strings that can be concatenated arbitrarily such that when concatenating $k$ strings together any of the $2^k$ combinations all have the same hash output, regardless of the seed used for the hash function:
a = "xx0rlpx!xxsXъВ"
b = "xxsXъВxx0rlpx!"
murmurhash2(a + a, seed) == murmurhash2(a + b, seed)
murmurhash2(a + a, seed) == murmurhash2(b + a, seed)
murmurhash2(a + a, seed) == murmurhash2(b + b, seed)
a = "!&orlpՓ"
b = "yǏglp$X"
murmurhash3(a + a, seed) == murmurhash3(a + b, seed)
murmurhash3(a + a, seed) == murmurhash3(b + a, seed)
murmurhash3(a + a, seed) == murmurhash3(b + b, seed)
Hash function security basics
Hash functions play a critical role in computer security. Hash functions are used not only to verify messages over secure channels, they are also used to identify trusted updates as well as known viruses. Virtually every signature scheme ever used starts with a hash function.
If a hash function does not behave randomly, we can break the above security constructs. Cryptographic hash functions thus take the randomness aspect very seriously. The ideal hash function would choose an output completely at random for each input, remembering that choice for future calls. This is called a random oracle.
The problem is that a random oracle requires a true random number generator, and more problematically, a globally accessible infinite memory bank. So we approximate it using deterministic hash functions instead. These compute their output by essentially shuffling their input really, really well, in such a way that it is not feasible to reverse.
To help quantify whether a specific function does a good job of approximating a random oracle, cryptographers came up with a variety of properties that a random oracle would have. The three most important and well-known properties a secure cryptographic hash function should satisfy are:
-
Pre-image resistance. For some constant $c$ it should be hard to find some input $m$ such that $h(m) = c$.
-
Second pre-image resistance. For some input $m_1$ it should be hard to find another input $m_2$ such that $h(m_1) = h(m_2)$.
-
Collision resistance. It should be hard to find inputs $m_1, m_2$ such that $h(m_1) = h(m_2)$.
We generally consider one of these properties broken if there exists a method that produces a collision or pre-image faster than simply trying random inputs (also known as a brute force attack). However, there are definitely gradations in breakage, as some methods are only several orders of magnitude faster than brute force. That may sound like a lot, but a method taking $2^{110}$ steps instead of $2^{128}$ are still both equally out of reach for today’s computers.
MD5 used to be a common hash function, and SHA-1 is still in common use today. While both were considered cryptographically secure at one point, generating MD5 collisions now takes less than a second on a modern PC. In 2017 a collaboration of researchers from CWI and Google and announced the first SHA-1 collision. However, as far as I’m aware, neither MD5 nor SHA-1 have practical (second) pre-image attacks, only theoretical ones.
Non-cryptographic hash functions
Cryptographically secure hash functions tend to have a small problem: they’re slow. Modern hash functions such as BLAKE3 resolve this somewhat by heavily vectorizing the hash using SIMD instructions, as well as parallelizing over multiple threads, but even then they require large input sizes before reaching those speeds.
A lot of problems don’t necessarily require secure hash functions, and people would much prefer a faster hash speed. Especially when we are computing many small hashes, such as in a hash table. Let’s take a look what common hash table implementations actually use as their hash for strings:
-
C++: there are multiple standard library implementations, but 64-bit
clang
13.0.0 on Apple M1 shipsCityHash64
. Currentlylibstdc++
shipsMurmurHash64A
, a variant of Murmur2 for 64-bit platforms. -
Java: OpenJDK uses an incredibly simple hash algorithm, which essentially just computes
h = 31 * h + c
for each characterc
. -
PHP: the Zend engine uses essentially the same algorithm as Java, just using unsigned integers and
33
as its multiplier. -
Nim: it used to use
MurmurHash3_x86_32
. While writing this article they appeared to have switched to use farmhash by default. -
Javascript: in V8 they use a custom weak string hash, with a randomly initialized seed.
There were some that used stronger hashes by default as well:
-
Go uses an AES-based hash if hardware acceleration is available on x86-64. Even though its construction is custom and likely not full-strength cryptographically secure, breaking it is too much effort and quite possibly beyond my capabilities.
If not available, it uses an algorithm inspired by wyhash.
-
Python and Rust use SipHash by default, which is a cryptographically secure pseudorandom function. This is effectively a hash function where you’re allowed to use a secret key during hashing, unlike a hash like SHA-2 where everyone knows all information involved.
This latter concept is actually really important, at least for protecting against HashDoS in hash tables. Even if a hash function is perfectly secure over its complete output, hash tables further reduce the output to only a couple bits to find the data it is looking for. For a static hash function without any randomness it’s possible to produce large lists of hashes that collide post-reduction, just by brute force. But for non-cryptographic hashes as we’ll see here we often don’t need brute force and can generate collisions at high speed for the full output, if not randomized by a random seed.
Interlude: inverse operations
Before we get to breaking some of the above hash functions, I must explain a basic technique I will use a lot: the inverting of operations. We are first exposed to this in primary school, where we might get faced by a question such as “$2 + x = 10$”. There we learn subtraction is the inverse of addition, such that we may find $x$ by computing $10 - 2 = 8$.
Most operations on the integer registers in computers are also invertible, despite the integers being reduced modulo $2^{w}$ in the case of overflow. Let us study some:
-
Addition can be inverted using subtraction. That is,
x += y
can be inverted usingx -= y
. Seems obvious enough. -
Multiplication by a constant $c$ is not inverted by division. This would not work in the case of overflow. Instead, we calculate the modular multiplicative inverse of $c$. This is an integer $c^{-1}$ such that $c \cdot c^{-1} \equiv 1 \pmod {m}$. Then we invert multiplication by $c$ simply by multiplying by $c^{-1}$.
This constant exists if and only if $c$ is coprime with our modulus $m$, which for us means that $c$ must be odd as $m = 2^n$. For example, multiplication by $2$ is not invertible, which is easy to see as such, as it is equivalent to a bit shift to the left by one position, losing the most significant bit forever.
Without delving into the details, here is a snippet of Python code that computes the modular multiplicative inverse of an integer using the extended Euclidean algorithm by calculating $x, y$ such that $$cx + my = \gcd(c, m).$$ Then, because $c$ is coprime we find $\gcd(c, m) = 1$, which means that $$cx + 0 \equiv 1 \pmod m,$$ and thus $x = c^{-1}$.
def egcd(a, b): if a == 0: return (b, 0, 1) g, y, x = egcd(b % a, a) return (g, x - (b // a) * y, y) def modinv(c, m): g, x, y = egcd(c, m) assert g == 1, "c, m must be coprime" return x % m
Using this we can invert modular multiplication:
>>> modinv(17, 2**32) 4042322161 >>> 42 * 17 * 4042322161 % 2**32 42
Magic!
-
XOR can be inverted using… XOR. It is its own inverse. So
x ^= y
can be inverted usingx ^= y
. -
Bit shifts can not be inverted, but two common operations in hash functions that use bit shifts can be. The first is bit rotation by a constant. This is best explained visually, for example a bit rotation to the left by 3 places on a 8-bit word, where each bit is shown as a letter:
abcdefghi defghiabc
The formula for a right-rotation of
k
places is(x >> k) | (x << (w - k))
, wherew
is the width of the integer type. Its inverse is a left-rotation, which simply swaps the direction of both shifts. Alternatively, the inverse of a right-rotation ofk
places is another right-rotation ofw-k
places. -
Another common operation in hash functions is the “xorshift”. It is an operation of one of the following forms, with $k > 0$:
x ^= x << k // Left xorshift. x ^= x >> k // Right xorshift.
How to invert it is entirely analogous between the two, so I will focus on the left xorshift.
An important observation is that the least significant $k$ bits are left entirely untouched by the xorshift. Thus by repeating the operation, we recover the least significant $2k$ bits, as the XOR will invert itself for the next $k$ bits. Let’s take a look at the resulting value to see how we should proceed:
v0 = (x << k) ^ x // Apply first step of inverse v1 = v0 ^ (v0 << k). v1 = (x << 2*k) ^ (x << k) ^ (x << k) ^ x // Simplify using self-inverse (x << k) ^ (x << k) = 0. v1 = (x << 2*k) ^ x
From this we can conclude the following identity: $$\operatorname{xorshift}(\operatorname{xorshift}(x, k), k) = \operatorname{xorshift}(x, 2k)$$ Now we only need one more observation to complete our algorithm: a xorshift of $k \geq w$ where $w$ is the width of our integer is a no-op. Thus we repeatedly apply our doubling identity until we reach large enough $q$ such that $\operatorname{xorshift}(x, 2^q \cdot k) = x$.
For example, to invert a left xorshift by 13 for 64-bit integers we apply the following sequence:
x ^= x << 13 // Left xorshift by 13. x ^= x << 13 // Inverse step 1. x ^= x << 26 // Inverse step 2. x ^= x << 52 // Inverse step 3. // x ^= x << 104 // Next step would be a no-op.
Armed with this knowledge, we can now attack.
Breaking CityHash64
Let us take a look at (part of) the source code of
CityHash64
from libcxx
that’s used for hashing strings on 64-bit platforms:
static const uint64_t mul = 0x9ddfea08eb382d69ULL;
static const uint64_t k0 = 0xc3a5c85c97cb3127ULL;
static const uint64_t k1 = 0xb492b66fbe98f273ULL;
static const uint64_t k2 = 0x9ae16a3b2f90404fULL;
static const uint64_t k3 = 0xc949d7c7509e6557ULL;
template<class T>
T loadword(const void* p) {
T r;
std::memcpy(&r, p, sizeof(r));
return r;
}
uint64_t rotate(uint64_t val, int shift) {
if (shift == 0) return val;
return (val >> shift) | (val << (64 - shift));
}
uint64_t hash_len_16(uint64_t u, uint64_t v) {
uint64_t x = u ^ v;
x *= mul;
x ^= x >> 47;
uint64_t y = v ^ x;
y *= mul;
y ^= y >> 47;
y *= mul;
return y;
}
uint64_t hash_len_17_to_32(const char *s, uint64_t len) {
const uint64_t a = loadword<uint64_t>(s) * k1;
const uint64_t b = loadword<uint64_t>(s + 8);
const uint64_t c = loadword<uint64_t>(s + len - 8) * k2;
const uint64_t d = loadword<uint64_t>(s + len - 16) * k0;
return hash_len_16(
rotate(a - b, 43) + rotate(c, 30) + d,
a + rotate(b ^ k3, 20) - c + len
);
}
To break this, let’s assume we’ll always give length 32 inputs. Then the
implementation will always call hash_len_17_to_32
, and we have full control
over variables a
, b
, c
and d
by changing our input.
Note that d
is only used once, in the final expression. This makes it a
prime target for attacking the hash. We will choose a
, b
and c
arbitrarily,
and then solve for d
to compute a desired hash outcome.
Using the above modinv
function we first compute the necessary modular multiplicative
inverses of mul
and k0
:
>>> 0x9ddfea08eb382d69 * 0xdc56e6f5090b32d9 % 2**64
1
>>> 0xc3a5c85c97cb3127 * 0x81bc9c5aa9c72e97 % 2**64
1
We also note that in this case the xorshift is easy to invert, as x ^= x >> 47
is simply its own inverse. Having all the components ready, we can invert
the function step by step.
We first load a
, b
and c
like in the hash function, and compute
uint64_t v = a + rotate(b ^ k3, 20) - c + len;
which is the second parameter to hash_len_16
. Then, starting from our
desired return value of hash_len_16(u, v)
we work backwards step by step, inverting
each operation to find the function argument u
that would result in our target hash
.
Then once we have found such the unique u
we compute our required input d
.
Putting it all together:
static const uint64_t mul_inv = 0xdc56e6f5090b32d9ULL;
static const uint64_t k0_inv = 0x81bc9c5aa9c72e97ULL;
void cityhash64_preimage32(uint64_t hash, char *s) {
const uint64_t len = 32;
const uint64_t a = loadword<uint64_t>(s) * k1;
const uint64_t b = loadword<uint64_t>(s + 8);
const uint64_t c = loadword<uint64_t>(s + len - 8) * k2;
uint64_t v = a + rotate(b ^ k3, 20) - c + len;
// Invert hash_len_16(u, v). Original operation inverted
// at each step is shown on the right, note that it is in
// the inverse order of hash_len_16.
uint64_t y = hash; // return y;
y *= mul_inv; // y *= mul;
y ^= y >> 47; // y ^= y >> 47;
y *= mul_inv; // y *= mul;
uint64_t x = y ^ v; // uint64_t y = v ^ x;
x ^= x >> 47; // x ^= x >> 47;
x *= mul_inv; // x *= mul;
uint64_t u = x ^ v; // uint64_t x = u ^ v;
// Find loadword<uint64_t>(s + len - 16).
uint64_t d = u - rotate(a - b, 43) - rotate(c, 30);
d *= k0_inv;
std::memcpy(s + len - 16, &d, sizeof(d));
}
The chance that a random uint64_t
forms 8 printable ASCII bytes is
$\left(94/256\right)^8 \approx 0.033%$. Not great, but cityhash64_preimage32
is so fast that having to repeat it on average ~3000 times to get a purely
ASCII result isn’t so bad.
For example, the following 10 strings all hash to 1337
using CityHash64, generated
using this code:
orlp-cityhash64-D-:K5yx*zkgaaaaa
orlp-cityhash64-TXb7;1j&btkaaaaa
orlp-cityhash64-+/LM$0 ;msnaaaaa
orlp-cityhash64-u'f&>I'~mtnaaaaa
orlp-cityhash64-pEEv.LyGcnpaaaaa
orlp-cityhash64-v~~bm@,Vahtaaaaa
orlp-cityhash64-RxHr_&~{miuaaaaa
orlp-cityhash64-is_$34#>uavaaaaa
orlp-cityhash64-$*~l\{S!zoyaaaaa
orlp-cityhash64-W@^5|3^:gtcbaaaa
Breaking MurmurHash2
We can’t let libstdc++
get away after targetting libc++
, can we?
The default string hash
calls
an implementation of MurmurHash2
with seed 0xc70f6907
. The hash—simplified to only handle strings whose
lengths are multiples of 8—is as follows:
uint64_t murmurhash64a(const char* s, size_t len, uint64_t seed) {
const uint64_t mul = 0xc6a4a7935bd1e995ULL;
uint64_t hash = seed ^ (len * mul);
for (const char* p = s; p != s + len; p += 8) {
uint64_t data = loadword<uint64_t>(p);
data *= mul;
data ^= data >> 47;
data *= mul;
hash ^= data;
hash *= mul;
}
hash ^= hash >> 47;
hash *= mul;
hash ^= hash >> 47;
return hash;
}
We can take a similar approach here as before. We note that the modular
multiplicative inverse of 0xc6a4a7935bd1e995
mod $2^{64}$ is
0x5f7a0ea7e59b19bd
. As an example, we can choose the first 24 bytes
arbitrarily, and solve for the last 8 bytes:
void murmurhash64a_preimage32(uint64_t hash, char* s, uint64_t seed) {
const uint64_t mul = 0xc6a4a7935bd1e995ULL;
const uint64_t mulinv = 0x5f7a0ea7e59b19bdULL;
// Compute the hash state for the first 24 bytes as normal.
uint64_t state = seed ^ (32 * mul);
for (const char* p = s; p != s + 24; p += 8) {
uint64_t data = loadword<uint64_t>(p);
data *= mul;
data ^= data >> 47;
data *= mul;
state ^= data;
state *= mul;
}
// Invert target hash transformation.
// return hash;
hash ^= hash >> 47; // hash ^= hash >> 47;
hash *= mulinv; // hash *= mul;
hash ^= hash >> 47; // hash ^= hash >> 47;
// Invert last iteration for last 8 bytes.
hash *= mulinv; // hash *= mul;
uint64_t data = state ^ hash; // hash = hash ^ data;
data *= mulinv; // data *= mul;
data ^= data >> 47; // data ^= data >> 47;
data *= mulinv; // data *= mul;
std::memcpy(s + 24, &data, 8); // data = loadword<uint64_t>(s);
}
The following 10 strings all hash to 1337
using MurmurHash64A with the
default seed 0xc70f6907
, generated using this code:
orlp-murmurhash64-bhbaaat;SXtgVa
orlp-murmurhash64-bkiaaa&JInaNcZ
orlp-murmurhash64-ewmaaa(%J+jw>j
orlp-murmurhash64-vxpaaag"93\Yj5
orlp-murmurhash64-ehuaaafa`Wp`/|
orlp-murmurhash64-yizaaa1x.zQF6r
orlp-murmurhash64-lpzaaaZphp&c F
orlp-murmurhash64-wsjbaa771rz{z<
orlp-murmurhash64-rnkbaazy4X]p>B
orlp-murmurhash64-aqnbaaZ~OzP_Tp
Universal collision attack on MurmurHash64A
In fact, MurmurHash64A is so weak that Jean-Philippe Aumasson, Daniel J. Bernstein and Martin Boßlet published an attack that creates sets of strings which collide regardless of the random seed used.
To see how it works, let’s take a look at the core loop of MurmurHash64A:
uint64_t data = loadword<uint64_t>(p);
data *= mul; // Trivially invertible.
data ^= data >> 47; // Trivially invertible.
data *= mul; // Trivially invertible.
state ^= data;
state *= mul;
We know we can trivially invert the operations done on data
regardless of what the
current state is, so we might as well have had the following body:
state ^= data;
state *= mul;
Now the hash starts looking rather weak indeed. The clever trick they employ is by creating two strings simultaneously, such that they differ precisely in the top bit in each 8-byte word. Why the top bit?
>>> 1 << 63
9223372036854775808
>>> (1 << 63) * mul % 2**64
9223372036854775808
Since mul
is odd, its least significant bit is set. Multiplying 1 << 63
by
it is equivalent to shifting that bit 63 places to the left, which is once again
1 << 63
. That is, 1 << 63
is a fixed point for the state *= mul
operation.
We also note that for the top bit XOR is equivalent to addition, as the overflow
from addition is removed mod $2^{64}$.
So if we have two input strings, one starting with the 8 bytes data
, and the
other starting with data ^ (1 << 63) == data + (1 << 63)
(after doing the
trivial inversions). We then find that the two states, regardless of seed,
differ exactly in the top bit after state ^= data
. After multiplication we
find we have two states x * mul
and (x + (1 << 63)) * mul == x * mul + (1 << 63)
… which again differ exactly in the top bit! We are now back to state ^= data
in our iteration, for the next 8 bytes. We can now use this moment to
cancel our top bit difference, by again feeding two 8-byte strings that
differ in the top bit (after inverting).
In fact, we only have to find one pair of such strings that differ in the top
bit, which we can then repeat twice (in either order) to cancel our difference
again. When represented as a uint64_t
if we choose the first string as x
we
can derive the second string as
x *= mul; // Forward transformation...
x ^= x >> 47; // ...
x *= mul; // ...
x ^= 1 << 63; // Difference in top bit.
x *= mulinv; // Backwards transformation...
x ^= x >> 47; // ...
x *= mulinv; // ...
I was unable to find a printable ASCII string that has another printable ASCII string as its partner. But I was able to find the following pair of 8-byte UTF-8 strings that differ in exactly the top bit after the Murmurhash64A input transformation:
xx0rlpx!
xxsXъВ
Combining them as such gives two 16-byte strings that when fed through the hash algorithm manipulate the state in the same way: a collision.
xx0rlpx!xxsXъВ
xxsXъВxx0rlpx!
But it doesn’t stop there. By concatenating these two strings we can create
$2^n$ different colliding strings each $16n$ bytes long. With the current
libstdc++
implementation the following prints the same number eight times:
std::hash<std::u8string> h;
std::u8string a = u8"xx0rlpx!xxsXъВ";
std::u8string b = u8"xxsXъВxx0rlpx!";
std::cout << h(a + a + a) << "\n";
std::cout << h(a + a + b) << "\n";
std::cout << h(a + b + a) << "\n";
std::cout << h(a + b + b) << "\n";
std::cout << h(b + a + a) << "\n";
std::cout << h(b + a + b) << "\n";
std::cout << h(b + b + a) << "\n";
std::cout << h(b + b + b) << "\n";
Even if the libstdc++
would randomize the seed used by MurmurHash64a, the
strings would still collide.
Breaking MurmurHash3
Nim uses
used to use
MurmurHash3_x86_32
,
so let’s try to break that.
If we once again simplify to strings whose lengths are a multiple of 4 we get the following code:
uint32_t rotl32(uint32_t x, int r) {
return (x << r) | (x >> (32 - r));
}
uint32_t murmurhash3_x86_32(const char* s, int len, uint32_t seed) {
const uint32_t c1 = 0xcc9e2d51;
const uint32_t c2 = 0x1b873593;
const uint32_t c3 = 0x85ebca6b;
const uint32_t c4 = 0xc2b2ae35;
uint32_t h = seed;
for (const char* p = s; p != s + len; p += 4) {
uint32_t k = loadword<uint32_t>(p);
k *= c1;
k = rotl32(k, 15);
k *= c2;
h ^= k;
h = rotl32(h, 13);
h = h * 5 + 0xe6546b64;
}
h ^= len;
h ^= h >> 16;
h *= c3;
h ^= h >> 13;
h *= c4;
h ^= h >> 16;
return h;
}
I think by now you should be able to get this function to spit out any
value you want if you know the seed.
The inverse of
rotl32(x, r)
is rotl32(x, 32-r)
and the inverse of h ^= h >> 16
is
once again just h ^= h >> 16
. Only h ^= h >> 13
is a bit different, it’s the first time
we’ve seen that a xorshift’s inverse has more than one step:
h ^= h >> 13
h ^= h >> 26
Compute the modular inverses
of c1
through c4
as well as 5
mod $2^{32}$, and go to town.
If you want to cheat or check your answer, you can check out the code
I’ve used to generate the following ten strings that all hash to 1337 when
fed to MurmurHash3_x86_32
with seed 0
:
orlp-murmurhash3_x86_32-haaaPa*+
orlp-murmurhash3_x86_32-saaaUW&<
orlp-murmurhash3_x86_32-ubaa/!/"
orlp-murmurhash3_x86_32-weaare]]
orlp-murmurhash3_x86_32-chaa5@/}
orlp-murmurhash3_x86_32-claaM[,5
orlp-murmurhash3_x86_32-fraaIx`N
orlp-murmurhash3_x86_32-iwaara&<
orlp-murmurhash3_x86_32-zwaa]>zd
orlp-murmurhash3_x86_32-zbbaW-5G
Nim uses 0
as a fixed seed.
Universal collision attack on MurmurHash3
Suppose that Nim didn’t use 0
as a fixed seed, but chose a randomly generated
one. Can we do a similar attack as the one done to MurmurHash2 to still generate
universal multicollisions?
Yes we can. Let’s take another look at that core loop body:
uint32_t k = loadword<uint32_t>(p);
k *= c1; // Trivially invertable.
k = rotl32(k, 15); // Trivially invertable.
k *= c2; // Trivially invertable.
h ^= k;
h = rotl32(h, 13);
h = h * 5 + 0xe6546b64;
Once again we can ignore the first three trivially invertable instructions as we
can simply choose our input so that we get exactly the k
we want.
Remember from last time that we want to introduce a difference in exactly the
top bit of h
, as the multiplication will leave this difference in place.
But here there is a bit rotation between the XOR and the multiplication.
The solution? Simply place our bit difference such that rotl32(h, 13)
shifts
it into the top position.
Does the addition of 0xe6546b64
mess things up? No. Since only the top bit
between the two states will be different, there is a difference of exactly
$2^{31}$ between the two states. This difference is maintained by the addition.
Since two 32-bit numbers with the same top bit can be at most $2^{31} - 1$
apart, we can conclude that the two states still differ in the top bit after
the addition.
So we want to find two pairs of 32-bit ints, such that after applying the first
three instructions the first pair differs in bit 1 << (31 - 13) == 0x00040000
and the second pair in bit 1 << 31 == 0x80000000
.
After some brute-force searching I found some cool pairs (again forced to use UTF-8), which when combined give the following collision:
a = "!&orlpՓ"
b = "yǏglp$X"
As before, any concatenation of a
s and b
s of length n
collides with all
other combinations of length n
.
Breaking FarmHash64
Nim switched to
farmhash
since I started writing this post. To break it we can notice that its structure
is very similar to CityHash64, so we can use those same techniques again. In
fact, the only changes between the two for lengths 17-32 bytes is that a few
operators were changed from subtraction/XOR to addition, a rotation operator had
its constant tweaked, and some k
constants are slightly tweaked in usage. The
process of breaking it is so similar that it’s entirely analogous, so we can
skip straight to the result.
These 10 strings all hash to 1337 with FarmHash64:
orlp-farmhash64-?VrJ@L7ytzwheaaa
orlp-farmhash64-p3`!SQb}fmxheaaa
orlp-farmhash64-pdt'cuI\gvxheaaa
orlp-farmhash64-IbY`xAG&ibkieaaa
orlp-farmhash64-[_LU!d1hwmkieaaa
orlp-farmhash64-QiY!clz]bttieaaa
orlp-farmhash64-&?J3rZ_8gsuieaaa
orlp-farmhash64-LOBWtm5Szyuieaaa
orlp-farmhash64-Mptaa^g^ytvieaaa
orlp-farmhash64-B?&l::hxqmfjeaaa
Trivial fixed-seed wyhash multicollisions
Zig uses wyhash with a fixed seed of zero. While I was unable to do seed-independent attacks against wyhash, using it with a fixed seed makes generating collisions trivial. Wyhash is built upon the folded multiply, which takes two 64-bit inputs, multiplies them to a 128-bit product before XORing together the two halves:
uint64_t folded_multiply(uint64_t a, uint64_t b) {
__uint128_t full = __uint128_t(a) * __uint128_t(b);
return uint64_t(full) ^ uint64_t(full >> 64);
}
It’s easy to immediately see a critical flaw with this: if one of the two sides is zero, the output will also always be zero. To protect against this, wyhash always uses a folded multiply in the following form:
out = folded_multiply(input_a ^ secret_a, input_b ^ secret_b);
where secret_a
and secret_b
are determined by the seed, or outputs of
previous iterations which are influenced by the seed. However, when your seed is
constant… With a bit of creativity
we can use the start of our string to prepare a ‘secret’ value which we can
perfectly cancel with another ASCII string later in the input.
So, without further ado, every 32-byte string of the form
orlp-wyhash-oGf_________tWJbzMJR
hashes to the same value with Zig’s default hasher.
Zig uses a different set of parameters than the defaults found in the wyhash
repository, so for good measure, this pattern provides arbitrary multicollisions
for the default parameters found in wyhash when using seed == 0
:
orlp-wyhash-EUv_________NLXyytkp
Conclusion
We’ve seen that a lot of the hash functions in common use in hash tables today are very weak, allowing fairly trivial attacks to produce arbitrary amounts of collisions if not randomly initialized. Using a randomly seeded hash table is paramount if you don’t wish to become a victim of a hash flooding attack.
We’ve also seen that some hash functions are vulnerable to attack even if randomly seeded. These are completely broken and should not be used if attacks are a concern at all. Luckily I was unable to find such attacks against most hashes, but the possibility of such an attack existing is quite unnerving.
With universal hashing it’s possible to construct hash functions for which such an attack is provably impossible, last year I published a hash function called polymur-hash that has this property. Your HTTPS connection to this website also likely uses a universal hash function for authenticity of the transferred data, both Poly1305 and GCM are based on universal hashing for their security proofs.
Of course, if your data is not user-controlled, or there is no reasonable security model where your application would face attacks, you can get away with faster and insecure hashes.
More to come on the subject of hashing and hash tables and how it can go right or wrong, but for now this article is long enough as-is…