Breaking CityHash64, MurmurHash2/3, wyhash, and more...

2024-11-02

Hash functions are incredibly neat mathematical objects. They can map arbitrary data to a small fixed-size output domain such that the mapping is deterministic, yet appears to be random. This “deterministic randomness” is incredibly useful for a variety of purposes, such as hash tables, checksums, monte carlo algorithms, communication-less distributed algorithms, etc, the list goes on.

In this article we will take a look at the dark side of hash functions: when things go wrong. Luckily this essentially never happens due to unlucky inputs in the wild (for good hash functions, at least). However, people exist, and some of them may be malicious. Thus we must look towards computer security for answers. I will quickly explain some of the basics of hash function security and then show how easy it is to break this security for some commonly used non-cryptographic hash functions.

As a teaser, this article explains how you can generate strings such as these, thousands per second:

 cityhash64("orlp-cityhash64-D-:K5yx*zkgaaaaa") == 1337
murmurhash2("orlp-murmurhash64-bkiaaa&JInaNcZ") == 1337
murmurhash3("orlp-murmurhash3_x86_32-haaaPa*+") == 1337
 farmhash64("orlp-farmhash64-/v^CqdPvziuheaaa") == 1337

I also show how you can create some really funky pairs of strings that can be concatenated arbitrarily such that when concatenating $k$ strings together any of the $2^k$ combinations all have the same hash output, regardless of the seed used for the hash function:

a = "xx0rlpx!xxsXъВ"
b = "xxsXъВxx0rlpx!"
murmurhash2(a + a, seed) == murmurhash2(a + b, seed)
murmurhash2(a + a, seed) == murmurhash2(b + a, seed)
murmurhash2(a + a, seed) == murmurhash2(b + b, seed)

a = "!&orlpՓ"
b = "yǏglp$X"
murmurhash3(a + a, seed) == murmurhash3(a + b, seed)
murmurhash3(a + a, seed) == murmurhash3(b + a, seed)
murmurhash3(a + a, seed) == murmurhash3(b + b, seed)

Hash function security basics

Hash functions play a critical role in computer security. Hash functions are used not only to verify messages over secure channels, they are also used to identify trusted updates as well as known viruses. Virtually every signature scheme ever used starts with a hash function.

If a hash function does not behave randomly, we can break the above security constructs. Cryptographic hash functions thus take the randomness aspect very seriously. The ideal hash function would choose an output completely at random for each input, remembering that choice for future calls. This is called a random oracle.

The problem is that a random oracle requires a true random number generator, and more problematically, a globally accessible infinite memory bank. So we approximate it using deterministic hash functions instead. These compute their output by essentially shuffling their input really, really well, in such a way that it is not feasible to reverse.

To help quantify whether a specific function does a good job of approximating a random oracle, cryptographers came up with a variety of properties that a random oracle would have. The three most important and well-known properties a secure cryptographic hash function should satisfy are:

Pre-image resistance. For some constant $c$ it should be hard to find some input $m$ such that $h(m) = c$.
Second pre-image resistance. For some input $m_1$ it should be hard to find another input $m_2$ such that $h(m_1) = h(m_2)$.
Collision resistance. It should be hard to find inputs $m_1, m_2$ such that $h(m_1) = h(m_2)$.

We generally consider one of these properties broken if there exists a method that produces a collision or pre-image faster than simply trying random inputs (also known as a brute force attack). However, there are definitely gradations in breakage, as some methods are only several orders of magnitude faster than brute force. That may sound like a lot, but a method taking $2^{110}$ steps instead of $2^{128}$ are still both equally out of reach for today’s computers.

MD5 used to be a common hash function, and SHA-1 is still in common use today. While both were considered cryptographically secure at one point, generating MD5 collisions now takes less than a second on a modern PC. In 2017 a collaboration of researchers from CWI and Google and announced the first SHA-1 collision. However, as far as I’m aware, neither MD5 nor SHA-1 have practical (second) pre-image attacks, only theoretical ones.

Non-cryptographic hash functions

Cryptographically secure hash functions tend to have a small problem: they’re slow. Modern hash functions such as BLAKE3 resolve this somewhat by heavily vectorizing the hash using SIMD instructions, as well as parallelizing over multiple threads, but even then they require large input sizes before reaching those speeds.

A lot of problems don’t necessarily require secure hash functions, and people would much prefer a faster hash speed. Especially when we are computing many small hashes, such as in a hash table. Let’s take a look what common hash table implementations actually use as their hash for strings:

C++: there are multiple standard library implementations, but 64-bit clang 13.0.0 on Apple M1 ships CityHash64. Currently libstdc++ ships MurmurHash64A, a variant of Murmur2 for 64-bit platforms.
Java: OpenJDK uses an incredibly simple hash algorithm, which essentially just computes h = 31 * h + c for each character c.
PHP: the Zend engine uses essentially the same algorithm as Java, just using unsigned integers and 33 as its multiplier.
Nim: it used to use MurmurHash3_x86_32. While writing this article they appeared to have switched to use farmhash by default.
Zig: it uses wyhash by default, with 0 as seed.
Javascript: in V8 they use a custom weak string hash, with a randomly initialized seed.

There were some that used stronger hashes by default as well:

Go uses an AES-based hash if hardware acceleration is available on x86-64. Even though its construction is custom and likely not full-strength cryptographically secure, breaking it is too much effort and quite possibly beyond my capabilities.

If not available, it uses an algorithm inspired by wyhash.
Python and Rust use SipHash by default, which is a cryptographically secure pseudorandom function. This is effectively a hash function where you’re allowed to use a secret key during hashing, unlike a hash like SHA-2 where everyone knows all information involved.

This latter concept is actually really important, at least for protecting against HashDoS in hash tables. Even if a hash function is perfectly secure over its complete output, hash tables further reduce the output to only a couple bits to find the data it is looking for. For a static hash function without any randomness it’s possible to produce large lists of hashes that collide post-reduction, just by brute force. But for non-cryptographic hashes as we’ll see here we often don’t need brute force and can generate collisions at high speed for the full output, if not randomized by a random seed.

Interlude: inverse operations

Before we get to breaking some of the above hash functions, I must explain a basic technique I will use a lot: the inverting of operations. We are first exposed to this in primary school, where we might get faced by a question such as “$2 + x = 10$”. There we learn subtraction is the inverse of addition, such that we may find $x$ by computing $10 - 2 = 8$.

Most operations on the integer registers in computers are also invertible, despite the integers being reduced modulo $2^{w}$ in the case of overflow. Let us study some:

Addition can be inverted using subtraction. That is, x += y can be inverted using x -= y. Seems obvious enough.
Multiplication by a constant $c$ is not inverted by division. This would not work in the case of overflow. Instead, we calculate the modular multiplicative inverse of $c$. This is an integer $c^{-1}$ such that $c \cdot c^{-1} \equiv 1 \pmod {m}$. Then we invert multiplication by $c$ simply by multiplying by $c^{-1}$.

This constant exists if and only if $c$ is coprime with our modulus $m$, which for us means that $c$ must be odd as $m = 2^n$. For example, multiplication by $2$ is not invertible, which is easy to see as such, as it is equivalent to a bit shift to the left by one position, losing the most significant bit forever.

Without delving into the details, here is a snippet of Python code that computes the modular multiplicative inverse of an integer using the extended Euclidean algorithm by calculating $x, y$ such that $$cx + my = \gcd(c, m).$$ Then, because $c$ is coprime we find $\gcd(c, m) = 1$, which means that $$cx + 0 \equiv 1 \pmod m,$$ and thus $x = c^{-1}$.
```
def egcd(a, b):
    if a == 0: return (b, 0, 1)
    g, y, x = egcd(b % a, a)
    return (g, x - (b // a) * y, y)

def modinv(c, m):
    g, x, y = egcd(c, m)
    assert g == 1, "c, m must be coprime"
    return x % m
```
Using this we can invert modular multiplication:
```
>>> modinv(17, 2**32)
4042322161
>>> 42 * 17 * 4042322161 % 2**32
42
```
Magic!
XOR can be inverted using… XOR. It is its own inverse. So x ^= y can be inverted using x ^= y.
Bit shifts can not be inverted, but two common operations in hash functions that use bit shifts can be. The first is bit rotation by a constant. This is best explained visually, for example a bit rotation to the left by 3 places on a 8-bit word, where each bit is shown as a letter:
```
abcdefghi
defghiabc
```
The formula for a right-rotation of k places is (x >> k) | (x << (w - k)), where w is the width of the integer type. Its inverse is a left-rotation, which simply swaps the direction of both shifts. Alternatively, the inverse of a right-rotation of k places is another right-rotation of w-k places.
Another common operation in hash functions is the “xorshift”. It is an operation of one of the following forms, with $k > 0$:
```
x ^= x << k  // Left xorshift.
x ^= x >> k  // Right xorshift.
```
How to invert it is entirely analogous between the two, so I will focus on the left xorshift.

An important observation is that the least significant $k$ bits are left entirely untouched by the xorshift. Thus by repeating the operation, we recover the least significant $2k$ bits, as the XOR will invert itself for the next $k$ bits. Let’s take a look at the resulting value to see how we should proceed:
```
v0 = (x << k) ^ x
// Apply first step of inverse v1 = v0 ^ (v0 << k).
v1 = (x << 2*k) ^ (x << k) ^ (x << k) ^ x
// Simplify using self-inverse (x << k) ^ (x << k) = 0.
v1 = (x << 2*k) ^ x
```
From this we can conclude the following identity: $$\operatorname{xorshift}(\operatorname{xorshift}(x, k), k) = \operatorname{xorshift}(x, 2k)$$ Now we only need one more observation to complete our algorithm: a xorshift of $k \geq w$ where $w$ is the width of our integer is a no-op. Thus we repeatedly apply our doubling identity until we reach large enough $q$ such that $\operatorname{xorshift}(x, 2^q \cdot k) = x$.

For example, to invert a left xorshift by 13 for 64-bit integers we apply the following sequence:
```
x ^= x << 13  // Left xorshift by 13.

x ^= x << 13  // Inverse step 1.
x ^= x << 26  // Inverse step 2.
x ^= x << 52  // Inverse step 3.
// x ^= x << 104  // Next step would be a no-op.
```

Armed with this knowledge, we can now attack.

Breaking `CityHash64`

Let us take a look at (part of) the source code of CityHash64 from libcxx that’s used for hashing strings on 64-bit platforms:

static const uint64_t mul = 0x9ddfea08eb382d69ULL;
static const uint64_t k0 = 0xc3a5c85c97cb3127ULL;
static const uint64_t k1 = 0xb492b66fbe98f273ULL;
static const uint64_t k2 = 0x9ae16a3b2f90404fULL;
static const uint64_t k3 = 0xc949d7c7509e6557ULL;

template<class T>
T loadword(const void* p) {
    T r;
    std::memcpy(&r, p, sizeof(r));
    return r;
}

uint64_t rotate(uint64_t val, int shift) {
    if (shift == 0) return val;
    return (val >> shift) | (val << (64 - shift));
}

uint64_t hash_len_16(uint64_t u, uint64_t v) {
    uint64_t x = u ^ v;
    x *= mul;
    x ^= x >> 47;
    uint64_t y = v ^ x;
    y *= mul;
    y ^= y >> 47;
    y *= mul;
    return y;
}

uint64_t hash_len_17_to_32(const char *s, uint64_t len) {
    const uint64_t a = loadword<uint64_t>(s) * k1;
    const uint64_t b = loadword<uint64_t>(s + 8);
    const uint64_t c = loadword<uint64_t>(s + len - 8) * k2;
    const uint64_t d = loadword<uint64_t>(s + len - 16) * k0;
    return hash_len_16(
        rotate(a - b, 43) + rotate(c, 30) + d,
        a + rotate(b ^ k3, 20) - c + len
    );
}

To break this, let’s assume we’ll always give length 32 inputs. Then the implementation will always call hash_len_17_to_32, and we have full control over variables a, b, c and d by changing our input.

Note that d is only used once, in the final expression. This makes it a prime target for attacking the hash. We will choose a, b and c arbitrarily, and then solve for d to compute a desired hash outcome.

Using the above modinv function we first compute the necessary modular multiplicative inverses of mul and k0:

>>> 0x9ddfea08eb382d69 * 0xdc56e6f5090b32d9 % 2**64
1
>>> 0xc3a5c85c97cb3127 * 0x81bc9c5aa9c72e97 % 2**64
1

We also note that in this case the xorshift is easy to invert, as x ^= x >> 47 is simply its own inverse. Having all the components ready, we can invert the function step by step.

We first load a, b and c like in the hash function, and compute

uint64_t v = a + rotate(b ^ k3, 20) - c + len;

which is the second parameter to hash_len_16. Then, starting from our desired return value of hash_len_16(u, v) we work backwards step by step, inverting each operation to find the function argument u that would result in our target hash. Then once we have found such the unique u we compute our required input d. Putting it all together:

static const uint64_t mul_inv = 0xdc56e6f5090b32d9ULL;
static const uint64_t k0_inv  = 0x81bc9c5aa9c72e97ULL;

void cityhash64_preimage32(uint64_t hash, char *s) {
    const uint64_t len = 32;
    const uint64_t a = loadword<uint64_t>(s) * k1;
    const uint64_t b = loadword<uint64_t>(s + 8);
    const uint64_t c = loadword<uint64_t>(s + len - 8) * k2;
    
    uint64_t v = a + rotate(b ^ k3, 20) - c + len;
    
    // Invert hash_len_16(u, v). Original operation inverted
    // at each step is shown on the right, note that it is in
    // the inverse order of hash_len_16.
    uint64_t y = hash;    // return y;
    y *= mul_inv;         // y *= mul;
    y ^= y >> 47;         // y ^= y >> 47;
    y *= mul_inv;         // y *= mul;
    uint64_t x = y ^ v;   // uint64_t y = v ^ x;
    x ^= x >> 47;         // x ^= x >> 47;
    x *= mul_inv;         // x *= mul;
    uint64_t u = x ^ v;   // uint64_t x = u ^ v;
    
    // Find loadword<uint64_t>(s + len - 16).
    uint64_t d = u - rotate(a - b, 43) - rotate(c, 30);
    d *= k0_inv;
    std::memcpy(s + len - 16, &d, sizeof(d));
}

The chance that a random uint64_t forms 8 printable ASCII bytes is $\left(94/256\right)^8 \approx 0.033%$. Not great, but cityhash64_preimage32 is so fast that having to repeat it on average ~3000 times to get a purely ASCII result isn’t so bad.

For example, the following 10 strings all hash to 1337 using CityHash64, generated using this code:

orlp-cityhash64-D-:K5yx*zkgaaaaa
orlp-cityhash64-TXb7;1j&btkaaaaa
orlp-cityhash64-+/LM$0 ;msnaaaaa
orlp-cityhash64-u'f&>I'~mtnaaaaa
orlp-cityhash64-pEEv.LyGcnpaaaaa
orlp-cityhash64-v~~bm@,Vahtaaaaa
orlp-cityhash64-RxHr_&~{miuaaaaa
orlp-cityhash64-is_$34#>uavaaaaa
orlp-cityhash64-$*~l\{S!zoyaaaaa
orlp-cityhash64-W@^5|3^:gtcbaaaa

Breaking MurmurHash2

We can’t let libstdc++ get away after targetting libc++, can we? The default string hash calls an implementation of MurmurHash2 with seed 0xc70f6907. The hash—simplified to only handle strings whose lengths are multiples of 8—is as follows:

uint64_t murmurhash64a(const char* s, size_t len, uint64_t seed) {
    const uint64_t mul = 0xc6a4a7935bd1e995ULL;

    uint64_t hash = seed ^ (len * mul);
    for (const char* p = s; p != s + len; p += 8) {
        uint64_t data = loadword<uint64_t>(p);
        data *= mul;
        data ^= data >> 47;
        data *= mul;
        hash ^= data;
        hash *= mul;
    }

    hash ^= hash >> 47;
    hash *= mul;
    hash ^= hash >> 47;
    return hash;
}

We can take a similar approach here as before. We note that the modular multiplicative inverse of 0xc6a4a7935bd1e995 mod $2^{64}$ is 0x5f7a0ea7e59b19bd. As an example, we can choose the first 24 bytes arbitrarily, and solve for the last 8 bytes:

void murmurhash64a_preimage32(uint64_t hash, char* s, uint64_t seed) {
    const uint64_t mul = 0xc6a4a7935bd1e995ULL;
    const uint64_t mulinv = 0x5f7a0ea7e59b19bdULL;
    
    // Compute the hash state for the first 24 bytes as normal.
    uint64_t state = seed ^ (32 * mul);
    for (const char* p = s; p != s + 24; p += 8) {
        uint64_t data = loadword<uint64_t>(p);
        data *= mul;
        data ^= data >> 47;
        data *= mul;
        state ^= data;
        state *= mul;
    }
    
    // Invert target hash transformation.
                        // return hash;
    hash ^= hash >> 47; // hash ^= hash >> 47;
    hash *= mulinv;     // hash *= mul;
    hash ^= hash >> 47; // hash ^= hash >> 47;

    // Invert last iteration for last 8 bytes.
    hash *= mulinv;                // hash *= mul;
    uint64_t data = state ^ hash;  // hash = hash ^ data;
    data *= mulinv;                // data *= mul;
    data ^= data >> 47;            // data ^= data >> 47;
    data *= mulinv;                // data *= mul;
    std::memcpy(s + 24, &data, 8); // data = loadword<uint64_t>(s);
}

The following 10 strings all hash to 1337 using MurmurHash64A with the default seed 0xc70f6907, generated using this code:

orlp-murmurhash64-bhbaaat;SXtgVa
orlp-murmurhash64-bkiaaa&JInaNcZ
orlp-murmurhash64-ewmaaa(%J+jw>j
orlp-murmurhash64-vxpaaag"93\Yj5
orlp-murmurhash64-ehuaaafa`Wp`/|
orlp-murmurhash64-yizaaa1x.zQF6r
orlp-murmurhash64-lpzaaaZphp&c F
orlp-murmurhash64-wsjbaa771rz{z<
orlp-murmurhash64-rnkbaazy4X]p>B
orlp-murmurhash64-aqnbaaZ~OzP_Tp

Universal collision attack on MurmurHash64A

In fact, MurmurHash64A is so weak that Jean-Philippe Aumasson, Daniel J. Bernstein and Martin Boßlet published an attack that creates sets of strings which collide regardless of the random seed used.

To see how it works, let’s take a look at the core loop of MurmurHash64A:

uint64_t data = loadword<uint64_t>(p);
data *= mul;          // Trivially invertible.
data ^= data >> 47;   // Trivially invertible.
data *= mul;          // Trivially invertible.
state ^= data;
state *= mul;

We know we can trivially invert the operations done on data regardless of what the current state is, so we might as well have had the following body:

state ^= data;
state *= mul;

Now the hash starts looking rather weak indeed. The clever trick they employ is by creating two strings simultaneously, such that they differ precisely in the top bit in each 8-byte word. Why the top bit?

>>> 1 << 63
9223372036854775808
>>> (1 << 63) * mul % 2**64
9223372036854775808

Since mul is odd, its least significant bit is set. Multiplying 1 << 63 by it is equivalent to shifting that bit 63 places to the left, which is once again 1 << 63. That is, 1 << 63 is a fixed point for the state *= mul operation. We also note that for the top bit XOR is equivalent to addition, as the overflow from addition is removed mod $2^{64}$.

So if we have two input strings, one starting with the 8 bytes data, and the other starting with data ^ (1 << 63) == data + (1 << 63) (after doing the trivial inversions). We then find that the two states, regardless of seed, differ exactly in the top bit after state ^= data. After multiplication we find we have two states x * mul and (x + (1 << 63)) * mul == x * mul + (1 << 63)… which again differ exactly in the top bit! We are now back to state ^= data in our iteration, for the next 8 bytes. We can now use this moment to cancel our top bit difference, by again feeding two 8-byte strings that differ in the top bit (after inverting).

In fact, we only have to find one pair of such strings that differ in the top bit, which we can then repeat twice (in either order) to cancel our difference again. When represented as a uint64_t if we choose the first string as x we can derive the second string as

x *= mul;        // Forward transformation...
x ^= x >> 47;    // ...
x *= mul;        // ...
x ^= 1 << 63;    // Difference in top bit.
x *= mulinv;     // Backwards transformation...
x ^= x >> 47;    // ...
x *= mulinv;     // ...

I was unable to find a printable ASCII string that has another printable ASCII string as its partner. But I was able to find the following pair of 8-byte UTF-8 strings that differ in exactly the top bit after the Murmurhash64A input transformation:

xx0rlpx!
xxsXъВ

Combining them as such gives two 16-byte strings that when fed through the hash algorithm manipulate the state in the same way: a collision.

xx0rlpx!xxsXъВ
xxsXъВxx0rlpx!

But it doesn’t stop there. By concatenating these two strings we can create $2^n$ different colliding strings each $16n$ bytes long. With the current libstdc++ implementation the following prints the same number eight times:

std::hash<std::u8string> h;
std::u8string a = u8"xx0rlpx!xxsXъВ";
std::u8string b = u8"xxsXъВxx0rlpx!";
std::cout << h(a + a + a) << "\n";
std::cout << h(a + a + b) << "\n";
std::cout << h(a + b + a) << "\n";
std::cout << h(a + b + b) << "\n";
std::cout << h(b + a + a) << "\n";
std::cout << h(b + a + b) << "\n";
std::cout << h(b + b + a) << "\n";
std::cout << h(b + b + b) << "\n";

Even if the libstdc++ would randomize the seed used by MurmurHash64a, the strings would still collide.

Breaking MurmurHash3

Nim ~~uses~~ used to use MurmurHash3_x86_32, so let’s try to break that.

If we once again simplify to strings whose lengths are a multiple of 4 we get the following code:

uint32_t rotl32(uint32_t x, int r) {
    return (x << r) | (x >> (32 - r));
}

uint32_t murmurhash3_x86_32(const char* s, int len, uint32_t seed) {
    const uint32_t c1 = 0xcc9e2d51;
    const uint32_t c2 = 0x1b873593;
    const uint32_t c3 = 0x85ebca6b;
    const uint32_t c4 = 0xc2b2ae35;

    uint32_t h = seed;
    for (const char* p = s; p != s + len; p += 4) {
        uint32_t k = loadword<uint32_t>(p);
        k *= c1;
        k = rotl32(k, 15);
        k *= c2;

        h ^= k;
        h = rotl32(h, 13);
        h = h * 5 + 0xe6546b64;
    }

    h ^= len;
    h ^= h >> 16;
    h *= c3;
    h ^= h >> 13;
    h *= c4;
    h ^= h >> 16;
    return h;
}

I think by now you should be able to get this function to spit out any value you want if you know the seed. The inverse of rotl32(x, r) is rotl32(x, 32-r) and the inverse of h ^= h >> 16 is once again just h ^= h >> 16. Only h ^= h >> 13 is a bit different, it’s the first time we’ve seen that a xorshift’s inverse has more than one step:

h ^= h >> 13
h ^= h >> 26

Compute the modular inverses of c1 through c4 as well as 5 mod $2^{32}$, and go to town. If you want to cheat or check your answer, you can check out the code I’ve used to generate the following ten strings that all hash to 1337 when fed to MurmurHash3_x86_32 with seed 0:

orlp-murmurhash3_x86_32-haaaPa*+
orlp-murmurhash3_x86_32-saaaUW&<
orlp-murmurhash3_x86_32-ubaa/!/"
orlp-murmurhash3_x86_32-weaare]]
orlp-murmurhash3_x86_32-chaa5@/}
orlp-murmurhash3_x86_32-claaM[,5
orlp-murmurhash3_x86_32-fraaIx`N
orlp-murmurhash3_x86_32-iwaara&<
orlp-murmurhash3_x86_32-zwaa]>zd
orlp-murmurhash3_x86_32-zbbaW-5G

Nim uses 0 as a fixed seed.

Universal collision attack on MurmurHash3

Suppose that Nim didn’t use 0 as a fixed seed, but chose a randomly generated one. Can we do a similar attack as the one done to MurmurHash2 to still generate universal multicollisions?

Yes we can. Let’s take another look at that core loop body:

uint32_t k = loadword<uint32_t>(p);
k *= c1;            // Trivially invertable.
k = rotl32(k, 15);  // Trivially invertable.
k *= c2;            // Trivially invertable.

h ^= k;
h = rotl32(h, 13);
h = h * 5 + 0xe6546b64;

Once again we can ignore the first three trivially invertable instructions as we can simply choose our input so that we get exactly the k we want. Remember from last time that we want to introduce a difference in exactly the top bit of h, as the multiplication will leave this difference in place. But here there is a bit rotation between the XOR and the multiplication. The solution? Simply place our bit difference such that rotl32(h, 13) shifts it into the top position.

Does the addition of 0xe6546b64 mess things up? No. Since only the top bit between the two states will be different, there is a difference of exactly $2^{31}$ between the two states. This difference is maintained by the addition. Since two 32-bit numbers with the same top bit can be at most $2^{31} - 1$ apart, we can conclude that the two states still differ in the top bit after the addition.

So we want to find two pairs of 32-bit ints, such that after applying the first three instructions the first pair differs in bit 1 << (31 - 13) == 0x00040000 and the second pair in bit 1 << 31 == 0x80000000.

After some brute-force searching I found some cool pairs (again forced to use UTF-8), which when combined give the following collision:

a = "!&orlpՓ"
b = "yǏglp$X"

As before, any concatenation of as and bs of length n collides with all other combinations of length n.

Breaking FarmHash64

Nim switched to farmhash since I started writing this post. To break it we can notice that its structure is very similar to CityHash64, so we can use those same techniques again. In fact, the only changes between the two for lengths 17-32 bytes is that a few operators were changed from subtraction/XOR to addition, a rotation operator had its constant tweaked, and some k constants are slightly tweaked in usage. The process of breaking it is so similar that it’s entirely analogous, so we can skip straight to the result. These 10 strings all hash to 1337 with FarmHash64:

orlp-farmhash64-?VrJ@L7ytzwheaaa
orlp-farmhash64-p3`!SQb}fmxheaaa
orlp-farmhash64-pdt'cuI\gvxheaaa
orlp-farmhash64-IbY`xAG&ibkieaaa
orlp-farmhash64-[_LU!d1hwmkieaaa
orlp-farmhash64-QiY!clz]bttieaaa
orlp-farmhash64-&?J3rZ_8gsuieaaa
orlp-farmhash64-LOBWtm5Szyuieaaa
orlp-farmhash64-Mptaa^g^ytvieaaa
orlp-farmhash64-B?&l::hxqmfjeaaa

Trivial fixed-seed wyhash multicollisions

Zig uses wyhash with a fixed seed of zero. While I was unable to do seed-independent attacks against wyhash, using it with a fixed seed makes generating collisions trivial. Wyhash is built upon the folded multiply, which takes two 64-bit inputs, multiplies them to a 128-bit product before XORing together the two halves:

uint64_t folded_multiply(uint64_t a, uint64_t b) {
    __uint128_t full = __uint128_t(a) * __uint128_t(b);
    return uint64_t(full) ^ uint64_t(full >> 64);
}

It’s easy to immediately see a critical flaw with this: if one of the two sides is zero, the output will also always be zero. To protect against this, wyhash always uses a folded multiply in the following form:

out = folded_multiply(input_a ^ secret_a, input_b ^ secret_b);

where secret_a and secret_b are determined by the seed, or outputs of previous iterations which are influenced by the seed. However, when your seed is constant… With a bit of creativity we can use the start of our string to prepare a ‘secret’ value which we can perfectly cancel with another ASCII string later in the input.

So, without further ado, every 32-byte string of the form

orlp-wyhash-oGf_________tWJbzMJR

hashes to the same value with Zig’s default hasher.

Zig uses a different set of parameters than the defaults found in the wyhash repository, so for good measure, this pattern provides arbitrary multicollisions for the default parameters found in wyhash when using seed == 0:

orlp-wyhash-EUv_________NLXyytkp

Conclusion

We’ve seen that a lot of the hash functions in common use in hash tables today are very weak, allowing fairly trivial attacks to produce arbitrary amounts of collisions if not randomly initialized. Using a randomly seeded hash table is paramount if you don’t wish to become a victim of a hash flooding attack.

We’ve also seen that some hash functions are vulnerable to attack even if randomly seeded. These are completely broken and should not be used if attacks are a concern at all. Luckily I was unable to find such attacks against most hashes, but the possibility of such an attack existing is quite unnerving.

With universal hashing it’s possible to construct hash functions for which such an attack is provably impossible, last year I published a hash function called polymur-hash that has this property. Your HTTPS connection to this website also likely uses a universal hash function for authenticity of the transferred data, both Poly1305 and GCM are based on universal hashing for their security proofs.

Of course, if your data is not user-controlled, or there is no reasonable security model where your application would face attacks, you can get away with faster and insecure hashes.

More to come on the subject of hashing and hash tables and how it can go right or wrong, but for now this article is long enough as-is…