<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<title>orlp.net - Blog Archive</title>
	<link href="https://orlp.net/blog/atom.xml" rel="self" type="application/atom+xml"/>
  <link href="https://orlp.net/blog/"/>
	<generator uri="https://www.getzola.org/">Zola</generator>
	<updated>2025-12-31T00:00:00+00:00</updated>
	<id>https://orlp.net/blog/atom.xml</id>
	<entry xml:lang="en">
		<title>Sorting with Fibonacci Numbers and a Knuth Reward Check</title>
		<author>Orson R. L. Peters</author>
		<published>2025-12-31T00:00:00+00:00</published>
		<updated>2025-12-31T00:00:00+00:00</updated>
		<link rel="alternate" href="https://orlp.net/blog/fibonacci-sort/" type="text/html"/>
		<id>https://orlp.net/blog/fibonacci-sort/</id>
		<content type="html">&lt;p&gt;The following incredibly small sorting algorithm has an $O(n^{4&#x2F;3})$ worst-case runtime:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span&gt;fibonacci_sort(v):
&lt;&#x2F;span&gt;&lt;span&gt;    a, b = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;a * b &amp;lt; len(v):
&lt;&#x2F;span&gt;&lt;span&gt;        a, b = b, a + b
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;a &amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;        a, b = b - a, a
&lt;&#x2F;span&gt;&lt;span&gt;        g = a * b
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;range(g, len(v)):
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;i &amp;gt;= g and v[i - g] &amp;gt; v[i]:
&lt;&#x2F;span&gt;&lt;span&gt;                v[i], v[i - g] = v[i - g], v[i]
&lt;&#x2F;span&gt;&lt;span&gt;                i -= g
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;As the name implies, it uses the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Fibonacci_sequence&quot;&gt;Fibonacci
sequence&lt;&#x2F;a&gt; ($1, 1, 2, 3, 5, \dots$) to sort the
elements. In this article I will explain how it works, show an interesting divisibility property of
the Fibonacci numbers and use that to prove its complexity. I will also explain how I ended up with
one of &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Donald_Knuth&quot;&gt;Donald Knuth&lt;&#x2F;a&gt;’s coveted &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Knuth_reward_check&quot;&gt;reward checks&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;shellsort&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#shellsort&quot; aria-label=&quot;Anchor link for: shellsort&quot;&gt;Shellsort&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;The above sorting algorithm is an
instance of the more generic &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Shellsort&quot;&gt;Shellsort&lt;&#x2F;a&gt;.
Shellsort, named after Donald L. Shell, does not directly sort all the elements
in one step. Rather, it first only sorts &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Subsequence&quot;&gt;subsequences&lt;&#x2F;a&gt;
of the array in a process known as $k$-sorting.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;k-sorting&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#k-sorting&quot; aria-label=&quot;Anchor link for: k-sorting&quot;&gt;$k$-sorting&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;aside&gt;
&lt;p&gt;A subsequence of an array can be any of its elements, but they must remain in the original order.
Unlike a substring, the subsequence can have gaps—the elements need not be contiguous.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;When $k$-sorting an array you split the elements of the array up in groups, and sort those
groups &lt;strong&gt;independently from each other&lt;&#x2F;strong&gt;. Each group is formed by taking every
$k$th value ($k$ is also referred to as the &lt;em&gt;gap&lt;&#x2F;em&gt;), differing only by their starting point. For example, $3$-sorting
an array with eight elements looks like this:&lt;&#x2F;p&gt;
&lt;div class=&quot;small&quot;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;fibonacci-sort&#x2F;sorting-subsequences.png&quot; alt=&quot;Sorting the subsequences independently.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;After this step we say the array is $k$-ordered. This means that for all $i$ we have ${A[i] \leq A[i + k]}$.
Nothing is (directly) known about the relative order of elements which aren’t separated by a multiple
of $k$. Since everything is a multiple of $1$, we see that $1$-ordered is just… sorted.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;Confusingly, there appears to be another definition for &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;K-sorted_sequence&quot;&gt;$k$-sorted&lt;&#x2F;a&gt;
which states that all $i$ and all $j \geq k$ we have ${A[i] \leq A[i + j]}$, not
just at the multiples of $k$. That is &lt;em&gt;not&lt;&#x2F;em&gt; the case here, in the context of
Shellsort all literature uses the earlier definition.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;There is a fascinating property of $k$-sorting that is key to how Shellsort operates:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Theorem 1.&lt;&#x2F;strong&gt; If an array is $h$-ordered and then it is $k$-sorted, it remains
$h$-ordered as well.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;It is surprisingly tricky to prove this simple statement. A proof sketch paraphrased
from a formal proof due to Knuth (&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;The_Art_of_Computer_Programming&quot;&gt;TAOCP&lt;&#x2F;a&gt;,
Vol. 3, Chapter 5.2.1, Theorem K) goes as follows:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lemma 1.&lt;&#x2F;strong&gt; If the last $r$ elements from array $Y$ are bigger or equal to
respectively the first $r$ elements from array $X$, then this remains true after
sorting both arrays.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Proof sketch.&lt;&#x2F;strong&gt; There are at least $r$ elements in $X$ which are smaller than or
equal to elements in $Y$, thus the maximum element in $Y$ dominates at least
$r$ elements, and thus the last element of $Y$ dominates the $r$th element of $X$ after sorting both.
Apply a similar argument for $r - 1$ and the second largest element, etc.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;A visual representation of this lemma really helps the understanding I believe:&lt;&#x2F;p&gt;
&lt;div class=&quot;medium&quot;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;fibonacci-sort&#x2F;sorting-lemma.png&quot; alt=&quot;Sorting X and Y maintaining their respective relationships.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;Now we can use this lemma to prove our original theorem:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Proof sketch of Theorem 1.&lt;&#x2F;strong&gt; We have some array $A$ which is $h$-ordered, and thus
we have $A[i] \leq A[i + h]$ for valid $i$. Then we $k$-sort it and now have
$A[i] \leq A[i + k]$ instead. For any particular choice of $i$, define $X$ as
$A[i + sk]$ for all valid integer $s$, and $Y$ as $A[i + tk + h]$ for all valid
integer $t$. Then you can apply Lemma 1 to show that $A[i] \leq A[i + h]$ still
remains true after $k$-sorting.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Once again I think a visual representation really helps here, especially to see
the parallels with Lemma 1. Suppose we have a $3$-ordered array
and then $5$-sort it. The proof sketch of Theorem 1 for $i = 2$ (and in fact for any $i \equiv 2 \pmod 5$) that we still
have $A[i] \leq A[i + 3]$ can then be visualized as such:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;fibonacci-sort&#x2F;k-sorting-theorem.png&quot; alt=&quot;Theorem 1 visualized.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;I chose to only show a proof sketch rather than a full formal proof in this blog
post, as I quote:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;“This is much harder to write down than to understand.”&lt;&#x2F;em&gt; — Donald Knuth&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;The result of this theorem is that as you apply more and more steps of $k$-sorting, all the work
compounds into a larger overall ordering. Even with just two steps we can see
very powerful results:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lemma 2.&lt;&#x2F;strong&gt; If an array is both $h$-ordered and $k$-ordered where $k$ and $h$ are
&lt;em&gt;relatively prime&lt;&#x2F;em&gt; (they don’t share any divisors other than 1), then two
elements in the array which are at least $s \geq (h-1)(k-1)$ steps apart are in a
correct relative order.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Proof.&lt;&#x2F;strong&gt; It is possible to write $s = \alpha h + \beta k$ with integer $\alpha, \beta \geq 0$
(for a proof of that see &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Coin_problem#n_=_2&quot;&gt;here&lt;&#x2F;a&gt;).
This means we can do $\alpha$ steps of $h$ (each time using $A[i] \leq A[i + h]$
since we’re $h$-ordered)
and similarly $\beta$ steps of $k$ to see that $A[i] \leq A[i + \alpha h + \beta k] \leq A[i + s]$.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;For example, here is a visualization of the relative order implied by the combination
of $4$-ordering and $9$-ordering:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;fibonacci-sort&#x2F;frobenius.png&quot; alt=&quot;4-ordering and 9-ordering combined.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;We see that indeed starting from offset $(4 - 1)(9 - 1) = 24$ each element is ordered relative
to the element at 0.&lt;&#x2F;p&gt;
&lt;p&gt;More generally, we find that if we $k$-sort with some set $\{k_i, \dots, k_j\}$
and $\gcd(k_i, \dots, k_j) = 1$, then there exists some upper bound such that
all numbers greater than it are representable as sums of non-negative multiples of $k_i, \dots, k_j$,
meaning all elements beyond that point are ordered correctly relative to the
starting point. Finding this upper bound is known as the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Coin_problem&quot;&gt;Frobenius problem&lt;&#x2F;a&gt;
or coin problem, and it is a hard number theoretic problem. Some formulae like
the above one for two coprime integers are known but the general problem for
arbitrary sets of $k$ is NP-hard. It is this problem that forms the surprising
link between Shellsort and number theory which will ultimately lead us towards
the Fibonacci numbers.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;insertion-sort&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#insertion-sort&quot; aria-label=&quot;Anchor link for: insertion-sort&quot;&gt;Insertion sort&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;Shellsort uses &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Insertion_sort&quot;&gt;insertion sort&lt;&#x2F;a&gt;
to $k$-sort the subsequences. This sort repeatedly swaps the last
unsorted element with the element before it until it falls into place before
continuing with the next unsorted element. This is generally speaking a
rather inefficient algorithm for large arrays, as it has a worst-case of $O(n^2)$.
However, this worst-case complexity can be refined. If each element is at most
$m$ steps away from its final sorted position, insertion sort takes $O(nm)$
time.&lt;&#x2F;p&gt;
&lt;p&gt;And this is the key to Shellsort’s subquadratic worst-case complexity.
You choose a clever &lt;em&gt;gap sequence&lt;&#x2F;em&gt; such that gaps start off very large leading
to small subsequences and thus small $O(n^2)$ terms. Then, when Shellsort starts
$k$-sorting with smaller gaps, you can use the fact that no element is very far
from its final position to prove that it is still efficient. For example, based
on our earlier observations, if the array is $4$-ordered and $9$-ordered and we
do a $1$-sort (that is, a regular sort with no gaps) we know insertion sort can
not run slower than $O(24n) = O(n)$ because each element is no further than $m = 24$ steps
away from its final position.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;gap-sequence&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#gap-sequence&quot; aria-label=&quot;Anchor link for: gap-sequence&quot;&gt;Gap sequence&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;Choosing the gap sequence is thus key to Shellsort’s performance. The Wikipedia
page lists &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Shellsort#Gap_sequences&quot;&gt;many known gap sequences&lt;&#x2F;a&gt;
with different complexities. For example Hibbard’s 1963 sequence $k_i = 2^{i} - 1$
with $O(n^{3&#x2F;2})$ complexity, or Pratt’s from 1971 which chooses all numbers of the form
$2^p3^q$ for a complexity of $O(n\,(\log n)^2)$ (still the best known sequence to date, at least asymptotically).&lt;&#x2F;p&gt;
&lt;p&gt;However, all ‘new’ sequences listed after 1986 are &lt;em&gt;empirically&lt;&#x2F;em&gt; established, their
complexities are unknown. They perform very well in practice, but they could
possibly have much slower than expected performance for some inputs. A search
on Google Scholar reveals no interesting new sequences either. With that in mind
I’m happy to announce that Shellsort with the gap sequence&lt;&#x2F;p&gt;
&lt;p&gt;$$k_i = F_i \cdot F_{i+1} = (1, 2, 6, 15, 40, 104, 273, \dots)$$&lt;&#x2F;p&gt;
&lt;p&gt;where $F_n$ is the $n$th Fibonacci number has a worst-case runtime of $O(n^{4&#x2F;3})$.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;an-interesting-fibonacci-property&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#an-interesting-fibonacci-property&quot; aria-label=&quot;Anchor link for: an-interesting-fibonacci-property&quot;&gt;An interesting Fibonacci property&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Before we can prove this worst-case runtime of gap sequence we’re going to need
to take a look at the Fibonacci numbers in more detail:&lt;&#x2F;p&gt;
&lt;p&gt;$$F_0 = 0, \quad F_1 = 1, \quad F_n = F_{n-1} + F_{n-2}$$&lt;&#x2F;p&gt;
&lt;p&gt;This very well-known series of numbers $0, 1, 1, 2, 3, 5, 8, 13, \dots$ has all
kind of interesting properties, but today we’ll need a very curious
property where for $a, b &amp;gt; 0$,
$$\gcd(F_a, F_b) = F_{\gcd(a, b)},$$
where $\gcd$ is the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Greatest_common_divisor&quot;&gt;greatest
common divisor&lt;&#x2F;a&gt; function.&lt;&#x2F;p&gt;
&lt;p&gt;This property in plain words states that the greatest common divisor of the
$a$th and $b$th Fibonacci number can be found by computing the greatest common
divisor of $a$ and $b$, and looking up that index in the Fibonacci series.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;addition-formula&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#addition-formula&quot; aria-label=&quot;Anchor link for: addition-formula&quot;&gt;Addition formula&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;To prove that we first need to prove the addition formula of the Fibonacci
numbers, where $n, m &amp;gt; 0$,
$$F_{n+m} = F_{n+1}F_m + F_nF_{m-1}.$$
A fairly simple way to do this is using the matrix exponential form of the Fibonacci
sequence,
$$\begin{pmatrix}F_{n+1}&amp;amp;F_n\\F_n&amp;amp;F_{n-1}\end{pmatrix} = {\begin{pmatrix}1&amp;amp;1\\1&amp;amp;0\end{pmatrix}}^n.$$&lt;&#x2F;p&gt;
&lt;aside&gt;
Proving the formula for $n = 1, 2$ directly using substitution and then using strong induction is
probably simpler, I just wanted an excuse to show the matrix exponential form.
&lt;&#x2F;aside&gt;
&lt;p&gt;This form can be easily proven as correct using induction. With $n = 1$ you can directly see
the equality holds, and to prove it holds for $n + 1$ assuming it holds for $n$ we have&lt;&#x2F;p&gt;
&lt;p&gt;\begin{align*}
{\begin{pmatrix}1&amp;amp;1\\1&amp;amp;0\end{pmatrix}}^{n + 1} &amp;amp;= {\begin{pmatrix}1&amp;amp;1\\1&amp;amp;0\end{pmatrix}}{\begin{pmatrix}1&amp;amp;1\\1&amp;amp;0\end{pmatrix}}^n = {\begin{pmatrix}1&amp;amp;1\\1&amp;amp;0\end{pmatrix}}\begin{pmatrix}F_{n+1}&amp;amp;F_n\\F_n&amp;amp;F_{n-1}\end{pmatrix}\\
&amp;amp;= \begin{pmatrix}F_{n+1}+F_{n}&amp;amp;F_n+F_{n-1}\\F_{n+1}&amp;amp;F_{n}\end{pmatrix} = \begin{pmatrix}F_{n+2}&amp;amp;F_{n+1}\\F_{n+1}&amp;amp;F_{n}\end{pmatrix}.
\end{align*}&lt;&#x2F;p&gt;
&lt;p&gt;Then, using the fact that in matrix exponentiation $A^{n+m} = A^n \times A^m$ we find:&lt;&#x2F;p&gt;
&lt;p&gt;\begin{align*}
\begin{pmatrix}F_{n+m+1}&amp;amp;F_{n+m}\\F_{n+m}&amp;amp;F_{n+m-1}\end{pmatrix} &amp;amp;= \begin{pmatrix}F_{n+1}&amp;amp;F_n\\F_n&amp;amp;F_{n-1}\end{pmatrix} \times \begin{pmatrix}F_{m+1}&amp;amp;F_m\\F_m&amp;amp;F_{m-1}\end{pmatrix}\\
&amp;amp;= \begin{pmatrix}F_{n+1}F_{m+1}+F_nF_m&amp;amp;F_{n+1}F_m+F_nF_{m-1}\\F_nF_{m+1}+F_{n-1}F_m&amp;amp;F_nF_m + F_{n-1}F_{m-1}\end{pmatrix}.
\end{align*}&lt;&#x2F;p&gt;
&lt;p&gt;Our desired addition formula can then be read from the top right element of both
sides of the matrix equation.&lt;&#x2F;p&gt;
&lt;p&gt;If we generalize the Fibonacci formula a bit to allow negative
indices we find that $F_{-1} = 1$ (which is the only choice preserving
$F_{n} = F_{n-1} + F_{n-2}$), and that the above addition formula also holds for
$n &amp;gt; 0, m = 0$ which we’ll need in a bit.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;divisibility-and-coprimality&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#divisibility-and-coprimality&quot; aria-label=&quot;Anchor link for: divisibility-and-coprimality&quot;&gt;Divisibility and coprimality&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;We also need to prove that $F_{kn} \equiv 0 \bmod F_{n}$ for all $k, n &amp;gt; 0$.
If we do induction on $k$ it obviously holds true for $k = 1$, and we see that if it holds for $k$ we have
\begin{align*}
F_{(k + 1)n} \equiv F_{kn + n} &amp;amp;\equiv F_{kn+1}F_{n} + F_{kn}F_{n - 1}\\
&amp;amp;\equiv F_{kn+1} \cdot 0 + 0 \cdot F_{n - 1} \equiv 0 &amp;amp;&amp;amp;\mod F_{n}.
\end{align*}
In other words, $F_n$ divides $F_{kn}$, also written $F_{n} \mathrel{|} F_{kn}$.&lt;&#x2F;p&gt;
&lt;p&gt;Finally we need to prove that $\gcd(F_n, F_{n+1}) = 1$, meaning consecutive Fibonacci numbers don’t
share any prime factors (they are
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Coprime_integers&quot;&gt;coprime&lt;&#x2F;a&gt;). This is once again
proven through induction, the property holds for $n = 1$ and since $\gcd(a, b) =
\gcd(a, a + b)$, we have $$\gcd(F_n, F_{n+1}) = \gcd(F_n + F_{n+1} , F_{n+1}) =
\gcd(F_{n+1}, F_{n+2}).$$&lt;&#x2F;p&gt;
&lt;h3 id=&quot;euclidean-algorithm&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#euclidean-algorithm&quot; aria-label=&quot;Anchor link for: euclidean-algorithm&quot;&gt;Euclidean algorithm&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Euclidean_algorithm&quot;&gt;Euclidean algorithm&lt;&#x2F;a&gt; is
an algorithm to compute the greatest common divisor based on the identity
$$\gcd(a, b) = \gcd(a, b - a),$$ where $1 &amp;lt; a &amp;lt; b$. By repeatedly applying this
identity (swapping $a, b$ when needed to keep $a &amp;lt; b$) until $a = 1$ you’ll find
the greatest common divisor. You can speed up the algorithm by replacing the
repeated subtraction with the modulo operator: $$\gcd(a, b) = \gcd(a, b \bmod
a).$$&lt;&#x2F;p&gt;
&lt;p&gt;Now, suppose we are computing $\gcd(F_a, F_b)$ with $0 &amp;lt; a &amp;lt; b$. We can
write $b = qa + r$ with $q \geq 1$ and $0 \leq r &amp;lt; a$, this is just splitting up $b$
into the quotient $q$ and remainder $r$ when dividing by $a$. Then, using the
addition formula we find&lt;&#x2F;p&gt;
&lt;p&gt;$$\gcd(F_a, F_b) = \gcd(F_a, F_{qa + r}) = \gcd(F_a, F_{qa + 1}F_r + F_{qa}F_{r-1})$$&lt;&#x2F;p&gt;
&lt;p&gt;We proved earlier that $F_{qa} \equiv 0 \mod F_a$, so applying the above modular identity
we can eliminate the $F_{qa}F_{r-1}$ term,
$$\gcd(F_a, F_b) = \gcd(F_a, F_{qa + 1}F_r).$$
We also know that $F_{qa+1}$ and $F_{qa}$ are coprime. Since all factors
of $F_a$ are factors of $F_{qa}$ we can conclude that $F_a$
and $F_{qa+1}$ share no (prime) factors either. This property lets us eliminate
that factor from the greatest common divisor calculation entirely:
$$\gcd(F_a, F_b) = \gcd(F_a, F_r).$$&lt;&#x2F;p&gt;
&lt;p&gt;Since $r = b \bmod a$ we note the following symmetry when $1 &amp;lt; a &amp;lt; b$:&lt;&#x2F;p&gt;
&lt;p&gt;$$\gcd(a, b) = \gcd(a, b \bmod a),$$
$$\gcd(F_a, F_b) = \gcd(F_a, F_{b \bmod a}).$$&lt;&#x2F;p&gt;
&lt;p&gt;In other words, by repeatedly applying the above identity we can perform the
Euclidean algorithm on the &lt;em&gt;indices&lt;&#x2F;em&gt; of the Fibonacci numbers, giving
$$\gcd(F_a, F_b) = F_{\gcd(a, b)}.$$&lt;&#x2F;p&gt;
&lt;h2 id=&quot;fibonacci-sort-s-complexity&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#fibonacci-sort-s-complexity&quot; aria-label=&quot;Anchor link for: fibonacci-sort-s-complexity&quot;&gt;Fibonacci sort’s complexity&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Now, for the actual proof I have to stand on top of the shoulders of giants. Recall
that our gap sequence was as follows:&lt;&#x2F;p&gt;
&lt;p&gt;$$k_i = F_{i} \cdot F_{i + 1}.$$&lt;&#x2F;p&gt;
&lt;p&gt;In the 1986 paper &lt;em&gt;A New Upper Bound for Shellsort&lt;&#x2F;em&gt; by Robert Sedgewick, a formula due to Johnson is shown
(Theorem 4) which lets us bound the Frobenius term $g(a, b, c)$ for three numbers $a, b, c$, assuming that they are &lt;em&gt;independent&lt;&#x2F;em&gt;.
The Frobenius term $g(a, b, c)$ is what we discussed earlier in our analysis of Shellsort. It means that
for all $n \geq g(a, b, c)$ there exist some combination of non-negative integers $\alpha, \beta, \gamma$ such that
$$\alpha a + \beta b + \gamma c = n.$$&lt;&#x2F;p&gt;
&lt;p&gt;Conversely, independence means none
of the numbers can be written as a linear combination of the others with non-negative integer coefficients.
Assuming $i \geq 2$, for $\{k_i, k_{i+1}, k_{i+2}\}$ it is easy to prove the independence using our earlier GCD formula. From
this we establish that
$$\gcd(F_{i}, F_{i+1}) = F_{\gcd(i, i+1)} = F_{1} = 1,$$
$$\gcd(F_{i}, F_{i+2}) = F_{\gcd(i, i+2)} = F_{2} = 1,$$
and thus
$$\gcd(k_{i}, k_{i+1}) = \gcd(F_{i}\cdot F_{i+1}, F_{i+1}\cdot F_{i+2}) = F_{i+1}\cdot \gcd(F_{i}, F_{i+2}) = F_{i+1}.$$
This directly proves that $k_{i+1}$ is not a multiple of $k_{i}$. This
leaves one other possibility, that $k_{i+2}$ is dependent on $\{k_i, k_{i+1}\}$:
$$\alpha k_{i} + \beta k_{i+1} = k_{i+2}$$
$$\alpha F_i F_{i+1} + \beta F_{i+1} F_{i+2} = F_{i+2}F_{i+3}$$
$$F_{i+1}(\alpha F_i + \beta F_{i+2}) = F_{i+2}F_{i+3}$$
This can only be true if $F_{i+2}F_{i+3}$ is a multiple of $F_{i+1}$, but $F_{i+1}$ is coprime with both $F_{i+2}$ and $F_{i+3}$ as we saw earlier, and thus
this rules out any dependence.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;frobenius-bound&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#frobenius-bound&quot; aria-label=&quot;Anchor link for: frobenius-bound&quot;&gt;Frobenius bound&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;Since we have proven independence we can use Johnson’s formula from the above paper.
It states that
$$g(a, b, c) = d \cdot g\left(\frac{a}{d}, \frac{b}{d}, c\right) + (d-1)c$$
where $d = \gcd(a, b)$. This lets us use the formula for coprime
$a, b$ we saw in Lemma 2, because when $a, b &amp;lt; c$ and $a, b$ are coprime we have
$$g(a, b, c) \leq g(a, b) = (a - 1)(b - 1) \leq ab.$$
In our case this means that
$$g(k_{i+1}, k_{i+2}, k_{i+3}) \leq F_{i+2} \cdot g(F_{i+1}, F_{i+3}) + F_{i+2}F_{i+3}F_{i+4},$$
$$g(k_{i+1}, k_{i+2}, k_{i+3}) \leq F_{i+1}F_{i+2}F_{i+3} + F_{i+2}F_{i+3}F_{i+4}.$$
If we now use the asymptotic approximation $F_n = \Theta(\phi^n)$ where ${\phi = (\sqrt{5} + 1)&#x2F;2}$ we have&lt;&#x2F;p&gt;
&lt;p&gt;$$k_i = \Theta(\phi^{2n}), \quad g(k_{i+1}, k_{i+2}, k_{i+3}) \leq \Theta(\phi^{3n}).$$&lt;&#x2F;p&gt;
&lt;p&gt;In other words, $g(k_{i+1}, k_{i+2}, k_{i+3}) = O(k_i^{3&#x2F;2})$.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;See my &lt;a href=&quot;&#x2F;blog&#x2F;magical-fibonacci-formulae&#x2F;#an-interlude-by-binet&quot;&gt;previous blog post&lt;&#x2F;a&gt;
on magical Fibonacci formulae to see how the asymptotic approximation is derived.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;h3 id=&quot;back-to-shellsort&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#back-to-shellsort&quot; aria-label=&quot;Anchor link for: back-to-shellsort&quot;&gt;Back to Shellsort&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;We proved just above that after having processed gaps $k_{i+3}, k_{i+2}, k_{i+1}$
the maximum distance for an element from its true location is no more than $O(k_i^{3&#x2F;2})$.
Then if we perform a $k_i$-sort next, we have $k_i$ subsequences of length
$n &#x2F; k_i$ in which the maximum distance of an element from its sorted location
within that subsequence is no more than $O(k_i^{3&#x2F;2} &#x2F; k_i) = O(k_i^{1&#x2F;2})$.&lt;&#x2F;p&gt;
&lt;p&gt;Recall from earlier that insertion sort on a subsequence of length $n$
where each element is at most $m$ places from its sorted location has
complexity $O(nm)$. This means the $k_i$-sort step for all subsequences combined takes at most
$O(k_i \cdot (n &#x2F; k_i) \cdot k_i^{1&#x2F;2}) = O(n \cdot k_i^{1&#x2F;2})$ time.
This bound is good when $k_i$ is small, but we also have a different upper bound
which is good when $k_i$ is big, using the fact that insertion sort takes at
most $O(n^2)$ time. This gives us the bound $O(k_i \cdot (n &#x2F; k_i)^2) = O(n^2 &#x2F; k_i)$
as well.&lt;&#x2F;p&gt;
&lt;p&gt;We note that both bounds are equal when $k_i = \Theta(n^{2&#x2F;3})$, giving us $O(n^{4&#x2F;3})$
overall by consistently choosing the smaller of the two bounds. To make this a
bit more formal, let $t$ be the maximum index such that $k_t \leq n$, and let
$s$ represent the index on which we switch our bound. Then we have
$$cT \leq \sum_{i=1}^{s - 1} \left(n \cdot k_i^{1&#x2F;2}\right) + \sum_{i=s}^t \left(n^2 &#x2F; k_i\right),$$
where $T$ is our total runtime and $c &amp;gt; 0$ is some ‘constant’ irrelevant for the
asymptotic analysis (varying from equation to equation). Then, substituting the asymptotic approximation for $k_i \approx \phi^{2i}$ we get
$$cT \leq n \sum_{i=1}^{s - 1} \phi^{i} + n^2\sum_{i=s}^t \phi^{-2i},$$
$$cT \leq n \cdot \frac{\phi^s - \phi}{\phi - 1} + n^2 \cdot \frac{\phi^{2 - 2s} - \phi^{-2t}}{\phi^2 - 1}.$$
Removing constant factors from terms and substituting $s = 2t&#x2F;3$ gives us
$$cT \leq n \cdot (\phi^{2t&#x2F;3} - 1) + n^2 \cdot (\phi^{- 4t&#x2F;3} - \phi^{-2t}).$$
Finally we note that within a constant factor $\phi^{2t} \approx k_t \approx n$ to simplify to
$$cT \leq n \cdot (n^{1&#x2F;3} - 1) + n^2 \cdot (n^{- 2&#x2F;3} - 1&#x2F;n),$$
$$cT \leq n^{4&#x2F;3} + n^{4&#x2F;3} - 2n,$$
giving our overall bound of $O(n^{4&#x2F;3})$.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;generalizing&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#generalizing&quot; aria-label=&quot;Anchor link for: generalizing&quot;&gt;Generalizing&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The above proof is analogous to the one found in &lt;em&gt;A New Upper Bound for Shellsort&lt;&#x2F;em&gt;
by Sedgewick, I don’t claim credit for it. Ultimately there is nothing fundamental
about the choice of the Fibonacci numbers here, any sequence which starts with $1$ and:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;grows like $\Theta(b^{2n})$ for some base $b$,&lt;&#x2F;li&gt;
&lt;li&gt;has a greatest common divisor on the order of $\Theta(b^n)$ for consecutive terms, and&lt;&#x2F;li&gt;
&lt;li&gt;where any three consecutive terms are independent,&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;should have a $O(n^{4&#x2F;3})$ runtime when used as a gap sequence in Shellsort.
For example, Sedgewick himself used (among others) the sequence
$$k_1 = 1, \quad k_i = (2^i - 3)(2^{i+1} - 3),$$
but it is pretty neat that the product of consecutive Fibonacci numbers can be
used here as well.&lt;&#x2F;p&gt;
&lt;p&gt;In fact, we can generalize our Fibonacci sequence to a simple way of
constructing sequences that match the above set of criteria. Simply start with a
sequence $A_n$ where any three consecutive elements of $A$ are pairwise coprime
and grow like $\Theta(b^n)$, then define the gap sequence as $$k_i = A_i \cdot
A_{i+1}.$$&lt;&#x2F;p&gt;
&lt;h2 id=&quot;real-world-performance&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#real-world-performance&quot; aria-label=&quot;Anchor link for: real-world-performance&quot;&gt;Real-world performance&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;So… is Fibonacci sort any good in practice? As far as shellsorts go, it is middle-of-the-pack. I
also found a similarly structured sequence (with a similar $O(n^{4&#x2F;3})$ proof) which performs
quite a bit better:&lt;&#x2F;p&gt;
&lt;p&gt;$$A_1 = 1, \quad A_2 = 1, \quad A_3 = 2,\quad A_n = 2A_{n-2} + 1$$
$$k_i = A_i \cdot A_{i+1} = (1, 2, 6, 15, 35, 77, 165, \dots)$$&lt;&#x2F;p&gt;
&lt;p&gt;The advantage of this gap sequence, like the Fibonacci one, is that it is a
simple reversible recurrent formula meaning the implementation does not need a lookup table, and
can compute the gaps on-the-fly. This is useful, because in my opinion Shellsort only truly has one
niche where it shines: tiny code size. I think it’s a solid choice for places where every byte of
code matters, while still giving a decent algorithm.&lt;&#x2F;p&gt;
&lt;p&gt;I ran &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;orlp&#x2F;fibonacci-sort-bench&quot;&gt;a simple benchmark&lt;&#x2F;a&gt; for the number of
comparisons performed by the following gap sequences:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Algorithm&lt;&#x2F;th&gt;&lt;th&gt;Formula&lt;&#x2F;th&gt;&lt;th&gt;Complexity&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;hibbard63&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;$k_i = 2^{i} - 1$&lt;&#x2F;td&gt;&lt;td&gt;$O(n^{3&#x2F;2})$&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;pratt71&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;$k = \{2^p3^q\mid p, q \in \mathbb{N}_0\}$&lt;&#x2F;td&gt;&lt;td&gt;$O(n\,(\log n)^2)$&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;sedge86a&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;$k_1 = 1, k_i = 4^{i} + 3\cdot 2^{i-1}+ 1$&lt;&#x2F;td&gt;&lt;td&gt;$O(n^{4&#x2F;3})$&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;sedge86b&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;$k_1 = 1, k_i = (2^i - 3)(2^{i+1} - 3)$&lt;&#x2F;td&gt;&lt;td&gt;$O(n^{4&#x2F;3})$&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;sedge86i&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;$k = \{4^j - 3\cdot 2^j + 1\} \cup {\{9\cdot 4^j - 9\cdot 2^j + 1\}}$&lt;&#x2F;td&gt;&lt;td&gt;&lt;strong&gt;Unknown&lt;&#x2F;strong&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;lee21&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;$k_i = \lceil\frac{\gamma^i - 1}{\gamma - 1}\rceil, \gamma = 2.243609061420001\dots$&lt;&#x2F;td&gt;&lt;td&gt;Unknown&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;fib25&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;$k_i = F_i \cdot F_{i+1}$&lt;&#x2F;td&gt;&lt;td&gt;$O(n^{4&#x2F;3})$&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;orlp25&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;$k_i = A_i \cdot A_{i+1}$&lt;&#x2F;td&gt;&lt;td&gt;$O(n^{4&#x2F;3})$&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;I also added &lt;code&gt;heapsort&lt;&#x2F;code&gt; as a $O(n \log n)$ comparison datapoint as it too is a
small in-place unstable sort. The results are as follows, with a log scale for size $n$
on the X-axis and the number of comparisons divided by $n \log_2 n$ on the Y-axis:&lt;&#x2F;p&gt;
&lt;div class=&quot;big&quot;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;fibonacci-sort&#x2F;comparison-bench.svg&quot; alt=&quot;Comparison benchmark.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;As you can see &lt;code&gt;fib25&lt;&#x2F;code&gt; doesn’t perform great nor terrible. My other sequence &lt;code&gt;orlp25&lt;&#x2F;code&gt;
performs almost as good as the best experimentally derived sequences, but still
provides a worst-case guarantee.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;knuth-reward-check&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#knuth-reward-check&quot; aria-label=&quot;Anchor link for: knuth-reward-check&quot;&gt;Knuth Reward Check&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;The astute observer may have seen &lt;code&gt;sedge86i&lt;&#x2F;code&gt; marked as having an unknown
complexity in bold. However, the Wikipedia page for &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Shellsort&quot;&gt;Shellsort&lt;&#x2F;a&gt;
at the time of writing lists it as $O(n^{4&#x2F;3})$. While writing this
article I did quite some reading of Sedgewick’s 1986 paper &lt;em&gt;A New Upper Bound for Shellsort&lt;&#x2F;em&gt;,
in fact I originally found the paper through a reference on the Wikipedia page
for that sequence.&lt;&#x2F;p&gt;
&lt;p&gt;While reading the paper I noticed that there are two theorems he proves both
of which lead to $O(n^{4&#x2F;3})$ behavior, Theorem 5 and Theorem 6. Theorem 5 requires as a
key property that $k_i, k_{i+1}$, and $k_{i+2}$ are pairwise coprime for all $i$.
Theorem 6 requires as a key property that the greatest common divisor of
consecutive terms $k_i, k_{i+1}$ always is on the order of $\sqrt{k_i}$ (that’s
the theorem I used above).&lt;&#x2F;p&gt;
&lt;p&gt;But &lt;code&gt;sedge86i&lt;&#x2F;code&gt; which consists of the merger of two sequences satisfies neither
property, nor do the individual sequences before merging!&lt;&#x2F;p&gt;
&lt;p&gt;\begin{align*}
\{4^j - 3\cdot 2^j + 1\} &amp;amp;= 5, 41, 209, 929, 3905, 16001, 64769, 260609, \dots\\
\{9\cdot 4^j - 9\cdot 2^j + 1\} &amp;amp;= 1, 19, 109, 505, 2161, 8929, 36289, 146305, 587521, \dots
\end{align*}&lt;&#x2F;p&gt;
&lt;p&gt;As counterexamples in the pre-merged sequences we find $\gcd(209, 3905) = \gcd(36289, 587521) = 11$,
and $\gcd(16764929, 37730305) = 29$ in the merged sequence. In fact, if you
read the paper, Sedgewick only describes the merged sequence in his conclusion
section, as such:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;The particular sequence used here is a merge of the sequences […]. These
occasionally have triples that are not relatively prime, but the combination
does better on random inputs than the sequence of Theorem 5 because it has
more smaller increments.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Sedgewick never claims this particular merged sequence leads to a worst-case of
$O(n^{4&#x2F;3})$. So why is it listed as such on the Wikipedia page? After more
digging I realized that the sequence is also found in Knuth’s
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;The_Art_of_Computer_Programming&quot;&gt;TAOCP&lt;&#x2F;a&gt;, in
Volume 3, Sorting and Searching, Chapter 5.2.1. At the
time Knuth wrote:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;The final examples in Table 6 come from another sequence devised by Sedgewick,
based on slightly different heuristics.
When these increments $(h_0, h_1, h_2, \dots) = 1, 5, 19, 41, 109, 209, \dots$
are used, Sedgewick proved that the worst-case running time is $O(N^{4&#x2F;3})$.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Except Sedgewick did no such thing (nor did he claim to). I guess whoever edited
the Wikipedia article read Knuth’s book and assumed that statement was correct.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;contacting-knuth&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#contacting-knuth&quot; aria-label=&quot;Anchor link for: contacting-knuth&quot;&gt;Contacting Knuth&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;I emailed Knuth with my findings, and a good few months later I received a
physical letter containing:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;my email printed out with a few pencil scribbles,&lt;&#x2F;li&gt;
&lt;li&gt;a reward check for finding an error.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;The errata for Volume 3 now read (emphasis mine):&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;These increments […] combine two sequences that &lt;em&gt;resemble&lt;&#x2F;em&gt; increments
for which Sedgewick proved the worst-case time bound $O(N^{4&#x2F;3})$.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;aside&gt;
&lt;p&gt;I could not resist including a brief synopsis of this blog post with my email,
explaining the Fibonacci gap sequence. Knuth wrote back (still scribbled in
pencil on my email), “You should experiment with this, and if it performs well or average
please publish that fact ASAP!”&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;div class=&quot;big&quot;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;fibonacci-sort&#x2F;knuth-check.jpg&quot; alt=&quot;Knuth reward check.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</content>
	</entry>
	<entry xml:lang="en">
		<title>Why Bad AI Is Here to Stay</title>
		<author>Orson R. L. Peters</author>
		<published>2025-02-02T00:00:00+00:00</published>
		<updated>2025-02-02T00:00:00+00:00</updated>
		<link rel="alternate" href="https://orlp.net/blog/bad-ai/" type="text/html"/>
		<id>https://orlp.net/blog/bad-ai/</id>
		<content type="html">&lt;p&gt;It seems that in 2025 a lot of people fall into one of two camps when it comes
to AI: skeptic or fanatic. The skeptic thinks AI sucks, that it’s overhyped, it
only ever parrots nonsense and it will all blow over soon. The fanatic thinks
general human-level intelligence is just around the corner, and that AI will
solve almost all our problems. I hope my title is sufficiently ambiguous to
attract both camps. The fanatic will be outraged, being ready to jump into the
fray to point out why AI isn’t or won’t stay bad. The skeptic will feel
validated, and will be eager to read more reasons as to why AI sucks.&lt;&#x2F;p&gt;
&lt;p&gt;I’m neither a skeptic nor a fanatic. I see AI more neutrally, as a
tool, and from that viewpoint I make the following two observations:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;AI is bad&lt;&#x2F;strong&gt;. It is often incorrect, expensive, racist, trained on data
without knowledge or consent, environmentally unfriendly, disruptive to
society, etc.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;AI is useful&lt;&#x2F;strong&gt;. Despite the above shortcomings there are tasks for which
AI is cheap and effective.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;I’m no seer, perhaps AI will improve, become more accurate, less biased,
cheaper, trained on open access data, cost less electricity, etc. Or perhaps we
have plateaued in performance, and there is no political or economic goodwill
to address any of the other issues, nor will there be.&lt;&#x2F;p&gt;
&lt;p&gt;However, even if AI does not improve in any of the above metrics, it will still
be useful, and I hope to show you in this article why. Hence my point: bad AI is
here to stay. If you agree with me on this, I hope you’ll also agree with me
that we have to stop pretending AI is useless and start taking it and its
problems seriously.&lt;&#x2F;p&gt;
&lt;aside&gt;
I&#x27;m going to be saying the word AI a lot in this article. It&#x27;s an incredibly
overloaded term with definitions covering and&#x2F;or excluding everything from a
Tic-Tac-Toe bot to super-human intelligence. For the sake of argument you can
assume I mean &quot;large language model anno 2025&quot; with AI, but I feel my points
can generalize quite a bit beyond that as well.
&lt;&#x2F;aside&gt;
&lt;h2 id=&quot;a-formula-for-query-cost&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#a-formula-for-query-cost&quot; aria-label=&quot;Anchor link for: a-formula-for-query-cost&quot;&gt;A formula for query cost&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Suppose I am a human with some kind of question that can be answered. I know
AI could potentially help me with this question, but I wonder if it’s worth it
or if I should not use it at all. To help with this we can quantify the risk
associated with any potential method of answering the question:&lt;&#x2F;p&gt;
&lt;p&gt;$$\mathrm{Risk}_\mathrm{AI} = \mathrm{Cost(query)} + (1 - P(\mathrm{success})) \cdot \mathrm{Cost(bad)}$$&lt;&#x2F;p&gt;
&lt;p&gt;That is, the risk of using any particular method is the cost associated with the
method plus the cost of the consequences of a bad answer multiplied by the
probability of failure. Here ‘Cost’ is a highly multidimensional object, which
can consist of but is not limited to: time, money, environmental impact, ethical
concerns, etc.&lt;&#x2F;p&gt;
&lt;p&gt;In a lot of cases however we don’t have to blindly trust the answer, and we can
verify it. In these cases the consequence of a bad answer is that you’re left
in the exact same scenario before trying, except knowing that the AI is of no use.
In some scenarios when the AI is non-deterministic it might be worth it to try
again as well, but let’s assume for now that you’d have to switch method. In this
case the risk is:&lt;&#x2F;p&gt;
&lt;p&gt;$$\mathrm{Risk}_\mathrm{AI} = \mathrm{Cost(query)} + \mathrm{Cost(verify)} + (1 - P(\mathrm{success})) \cdot {\mathrm{Risk}}_\mathrm{Other}$$&lt;&#x2F;p&gt;
&lt;p&gt;The cost of a query is usually fairly fixed and known, and although verification
cost can vary drastically from task to task, I’d argue that in most cases the
cost of verification is also fairly predictable and known. This makes the risk
formula applicable in a lot of scenarios, if you have a good idea of the chance
of success. The latter, however, can usually only be established empirically, so
for one-shot queries without having done any similar queries in the past it can
be hard to evaluate whether trying AI is a good idea before doing so.&lt;&#x2F;p&gt;
&lt;aside&gt;
A big exception to the known cost of queries is that of hidden biases based on
race, gender, sexual orientation, caste or other traits one does not control
which ought to be irrelevant to the judgement call. Please, do not use black-box
AI to judge humans, whether that is for jobs, healthcare insurance, justice or
otherwise. It is immoral and dangerous.
&lt;&#x2F;aside&gt;
&lt;p&gt;There is one more expansion to the formula I’d like to make before we can look
at some examples, and that is to the definition of a successful answer:&lt;&#x2F;p&gt;
&lt;p&gt;$$P(\mathrm{success}) = P(\mathrm{correct} \cap \mathrm{relevant})$$&lt;&#x2F;p&gt;
&lt;p&gt;I define a successful answer as one that is both correct and relevant. For
example “1 + 1 = 2” might be a correct answer, but irrelevant if we asked about
anything else. Relevance is always subjective, but often the correctness of an
answer is as well - I’m not assuming here that all questions are about objective
facts.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;cheap-and-effective-ai-queries&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#cheap-and-effective-ai-queries&quot; aria-label=&quot;Anchor link for: cheap-and-effective-ai-queries&quot;&gt;Cheap and effective AI queries&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;Because AIs are fallible, usually the biggest cost is in fact the time needed
for a human to verify the answer as correct and relevant (or the cost of
consequences if left unverified). However, I’ve noticed a real asymmetry between
these two properties when it comes to AIs:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;AIs often give incorrect answers&lt;&#x2F;strong&gt;. Worse, they will do so confidently, forcing
you to waste time checking their answer instead of them simply stating that
they don’t know for sure.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;AIs almost never give irrelevant answers&lt;&#x2F;strong&gt;. If I ask about cheese, the
probability a modern AI starts talking about cars is very low.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;With this in mind I identify five general categories of query for which even bad
AI is useful, either by massively reducing or eliminating this verification cost
or by leaning on the strong relevance of AI answers:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Inspiration&lt;&#x2F;strong&gt;, where $\operatorname{Cost}(\mathrm{bad}) \approx 0$,&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Creative&lt;&#x2F;strong&gt;, where $P(\mathrm{correct}) \approx 1$,&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Planning&lt;&#x2F;strong&gt;, where $P(\mathrm{correct}) = P(\mathrm{relevant}) = 1$,&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Retrieval&lt;&#x2F;strong&gt;, where $P(\mathrm{correct}) \approx P(\mathrm{relevant})$, and&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Objective&lt;&#x2F;strong&gt;, where $P(\mathrm{relevant}) = 1$ and correctness verification cost is low.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Let’s go over them one by one and look at some examples.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;inspiration-operatorname-cost-mathrm-bad-approx-0&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#inspiration-operatorname-cost-mathrm-bad-approx-0&quot; aria-label=&quot;Anchor link for: inspiration-operatorname-cost-mathrm-bad-approx-0&quot;&gt;Inspiration ($\operatorname{Cost}(\mathrm{bad}) \approx 0$)&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;In this category are the queries where the consequences of a wrong answer are
(near) zero. Informally speaking, “it can’t hurt to try”. In my experience these
kinds of queries tend to be the ones where you are looking for &lt;em&gt;something&lt;&#x2F;em&gt; but
don’t know exactly what; you’ll know it when you see it. For example:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;“I have leeks, eggs and minced meat in the fridge, as well as a stocked pantry
with non-perishable staples. Can you suggest me some dishes I can make with this for a dinner?”&lt;&#x2F;li&gt;
&lt;li&gt;“What kind of fun activities can I do with a budget of $100 in New York?”&lt;&#x2F;li&gt;
&lt;li&gt;“Suggest some names for a Python function that finds the smallest non-negative number in a list.”&lt;&#x2F;li&gt;
&lt;li&gt;“The user wrote this partial paragraph on their phone, suggest three words that are most
likely to follow for a quick typing experience.”&lt;&#x2F;li&gt;
&lt;li&gt;“Give me 20 synonyms of or similar words to ‘good’.”&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;I think the last query highlights where AI shines or falls for this kind of
query. The more localized and personalized your question is, the better the AI
will do compared to an alternative. For simple synonyms you can usually just
look up the word on a dedicated synonym site, as millions of other people have
also wondered the same thing. But the exact contents of your fridge or your
exact Python function you’re writing are rather unique to you.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;creative-p-mathrm-correct-approx-1&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#creative-p-mathrm-correct-approx-1&quot; aria-label=&quot;Anchor link for: creative-p-mathrm-correct-approx-1&quot;&gt;Creative ($P(\mathrm{correct}) \approx 1$)&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;In this category are the queries where there are (almost) no wrong answers.
The only thing that really matters is the relevance of the answer, and as I
mentioned before, I think AIs are pretty good at being relevant.&lt;&#x2F;p&gt;
&lt;p&gt;Examples of queries like these are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;“Draw me an image of a polar bear using a computer.”&lt;&#x2F;li&gt;
&lt;li&gt;“Write and perform for me a rock ballad about gnomes on tiny bicycles.”&lt;&#x2F;li&gt;
&lt;li&gt;“Rephrase the following sentence to be more formal.”&lt;&#x2F;li&gt;
&lt;li&gt;“Write a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Sinterklaas#Saint_Nicholas&amp;#x27;_Eve_and_Saint_Nicholas&amp;#x27;_Day:~:text=Poems%20from%20Sinterklaas%20usually%20accompany%20gifts%2C%20bearing%20a%20personal%20message%20for%20the%20receiver.&quot;&gt;poem to accompany my Sinterklaas gift&lt;&#x2F;a&gt;.”&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;This category does have a controversial aspect to it: it is ‘soulless’, inhuman.
Usually if there are no wrong answers we expect the creator to use this
opportunity to express their inner thoughts, ideas, experiences and emotions to
evoke them in others. If an AI generates art it is not viewed as genuine, even
if it evokes the same emotions to those ignorant of the art’s source, because
the human to human connection is lost.&lt;&#x2F;p&gt;
&lt;p&gt;Current AI models have no inner thoughts, ideas, experiences or emotions,
at least not in a way I recognize them. I think it’s fine to use AI art in
places where it would otherwise be meaningless (e.g. your corporate presentation
slides), fine for humans to use AI-assisted art tools to express themselves, but
ultimately defeating the point of art if used as a direct substitute.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;In the Netherlands we celebrate &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Sinterklaas&quot;&gt;Sinterklaas&lt;&#x2F;a&gt;
which is, roughly speaking, Santa Claus (except we also have Santa Claus, so our
children double-dip during the gifting season). Traditionally, gifts from Sinterklaas come accompanied by
poems describing the gift and the receiver in a humorous way.&lt;&#x2F;p&gt;
&lt;p&gt;It is quite common
nowadays, albeit viewed as lazy, to generate such a poem using AI. What’s
interesting is that this practice long predates modern LLMs–the poems have such
a fixed structure that poem generators have existed a long time. The earliest
reference I can find is the 1984 MS-DOS program “Sniklaas”. So people being lazy
in supposedly heartfelt art is nothing new.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;h2 id=&quot;planning-p-mathrm-correct-p-mathrm-relevant-1&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#planning-p-mathrm-correct-p-mathrm-relevant-1&quot; aria-label=&quot;Anchor link for: planning-p-mathrm-correct-p-mathrm-relevant-1&quot;&gt;Planning ($P(\mathrm{correct}) = P(\mathrm{relevant}) = 1$)&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;This is a more restrictive form of creativity, where irrelevant
answers are absolutely impossible. This often requires some modification of the
AI output generation method, where you restrict the output to the valid
subdomain (for example yes &#x2F; no, or binary numbers, etc). However, this is
often trivial if you actually have access to the raw model by e.g. masking out
invalid outputs, or you are working with a model which outputs the answer
directly rather than in natural language or a stream of tokens.&lt;&#x2F;p&gt;
&lt;p&gt;One might think in such a restrictive scenario there would be no useful
queries, but this isn’t true. The &lt;em&gt;quality&lt;&#x2F;em&gt; of the answer with respect to some
(complex) metric might still vary, and AIs might be far better than traditional
methods at navigating such domains. For example:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;“Here is the schema of my database, a SQL query, a small sample of the data
and 100 possible query plans. Which query plan seems most likely to execute
the fastest? Take into account likely assumptions based on column names and
these small data samples.”&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;“What follows is a piece of code. Reformat the code, placing whitespace to
maximize readability, while maintaining the exact same syntax tree as
per this EBNF grammar.”&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;“Re-order this set of if-else conditions in my code based on your intuition to
minimize the expected number of conditions that need to be checked.”&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;“Simplify this math expression using the following set of rewrite rules.”&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;retrieval-p-mathrm-correct-approx-p-mathrm-relevant&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#retrieval-p-mathrm-correct-approx-p-mathrm-relevant&quot; aria-label=&quot;Anchor link for: retrieval-p-mathrm-correct-approx-p-mathrm-relevant&quot;&gt;Retrieval ($P(\mathrm{correct}) \approx P(\mathrm{relevant})$)&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;I define retrieval queries as those where the correctness of the answer
depends (almost) entirely on its relevance. I’m including classification tasks
in this category as well, as one can view it as retrieval of the class from the
set of classes (or for binary classification, retrieval of positive samples from a larger set).&lt;&#x2F;p&gt;
&lt;p&gt;Then, as long as the cost of verification is low (e.g. a quick glance at a
result by a human to see if it interests them), or the consequences of not
verifying an irrelevant answer are minimal, AIs can be excellent at this. For
example:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;“Here are 1000 reviews of a restaurant, which ones are overall positive?
Which ones mention unsanitary conditions?”&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;“Find me pictures of my dog in my photo collection.”&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;“What are good data structures for maintaining a list of events with dates
and quickly counting the number of events in a specified period of time?”&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;“I like Minecraft, can you suggest me some similar games?”&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;“Summarise this 200 page government proposal.”&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;“Which classical orchestral piece starts like ‘da da da daaaaa’”?&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;objective-p-mathrm-relevant-1-low-verification-cost&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#objective-p-mathrm-relevant-1-low-verification-cost&quot; aria-label=&quot;Anchor link for: objective-p-mathrm-relevant-1-low-verification-cost&quot;&gt;Objective ($P(\mathrm{relevant}) = 1$, low verification cost)&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;If a problem has an objective answer which can be verified, the relevance of the
answer doesn’t really matter or arguably even make sense as a concept. Thus
in these cases I’ll define $P(\mathrm{relevant}) = 1$ and leave the cost of
verification entirely to correctness.&lt;&#x2F;p&gt;
&lt;p&gt;AIs are often incorrect, but not always, so if the primary cost is verification
and verification can be done very cheaply or entirely automatically without
error, AIs can still be useful despite their fallibility.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;“What is the mathematical property where a series of numbers can only go up called?”&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;“Identify the car model in this photo.”&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;“I have a list of all Unicode glyphs which are commonly confused with other
letters. Can you write an efficient function returning a boolean value that returns true
for values in the list but false for all other code points?”&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;“I formalized this mathematical conjecture in Lean. Can you help me write a proof for it?”&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;In a way this category is reminiscent of the $P = NP$ problem. If you have an
efficient verification algorithm, is finding solutions still hard in general?
The answer seems to be yes, yet the proof eludes us. However, this is only true
&lt;em&gt;in general&lt;&#x2F;em&gt;. For specific problems it might very well be possible to use
AI to generate provably correct solutions with high probability, even though the
the search space is far too large or too complicated for a traditional algorithm.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#conclusion&quot; aria-label=&quot;Anchor link for: conclusion&quot;&gt;Conclusion&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Out of the five identified categories, I consider inspiration and retrieval
queries to be the strongest use-case for AI where often there is no alternative
at all, besides an expensive and slow human that would rather be doing something
else. Relevance is highly subjective, complex and fuzzy, which AI handles much
better than traditional algorithms. Planning and objective queries are more
niche, but absolutely will see use-cases for AI that are hard to replace.&lt;&#x2F;p&gt;
&lt;p&gt;Creative queries are both something I think AI is really good at, while
simultaneously being the most dangerous and useless category. Art, creativity
and human-to-human connections are in my opinion some of the most fundamental
aspects of human society, and I think it is incredibly dangerous to mess with
them.&lt;&#x2F;p&gt;
&lt;p&gt;So dangerous in fact I consider many such queries useless. I wanted the above
examples all to be useful queries, so I did not list the following four examples
in the “Creative” section despite them belonging there:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;“Write me ten million personalized spam emails including these links
based on the following template.”&lt;&#x2F;li&gt;
&lt;li&gt;“Emulate being the perfect girlfriend for me–never disagree with me or
challenge my world views like real women do.”&lt;&#x2F;li&gt;
&lt;li&gt;“Here is a feed of Reddit threads discussing the upcoming election. Post a
comment in each thread, making up a personal anecdote how you are affected by
immigrants in a negative way.”&lt;&#x2F;li&gt;
&lt;li&gt;“A customer sent in this complaint. Try to help them with any questions they
have but if your help is insufficient explain that you are sorry but can not
help them any further. Do not reveal you are an AI.”&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Why do I consider these queries useless, despite them being potentially very
profitable or effective? Because their cost function includes such a large
detriment to society that only those who ignore its cost to society would ever
use them. However, since the cost is “to society” and not to any particular
individual, the only way to address this problem is with legislation, as
otherwise bad actors are free to harm society for (temporary) personal gain.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;I wrote this article because I noticed that there are a lot of otherwise
intelligent people out there who still believe (or hope) that all AI is useless
garbage and that it and its problems will go away by itself. They will not.
If you know someone that still believes so, please share this article with them.&lt;&#x2F;p&gt;
&lt;p&gt;AI is bad, yes, but bad AI is still useful. Therefore, bad AI is here to stay,
and we must deal with it.&lt;&#x2F;p&gt;
</content>
	</entry>
	<entry xml:lang="en">
		<title>Breaking CityHash64, MurmurHash2&#x2F;3, wyhash, and more...</title>
		<author>Orson R. L. Peters</author>
		<published>2024-11-02T00:00:00+00:00</published>
		<updated>2024-11-02T00:00:00+00:00</updated>
		<link rel="alternate" href="https://orlp.net/blog/breaking-hash-functions/" type="text/html"/>
		<id>https://orlp.net/blog/breaking-hash-functions/</id>
		<content type="html">&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Hash_function&quot;&gt;Hash functions&lt;&#x2F;a&gt; are incredibly
neat mathematical objects. They can map arbitrary data to a small fixed-size
output domain such that the mapping is deterministic, yet appears to be random.
This “deterministic randomness” is incredibly useful for a variety of purposes,
such as &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Hash_function&quot;&gt;hash tables&lt;&#x2F;a&gt;,
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Checksum&quot;&gt;checksums&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Monte_Carlo_algorithm&quot;&gt;monte carlo
algorithms&lt;&#x2F;a&gt;,
communication-less &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Distributed_algorithm&quot;&gt;distributed
algorithms&lt;&#x2F;a&gt;, etc, the list
goes on.&lt;&#x2F;p&gt;
&lt;p&gt;In this article we will take a look at the dark side of hash functions: when
things go wrong. Luckily this essentially never happens due to unlucky inputs in
the wild (for good hash functions, at least). However, people exist, and some of
them may be malicious. Thus we must look towards computer security for answers.
I will quickly explain some of the basics of hash function security and then
show how easy it is to break this security for some commonly used
non-cryptographic hash functions.&lt;&#x2F;p&gt;
&lt;p&gt;As a teaser, this article explains how you can generate strings
such as these, thousands per second:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt; cityhash64(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;orlp-cityhash64-D-:K5yx*zkgaaaaa&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;) == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1337
&lt;&#x2F;span&gt;&lt;span&gt;murmurhash2(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;orlp-murmurhash64-bkiaaa&amp;amp;JInaNcZ&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;) == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1337
&lt;&#x2F;span&gt;&lt;span&gt;murmurhash3(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;orlp-murmurhash3_x86_32-haaaPa*+&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;) == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1337
&lt;&#x2F;span&gt;&lt;span&gt; farmhash64(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;orlp-farmhash64-&#x2F;v^CqdPvziuheaaa&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;) == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1337
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;I also show how you can create some really funky pairs of strings that can be
concatenated arbitrarily such that when concatenating $k$ strings together
any of the $2^k$ combinations all have the same hash output, regardless of the
seed used for the hash function:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;a = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;xx0rlpx!xxsXъВ&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;b = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;xxsXъВxx0rlpx!&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;murmurhash2(a + a, seed) == murmurhash2(a + b, seed)
&lt;&#x2F;span&gt;&lt;span&gt;murmurhash2(a + a, seed) == murmurhash2(b + a, seed)
&lt;&#x2F;span&gt;&lt;span&gt;murmurhash2(a + a, seed) == murmurhash2(b + b, seed)
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;a = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;!&amp;amp;orlpՓ&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;b = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;yǏglp$X&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;murmurhash3(a + a, seed) == murmurhash3(a + b, seed)
&lt;&#x2F;span&gt;&lt;span&gt;murmurhash3(a + a, seed) == murmurhash3(b + a, seed)
&lt;&#x2F;span&gt;&lt;span&gt;murmurhash3(a + a, seed) == murmurhash3(b + b, seed)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;hash-function-security-basics&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#hash-function-security-basics&quot; aria-label=&quot;Anchor link for: hash-function-security-basics&quot;&gt;Hash function security basics&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Hash functions play a critical role in computer security. Hash
functions are used not only to verify messages over secure channels, they are
also used to identify trusted updates as well as known viruses. Virtually every
signature scheme ever used starts with a hash function.&lt;&#x2F;p&gt;
&lt;p&gt;If a hash function does not behave randomly, we can break the above security
constructs. &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Cryptographic_hash_function&quot;&gt;Cryptographic hash
functions&lt;&#x2F;a&gt; thus take
the randomness aspect very seriously. The ideal hash function would choose an
output completely at random for each input, remembering that choice for future
calls. This is called a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Random_oracle&quot;&gt;random
oracle&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The problem is that a random oracle requires a true random number generator, and
more problematically, a globally accessible infinite memory bank. So we
approximate it using deterministic hash functions instead. These compute their
output by essentially shuffling their input really, really well, in such a way
that it is not feasible to reverse.&lt;&#x2F;p&gt;
&lt;p&gt;To help quantify whether a specific function does a good job of approximating a
random oracle, cryptographers came up with a variety of properties that a random
oracle would have. The three most important and well-known properties a secure
cryptographic hash function should satisfy are:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pre-image resistance.&lt;&#x2F;strong&gt; For some constant $c$ it should be hard to find
some input $m$ such that $h(m) = c$.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Second pre-image resistance.&lt;&#x2F;strong&gt; For some input $m_1$ it should be hard to
find another input $m_2$ such that $h(m_1) = h(m_2)$.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Collision resistance.&lt;&#x2F;strong&gt; It should be hard to find inputs $m_1, m_2$ such
that $h(m_1) = h(m_2)$.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;aside&gt;Note that collision resistance implies second pre-image resistance
which in turn implies pre-image resistance. Conversely, a pre-image attack breaks all three
properties.&lt;&#x2F;aside&gt;
&lt;p&gt;We generally consider one of these properties &lt;em&gt;broken&lt;&#x2F;em&gt; if there exists a method
that produces a collision or pre-image faster than simply trying random
inputs (also known as a &lt;em&gt;brute force attack&lt;&#x2F;em&gt;). However, there are definitely
gradations in breakage, as some methods are only several orders of magnitude
faster than brute force. That may sound like a lot, but a method taking
$2^{110}$ steps instead of $2^{128}$ are still both equally out of reach for
today’s computers.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;MD5&quot;&gt;MD5&lt;&#x2F;a&gt; used to be a common hash function, and
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;SHA-1&quot;&gt;SHA-1&lt;&#x2F;a&gt; is still in common use today. While
both were considered cryptographically secure at one point, generating MD5
collisions now takes less than a second on a modern PC. In 2017 a collaboration
of researchers from CWI and Google and announced &lt;a href=&quot;https:&#x2F;&#x2F;shattered.io&#x2F;&quot;&gt;the first SHA-1
collision&lt;&#x2F;a&gt;. However, as far as I’m aware, neither MD5 nor
SHA-1 have practical (second) pre-image attacks, only theoretical ones.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;non-cryptographic-hash-functions&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#non-cryptographic-hash-functions&quot; aria-label=&quot;Anchor link for: non-cryptographic-hash-functions&quot;&gt;Non-cryptographic hash functions&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Cryptographically secure hash functions tend to have a small problem: they’re
slow. Modern hash functions such as &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;BLAKE3-team&#x2F;BLAKE3&quot;&gt;BLAKE3&lt;&#x2F;a&gt;
resolve this somewhat by heavily vectorizing the hash using
SIMD instructions, as well as parallelizing over multiple threads, but even then
they require large input sizes before reaching those speeds.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;One particular use-case for hash functions is deriving a secret key from a
password: a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Key_derivation_function&quot;&gt;key derivation
function&lt;&#x2F;a&gt;. Unlike regular
hash functions, being slow is actually a safety feature here to protect against
brute forcing passwords. Modern ones such as
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Argon2&quot;&gt;Argon2&lt;&#x2F;a&gt; also intentionally use a lot of
memory for protection against specialized hardware such as &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Application-specific_integrated_circuit&quot;&gt;ASICs&lt;&#x2F;a&gt;
or &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Field-programmable_gate_array&quot;&gt;FPGAs&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;A lot of problems don’t necessarily require secure hash functions, and people
would much prefer a faster hash speed. Especially when we are computing many
small hashes, such as in a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Hash_table&quot;&gt;hash table&lt;&#x2F;a&gt;.
Let’s take a look what common hash table implementations actually use as their
hash for strings:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;C++: there are multiple standard library implementations, but 64-bit
&lt;code&gt;clang&lt;&#x2F;code&gt; 13.0.0 on Apple M1 &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm-mirror&#x2F;libcxx&#x2F;blob&#x2F;78d6a7767ed57b50122a161b91f59f19c9bd0d19&#x2F;include&#x2F;utility#L977&quot;&gt;ships&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;google&#x2F;cityhash&quot;&gt;&lt;code&gt;CityHash64&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;.
Currently &lt;code&gt;libstdc++&lt;&#x2F;code&gt; &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;gcc-mirror&#x2F;gcc&#x2F;blob&#x2F;20d790aa3ea5b0d240032cab997b8e0938cac62c&#x2F;libstdc%2B%2B-v3&#x2F;libsupc%2B%2B&#x2F;hash_bytes.cc#L136&quot;&gt;ships&lt;&#x2F;a&gt;
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;aappleby&#x2F;smhasher&#x2F;blob&#x2F;master&#x2F;src&#x2F;MurmurHash2.cpp&quot;&gt;&lt;code&gt;MurmurHash64A&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;,
a variant of Murmur2 for 64-bit platforms.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Java: OpenJDK uses an &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;openjdk&#x2F;zgc&#x2F;blob&#x2F;ccf2f5837b31cddd24ec81f7f67107d9fc03c294&#x2F;src&#x2F;java.base&#x2F;share&#x2F;classes&#x2F;jdk&#x2F;internal&#x2F;util&#x2F;ArraysSupport.java#L212&quot;&gt;incredibly simple hash algorithm&lt;&#x2F;a&gt;, which essentially just computes
&lt;code&gt;h = 31 * h + c&lt;&#x2F;code&gt; for each character &lt;code&gt;c&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;PHP: the Zend engine uses &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;php&#x2F;php-src&#x2F;blob&#x2F;master&#x2F;Zend&#x2F;zend_string.h#L431&quot;&gt;essentially the same algorithm&lt;&#x2F;a&gt;
as Java, just using unsigned integers and &lt;code&gt;33&lt;&#x2F;code&gt; as its multiplier.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Nim: it &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;nim-lang&#x2F;Nim&#x2F;blob&#x2F;46d2161c23c2aa1905571512b9a1ef7d61ae670e&#x2F;lib&#x2F;pure&#x2F;hashes.nim#L386&quot;&gt;used to use&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PeterScott&#x2F;murmur3&#x2F;blob&#x2F;master&#x2F;murmur3.c&quot;&gt;&lt;code&gt;MurmurHash3_x86_32&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;. While writing this article they appeared to &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;nim-lang&#x2F;Nim&#x2F;blob&#x2F;46bb47a444bd377860d832fc1c62b262343f36a2&#x2F;lib&#x2F;pure&#x2F;hashes.nim#L537&quot;&gt;have switched&lt;&#x2F;a&gt; to use
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;google&#x2F;farmhash&quot;&gt;farmhash&lt;&#x2F;a&gt; by default.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Zig: it &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;ziglang&#x2F;zig&#x2F;blob&#x2F;904f414e7eab7bc0f7ea00f616831bfc3c1f18a4&#x2F;lib&#x2F;std&#x2F;hash_map.zig#L31&quot;&gt;uses&lt;&#x2F;a&gt;
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;wangyi-fudan&#x2F;wyhash&#x2F;blob&#x2F;master&#x2F;wyhash.h&quot;&gt;&lt;code&gt;wyhash&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; by default, with &lt;code&gt;0&lt;&#x2F;code&gt; as seed.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Javascript: in V8 they use &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;v8&#x2F;v8&#x2F;blob&#x2F;b3776d5dea2f7858e9903a014b63ea86ef30c04f&#x2F;src&#x2F;strings&#x2F;string-hasher-inl.h#L114&quot;&gt;a custom&lt;&#x2F;a&gt;
weak string hash, with a randomly initialized seed.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;There were some that used stronger hashes by default as well:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Go &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;golang&#x2F;go&#x2F;blob&#x2F;d12fe60004ae5e4024c8a93f4f7de7183bb61576&#x2F;src&#x2F;runtime&#x2F;asm_amd64.s#L1117&quot;&gt;uses&lt;&#x2F;a&gt; an
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Advanced_Encryption_Standard&quot;&gt;AES&lt;&#x2F;a&gt;-based hash
if hardware acceleration is available on x86-64. Even though its construction is custom
and likely not full-strength cryptographically secure, breaking it is too
much effort and quite possibly beyond my capabilities.&lt;&#x2F;p&gt;
&lt;p&gt;If not available, it uses an algorithm &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;golang&#x2F;go&#x2F;blob&#x2F;d12fe60004ae5e4024c8a93f4f7de7183bb61576&#x2F;src&#x2F;runtime&#x2F;hash64.go#L25&quot;&gt;inspired by wyhash&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Python and Rust use &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;SipHash&quot;&gt;SipHash&lt;&#x2F;a&gt; by
default, which is a cryptographically secure &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Pseudorandom_function_family&quot;&gt;pseudorandom function&lt;&#x2F;a&gt;.
This is effectively a hash function where you’re allowed to use a secret key
during hashing, unlike a hash like SHA-2 where everyone knows all information
involved.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;This latter concept is actually really important, at least for protecting
against &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Collision_attack&quot;&gt;HashDoS&lt;&#x2F;a&gt; in hash
tables. Even if a hash function is perfectly secure over its complete output,
hash tables further reduce the output to only a couple bits to find the data it
is looking for. For a static hash function without any randomness it’s possible
to produce large lists of hashes that collide post-reduction, just by brute
force. But for non-cryptographic hashes as we’ll see here we often don’t need
brute force and can generate collisions at high speed for the full output, if
not randomized by a random seed.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;interlude-inverse-operations&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#interlude-inverse-operations&quot; aria-label=&quot;Anchor link for: interlude-inverse-operations&quot;&gt;Interlude: inverse operations&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Before we get to breaking some of the above hash functions, I must explain a basic
technique I will use a lot: the inverting of operations. We are first exposed to
this in primary school, where we might get faced by a question such as “$2 + x = 10$”.
There we learn &lt;em&gt;subtraction&lt;&#x2F;em&gt; is the &lt;em&gt;inverse&lt;&#x2F;em&gt; of addition, such that we may find
$x$ by computing $10 - 2 = 8$.&lt;&#x2F;p&gt;
&lt;p&gt;Most operations on the integer registers in computers are also invertible, despite
the integers being reduced modulo $2^{w}$ in the case of overflow. Let
us study some:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Addition can be inverted using subtraction. That is, &lt;code&gt;x += y&lt;&#x2F;code&gt; can be inverted
using &lt;code&gt;x -= y&lt;&#x2F;code&gt;. Seems obvious enough.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Multiplication by a constant $c$ is &lt;em&gt;not&lt;&#x2F;em&gt; inverted by division. This would
not work in the case of overflow. Instead, we calculate the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Modular_multiplicative_inverse&quot;&gt;modular
multiplicative
inverse&lt;&#x2F;a&gt; of
$c$. This is an integer $c^{-1}$ such that $c \cdot c^{-1} \equiv 1 \pmod
{m}$. Then we invert multiplication by $c$ simply by multiplying by $c^{-1}$.&lt;&#x2F;p&gt;
&lt;p&gt;This constant exists if and only if $c$ is &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Coprime_integers&quot;&gt;coprime&lt;&#x2F;a&gt; with our modulus $m$, which for
us means that $c$ must be odd as $m = 2^n$. For example, multiplication by $2$ is not
invertible, which is easy to see as such, as it is equivalent to a bit shift
to the left by one position, losing the most significant bit forever.&lt;&#x2F;p&gt;
&lt;p&gt;Without delving into the details, here is a snippet of Python code that computes
the modular multiplicative inverse of an integer using the
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Extended_Euclidean_algorithm&quot;&gt;extended Euclidean algorithm&lt;&#x2F;a&gt;
by calculating $x, y$ such that
$$cx + my = \gcd(c, m).$$
Then, because $c$ is coprime we find $\gcd(c, m) = 1$, which means that
$$cx + 0 \equiv 1 \pmod m,$$
and thus $x = c^{-1}$.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span&gt;egcd(a, b):
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;a == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;(b, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;    g, y, x = egcd(b % a, a)
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;(g, x - (b &#x2F;&#x2F; a) * y, y)
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span&gt;modinv(c, m):
&lt;&#x2F;span&gt;&lt;span&gt;    g, x, y = egcd(c, m)
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;assert &lt;&#x2F;span&gt;&lt;span&gt;g == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;c, m must be coprime&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;x % m
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Using this we can invert modular multiplication:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; modinv(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;17&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4042322161
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;42 &lt;&#x2F;span&gt;&lt;span&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;17 &lt;&#x2F;span&gt;&lt;span&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4042322161 &lt;&#x2F;span&gt;&lt;span&gt;% &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;42
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Magic!&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;XOR can be inverted using… XOR. It is its own inverse. So &lt;code&gt;x ^= y&lt;&#x2F;code&gt; can be
inverted using &lt;code&gt;x ^= y&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Bit shifts can not be inverted, but two common operations in hash functions
that use bit shifts can be. The first is bit &lt;em&gt;rotation&lt;&#x2F;em&gt; by a constant. This
is best explained visually, for example a bit rotation to the left by 3
places on a 8-bit word, where each bit is shown as a letter:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;abcdefghi
&lt;&#x2F;span&gt;&lt;span&gt;defghiabc
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The formula for a right-rotation of &lt;code&gt;k&lt;&#x2F;code&gt; places is &lt;code&gt;(x &amp;gt;&amp;gt; k) | (x &amp;lt;&amp;lt; (w - k))&lt;&#x2F;code&gt;, where &lt;code&gt;w&lt;&#x2F;code&gt; is the width of the integer type. Its inverse is a
left-rotation, which simply swaps the direction of both shifts.
Alternatively, the inverse of a right-rotation of &lt;code&gt;k&lt;&#x2F;code&gt; places is another
right-rotation of &lt;code&gt;w-k&lt;&#x2F;code&gt; places.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Another common operation in hash functions is the “xorshift”. It is an operation
of one of the following forms, with $k &amp;gt; 0$:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;x ^= x &amp;lt;&amp;lt; k  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Left xorshift.
&lt;&#x2F;span&gt;&lt;span&gt;x ^= x &amp;gt;&amp;gt; k  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Right xorshift.
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;How to invert it is entirely analogous between the two, so I will focus on the
left xorshift.&lt;&#x2F;p&gt;
&lt;p&gt;An important observation is that the least
significant $k$ bits are left entirely untouched by the xorshift.
Thus by repeating the operation, we recover the least significant $2k$ bits,
as the XOR will invert itself for the next $k$ bits.
Let’s take a look at the resulting value to see how we should proceed:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;v0 = (x &amp;lt;&amp;lt; k) ^ x
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Apply first step of inverse v1 = v0 ^ (v0 &amp;lt;&amp;lt; k).
&lt;&#x2F;span&gt;&lt;span&gt;v1 = (x &amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;*k) ^ (x &amp;lt;&amp;lt; k) ^ (x &amp;lt;&amp;lt; k) ^ x
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Simplify using self-inverse (x &amp;lt;&amp;lt; k) ^ (x &amp;lt;&amp;lt; k) = 0.
&lt;&#x2F;span&gt;&lt;span&gt;v1 = (x &amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;*k) ^ x
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;From this we can conclude the following identity:
$$\operatorname{xorshift}(\operatorname{xorshift}(x, k), k) = \operatorname{xorshift}(x, 2k)$$
Now we only need one more observation to complete our algorithm: a xorshift of $k \geq w$ where $w$ is the width of our integer is
a no-op. Thus we repeatedly apply our doubling identity until we reach
large enough $q$ such that $\operatorname{xorshift}(x, 2^q \cdot k) = x$.&lt;&#x2F;p&gt;
&lt;p&gt;For example, to invert a left xorshift by 13 for 64-bit integers we apply the following sequence:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;x ^= x &amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;13  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Left xorshift by 13.
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;x ^= x &amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;13  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Inverse step 1.
&lt;&#x2F;span&gt;&lt;span&gt;x ^= x &amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;26  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Inverse step 2.
&lt;&#x2F;span&gt;&lt;span&gt;x ^= x &amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;52  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Inverse step 3.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; x ^= x &amp;lt;&amp;lt; 104  &#x2F;&#x2F; Next step would be a no-op.
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Armed with this knowledge, we can now attack.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;breaking-cityhash64&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#breaking-cityhash64&quot; aria-label=&quot;Anchor link for: breaking-cityhash64&quot;&gt;Breaking &lt;code&gt;CityHash64&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Let us take a look at (part of) &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm-mirror&#x2F;libcxx&#x2F;blob&#x2F;78d6a7767ed57b50122a161b91f59f19c9bd0d19&#x2F;include&#x2F;utility#L977&quot;&gt;the source code&lt;&#x2F;a&gt; of
&lt;code&gt;CityHash64&lt;&#x2F;code&gt; from &lt;code&gt;libcxx&lt;&#x2F;code&gt; that’s used for hashing strings on 64-bit platforms:&lt;&#x2F;p&gt;
&lt;aside&gt;C++ standard library code goes through a process known as &#x27;uglification&#x27;,
which prepends underscores to all identifiers. This is because those identifiers
are reserved by the standard to only be used in the standard library, and thus
won&#x27;t be replaced by macros from standards-compliant code. For your sanity&#x27;s
sake I removed them here.&lt;&#x2F;aside&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;static const uint64_t mul = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x9ddfea08eb382d69&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;ULL&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;static const uint64_t k0 = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xc3a5c85c97cb3127&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;ULL&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;static const uint64_t k1 = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xb492b66fbe98f273&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;ULL&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;static const uint64_t k2 = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x9ae16a3b2f90404f&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;ULL&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;static const uint64_t k3 = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xc949d7c7509e6557&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;ULL&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;template&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;class&lt;&#x2F;span&gt;&lt;span&gt; T&amp;gt;
&lt;&#x2F;span&gt;&lt;span&gt;T loadword(const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;void&lt;&#x2F;span&gt;&lt;span&gt;* p) {
&lt;&#x2F;span&gt;&lt;span&gt;    T r;
&lt;&#x2F;span&gt;&lt;span&gt;    std::memcpy(&amp;amp;r, p, sizeof(r));
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; r;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;uint64_t rotate(uint64_t val, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span&gt;shift) {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(shift == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; val;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;(val &amp;gt;&amp;gt; shift) | (val &amp;lt;&amp;lt; (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;64 &lt;&#x2F;span&gt;&lt;span&gt;- shift));
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;uint64_t hash_len_16(uint64_t u, uint64_t v) {
&lt;&#x2F;span&gt;&lt;span&gt;    uint64_t x = u ^ v;
&lt;&#x2F;span&gt;&lt;span&gt;    x *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;    x ^= x &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    uint64_t y = v ^ x;
&lt;&#x2F;span&gt;&lt;span&gt;    y *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;    y ^= y &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    y *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; y;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;uint64_t hash_len_17_to_32(const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;char &lt;&#x2F;span&gt;&lt;span&gt;*s, uint64_t len) {
&lt;&#x2F;span&gt;&lt;span&gt;    const uint64_t a = loadword&amp;lt;uint64_t&amp;gt;(s) * k1;
&lt;&#x2F;span&gt;&lt;span&gt;    const uint64_t b = loadword&amp;lt;uint64_t&amp;gt;(s + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;    const uint64_t c = loadword&amp;lt;uint64_t&amp;gt;(s + len - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;) * k2;
&lt;&#x2F;span&gt;&lt;span&gt;    const uint64_t d = loadword&amp;lt;uint64_t&amp;gt;(s + len - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span&gt;) * k0;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;hash_len_16(
&lt;&#x2F;span&gt;&lt;span&gt;        rotate(a - b, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;43&lt;&#x2F;span&gt;&lt;span&gt;) + rotate(c, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;30&lt;&#x2F;span&gt;&lt;span&gt;) + d,
&lt;&#x2F;span&gt;&lt;span&gt;        a + rotate(b ^ k3, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;20&lt;&#x2F;span&gt;&lt;span&gt;) - c + len
&lt;&#x2F;span&gt;&lt;span&gt;    );
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;To break this, let’s assume we’ll always give length 32 inputs. Then the
implementation will always call &lt;code&gt;hash_len_17_to_32&lt;&#x2F;code&gt;, and we have full control
over variables &lt;code&gt;a&lt;&#x2F;code&gt;, &lt;code&gt;b&lt;&#x2F;code&gt;, &lt;code&gt;c&lt;&#x2F;code&gt; and &lt;code&gt;d&lt;&#x2F;code&gt; by changing our input.&lt;&#x2F;p&gt;
&lt;p&gt;Note that &lt;code&gt;d&lt;&#x2F;code&gt; is only used once, in the final expression. This makes it a
prime target for attacking the hash. We will choose &lt;code&gt;a&lt;&#x2F;code&gt;, &lt;code&gt;b&lt;&#x2F;code&gt; and &lt;code&gt;c&lt;&#x2F;code&gt; arbitrarily,
and then solve for &lt;code&gt;d&lt;&#x2F;code&gt; to compute a desired hash outcome.&lt;&#x2F;p&gt;
&lt;p&gt;Using the above &lt;code&gt;modinv&lt;&#x2F;code&gt; function we first compute the necessary modular multiplicative
inverses of &lt;code&gt;mul&lt;&#x2F;code&gt; and &lt;code&gt;k0&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x9ddfea08eb382d69 &lt;&#x2F;span&gt;&lt;span&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xdc56e6f5090b32d9 &lt;&#x2F;span&gt;&lt;span&gt;% &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;64
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xc3a5c85c97cb3127 &lt;&#x2F;span&gt;&lt;span&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x81bc9c5aa9c72e97 &lt;&#x2F;span&gt;&lt;span&gt;% &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;64
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We also note that in this case the xorshift is easy to invert, as &lt;code&gt;x ^= x &amp;gt;&amp;gt; 47&lt;&#x2F;code&gt;
is simply its own inverse. Having all the components ready, we can invert
the function step by step.&lt;&#x2F;p&gt;
&lt;p&gt;We first load &lt;code&gt;a&lt;&#x2F;code&gt;, &lt;code&gt;b&lt;&#x2F;code&gt; and &lt;code&gt;c&lt;&#x2F;code&gt; like in the hash function, and compute&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;uint64_t v = a + rotate(b ^ k3, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;20&lt;&#x2F;span&gt;&lt;span&gt;) - c + len;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;which is the second parameter to &lt;code&gt;hash_len_16&lt;&#x2F;code&gt;. Then, starting from our
desired return value of &lt;code&gt;hash_len_16(u, v)&lt;&#x2F;code&gt; we work backwards step by step, inverting
each operation to find the function argument &lt;code&gt;u&lt;&#x2F;code&gt; that would result in our target &lt;code&gt;hash&lt;&#x2F;code&gt;.
Then once we have found such the unique &lt;code&gt;u&lt;&#x2F;code&gt; we compute our required input &lt;code&gt;d&lt;&#x2F;code&gt;.
Putting it all together:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;static const uint64_t mul_inv = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xdc56e6f5090b32d9&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;ULL&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;static const uint64_t k0_inv  = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x81bc9c5aa9c72e97&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;ULL&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span&gt;cityhash64_preimage32(uint64_t hash, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;char &lt;&#x2F;span&gt;&lt;span&gt;*s) {
&lt;&#x2F;span&gt;&lt;span&gt;    const uint64_t len = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    const uint64_t a = loadword&amp;lt;uint64_t&amp;gt;(s) * k1;
&lt;&#x2F;span&gt;&lt;span&gt;    const uint64_t b = loadword&amp;lt;uint64_t&amp;gt;(s + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;    const uint64_t c = loadword&amp;lt;uint64_t&amp;gt;(s + len - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;) * k2;
&lt;&#x2F;span&gt;&lt;span&gt;    
&lt;&#x2F;span&gt;&lt;span&gt;    uint64_t v = a + rotate(b ^ k3, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;20&lt;&#x2F;span&gt;&lt;span&gt;) - c + len;
&lt;&#x2F;span&gt;&lt;span&gt;    
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Invert hash_len_16(u, v). Original operation inverted
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; at each step is shown on the right, note that it is in
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; the inverse order of hash_len_16.
&lt;&#x2F;span&gt;&lt;span&gt;    uint64_t y = hash;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; return y;
&lt;&#x2F;span&gt;&lt;span&gt;    y *= mul_inv;         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; y *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;    y ^= y &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span&gt;;         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; y ^= y &amp;gt;&amp;gt; 47;
&lt;&#x2F;span&gt;&lt;span&gt;    y *= mul_inv;         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; y *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;    uint64_t x = y ^ v;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; uint64_t y = v ^ x;
&lt;&#x2F;span&gt;&lt;span&gt;    x ^= x &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span&gt;;         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; x ^= x &amp;gt;&amp;gt; 47;
&lt;&#x2F;span&gt;&lt;span&gt;    x *= mul_inv;         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; x *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;    uint64_t u = x ^ v;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; uint64_t x = u ^ v;
&lt;&#x2F;span&gt;&lt;span&gt;    
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Find loadword&amp;lt;uint64_t&amp;gt;(s + len - 16).
&lt;&#x2F;span&gt;&lt;span&gt;    uint64_t d = u - rotate(a - b, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;43&lt;&#x2F;span&gt;&lt;span&gt;) - rotate(c, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;30&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;    d *= k0_inv;
&lt;&#x2F;span&gt;&lt;span&gt;    std::memcpy(s + len - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span&gt;, &amp;amp;d, sizeof(d));
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The chance that a random &lt;code&gt;uint64_t&lt;&#x2F;code&gt; forms 8 printable ASCII bytes is
$\left(94&#x2F;256\right)^8 \approx 0.033%$. Not great, but &lt;code&gt;cityhash64_preimage32&lt;&#x2F;code&gt;
is so fast that having to repeat it on average ~3000 times to get a purely
ASCII result isn’t so bad.&lt;&#x2F;p&gt;
&lt;p&gt;For example, the following 10 strings all hash to &lt;code&gt;1337&lt;&#x2F;code&gt; using CityHash64, generated
using &lt;a href=&quot;https:&#x2F;&#x2F;gist.github.com&#x2F;orlp&#x2F;8debf0047e7735b43887aafb041c9a01&quot;&gt;this code&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;I’ve noticed there’s variants of CityHash64 with subtle differences in the wild. I chose to
attack the variant shipped with &lt;code&gt;libc++&lt;&#x2F;code&gt;, so it should work for &lt;code&gt;std::hash&lt;&#x2F;code&gt; there, for example.
I also assume a little-endian machine throughout this article, your mileage
may vary on a big-endian machine depending on the hash implementation.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;orlp-cityhash64-D-:K5yx*zkgaaaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-cityhash64-TXb7;1j&amp;amp;btkaaaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-cityhash64-+&#x2F;LM$0 ;msnaaaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-cityhash64-u&amp;#39;f&amp;amp;&amp;gt;I&amp;#39;~mtnaaaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-cityhash64-pEEv.LyGcnpaaaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-cityhash64-v~~bm@,Vahtaaaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-cityhash64-RxHr_&amp;amp;~{miuaaaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-cityhash64-is_$34#&amp;gt;uavaaaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-cityhash64-$*~l\{S!zoyaaaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-cityhash64-W@^5|3^:gtcbaaaa
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;breaking-murmurhash2&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#breaking-murmurhash2&quot; aria-label=&quot;Anchor link for: breaking-murmurhash2&quot;&gt;Breaking MurmurHash2&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;We can’t let &lt;code&gt;libstdc++&lt;&#x2F;code&gt; get away after targetting &lt;code&gt;libc++&lt;&#x2F;code&gt;, can we?
The &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;gcc-mirror&#x2F;gcc&#x2F;blob&#x2F;97a36b466ba1420210294f0a1dd7002054ba3b7e&#x2F;libstdc%2B%2B-v3&#x2F;include&#x2F;bits&#x2F;basic_string.h#L4402&quot;&gt;default string hash&lt;&#x2F;a&gt;
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;gcc-mirror&#x2F;gcc&#x2F;blob&#x2F;97a36b466ba1420210294f0a1dd7002054ba3b7e&#x2F;libstdc%2B%2B-v3&#x2F;include&#x2F;bits&#x2F;functional_hash.h#L206&quot;&gt;calls&lt;&#x2F;a&gt;
an &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;gcc-mirror&#x2F;gcc&#x2F;blob&#x2F;97a36b466ba1420210294f0a1dd7002054ba3b7e&#x2F;libstdc%2B%2B-v3&#x2F;libsupc%2B%2B&#x2F;hash_bytes.cc#L138&quot;&gt;implementation of MurmurHash2&lt;&#x2F;a&gt;
with seed &lt;code&gt;0xc70f6907&lt;&#x2F;code&gt;. The hash—simplified to only handle strings whose
lengths are multiples of 8—is as follows:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;uint64_t murmurhash64a(const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span&gt;* s, size_t len, uint64_t seed) {
&lt;&#x2F;span&gt;&lt;span&gt;    const uint64_t mul = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xc6a4a7935bd1e995&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;ULL&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    uint64_t hash = seed ^ (len * mul);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span&gt;* p = s; p != s + len; p += &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;        uint64_t data = loadword&amp;lt;uint64_t&amp;gt;(p);
&lt;&#x2F;span&gt;&lt;span&gt;        data *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;        data ^= data &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;        data *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;        hash ^= data;
&lt;&#x2F;span&gt;&lt;span&gt;        hash *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    hash ^= hash &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    hash *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;    hash ^= hash &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; hash;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We can take a similar approach here as before. We note that the modular
multiplicative inverse of &lt;code&gt;0xc6a4a7935bd1e995&lt;&#x2F;code&gt; mod $2^{64}$ is
&lt;code&gt;0x5f7a0ea7e59b19bd&lt;&#x2F;code&gt;. As an example, we can choose the first 24 bytes
arbitrarily, and solve for the last 8 bytes:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span&gt;murmurhash64a_preimage32(uint64_t hash, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span&gt;* s, uint64_t seed) {
&lt;&#x2F;span&gt;&lt;span&gt;    const uint64_t mul = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xc6a4a7935bd1e995&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;ULL&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    const uint64_t mulinv = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x5f7a0ea7e59b19bd&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;ULL&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Compute the hash state for the first 24 bytes as normal.
&lt;&#x2F;span&gt;&lt;span&gt;    uint64_t state = seed ^ (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32 &lt;&#x2F;span&gt;&lt;span&gt;* mul);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span&gt;* p = s; p != s + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;24&lt;&#x2F;span&gt;&lt;span&gt;; p += &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;        uint64_t data = loadword&amp;lt;uint64_t&amp;gt;(p);
&lt;&#x2F;span&gt;&lt;span&gt;        data *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;        data ^= data &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;        data *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;        state ^= data;
&lt;&#x2F;span&gt;&lt;span&gt;        state *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;    
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Invert target hash transformation.
&lt;&#x2F;span&gt;&lt;span&gt;                        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; return hash;
&lt;&#x2F;span&gt;&lt;span&gt;    hash ^= hash &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; hash ^= hash &amp;gt;&amp;gt; 47;
&lt;&#x2F;span&gt;&lt;span&gt;    hash *= mulinv;     &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; hash *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;    hash ^= hash &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; hash ^= hash &amp;gt;&amp;gt; 47;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Invert last iteration for last 8 bytes.
&lt;&#x2F;span&gt;&lt;span&gt;    hash *= mulinv;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; hash *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;    uint64_t data = state ^ hash;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; hash = hash ^ data;
&lt;&#x2F;span&gt;&lt;span&gt;    data *= mulinv;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; data *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;    data ^= data &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span&gt;;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; data ^= data &amp;gt;&amp;gt; 47;
&lt;&#x2F;span&gt;&lt;span&gt;    data *= mulinv;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; data *= mul;
&lt;&#x2F;span&gt;&lt;span&gt;    std::memcpy(s + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;24&lt;&#x2F;span&gt;&lt;span&gt;, &amp;amp;data, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;); &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; data = loadword&amp;lt;uint64_t&amp;gt;(s);
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The following 10 strings all hash to &lt;code&gt;1337&lt;&#x2F;code&gt; using MurmurHash64A with the
default seed &lt;code&gt;0xc70f6907&lt;&#x2F;code&gt;, generated using &lt;a href=&quot;https:&#x2F;&#x2F;gist.github.com&#x2F;orlp&#x2F;59470263c1e2b05b035719f3121bcc45&quot;&gt;this code&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;orlp-murmurhash64-bhbaaat;SXtgVa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash64-bkiaaa&amp;amp;JInaNcZ
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash64-ewmaaa(%J+jw&amp;gt;j
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash64-vxpaaag&amp;quot;93\Yj5
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash64-ehuaaafa`Wp`&#x2F;|
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash64-yizaaa1x.zQF6r
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash64-lpzaaaZphp&amp;amp;c F
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash64-wsjbaa771rz{z&amp;lt;
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash64-rnkbaazy4X]p&amp;gt;B
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash64-aqnbaaZ~OzP_Tp
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;universal-collision-attack-on-murmurhash64a&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#universal-collision-attack-on-murmurhash64a&quot; aria-label=&quot;Anchor link for: universal-collision-attack-on-murmurhash64a&quot;&gt;Universal collision attack on MurmurHash64A&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;In fact, MurmurHash64A is so weak that Jean-Philippe Aumasson, Daniel J.
Bernstein and Martin Boßlet published &lt;a href=&quot;https:&#x2F;&#x2F;cr.yp.to&#x2F;talks&#x2F;2012.12.29&#x2F;slides.pdf&quot;&gt;an
attack&lt;&#x2F;a&gt; that creates sets of
strings which collide &lt;strong&gt;regardless of the random seed used&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;To be fair to CityHash64… just kidding they found &lt;a href=&quot;http:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20140731141732&#x2F;https:&#x2F;&#x2F;131002.net&#x2F;siphash&#x2F;citycollisions-20120730.tar.gz&quot;&gt;universal collisions&lt;&#x2F;a&gt;
against it as well, regardless of seed used.
CityHash64 is actually much easier to break in this way, as simply
doing the above pre-image attack targetting &lt;code&gt;0&lt;&#x2F;code&gt; as hash
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;google&#x2F;cityhash&#x2F;blob&#x2F;f5dc54147fcce12cefd16548c8e760d68ac04226&#x2F;src&#x2F;city.cc#L410&quot;&gt;makes the output purely dependent on the seed&lt;&#x2F;a&gt;,
and thus a universal collision.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;To see how it works, let’s take a look at the core loop of MurmurHash64A:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;uint64_t data = loadword&amp;lt;uint64_t&amp;gt;(p);
&lt;&#x2F;span&gt;&lt;span&gt;data *= mul;          &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Trivially invertible.
&lt;&#x2F;span&gt;&lt;span&gt;data ^= data &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span&gt;;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Trivially invertible.
&lt;&#x2F;span&gt;&lt;span&gt;data *= mul;          &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Trivially invertible.
&lt;&#x2F;span&gt;&lt;span&gt;state ^= data;
&lt;&#x2F;span&gt;&lt;span&gt;state *= mul;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We know we can trivially invert the operations done on &lt;code&gt;data&lt;&#x2F;code&gt; regardless of what the
current state is, so we might as well have had the following body:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;state ^= data;
&lt;&#x2F;span&gt;&lt;span&gt;state *= mul;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now the hash starts looking rather weak indeed. The clever trick they
employ is by creating two strings simultaneously, such that they
differ precisely in the top bit in each 8-byte word. Why the top bit?&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;63
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9223372036854775808
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;63&lt;&#x2F;span&gt;&lt;span&gt;) * mul % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;64
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9223372036854775808
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Since &lt;code&gt;mul&lt;&#x2F;code&gt; is odd, its least significant bit is set. Multiplying &lt;code&gt;1 &amp;lt;&amp;lt; 63&lt;&#x2F;code&gt; by
it is equivalent to shifting that bit 63 places to the left, which is once again
&lt;code&gt;1 &amp;lt;&amp;lt; 63&lt;&#x2F;code&gt;. That is, &lt;code&gt;1 &amp;lt;&amp;lt; 63&lt;&#x2F;code&gt; is a fixed point for the &lt;code&gt;state *= mul&lt;&#x2F;code&gt; operation.
We also note that for the top bit XOR is equivalent to addition, as the overflow
from addition is removed mod $2^{64}$.&lt;&#x2F;p&gt;
&lt;p&gt;So if we have two input strings, one starting with the 8 bytes &lt;code&gt;data&lt;&#x2F;code&gt;, and the
other starting with &lt;code&gt;data ^ (1 &amp;lt;&amp;lt; 63) == data + (1 &amp;lt;&amp;lt; 63)&lt;&#x2F;code&gt; (after doing the
trivial inversions). We then find that the two states, regardless of seed,
differ exactly in the top bit after &lt;code&gt;state ^= data&lt;&#x2F;code&gt;. After multiplication we
find we have two states &lt;code&gt;x * mul&lt;&#x2F;code&gt; and &lt;code&gt;(x + (1 &amp;lt;&amp;lt; 63)) * mul == x * mul + (1 &amp;lt;&amp;lt; 63)&lt;&#x2F;code&gt;… which again differ exactly in the top bit! We are now back to &lt;code&gt;state ^= data&lt;&#x2F;code&gt; in our iteration, for the next 8 bytes. We can now use this moment to
cancel our top bit difference, by again feeding two 8-byte strings that
differ in the top bit (after inverting).&lt;&#x2F;p&gt;
&lt;p&gt;In fact, we only have to find one pair of such strings that differ in the top
bit, which we can then repeat twice (in either order) to cancel our difference
again. When represented as a &lt;code&gt;uint64_t&lt;&#x2F;code&gt; if we choose the first string as &lt;code&gt;x&lt;&#x2F;code&gt; we
can derive the second string as&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;x *= mul;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Forward transformation...
&lt;&#x2F;span&gt;&lt;span&gt;x ^= x &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span&gt;;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; ...
&lt;&#x2F;span&gt;&lt;span&gt;x *= mul;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; ...
&lt;&#x2F;span&gt;&lt;span&gt;x ^= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;63&lt;&#x2F;span&gt;&lt;span&gt;;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Difference in top bit.
&lt;&#x2F;span&gt;&lt;span&gt;x *= mulinv;     &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Backwards transformation...
&lt;&#x2F;span&gt;&lt;span&gt;x ^= x &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;47&lt;&#x2F;span&gt;&lt;span&gt;;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; ...
&lt;&#x2F;span&gt;&lt;span&gt;x *= mulinv;     &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; ...
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;I was unable to find a printable ASCII string that has another printable
ASCII string as its partner. But I was able to find the following pair of 8-byte
UTF-8 strings that differ in exactly the top bit after the Murmurhash64A input
transformation:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;xx0rlpx!
&lt;&#x2F;span&gt;&lt;span&gt;xxsXъВ
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Combining them as such gives two 16-byte strings that when fed through the hash
algorithm manipulate the state in the same way: a collision.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;xx0rlpx!xxsXъВ
&lt;&#x2F;span&gt;&lt;span&gt;xxsXъВxx0rlpx!
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But it doesn’t stop there. By concatenating these two strings we can create
$2^n$ different colliding strings each $16n$ bytes long. With the current
&lt;code&gt;libstdc++&lt;&#x2F;code&gt; implementation the following prints the same number eight times:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;std::hash&amp;lt;std::u8string&amp;gt; h;
&lt;&#x2F;span&gt;&lt;span&gt;std::u8string a = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;xx0rlpx!xxsXъВ&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;std::u8string b = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;xxsXъВxx0rlpx!&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;std::cout &amp;lt;&amp;lt; h(a + a + a) &amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;std::cout &amp;lt;&amp;lt; h(a + a + b) &amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;std::cout &amp;lt;&amp;lt; h(a + b + a) &amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;std::cout &amp;lt;&amp;lt; h(a + b + b) &amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;std::cout &amp;lt;&amp;lt; h(b + a + a) &amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;std::cout &amp;lt;&amp;lt; h(b + a + b) &amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;std::cout &amp;lt;&amp;lt; h(b + b + a) &amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;std::cout &amp;lt;&amp;lt; h(b + b + b) &amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Even if the &lt;code&gt;libstdc++&lt;&#x2F;code&gt; would randomize the seed used by MurmurHash64a, the
strings would &lt;em&gt;still&lt;&#x2F;em&gt; collide.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;breaking-murmurhash3&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#breaking-murmurhash3&quot; aria-label=&quot;Anchor link for: breaking-murmurhash3&quot;&gt;Breaking MurmurHash3&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Nim &lt;del&gt;uses&lt;&#x2F;del&gt;
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;nim-lang&#x2F;Nim&#x2F;blob&#x2F;46d2161c23c2aa1905571512b9a1ef7d61ae670e&#x2F;lib&#x2F;pure&#x2F;hashes.nim#L386&quot;&gt;used to use&lt;&#x2F;a&gt;
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;PeterScott&#x2F;murmur3&#x2F;blob&#x2F;master&#x2F;murmur3.c&quot;&gt;&lt;code&gt;MurmurHash3_x86_32&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;,
so let’s try to break that.&lt;&#x2F;p&gt;
&lt;aside&gt;
Nim switched to use farmhash by default while I was procrastinating finishing this article.
Please pretend that it still uses MurmurHash3 while reading this section, so my words
make sense. Then, afterwards, we&#x27;ll break farmhash too.
&lt;&#x2F;aside&gt;
&lt;p&gt;If we once again simplify to strings whose lengths are a multiple of 4 we get the following code:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;uint32_t rotl32(uint32_t x, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span&gt;r) {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;(x &amp;lt;&amp;lt; r) | (x &amp;gt;&amp;gt; (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32 &lt;&#x2F;span&gt;&lt;span&gt;- r));
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;uint32_t murmurhash3_x86_32(const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span&gt;* s, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span&gt;len, uint32_t seed) {
&lt;&#x2F;span&gt;&lt;span&gt;    const uint32_t c1 = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xcc9e2d51&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    const uint32_t c2 = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x1b873593&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    const uint32_t c3 = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x85ebca6b&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    const uint32_t c4 = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xc2b2ae35&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    uint32_t h = seed;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span&gt;* p = s; p != s + len; p += &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;        uint32_t k = loadword&amp;lt;uint32_t&amp;gt;(p);
&lt;&#x2F;span&gt;&lt;span&gt;        k *= c1;
&lt;&#x2F;span&gt;&lt;span&gt;        k = rotl32(k, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;15&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;        k *= c2;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;        h ^= k;
&lt;&#x2F;span&gt;&lt;span&gt;        h = rotl32(h, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;13&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;        h = h * &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5 &lt;&#x2F;span&gt;&lt;span&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xe6546b64&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    h ^= len;
&lt;&#x2F;span&gt;&lt;span&gt;    h ^= h &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    h *= c3;
&lt;&#x2F;span&gt;&lt;span&gt;    h ^= h &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;13&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    h *= c4;
&lt;&#x2F;span&gt;&lt;span&gt;    h ^= h &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; h;
&lt;&#x2F;span&gt;&lt;span&gt;} 
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;I think by now you should be able to get this function to spit out any
value you want if you know the seed.
The inverse of
&lt;code&gt;rotl32(x, r)&lt;&#x2F;code&gt; is &lt;code&gt;rotl32(x, 32-r)&lt;&#x2F;code&gt; and the inverse of &lt;code&gt;h ^= h &amp;gt;&amp;gt; 16&lt;&#x2F;code&gt; is
once again just &lt;code&gt;h ^= h &amp;gt;&amp;gt; 16&lt;&#x2F;code&gt;. Only &lt;code&gt;h ^= h &amp;gt;&amp;gt; 13&lt;&#x2F;code&gt; is a bit different, it’s the first time
we’ve seen that a xorshift’s inverse has more than one step:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;h ^= h &amp;gt;&amp;gt; 13
&lt;&#x2F;span&gt;&lt;span&gt;h ^= h &amp;gt;&amp;gt; 26
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Compute the modular inverses
of &lt;code&gt;c1&lt;&#x2F;code&gt; through &lt;code&gt;c4&lt;&#x2F;code&gt; as well as &lt;code&gt;5&lt;&#x2F;code&gt; mod $2^{32}$, and go to town.
If you want to cheat or check your answer, you can check out &lt;a href=&quot;https:&#x2F;&#x2F;gist.github.com&#x2F;orlp&#x2F;0c33157a0971053b60ac1da84b021bea&quot;&gt;the code&lt;&#x2F;a&gt;
I’ve used to generate the following ten strings that all hash to 1337 when
fed to &lt;code&gt;MurmurHash3_x86_32&lt;&#x2F;code&gt; with seed &lt;code&gt;0&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;orlp-murmurhash3_x86_32-haaaPa*+
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash3_x86_32-saaaUW&amp;amp;&amp;lt;
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash3_x86_32-ubaa&#x2F;!&#x2F;&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash3_x86_32-weaare]]
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash3_x86_32-chaa5@&#x2F;}
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash3_x86_32-claaM[,5
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash3_x86_32-fraaIx`N
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash3_x86_32-iwaara&amp;amp;&amp;lt;
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash3_x86_32-zwaa]&amp;gt;zd
&lt;&#x2F;span&gt;&lt;span&gt;orlp-murmurhash3_x86_32-zbbaW-5G
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Nim uses &lt;code&gt;0&lt;&#x2F;code&gt; as a fixed seed.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;You might wonder about the ethics of publishing functions for generating
arbitrary amounts of collisions for hash functions actually in use today. I did
consider holding back. But HashDoS has been a known attack for almost two
decades now, and the universal hash collisions I’ve shown were also published
more than a decade ago now as well. At some point you’ve had enough time
to, uh, fix your shit.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;h3 id=&quot;universal-collision-attack-on-murmurhash3&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#universal-collision-attack-on-murmurhash3&quot; aria-label=&quot;Anchor link for: universal-collision-attack-on-murmurhash3&quot;&gt;Universal collision attack on MurmurHash3&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;Suppose that Nim didn’t use &lt;code&gt;0&lt;&#x2F;code&gt; as a fixed seed, but chose a randomly generated
one. Can we do a similar attack as the one done to MurmurHash2 to still generate
universal multicollisions?&lt;&#x2F;p&gt;
&lt;p&gt;Yes we can. Let’s take another look at that core loop body:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;uint32_t k = loadword&amp;lt;uint32_t&amp;gt;(p);
&lt;&#x2F;span&gt;&lt;span&gt;k *= c1;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Trivially invertable.
&lt;&#x2F;span&gt;&lt;span&gt;k = rotl32(k, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;15&lt;&#x2F;span&gt;&lt;span&gt;);  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Trivially invertable.
&lt;&#x2F;span&gt;&lt;span&gt;k *= c2;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Trivially invertable.
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;h ^= k;
&lt;&#x2F;span&gt;&lt;span&gt;h = rotl32(h, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;13&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;h = h * &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5 &lt;&#x2F;span&gt;&lt;span&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xe6546b64&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Once again we can ignore the first three trivially invertable instructions as we
can simply choose our input so that we get exactly the &lt;code&gt;k&lt;&#x2F;code&gt; we want.
Remember from last time that we want to introduce a difference in exactly the
top bit of &lt;code&gt;h&lt;&#x2F;code&gt;, as the multiplication will leave this difference in place.
But here there is a bit rotation between the XOR  and the multiplication.
The solution? Simply place our bit difference such that &lt;code&gt;rotl32(h, 13)&lt;&#x2F;code&gt; shifts
it into the top position.&lt;&#x2F;p&gt;
&lt;p&gt;Does the addition of &lt;code&gt;0xe6546b64&lt;&#x2F;code&gt; mess things up? No. Since only the top bit
between the two states will be different, there is a difference of exactly
$2^{31}$ between the two states. This difference is maintained by the addition.
Since two 32-bit numbers with the same top bit can be at most $2^{31} - 1$
apart, we can conclude that the two states still differ in the top bit after
the addition.&lt;&#x2F;p&gt;
&lt;p&gt;So we want to find two pairs of 32-bit ints, such that after applying the first
three instructions the first pair differs in bit &lt;code&gt;1 &amp;lt;&amp;lt; (31 - 13) == 0x00040000&lt;&#x2F;code&gt;
and the second pair in bit &lt;code&gt;1 &amp;lt;&amp;lt; 31 == 0x80000000&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;After some brute-force searching I found some cool pairs (again forced to use
UTF-8), which when combined give the following collision:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;nim&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-nim &quot;&gt;&lt;code class=&quot;language-nim&quot; data-lang=&quot;nim&quot;&gt;&lt;span&gt;a = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;!&amp;amp;orlpՓ&amp;quot;
&lt;&#x2F;span&gt;&lt;span&gt;b = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;yǏglp$X&amp;quot;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;As before, any concatenation of &lt;code&gt;a&lt;&#x2F;code&gt;s and &lt;code&gt;b&lt;&#x2F;code&gt;s of length &lt;code&gt;n&lt;&#x2F;code&gt; collides with all
other combinations of length &lt;code&gt;n&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;breaking-farmhash64&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#breaking-farmhash64&quot; aria-label=&quot;Anchor link for: breaking-farmhash64&quot;&gt;Breaking FarmHash64&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Nim switched to
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;nim-lang&#x2F;Nim&#x2F;blob&#x2F;46bb47a444bd377860d832fc1c62b262343f36a2&#x2F;lib&#x2F;pure&#x2F;hashes.nim#L537&quot;&gt;farmhash&lt;&#x2F;a&gt;
since I started writing this post. To break it we can notice that its structure
is very similar to CityHash64, so we can use those same techniques again. In
fact, the only changes between the two for lengths 17-32 bytes is that a few
operators were changed from subtraction&#x2F;XOR to addition, a rotation operator had
its constant tweaked, and some &lt;code&gt;k&lt;&#x2F;code&gt; constants are slightly tweaked in usage. The
process of breaking it is so similar that it’s entirely analogous, so we can
skip straight to &lt;a href=&quot;https:&#x2F;&#x2F;gist.github.com&#x2F;orlp&#x2F;f0f3307530841183ddb72a0528ce0742&quot;&gt;the result&lt;&#x2F;a&gt;.
These 10 strings all hash to 1337 with FarmHash64:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;orlp-farmhash64-?VrJ@L7ytzwheaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-farmhash64-p3`!SQb}fmxheaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-farmhash64-pdt&amp;#39;cuI\gvxheaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-farmhash64-IbY`xAG&amp;amp;ibkieaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-farmhash64-[_LU!d1hwmkieaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-farmhash64-QiY!clz]bttieaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-farmhash64-&amp;amp;?J3rZ_8gsuieaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-farmhash64-LOBWtm5Szyuieaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-farmhash64-Mptaa^g^ytvieaaa
&lt;&#x2F;span&gt;&lt;span&gt;orlp-farmhash64-B?&amp;amp;l::hxqmfjeaaa
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;trivial-fixed-seed-wyhash-multicollisions&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#trivial-fixed-seed-wyhash-multicollisions&quot; aria-label=&quot;Anchor link for: trivial-fixed-seed-wyhash-multicollisions&quot;&gt;Trivial fixed-seed wyhash multicollisions&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Zig uses &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;wangyi-fudan&#x2F;wyhash&#x2F;blob&#x2F;master&#x2F;wyhash.h&quot;&gt;wyhash&lt;&#x2F;a&gt;
with a fixed seed of zero. While I was unable to do
seed-independent attacks against wyhash, using it with a fixed seed makes
generating collisions trivial. Wyhash is &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;wangyi-fudan&#x2F;wyhash&#x2F;blob&#x2F;46cebe9dc4e51f94d0dca287733bc5a94f76a10d&#x2F;wyhash.h#L46&quot;&gt;built upon&lt;&#x2F;a&gt;
the folded multiply, which takes two 64-bit inputs, multiplies them to a 128-bit product before XORing
together the two halves:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;uint64_t folded_multiply(uint64_t a, uint64_t b) {
&lt;&#x2F;span&gt;&lt;span&gt;    __uint128_t full = __uint128_t(a) * __uint128_t(b);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;uint64_t(full) ^ uint64_t(full &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;64&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;aside&gt;
For the record, this is not a knock against wyhash in general. It seems solid
if used with a secret randomized seed. My goal is only to illustrate that if
used with a fixed seed it trivially breaks down.
&lt;&#x2F;aside&gt;
&lt;p&gt;It’s easy to immediately see a critical flaw with this: if one of the two sides
is zero, the output will also always be zero. To protect against this, wyhash
always uses a folded multiply in the following form:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;out = folded_multiply(input_a ^ secret_a, input_b ^ secret_b);
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;where &lt;code&gt;secret_a&lt;&#x2F;code&gt; and &lt;code&gt;secret_b&lt;&#x2F;code&gt; are determined by the seed, or outputs of
previous iterations which are influenced by the seed. However, when your seed is
constant… With &lt;a href=&quot;https:&#x2F;&#x2F;gist.github.com&#x2F;orlp&#x2F;a9cc8dae3a74b1faaa0a642135ee81df&quot;&gt;a bit of creativity&lt;&#x2F;a&gt;
we can use the start of our string to prepare a ‘secret’ value which we can
perfectly cancel with another ASCII string later in the input.&lt;&#x2F;p&gt;
&lt;p&gt;So, without further ado, every 32-byte string of the form&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;orlp-wyhash-oGf_________tWJbzMJR
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;hashes to the same value with Zig’s default hasher.&lt;&#x2F;p&gt;
&lt;p&gt;Zig uses a different set of parameters than the defaults found in the wyhash
repository, so for good measure, this pattern provides arbitrary multicollisions
for the default parameters found in wyhash when using &lt;code&gt;seed == 0&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;orlp-wyhash-EUv_________NLXyytkp
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#conclusion&quot; aria-label=&quot;Anchor link for: conclusion&quot;&gt;Conclusion&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;We’ve seen that a lot of the hash functions in common use in hash tables today
are very weak, allowing fairly trivial attacks to produce arbitrary amounts of
collisions if not randomly initialized. Using a randomly seeded hash table is
paramount if you don’t wish to become a victim of a hash flooding attack.&lt;&#x2F;p&gt;
&lt;p&gt;We’ve also seen that some hash functions are vulnerable to attack &lt;em&gt;even if
randomly seeded&lt;&#x2F;em&gt;. These are completely broken and should not be used if attacks
are a concern at all. Luckily I was unable to find such attacks against most
hashes, but the possibility of such an attack existing is quite unnerving.&lt;&#x2F;p&gt;
&lt;p&gt;With &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Universal_hashing&quot;&gt;universal hashing&lt;&#x2F;a&gt; it’s
possible to construct hash functions for which such an attack is provably
impossible, last year I published a hash function called
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;orlp&#x2F;polymur-hash&quot;&gt;polymur-hash&lt;&#x2F;a&gt; that has this property. Your
HTTPS connection to this website also likely uses a universal hash function
for authenticity of the transferred data, both &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Poly1305&quot;&gt;Poly1305&lt;&#x2F;a&gt;
and &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Galois&#x2F;Counter_Mode&quot;&gt;GCM&lt;&#x2F;a&gt; are based on
universal hashing for their security proofs.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;Well, such attacks are provably impossible against non-interactive attackers, everything goes out of the
window again when an attacker is allowed to inspect the output hashes and use
that to try and guess your secret key.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;Of course, if your data is not user-controlled, or there is no reasonable
security model where your application would face attacks, you can get away with
faster and insecure hashes.&lt;&#x2F;p&gt;
&lt;p&gt;More to come on the subject of hashing and hash
tables and how it can go right or wrong, but for now this article is long enough as-is…&lt;&#x2F;p&gt;
</content>
	</entry>
	<entry xml:lang="en">
		<title>Taming Floating-Point Sums</title>
		<author>Orson R. L. Peters</author>
		<published>2024-05-25T00:00:00+00:00</published>
		<updated>2024-05-25T00:00:00+00:00</updated>
		<link rel="alternate" href="https://orlp.net/blog/taming-float-sums/" type="text/html"/>
		<id>https://orlp.net/blog/taming-float-sums/</id>
		<content type="html">&lt;p&gt;Suppose you have an array of floating-point numbers, and wish to sum them.
You might naively think you can simply add them, e.g. in Rust:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;naive_sum(arr: &amp;amp;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32&lt;&#x2F;span&gt;&lt;span&gt;]) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;mut out = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for&lt;&#x2F;span&gt;&lt;span&gt; x in arr {
&lt;&#x2F;span&gt;&lt;span&gt;        out += *x;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;    out
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This however can easily result in an arbitrarily large accumulated error. Let’s try it out:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;naive_sum(&amp;amp;vec![&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1.0&lt;&#x2F;span&gt;&lt;span&gt;;     &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1_000_000&lt;&#x2F;span&gt;&lt;span&gt;]) =  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1000000.0
&lt;&#x2F;span&gt;&lt;span&gt;naive_sum(&amp;amp;vec![&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1.0&lt;&#x2F;span&gt;&lt;span&gt;;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;10_000_000&lt;&#x2F;span&gt;&lt;span&gt;]) = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;10000000.0
&lt;&#x2F;span&gt;&lt;span&gt;naive_sum(&amp;amp;vec![&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1.0&lt;&#x2F;span&gt;&lt;span&gt;;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;100_000_000&lt;&#x2F;span&gt;&lt;span&gt;]) = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;16777216.0
&lt;&#x2F;span&gt;&lt;span&gt;naive_sum(&amp;amp;vec![&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1.0&lt;&#x2F;span&gt;&lt;span&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1_000_000_000&lt;&#x2F;span&gt;&lt;span&gt;]) = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;16777216.0
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Uh-oh… What happened? When you compute $a + b$ the result must be rounded to
the nearest representable floating-point number, breaking ties towards the
number with an even mantissa. The problem is that the next 32-bit floating-point
number after &lt;code&gt;16777216&lt;&#x2F;code&gt; is &lt;code&gt;16777218&lt;&#x2F;code&gt;. In this case that means &lt;code&gt;16777216 + 1&lt;&#x2F;code&gt;
rounds back to &lt;code&gt;16777216&lt;&#x2F;code&gt; again. We’re stuck.&lt;&#x2F;p&gt;
&lt;p&gt;Luckily, there are better ways to sum an array.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;pairwise-summation&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#pairwise-summation&quot; aria-label=&quot;Anchor link for: pairwise-summation&quot;&gt;Pairwise summation&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;A method that’s a bit more clever is to use &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Pairwise_summation&quot;&gt;pairwise summation&lt;&#x2F;a&gt;.
Instead of a completely linear sum with a single accumulator it recursively
sums an array by splitting the array in half, summing the halves, and then adding
the sums.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;pairwise_sum(arr: &amp;amp;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32&lt;&#x2F;span&gt;&lt;span&gt;]) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt; arr.len() == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0&lt;&#x2F;span&gt;&lt;span&gt;; }
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt; arr.len() == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; arr[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;]; }
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;(first, second) = arr.split_at(arr.len() &#x2F; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;    pairwise_sum(first) + pairwise_sum(second)
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is more accurate:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;pairwise_sum(&amp;amp;vec![&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1.0&lt;&#x2F;span&gt;&lt;span&gt;;     &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1_000_000&lt;&#x2F;span&gt;&lt;span&gt;]) =    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1000000.0
&lt;&#x2F;span&gt;&lt;span&gt;pairwise_sum(&amp;amp;vec![&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1.0&lt;&#x2F;span&gt;&lt;span&gt;;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;10_000_000&lt;&#x2F;span&gt;&lt;span&gt;]) =   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;10000000.0
&lt;&#x2F;span&gt;&lt;span&gt;pairwise_sum(&amp;amp;vec![&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1.0&lt;&#x2F;span&gt;&lt;span&gt;;   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;100_000_000&lt;&#x2F;span&gt;&lt;span&gt;]) =  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;100000000.0
&lt;&#x2F;span&gt;&lt;span&gt;pairwise_sum(&amp;amp;vec![&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1.0&lt;&#x2F;span&gt;&lt;span&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1_000_000_000&lt;&#x2F;span&gt;&lt;span&gt;]) = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1000000000.0
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;However, this is rather slow. To get a summation routine that goes as fast as
possible while still being reasonably accurate we should not recurse down
all the way to length-1 arrays, as this gives too much call overhead. We can
still use our naive sum for small sizes, and only recurse on large sizes.
This does make our worst-case error worse by a constant factor, but in turn
makes the pairwise sum almost as fast as a naive sum.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;By choosing the splitpoint as a multiple of 256 we ensure that the base case in
the recursion always has exactly 256 elements except on the very last block.
This makes sure we use the most optimal reduction and always correctly predict
the loop condition. This small detail ended up improving the throughput by 40%
for large arrays!&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;block_pairwise_sum(arr: &amp;amp;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32&lt;&#x2F;span&gt;&lt;span&gt;]) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt; arr.len() &amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;256 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; split = (arr.len() &#x2F; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;).next_multiple_of(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;256&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;(first, second) = arr.split_at(split);
&lt;&#x2F;span&gt;&lt;span&gt;        block_pairwise_sum(first) + block_pairwise_sum(second)
&lt;&#x2F;span&gt;&lt;span&gt;    } &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        naive_sum(arr)
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;kahan-summation&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#kahan-summation&quot; aria-label=&quot;Anchor link for: kahan-summation&quot;&gt;Kahan summation&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;The worst-case round-off error of naive summation scales with $O(n \epsilon)$
when summing $n$ elements, where $\epsilon$ is the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Machine_epsilon&quot;&gt;machine
epsilon&lt;&#x2F;a&gt; of your floating-point
type (here $2^{-24}$). Pairwise summation improves this to  $O((\log n) \epsilon + n\epsilon^2)$. However, &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Kahan_summation_algorithm&quot;&gt;Kahan
summation&lt;&#x2F;a&gt; improves this
further to $O(n\epsilon^2)$, eliminating the $\epsilon$ term entirely, leaving only
the $\epsilon^2$ term which is negligible unless you sum a very large amount of
numbers.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;All of these bounds scale with $\sum_i |x_i|$, so the worst-case absolute error
bound is still quadratic in terms of $n$ even for Kahan summation.&lt;&#x2F;p&gt;
&lt;p&gt;In practice all summation algorithms do significantly better than their
worst-case bounds, as in most scenarios the errors do not exclusively
round up or down, but cancel each other out on average.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;pub &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;kahan_sum(arr: &amp;amp;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32&lt;&#x2F;span&gt;&lt;span&gt;]) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;mut sum = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;mut c = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for&lt;&#x2F;span&gt;&lt;span&gt; x in arr {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; y = *x - c;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; t = sum + y;
&lt;&#x2F;span&gt;&lt;span&gt;        c = (t - sum) - y;
&lt;&#x2F;span&gt;&lt;span&gt;        sum = t;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;    sum
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The Kahan summation works by maintaining the sum in two registers, the actual
bulk sum and a small error correcting term $c$. If you were using infinitely
precise arithmetic $c$ would always be zero, but with floating-point it might
not be. The downside is that each number now takes four operations to add to the
sum instead of just one.&lt;&#x2F;p&gt;
&lt;p&gt;To mitigate this we can do something similar to what we did with the pairwise
summation. We can first accumulate blocks into sums naively before combining the
block sums with Kaham summation to reduce overhead at the cost of accuracy:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;pub &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;block_kahan_sum(arr: &amp;amp;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32&lt;&#x2F;span&gt;&lt;span&gt;]) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;mut sum = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;mut c = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for&lt;&#x2F;span&gt;&lt;span&gt; chunk in arr.chunks(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;256&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; x = naive_sum(chunk);
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; y = x - c;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; t = sum + y;
&lt;&#x2F;span&gt;&lt;span&gt;        c = (t - sum) - y;
&lt;&#x2F;span&gt;&lt;span&gt;        sum = t;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;    sum
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;exact-summation&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#exact-summation&quot; aria-label=&quot;Anchor link for: exact-summation&quot;&gt;Exact summation&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;I know of at least two general methods to produce the &lt;em&gt;correctly-rounded&lt;&#x2F;em&gt; sum of a sequence
of floating-point numbers. That is, it logically computes the sum with
infinite precision before rounding it back to a floating-point value at the end.&lt;&#x2F;p&gt;
&lt;p&gt;The first method is based on the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;2Sum&quot;&gt;2Sum&lt;&#x2F;a&gt;
primitive which is an error-free transform from two numbers $x, y$ to $s, t$
such that $x + y = s + t$, where $t$ is a small error. By applying this
repeatedly until the errors vanish you can get a correctly-rounded sum.
Keeping track of what to add in what order can be tricky, and the worst-case
requires $O(n^2)$ additions to make all the terms vanish. This is what’s
implemented in Python’s &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;python&#x2F;cpython&#x2F;blob&#x2F;de19694cfbcaa1c85c3a4b7184a24ff21b1c0919&#x2F;Modules&#x2F;mathmodule.c#L1321&quot;&gt;&lt;code&gt;math.fsum&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
and in the Rust crate &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;fsum&#x2F;latest&#x2F;fsum&#x2F;&quot;&gt;&lt;code&gt;fsum&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; which use
extra memory to keep the partial sums around. The &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;accurate&#x2F;latest&#x2F;accurate&#x2F;&quot;&gt;&lt;code&gt;accurate&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; crate also
implements this using in-place mutation in &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;accurate&#x2F;latest&#x2F;accurate&#x2F;sum&#x2F;fn.i_fast_sum_in_place.html&quot;&gt;&lt;code&gt;i_fast_sum_in_place&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Another method is to keep a large buffer of integers around, one per exponent.
Then when adding a floating-point number you decompose it into a an exponent
and mantissa, and add the mantissa to the corresponding integer in the buffer.
If the integer &lt;code&gt;buf[i]&lt;&#x2F;code&gt; overflows you increment the integer in &lt;code&gt;buf[i + w]&lt;&#x2F;code&gt;,
where &lt;code&gt;w&lt;&#x2F;code&gt; is the width of your integer.&lt;&#x2F;p&gt;
&lt;p&gt;This can actually compute a completely exact sum, without any rounding at all,
and is effectively just an overly permissive representation of a fixed-point
number optimized for accumulating floats. This latter method is $O(n)$ time, but
uses a large but constant amount of memory ($\approx$ 1 KB for &lt;code&gt;f32&lt;&#x2F;code&gt;, $\approx$
16 KB for &lt;code&gt;f64&lt;&#x2F;code&gt;). An advantage of this method is that it’s also an online
algorithm - both adding a number to the sum and getting the current total are
amortized $O(1)$.&lt;&#x2F;p&gt;
&lt;p&gt;A variant of this method is implemented in the &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;accurate&#x2F;latest&#x2F;accurate&#x2F;&quot;&gt;&lt;code&gt;accurate&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
crate
as &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;accurate&#x2F;latest&#x2F;accurate&#x2F;sum&#x2F;struct.OnlineExactSum.html&quot;&gt;&lt;code&gt;OnlineExactSum&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
crate which uses floats instead of integers for the buffer.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;unleashing-the-compiler&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#unleashing-the-compiler&quot; aria-label=&quot;Anchor link for: unleashing-the-compiler&quot;&gt;Unleashing the compiler&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Besides accuracy, there is another problem with &lt;code&gt;naive_sum&lt;&#x2F;code&gt;. The Rust compiler
is not allowed to reorder floating-point additions, because floating-point
addition is not associative. So it cannot autovectorize the &lt;code&gt;naive_sum&lt;&#x2F;code&gt; to use
SIMD instructions to compute the sum, nor use instruction-level parallelism.&lt;&#x2F;p&gt;
&lt;p&gt;To solve this there are compiler intrinsics in Rust that do float sums while
allowing associativity, such as
&lt;a href=&quot;https:&#x2F;&#x2F;doc.rust-lang.org&#x2F;nightly&#x2F;std&#x2F;intrinsics&#x2F;fn.fadd_fast.html&quot;&gt;&lt;code&gt;std::intrinsics::fadd_fast&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;.
However, these instructions are &lt;em&gt;incredibly dangerous&lt;&#x2F;em&gt;, as they assume that both
the input and output are finite numbers (no infinities, no NaNs), or otherwise
they are undefined behavior. This functionally makes them unusable, as only in
the most restricted scenarios when computing a sum do you know that all inputs
are finite numbers, and that their sum cannot overflow.&lt;&#x2F;p&gt;
&lt;p&gt;I recently uttered my annoyance
with these operators to &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;saethlin&quot;&gt;Ben Kimock&lt;&#x2F;a&gt;, and together
we proposed (and he implemented) a new set of operators:
&lt;a href=&quot;https:&#x2F;&#x2F;doc.rust-lang.org&#x2F;nightly&#x2F;std&#x2F;intrinsics&#x2F;fn.fadd_algebraic.html&quot;&gt;&lt;code&gt;std::intrinsics::fadd_algebraic&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
and &lt;a href=&quot;https:&#x2F;&#x2F;doc.rust-lang.org&#x2F;nightly&#x2F;std&#x2F;intrinsics&#x2F;fn.fadd_algebraic.html?search=_algebraic&quot;&gt;friends&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;I proposed we call the operators &lt;em&gt;algebraic&lt;&#x2F;em&gt;, as they allow (in theory) any
transformation that is justified by real algebra. For example, substituting
${x - x \to 0}$, ${cx + cy \to c(x + y)}$, or ${x^6 \to (x^2)^3.}$
In general these operators are treated as-if they are done using real numbers,
and can map to any set of floating-point instructions that would be equivalent
to the original expression, assuming the floating-point instructions would be
exact.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;Note that the real numbers do not contain NaNs or infinities, so these operators
assume those do not exist for the validity of transformations, however it is not
undefined behavior when you do encounter those values.&lt;&#x2F;p&gt;
&lt;p&gt;They also allow &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Multiply%E2%80%93accumulate_operation&quot;&gt;fused multiply-add&lt;&#x2F;a&gt;
instructions to be generated, as under real arithmetic $\operatorname{fma}(a, b, c) = ab + c.$&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;Using those new instructions it is trivial to generate an autovectorized sum:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;#![allow(internal_features)]
&lt;&#x2F;span&gt;&lt;span&gt;#![feature(core_intrinsics)]
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;use &lt;&#x2F;span&gt;&lt;span&gt;std::intrinsics::fadd_algebraic;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;naive_sum_autovec(arr: &amp;amp;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32&lt;&#x2F;span&gt;&lt;span&gt;]) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;mut out = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for&lt;&#x2F;span&gt;&lt;span&gt; x in arr {
&lt;&#x2F;span&gt;&lt;span&gt;        out = fadd_algebraic(out, *x);
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;    out
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If we compile with &lt;code&gt;-C target-cpu=broadwell&lt;&#x2F;code&gt; we see that the compiler
automatically generated the following tight loop for us, using 4 accumulators
and AVX2 instructions:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;asm&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-asm &quot;&gt;&lt;code class=&quot;language-asm&quot; data-lang=&quot;asm&quot;&gt;&lt;span&gt;.LBB0_5:
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;vaddps  &lt;&#x2F;span&gt;&lt;span&gt;ymm0, ymm0, ymmword ptr [rdi + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;*r8]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;vaddps  &lt;&#x2F;span&gt;&lt;span&gt;ymm1, ymm1, ymmword ptr [rdi + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;*r8 + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;vaddps  &lt;&#x2F;span&gt;&lt;span&gt;ymm2, ymm2, ymmword ptr [rdi + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;*r8 + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;64&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;vaddps  &lt;&#x2F;span&gt;&lt;span&gt;ymm3, ymm3, ymmword ptr [rdi + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;*r8 + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;96&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;add     &lt;&#x2F;span&gt;&lt;span&gt;r8, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;cmp     &lt;&#x2F;span&gt;&lt;span&gt;rdx, r8
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;jne     &lt;&#x2F;span&gt;&lt;span&gt;.LBB0_5
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This will process 128 bytes of floating-point data (so 32 elements) in 7
instructions. Additionally, all the &lt;code&gt;vaddps&lt;&#x2F;code&gt; instructions are independent of
each other as they accumulate to different registers. If we analyze this with
&lt;a href=&quot;https:&#x2F;&#x2F;uica.uops.info&#x2F;&quot;&gt;uiCA&lt;&#x2F;a&gt; we see that it estimates the above loop to take
4 cycles to complete, processing 32 bytes &#x2F; cycle. At 4GHz that’s up to 128GB&#x2F;s!
Note that that’s way above what my machine’s RAM bandwidth is, so you will only
achieve that speed when summing data that is already in cache.&lt;&#x2F;p&gt;
&lt;p&gt;With this in mind we can also easily define &lt;code&gt;block_pairwise_sum_autovec&lt;&#x2F;code&gt; and
&lt;code&gt;block_kahan_sum_autovec&lt;&#x2F;code&gt; by replacing their calls to &lt;code&gt;naive_sum&lt;&#x2F;code&gt; with
&lt;code&gt;naive_sum_autovec&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;accuracy-and-speed&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#accuracy-and-speed&quot; aria-label=&quot;Anchor link for: accuracy-and-speed&quot;&gt;Accuracy and speed&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Let’s take a look at how the different summation methods compare. As a
relatively arbitrary benchmark, let’s sum 100,000 random floats ranging from
-100,000 to +100,000. This is 400 KB worth of data, so it still fits in cache on
my AMD Threadripper 2950x.&lt;&#x2F;p&gt;
&lt;p&gt;All the code is available on &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;orlp&#x2F;sum-bench&quot;&gt;Github&lt;&#x2F;a&gt;.
Compiled with &lt;code&gt;RUSTFLAGS=-C target-cpu=native&lt;&#x2F;code&gt; and &lt;code&gt;--release&lt;&#x2F;code&gt; I get the
following results:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Algorithm&lt;&#x2F;th&gt;&lt;th style=&quot;text-align: right&quot;&gt;Throughput&lt;&#x2F;th&gt;&lt;th style=&quot;text-align: right&quot;&gt;Mean absolute error&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;naive&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;5.5 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;71.796&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;pairwise&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;0.9 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;1.5528&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;kahan&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;1.4 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;0.2229&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;block_pairwise&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;5.8 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;3.8597&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;block_kahan&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;5.9 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;4.2184&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;naive_autovec&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;118.6 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;14.538&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;block_pairwise_autovec&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;71.7 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;1.6132&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;block_kahan_autovec&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;98.0 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;1.2306&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;crate_accurate_buffer&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;1.1 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;0.0015&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;crate_accurate_inplace&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;1.9 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;0.0015&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;crate_fsum&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;1.2 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;0.0000&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;aside&gt;
&lt;p&gt;The reason the &lt;code&gt;accurate&lt;&#x2F;code&gt; crate has a non-zero absolute error is because it
currently &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bsteinb&#x2F;accurate&#x2F;issues&#x2F;5&quot;&gt;does not implement rounding to nearest&lt;&#x2F;a&gt;
correctly, so it can be off by one unit in the last place for the final result.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;First I’d like to note that there’s &lt;strong&gt;more than a 100x&lt;&#x2F;strong&gt; performance difference
between the fastest and slowest method. For summing an array! Now this might not
be entirely fair as the slowest methods are computing something significantly
harder, but there’s still a 20x performance difference between a seemingly
reasonable naive implementation and the fastest one.&lt;&#x2F;p&gt;
&lt;p&gt;We find that in general the &lt;code&gt;_autovec&lt;&#x2F;code&gt; methods that use &lt;code&gt;fadd_algebraic&lt;&#x2F;code&gt; are
faster &lt;em&gt;and&lt;&#x2F;em&gt; more accurate than the ones using regular floating-point addition.
The reason they’re more accurate as well is the same reason a pairwise sum is
more accurate: any reordering of the additions is better as the default
long-chain-of-additions is already the worst case for accuracy in a sum.&lt;&#x2F;p&gt;
&lt;p&gt;Limiting ourselves to &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Pareto_efficiency&quot;&gt;Pareto-optimal&lt;&#x2F;a&gt; choices
we get the following four implementations:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Algorithm&lt;&#x2F;th&gt;&lt;th style=&quot;text-align: right&quot;&gt;Throughput&lt;&#x2F;th&gt;&lt;th style=&quot;text-align: right&quot;&gt;Mean absolute error&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;naive_autovec&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;118.6 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;14.538&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;block_kahan_autovec&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;98.0 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;1.2306&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;crate_accurate_inplace&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;1.9 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;0.0015&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;crate_fsum&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;1.2 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;0.0000&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Note that implementation differences can be quite impactful, and there are
likely dozens more methods of compensated summing I did not compare here.&lt;&#x2F;p&gt;
&lt;p&gt;For most cases I think &lt;code&gt;block_kahan_autovec&lt;&#x2F;code&gt; wins here, having good accuracy
(that doesn’t degenerate with larger inputs) at nearly the maximum speed. For
most applications the extra accuracy from the correctly-rounded sums is
unnecessary, and they are 50-100x slower.&lt;&#x2F;p&gt;
&lt;p&gt;By splitting the loop up into an explicit remainder plus a tight loop of
256-element sums we can squeeze out a bit more performance, and avoid a couple
floating-point ops for the last chunk:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;#![allow(internal_features)]
&lt;&#x2F;span&gt;&lt;span&gt;#![feature(core_intrinsics)]
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;use &lt;&#x2F;span&gt;&lt;span&gt;std::intrinsics::fadd_algebraic;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;sum_block(arr: &amp;amp;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32&lt;&#x2F;span&gt;&lt;span&gt;]) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    arr.iter().fold(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0&lt;&#x2F;span&gt;&lt;span&gt;, |x, y| fadd_algebraic(x, *y))
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;pub &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;sum_orlp(arr: &amp;amp;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32&lt;&#x2F;span&gt;&lt;span&gt;]) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;mut chunks = arr.chunks_exact(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;256&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;mut sum = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;mut c = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for&lt;&#x2F;span&gt;&lt;span&gt; chunk in &amp;amp;mut chunks {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; y = sum_block(chunk) - c;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; t = sum + y;
&lt;&#x2F;span&gt;&lt;span&gt;        c = (t - sum) - y;
&lt;&#x2F;span&gt;&lt;span&gt;        sum = t;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;    sum + (sum_block(chunks.remainder()) - c)
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Algorithm&lt;&#x2F;th&gt;&lt;th style=&quot;text-align: right&quot;&gt;Throughput&lt;&#x2F;th&gt;&lt;th style=&quot;text-align: right&quot;&gt;Mean absolute error&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;sum_orlp&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;112.2 GB&#x2F;s&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: right&quot;&gt;1.2306&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;You can of course tweak the number 256, I found that using 128 was $\approx$ 20%
slower, and that 512 didn’t really improve performance but did cost accuracy.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#conclusion&quot; aria-label=&quot;Anchor link for: conclusion&quot;&gt;Conclusion&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;I think the &lt;code&gt;fadd_algebraic&lt;&#x2F;code&gt; and similar algebraic intrinsics are very useful
for achieving high-speed floating-point routines, and that other languages
should add them as well. A global &lt;code&gt;-ffast-math&lt;&#x2F;code&gt; is not good enough, as we’ve
seen above the best implementation was a hybrid between automatically optimized
math for speed, and manually implemented non-associative compensated operations.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, if you are using LLVM, beware of &lt;code&gt;-ffast-math&lt;&#x2F;code&gt;. It is &lt;strong&gt;undefined
behavior&lt;&#x2F;strong&gt; to produce a NaN or infinity while that flag is set in LLVM. I have
no idea why they chose this hardcore stance which makes virtually every program
that uses it unsound. If you are targetting LLVM with your language, avoid the
&lt;code&gt;nnan&lt;&#x2F;code&gt; and &lt;code&gt;ninf&lt;&#x2F;code&gt; &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;docs&#x2F;LangRef.html#fastmath&quot;&gt;fast-math flags&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</content>
	</entry>
	<entry xml:lang="en">
		<title>Extracting and Depositing Bits</title>
		<author>Orson R. L. Peters</author>
		<published>2024-01-13T00:00:00+00:00</published>
		<updated>2024-01-13T00:00:00+00:00</updated>
		<link rel="alternate" href="https://orlp.net/blog/extracting-depositing-bits/" type="text/html"/>
		<id>https://orlp.net/blog/extracting-depositing-bits/</id>
		<content type="html">&lt;p&gt;Suppose you have a 64-bit word and wish to extract a couple bits from it.
For example you just performed a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;SWAR&quot;&gt;SWAR&lt;&#x2F;a&gt;
algorithm and wish to extract the least significant bit of each byte in the &lt;code&gt;u64&lt;&#x2F;code&gt;.
This is simple enough, you simply perform a binary AND with a mask of
the bits you wish to keep:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; out = word &amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x0101010101010101&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;However, this still leaves the bits of interest spread throughout the 64-bit
word. What if we also want to compress the 8 bits we wish to extract into a
single byte? Or what if we want the inverse, spreading the 8 bits of a byte
among the least significant bits of each byte in a 64-bit word?&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;extracting-depositing-bits&#x2F;bit-extract.png&quot; alt=&quot;Diagram showing bit extraction to a contiguous output.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;pext-and-pdep&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#pext-and-pdep&quot; aria-label=&quot;Anchor link for: pext-and-pdep&quot;&gt;&lt;code&gt;PEXT&lt;&#x2F;code&gt; and &lt;code&gt;PDEP&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;If you are using a modern x86-64 CPU, you are in luck. In the much underrated
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;X86_Bit_manipulation_instruction_set&quot;&gt;BMI instruction
set&lt;&#x2F;a&gt; there
are two very powerful instructions: &lt;code&gt;PDEP&lt;&#x2F;code&gt; and &lt;code&gt;PEXT&lt;&#x2F;code&gt;. They are inverses of each
other, &lt;code&gt;PEXT&lt;&#x2F;code&gt; &lt;em&gt;extracts&lt;&#x2F;em&gt; bits, &lt;code&gt;PDEP&lt;&#x2F;code&gt; &lt;em&gt;deposits&lt;&#x2F;em&gt; bits.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;PEXT&lt;&#x2F;code&gt; takes in a word and a
mask, takes just those bits from the word where the mask has a 1 bit, and
compresses all selected bits to a contiguous output word. Simulated in Rust
this would be:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;pext64(word: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u64&lt;&#x2F;span&gt;&lt;span&gt;, mask: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u64&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u64 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;mut out = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;mut out_idx = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for&lt;&#x2F;span&gt;&lt;span&gt; i in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;64 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; ith_mask_bit = (mask &amp;gt;&amp;gt; i) &amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; ith_word_bit = (word &amp;gt;&amp;gt; i) &amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt; ith_mask_bit == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;            out |= ith_word_bit &amp;lt;&amp;lt; out_idx;
&lt;&#x2F;span&gt;&lt;span&gt;            out_idx += &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;        }
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    out
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For example if you had the bitstring &lt;code&gt;abcdefgh&lt;&#x2F;code&gt; and mask &lt;code&gt;10110001&lt;&#x2F;code&gt; you would
get output bitstring &lt;code&gt;0000acdh&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;PDEP&lt;&#x2F;code&gt; is exactly its inverse, it takes contiguous data bits as a word, and
a mask, and deposits the data bits one-by-one (starting at the least significant
bits) into those bits where the mask
has a 1 bit, leaving the rest as zeros:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;pdep64(word: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u64&lt;&#x2F;span&gt;&lt;span&gt;, mask: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u64&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u64 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;mut out = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;mut input_idx = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for&lt;&#x2F;span&gt;&lt;span&gt; i in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;64 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; ith_mask_bit = (mask &amp;gt;&amp;gt; i) &amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt; ith_mask_bit == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; next_word_bit = (word &amp;gt;&amp;gt; input_idx) &amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;            out |= next_word_bit &amp;lt;&amp;lt; i;
&lt;&#x2F;span&gt;&lt;span&gt;            input_idx += &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;        }
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    out
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;So if you had the bitstring &lt;code&gt;abcdefgh&lt;&#x2F;code&gt; and mask &lt;code&gt;10100110&lt;&#x2F;code&gt; you would get output
&lt;code&gt;e0f00gh0&lt;&#x2F;code&gt; (recall that we traditionally write bitstrings with the least
significant bit on the right).&lt;&#x2F;p&gt;
&lt;p&gt;These instructions are incredibly powerful and flexible, and the amazing thing
is that these instructions only take a single cycle on modern Intel and AMD
CPUs! However, they are not available in other instruction sets, so whenever you
use them you will also likely need to write a cross-platform alternative.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;Unfortunately, both &lt;code&gt;PDEP&lt;&#x2F;code&gt; and &lt;code&gt;PEXT&lt;&#x2F;code&gt; are &lt;a href=&quot;https:&#x2F;&#x2F;uops.info&#x2F;html-instr&#x2F;PDEP_R64_R64_M64.html#ZEN2&quot;&gt;very slow&lt;&#x2F;a&gt;
on AMD Zen and Zen2. They are implemented in microcode, which is really unfortunate.
The platform advertises through &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;CPUID&quot;&gt;CPUID&lt;&#x2F;a&gt; that
the instructions are supported, but they’re almost unusably slow.
Use with caution.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;h2 id=&quot;extracting-bits-with-multiplication&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#extracting-bits-with-multiplication&quot; aria-label=&quot;Anchor link for: extracting-bits-with-multiplication&quot;&gt;Extracting bits with multiplication&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;While the following technique can’t replace all &lt;code&gt;PEXT&lt;&#x2F;code&gt; cases, it can be quite
general. It is applicable when:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;The bit pattern you want to extract is static and known in advance.&lt;&#x2F;li&gt;
&lt;li&gt;If you want to extract $k$ bits, there must at least be a $k-1$ gap between
two bits of interest.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;We compute the bit extraction by adding together many left-shifted copies of
our input word, such that we construct our desired bit pattern in the uppermost
bits. The trick is to then realize that &lt;code&gt;w &amp;lt;&amp;lt; i&lt;&#x2F;code&gt; is equivalent to &lt;code&gt;w * (1 &amp;lt;&amp;lt; i)&lt;&#x2F;code&gt;
and thus the sum of many left-shifted copies is equivalent to a single
multiplication by &lt;code&gt;(1 &amp;lt;&amp;lt; i) + (1 &amp;lt;&amp;lt; j) + ...&lt;&#x2F;code&gt;&lt;&#x2F;p&gt;
&lt;p&gt;I think the technique is best understood by visual example. Let’s use our
example from earlier, extracting the least significant bit of each byte in a
64-bit word. We start off by masking off just those bits. After that we shift
the most significant bit of interest to the topmost bit of the word to get our
first shifted copy. We then repeat this, shifting the second most significant
bit of interest to the second topmost bit, etc. We sum all these shifted copies.
This results in the following (using underscores instead of zeros for clarity):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;mask    = _______1_______1_______1_______1_______1_______1_______1_______1
&lt;&#x2F;span&gt;&lt;span&gt;t       = w &amp;amp; mask
&lt;&#x2F;span&gt;&lt;span&gt;t       = _______a_______b_______c_______d_______e_______f_______g_______h
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;t &amp;lt;&amp;lt; 7  = a_______b_______c_______d_______e_______f_______g_______h_______
&lt;&#x2F;span&gt;&lt;span&gt;t &amp;lt;&amp;lt; 14 = _b_______c_______d_______e_______f_______g_______h______________
&lt;&#x2F;span&gt;&lt;span&gt;t &amp;lt;&amp;lt; 21 = __c_______d_______e_______f_______g_______h_____________________
&lt;&#x2F;span&gt;&lt;span&gt;t &amp;lt;&amp;lt; 28 = ___d_______e_______f_______g_______h____________________________
&lt;&#x2F;span&gt;&lt;span&gt;t &amp;lt;&amp;lt; 35 = ____e_______f_______g_______h___________________________________
&lt;&#x2F;span&gt;&lt;span&gt;t &amp;lt;&amp;lt; 42 = _____f_______g_______h__________________________________________
&lt;&#x2F;span&gt;&lt;span&gt;t &amp;lt;&amp;lt; 49 = ______g_______h_________________________________________________
&lt;&#x2F;span&gt;&lt;span&gt;t &amp;lt;&amp;lt; 56 = _______h________________________________________________________
&lt;&#x2F;span&gt;&lt;span&gt;    sum = abcdefghbcdefgh_cdefh___defgh___efgh____fgh_____gh______h_______
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Note how we constructed &lt;code&gt;abcdefgh&lt;&#x2F;code&gt; in the topmost 8 bits, which we can then
extract using a single right-shift by $64 - 8 = 56$ bits. Since
&lt;code&gt;(1 &amp;lt;&amp;lt; 7) + (1 &amp;lt;&amp;lt; 14) + ... + (1 &amp;lt;&amp;lt; 56) = 0x102040810204080&lt;&#x2F;code&gt; we get the
following implementation:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;extract_lsb_bit_per_byte(w: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u64&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; mask = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x0101010101010101&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; sum_of_shifts = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x102040810204080&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    ((w &amp;amp; mask).wrapping_mul(sum_of_shifts) &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;56&lt;&#x2F;span&gt;&lt;span&gt;) as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Not as good as &lt;code&gt;PEXT&lt;&#x2F;code&gt;, but three arithmetic instructions is not bad at all.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;depositing-bits-with-multiplication&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#depositing-bits-with-multiplication&quot; aria-label=&quot;Anchor link for: depositing-bits-with-multiplication&quot;&gt;Depositing bits with multiplication&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Unfortunately the following technique is significantly less general than the
previous one. While you can take inspiration from it to implement similar
algorithms, as-is it is limited to just spreading the bits of one byte to the
least significant bit of each byte in a 64-bit word.&lt;&#x2F;p&gt;
&lt;p&gt;The trick is similar to the one above. We add 8 shifted copies of
our byte which once again translates to a multiplication. By choosing a shift
that increases in multiples if 9 instead of 8 we ensure that the bit pattern
shifts over by one position in each byte. We then mask out our bits of interest,
and finish off with a shift and byteswap (which compiles to a single instruction &lt;code&gt;bswap&lt;&#x2F;code&gt;
on Intel or &lt;code&gt;rev&lt;&#x2F;code&gt; on ARM) to put our output bits on the least significant bits
and reverse the order.&lt;&#x2F;p&gt;
&lt;p&gt;This technique visualized:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;b       = ________________________________________________________abcdefgh
&lt;&#x2F;span&gt;&lt;span&gt;b &amp;lt;&amp;lt;  9 = _______________________________________________abcdefgh_________
&lt;&#x2F;span&gt;&lt;span&gt;b &amp;lt;&amp;lt; 18 = ______________________________________abcdefgh__________________
&lt;&#x2F;span&gt;&lt;span&gt;b &amp;lt;&amp;lt; 27 = _____________________________abcdefgh___________________________
&lt;&#x2F;span&gt;&lt;span&gt;b &amp;lt;&amp;lt; 36 = ____________________abcdefgh____________________________________
&lt;&#x2F;span&gt;&lt;span&gt;b &amp;lt;&amp;lt; 45 = ___________abcdefgh_____________________________________________
&lt;&#x2F;span&gt;&lt;span&gt;b &amp;lt;&amp;lt; 54 = __abcdefgh______________________________________________________
&lt;&#x2F;span&gt;&lt;span&gt;b &amp;lt;&amp;lt; 63 = h_______________________________________________________________
&lt;&#x2F;span&gt;&lt;span&gt;    sum = h_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh
&lt;&#x2F;span&gt;&lt;span&gt;   mask = 1_______1_______1_______1_______1_______1_______1_______1_______
&lt;&#x2F;span&gt;&lt;span&gt;s &amp;amp; msk = h_______g_______f_______e_______d_______c_______b_______a_______
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We once again note that the sum of shifts can be precomputed as &lt;code&gt;1 + (1 &amp;lt;&amp;lt; 9) + ... + (1 &amp;lt;&amp;lt; 63) = 0x8040201008040201&lt;&#x2F;code&gt;, allowing the following implementation:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;deposit_lsb_bit_per_byte(b: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u64 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; sum_of_shifts = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x8040201008040201&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; mask = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x8080808080808080&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; spread = (b as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u64&lt;&#x2F;span&gt;&lt;span&gt;).wrapping_mul(sum_of_shifts) &amp;amp; mask;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u64&lt;&#x2F;span&gt;&lt;span&gt;::swap_bytes(spread &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This time it required 4 arithmetic instructions, not quite as good as &lt;code&gt;PDEP&lt;&#x2F;code&gt;,
but again not bad compared to a naive implementation, and this is cross-platform.&lt;&#x2F;p&gt;
</content>
	</entry>
	<entry xml:lang="en">
		<title>When Random Isn&#x27;t</title>
		<author>Orson R. L. Peters</author>
		<published>2024-01-10T00:00:00+00:00</published>
		<updated>2024-01-10T00:00:00+00:00</updated>
		<link rel="alternate" href="https://orlp.net/blog/when-random-isnt/" type="text/html"/>
		<id>https://orlp.net/blog/when-random-isnt/</id>
		<content type="html">&lt;p&gt;This post is an anecdote from over a decade ago, of which I lost the actual
code. So please forgive me if I do not accurately remember all the details. Some
details are also simplified so that anyone that likes computer security can
enjoy this article, not just those who have played World of Warcraft (although
the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Venn_diagram&quot;&gt;Venn diagram&lt;&#x2F;a&gt; of those two
groups likely has a solid overlap).&lt;&#x2F;p&gt;
&lt;p&gt;When I was around 14 years old I discovered &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;World_of_Warcraft&quot;&gt;World of
Warcraft&lt;&#x2F;a&gt; developed by Blizzard
Games and was immediately hooked. Not long after I discovered add-ons which allow
you to modify how your game’s user interface looks and works. However, not all
add-ons I downloaded did exactly what I wanted to do. I wanted more. So I went to
find out how they were made.&lt;&#x2F;p&gt;
&lt;p&gt;In a weird twist of fate, I blame World of Warcraft for me seriously picking up
programming. It turned out that they were made in the
&lt;a href=&quot;https:&#x2F;&#x2F;www.lua.org&#x2F;&quot;&gt;Lua&lt;&#x2F;a&gt; programming language. Add-ons were nothing more than
a couple &lt;code&gt;.lua&lt;&#x2F;code&gt; source files in a folder directly loaded into the game. The
barrier of entry was incredibly low: just edit a file, press save and reload the
interface. The fact that the game loaded &lt;em&gt;your&lt;&#x2F;em&gt; source code and you could see it
running was magical!&lt;&#x2F;p&gt;
&lt;p&gt;I enjoyed it immensely and in no time I was only writing add-ons and was barely playing
the game itself anymore. I published &lt;a href=&quot;https:&#x2F;&#x2F;www.wowinterface.com&#x2F;downloads&#x2F;author-207710.html&quot;&gt;quite a few
add-ons&lt;&#x2F;a&gt; in the next
two years, which mostly involved copying other people’s code with some
refactoring &#x2F; recombining &#x2F; tweaking to my wishes.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;add-on-security&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#add-on-security&quot; aria-label=&quot;Anchor link for: add-on-security&quot;&gt;Add-on security&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;A thought you might have is that it’s a really bad idea to let users have fully
programmable add-ons in your game, lest you get bots. However, the system
Blizzard made to prevent arbitrary programmable actions was quite clever.
Naturally, it did nothing to prevent actual botting, but at least
regular rule-abiding players were fundamentally restricted to the automation
Blizzard allowed.&lt;&#x2F;p&gt;
&lt;p&gt;Most UI elements that you could create were strictly decorative or
informational. These were completely unrestricted, as were most APIs that
strictly gather information. For example you can make a health bar display
using two frames, a background and a foreground, sizing the foreground
frame using an API call to get the health of your character.&lt;&#x2F;p&gt;
&lt;p&gt;Not all API calls were available to you however. Some were protected so they
could only be called from official Blizzard code. These typically involved
the API calls that would move your character, cast spells, use items, etc.
Generally speaking anything that actually makes you perform an in-game action
was protected.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;The API for getting your exact world location and camera orientation also became
protected at some point. This was a reaction by Blizzard to new add-ons that were
actively drawing 3D elements on top of the game world to make boss fights easier.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;However, some UI elements needed to actually interact with the game itself, e.g.
if I want to make a button that casts a certain spell. For this you could
construct a special kind of button that executes code in a secure environment
when clicked. You were only allowed to create&#x2F;destroy&#x2F;move such buttons when not
in combat, so you couldn’t simply conditionally place such buttons underneath
your cursor to automate actions during combat.&lt;&#x2F;p&gt;
&lt;p&gt;The catch was that this &lt;a href=&quot;https:&#x2F;&#x2F;wowwiki-archive.fandom.com&#x2F;wiki&#x2F;RestrictedEnvironment&quot;&gt;secure
environment&lt;&#x2F;a&gt;
&lt;em&gt;did&lt;&#x2F;em&gt; allow you to programmatically set which spell to cast, but doesn’t
let you gather the information you would need to do arbitrary automation. All
access to state from outside the secure environment was blocked. There were
some information gathering API calls available to match the more accessible
in-game macro system, but nothing as fancy as getting skill cooldowns or
unit health which would enable automatic optimal spellcasting.&lt;&#x2F;p&gt;
&lt;p&gt;So there were two environments: an insecure one where you can get all
information but can’t act on it, and a secure one where you can act but can’t
get the information needed for automation.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;a-backdoor-channel&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#a-backdoor-channel&quot; aria-label=&quot;Anchor link for: a-backdoor-channel&quot;&gt;A backdoor channel&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Fast forward a couple years and I had mostly stopped playing. My interests had
mainly moved on to more “serious” programming, and I was only occasionally
playing, mostly messing around with add-on ideas. But this secure environment kept
on nagging in my brain; I wanted to break it.&lt;&#x2F;p&gt;
&lt;p&gt;Of course there was third-party
software that completely disables the security restrictions from Blizzard, but
what’s the fun in that? I wanted to do it “legitimately”, using the technically
allowed tools, as a challenge.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;Obviously using clever code to bypass security restrictions is no better than
using third-party software, and both would likely get you banned. I never
actually wanted to use the code, just to see if I could make it work.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;So I scanned the secure environment allowed function list to see if I could smuggle any
information from the outside into the secure environment. It all seemed pretty
hopeless until I saw one tiny, innocent little function: &lt;code&gt;random&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;An evil idea came in my head: random number generators (RNGs) used in computers are almost
always &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Pseudorandom_number_generator&quot;&gt;pseudorandom number generators&lt;&#x2F;a&gt;
with (hidden) internal state. If I can manipulate this state, perhaps I can use
that to pass information into the secure environment.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;random-number-generator-woes&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#random-number-generator-woes&quot; aria-label=&quot;Anchor link for: random-number-generator-woes&quot;&gt;Random number generator woes&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;It turned out that &lt;code&gt;random&lt;&#x2F;code&gt; was just a small shim around C’s
&lt;a href=&quot;https:&#x2F;&#x2F;en.cppreference.com&#x2F;w&#x2F;c&#x2F;numeric&#x2F;random&#x2F;rand&quot;&gt;&lt;code&gt;rand&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;. I was excited!
This meant that there was a single global random state that was shared in the
process. It also helps that &lt;code&gt;rand&lt;&#x2F;code&gt; implementations tended to be on the weak side.
Since World
of Warcraft was compiled with MSVC, the actual implementation of &lt;code&gt;rand&lt;&#x2F;code&gt; was as follows:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;c&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-c &quot;&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;&lt;span&gt;uint32_t state;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span&gt;rand() {
&lt;&#x2F;span&gt;&lt;span&gt;    state = state * &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;214013 &lt;&#x2F;span&gt;&lt;span&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2531011&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;(state &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span&gt;) &amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x7fff&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This RNG is, for the lack of a better word, shite. It is
a naked &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Linear_congruential_generator&quot;&gt;linear congruential generator&lt;&#x2F;a&gt;,
and a weak one at that. Which in my case, was a good thing.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;I can understand MSVC keeps &lt;code&gt;rand&lt;&#x2F;code&gt; the same for backwards compatibility, and at
least all documentation I could find for &lt;code&gt;rand&lt;&#x2F;code&gt; recommends you not to use &lt;code&gt;rand&lt;&#x2F;code&gt;
for cryptographic purposes. But was there ever a time where such a bad PRNG
implementation was fit for &lt;em&gt;any&lt;&#x2F;em&gt; purpose?&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;So let’s get to breaking this thing. Since the state is so laughably small
and you can see 15 bits of the state directly you can keep a full list of
all possible states consistent with a single output of the RNG and use
further calls to the RNG to eliminate possibilities until a
single one remains. But we can be significantly more clever.&lt;&#x2F;p&gt;
&lt;p&gt;First we note that the top bit of &lt;code&gt;state&lt;&#x2F;code&gt; never affects anything in this RNG.
&lt;code&gt;(state &amp;gt;&amp;gt; 16) &amp;amp; 0x7fff&lt;&#x2F;code&gt; masks out 15 bits, after shifting away the bottom 16
bits, and thus effectively works mod $2^{31}$. Since on any update the new state
is a linear function of the previous state, we can propagate this modular form
all the way down to the initial state as $$f(x) \equiv f(x \bmod m) \mod m$$ for
any linear $f$.&lt;&#x2F;p&gt;
&lt;p&gt;Let $a = 214013$ and $b = 2531011$. We observe the 15-bit output $r_0, r_1$ of
two RNG calls. We’ll call the 16-bit portion of the RNG state that is hidden by
the shift $h_0, h_1$ respectively, for the states after the first and second
call. This means the state of the RNG after the first call is $2^{16} r_0 + h_0$
and similarly for $2^{16} r_1 + h_1$ after the second call. Then we have the following identity:&lt;&#x2F;p&gt;
&lt;p&gt;$$a\cdot (2^{16}r_0 + h_0) + b \equiv 2^{16}r_1 + h_1 \mod 2^{31},$$&lt;&#x2F;p&gt;
&lt;p&gt;$$ah_0 \equiv h_1 + 2^{16}(r_1 - ar_0) - b \mod 2^{31}.$$&lt;&#x2F;p&gt;
&lt;p&gt;Now let $c \geq 0$ be the known constant $(2^{16}(r_1 - ar_0) - b) \bmod 2^{31}$, then
for some integer $k$ we have&lt;&#x2F;p&gt;
&lt;p&gt;$$ah_0 = h_1 + c + 2^{31} k.$$&lt;&#x2F;p&gt;
&lt;p&gt;Note that the left hand side ranges from $0$ to $a (2^{16} - 1) \approx 2^{33.71}$.
Thus we must have $-1 \leq k \leq 2^{2.71} &amp;lt; 7$. Reordering we get the following
expression for $h_0$:
$$h_0 = \frac{c + 2^{31} k}{a} + h_1&#x2F;a.$$
Since $a &amp;gt; 2^{16}$ while $0 \leq h_1 &amp;lt; 2^{16}$ we note that the term $0 \leq h_1&#x2F;a &amp;lt; 1$.
Thus, assuming a solution exists, we must have
$$h_0 = \left\lceil\frac{c + 2^{31} k}{a}\right\rceil.$$&lt;&#x2F;p&gt;
&lt;p&gt;So for $-1 \leq k &amp;lt; 7$ we compute the above guess for the hidden portion of
the RNG state after the first call. This gives us 8 guesses, after which we can
reject bad guesses using follow-up calls to the RNG until a single unique answer remains.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;While I was able to re-derive the above with little difficulty now, 18 year old
me wasn’t as experienced in discrete math. So I &lt;a href=&quot;https:&#x2F;&#x2F;crypto.stackexchange.com&#x2F;questions&#x2F;10608&#x2F;how-to-attack-a-fixed-lcg-with-partial-output&quot;&gt;asked on crypto.SE&lt;&#x2F;a&gt;,
with the excuse that I wanted to ‘show my colleagues how weak this RNG is’.
It worked, which sparks all kinds of interesting ethics questions.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;An example implementation of this process in Python:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;import &lt;&#x2F;span&gt;&lt;span&gt;random
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;A = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;214013
&lt;&#x2F;span&gt;&lt;span&gt;B = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2531011
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;class &lt;&#x2F;span&gt;&lt;span&gt;MsvcRng:
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span&gt;__init__(self, state):
&lt;&#x2F;span&gt;&lt;span&gt;        self.state = state
&lt;&#x2F;span&gt;&lt;span&gt;        
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span&gt;__call__(self):
&lt;&#x2F;span&gt;&lt;span&gt;        self.state = (self.state * A + B) % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;(self.state &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span&gt;) &amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x7fff
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Create a random RNG state we&amp;#39;ll reverse engineer.
&lt;&#x2F;span&gt;&lt;span&gt;hidden_rng = MsvcRng(random.randint(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span&gt;))
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Compute guesses for hidden state from 2 observations.
&lt;&#x2F;span&gt;&lt;span&gt;r0 = hidden_rng()
&lt;&#x2F;span&gt;&lt;span&gt;r1 = hidden_rng()
&lt;&#x2F;span&gt;&lt;span&gt;c = (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;16 &lt;&#x2F;span&gt;&lt;span&gt;* (r1 - A * r0) - B) % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;31
&lt;&#x2F;span&gt;&lt;span&gt;ceil_div = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;a, b: (a + b - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) &#x2F;&#x2F; b
&lt;&#x2F;span&gt;&lt;span&gt;h_guesses = [ceil_div(c + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;31 &lt;&#x2F;span&gt;&lt;span&gt;* k, A) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;k &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;range(-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;)]
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Validate guesses until a single guess remains.
&lt;&#x2F;span&gt;&lt;span&gt;guess_rngs = [MsvcRng(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;16 &lt;&#x2F;span&gt;&lt;span&gt;* r0 + h0) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;h0 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;h_guesses]
&lt;&#x2F;span&gt;&lt;span&gt;guess_rngs = [g &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;g &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;guess_rngs &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;g() == r1]
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;len(guess_rngs) &amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;    r = hidden_rng()
&lt;&#x2F;span&gt;&lt;span&gt;    guess_rngs = [g &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;g &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;guess_rngs &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;g() == r]
&lt;&#x2F;span&gt;&lt;span&gt;    
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# The top bit can not be recovered as it never affects the output,
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# but we should have recovered the effective hidden state.
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;assert &lt;&#x2F;span&gt;&lt;span&gt;guess_rngs[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;].state % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;31 &lt;&#x2F;span&gt;&lt;span&gt;== hidden_rng.state % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;31
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;While I did write the above process with a &lt;code&gt;while&lt;&#x2F;code&gt; loop, it appears to only ever
need a third output at most to narrow it down to a single guess.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;putting-it-together&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#putting-it-together&quot; aria-label=&quot;Anchor link for: putting-it-together&quot;&gt;Putting it together&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Once we could reverse-engineer the internal state of the random number
generator we could make arbitrary automated decisions in the supposedly secure
environment. How it worked was as follows:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;An insecure hook was registered that would execute right before the secure
environment code would run.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;In this hook we have full access to information, and make a decision as to
which action should be taken (e.g. casting a particular spell). This action
is looked up in a hardcoded list to get an index.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;The current state of the RNG is reverse-engineered using the above process.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;We predict the outcome of the next RNG call. If this (modulo the length
of our action list) does not give our desired outcome, we advance the RNG and
try again. This repeats until the next random number would correspond to our
desired action.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;The hook returns, and the secure environment starts. It generates a “random”
number, indexes our hardcoded list of actions, and performs the “random” action.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;That’s all! By being able to simulate the RNG and looking one step ahead we could
use it as our information channel by choosing exactly the right moment to call
&lt;code&gt;random&lt;&#x2F;code&gt; in the secure environment. Now if you wanted to support a list of $n$
actions it would on average take $n$ steps of the RNG before the correct
number came up to pass along, but that wasn’t a problem in practice.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#conclusion&quot; aria-label=&quot;Anchor link for: conclusion&quot;&gt;Conclusion&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;I don’t know when Blizzard fixed the issue where the RNG state is so weak and
shared, or whether they were aware of it being an issue at all. A few years
after I had written the code I tried it again out of curiosity, and it had
stopped working. Maybe they switched to a different algorithm,
or had a properly separated RNG state for the secure environment.&lt;&#x2F;p&gt;
&lt;p&gt;All-in-all it was a lot of effort for a niche exploit in a video game that I
didn’t even want to use. But there certainly was a magic to manipulating
something supposedly random into doing exactly what you want, like a magician
pulling four aces from a shuffled deck.&lt;&#x2F;p&gt;
</content>
	</entry>
	<entry xml:lang="en">
		<title>Branchless Lomuto Partitioning</title>
		<author>Orson R. L. Peters</author>
		<published>2023-12-04T00:00:00+00:00</published>
		<updated>2023-12-04T00:00:00+00:00</updated>
		<link rel="alternate" href="https://orlp.net/blog/branchless-lomuto-partitioning/" type="text/html"/>
		<id>https://orlp.net/blog/branchless-lomuto-partitioning/</id>
		<content type="html">&lt;p&gt;A partition function accepts as input an array of elements, and a function
returning a bool (a &lt;em&gt;predicate&lt;&#x2F;em&gt;) which indicates if an element should be in the
first, or second partition. Then it returns two arrays, the two &lt;em&gt;partitions&lt;&#x2F;em&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span&gt;partition(v, pred):
&lt;&#x2F;span&gt;&lt;span&gt;    first = [x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;v &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;pred(x)]
&lt;&#x2F;span&gt;&lt;span&gt;    second = [x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;v &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;not pred(x)]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;first, second
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This can actually be done without needing any extra memory beyond the original
array. An &lt;em&gt;in-place&lt;&#x2F;em&gt; partition algorithm reorders the array such that all the
elements for which &lt;code&gt;pred(x)&lt;&#x2F;code&gt; is true come before the elements for which
&lt;code&gt;pred(x)&lt;&#x2F;code&gt; is false (or vice versa - it doesn’t really matter). Finally by
returning how many elements satisfied &lt;code&gt;pred(x)&lt;&#x2F;code&gt; to the caller they can then
logically split the array in two slices. This scheme is used in C++’s
&lt;a href=&quot;https:&#x2F;&#x2F;en.cppreference.com&#x2F;w&#x2F;cpp&#x2F;algorithm&#x2F;partition&quot;&gt;&lt;code&gt;std::partition&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
for example.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;Usually partitioning is discussed in the context of sorting, most notably
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Quicksort&quot;&gt;quicksort&lt;&#x2F;a&gt;. There it is typically used
with a &lt;em&gt;pivot&lt;&#x2F;em&gt; $p$, and you partition the data into ${x &amp;lt; p}$ and ${x \geq p}$
before recursing on both partitions. However, it is more generally applicable than that.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;h2 id=&quot;in-place-partition-algorithms&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#in-place-partition-algorithms&quot; aria-label=&quot;Anchor link for: in-place-partition-algorithms&quot;&gt;In-place partition algorithms&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;There are many possible variations &#x2F; implementations of in-place partition
algorithms, but they usually follow one of two schools: Hoare or Lomuto, named
after their original inventors &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Tony_Hoare&quot;&gt;Tony
Hoare&lt;&#x2F;a&gt; (the inventor of quicksort),
and Nico Lomuto.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;hoare&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#hoare&quot; aria-label=&quot;Anchor link for: hoare&quot;&gt;Hoare&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;In Hoare-style partition algorithms you have two iterators scanning the array,
one left-to-right ($i$) and one right-to-left ($j$). The former tries to find
elements that belong on the right, and the latter tries to find elements that
belong on the left. When both iterators have found an element, you swap them,
and continue. When the iterators cross each other you are done.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span&gt;hoare_partition(v, pred):
&lt;&#x2F;span&gt;&lt;span&gt;    i = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0       &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Loop invariant: all(pred(x) for x in v[:i])
&lt;&#x2F;span&gt;&lt;span&gt;    j = len(v)  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Loop invariant: all(not pred(x) for x in v[j:])
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;True&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;i &amp;lt; j and pred(v[i]): i += &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span&gt;        j -= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;i &amp;lt; j and not pred(v[j]): j -= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;i &amp;gt;= j: &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;i
&lt;&#x2F;span&gt;&lt;span&gt;        v[i], v[j] = v[j], v[i]
&lt;&#x2F;span&gt;&lt;span&gt;        i += &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The loop invariant can visually be seen as such (using the symbols $&amp;lt;$ and
$\geq$ for the predicate outcomes, as is usual in sorting):&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;branchless-lomuto-partitioning&#x2F;hoare-invariant.png&quot; alt=&quot;A visual representation of the Hoare loop invariant.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Doing this efficiently is perhaps an article for another day, if you are curious you can check out
the paper &lt;a href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1604.06697&quot;&gt;BlockQuicksort: How Branch Mispredictions don’t affect
Quicksort&lt;&#x2F;a&gt; by Stefan Edelkamp and Armin Weiß,
which is the technique I implemented in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;orlp&#x2F;pdqsort&quot;&gt;pdqsort&lt;&#x2F;a&gt;. Another take on the same
idea is found in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;minjaehwang&#x2F;bitsetsort&quot;&gt;bitsetsort&lt;&#x2F;a&gt;. Key here is that it can be
done &lt;em&gt;branchlessly&lt;&#x2F;em&gt;. A &lt;em&gt;branch&lt;&#x2F;em&gt; is a point at which the CPU has to make a choice of which code to
run. The most recognizable form of a branch is the &lt;code&gt;if&lt;&#x2F;code&gt; statement, but there are others (e.g.
&lt;code&gt;while&lt;&#x2F;code&gt; conditions, calling a function from a lookup table, short-circuiting logical operators).&lt;&#x2F;p&gt;
&lt;p&gt;To cut a long story short, CPUs try to predict which piece of code it should run
next and already starts doing it even before it knows if the code it is sending
down the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Instruction_pipelining&quot;&gt;pipeline&lt;&#x2F;a&gt; is the
right choice. This is great in most cases as most branches are easy to predict,
but it does incur a penalty when the prediction was wrong, as the CPU needs to
stop, go back and restart from the right point once it realizes it was wrong
(which can take a while).&lt;&#x2F;p&gt;
&lt;p&gt;Especially in sorting when the outcomes of comparisons are ideally unpredictable
(the more unpredictable the outcome of a comparison, the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Binary_entropy_function&quot;&gt;more informative&lt;&#x2F;a&gt;
getting the answer is), it can thus be advisable to avoid branching on comparisons
altogether.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;Another branchless partition algorithm that is similar to Hoare but which makes
a temporary gap in the array so it can use moves rather than swaps is the &lt;em&gt;fulcrum
partition&lt;&#x2F;em&gt; found in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;scandum&#x2F;crumsort&quot;&gt;crumsort&lt;&#x2F;a&gt; by Igor
van den Hoven.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;h3 id=&quot;lomuto&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#lomuto&quot; aria-label=&quot;Anchor link for: lomuto&quot;&gt;Lomuto&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;In Lomuto-style partition algorithms the following invariant is followed:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;branchless-lomuto-partitioning&#x2F;lomuto-invariant.png&quot; alt=&quot;A visual representation of the Lomuto loop invariant.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;That is, there is a single iterator scanning the array from left-to-right ($j$). If
the element is found to belong in the left partition, it is swapped with the
first element of the right partition (tracked by $i$), otherwise it is left
where it is.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span&gt;lomuto_partition(v, pred):
&lt;&#x2F;span&gt;&lt;span&gt;    i = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Loop invariant: all(pred(x) for x in v[:i])
&lt;&#x2F;span&gt;&lt;span&gt;    j = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Loop invariant: all(not pred(x) for x in v[i:j])
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;j &amp;lt; len(v):
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;pred(v[j]):
&lt;&#x2F;span&gt;&lt;span&gt;            v[i], v[j] = v[j], v[i]
&lt;&#x2F;span&gt;&lt;span&gt;            i += &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span&gt;        j += &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;i
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This article is focused on optimizing this style of partition.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;branchless-lomuto&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#branchless-lomuto&quot; aria-label=&quot;Anchor link for: branchless-lomuto&quot;&gt;Branchless Lomuto&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;A few years ago I read &lt;a href=&quot;https:&#x2F;&#x2F;dlang.org&#x2F;blog&#x2F;2020&#x2F;05&#x2F;14&#x2F;lomutos-comeback&#x2F;&quot;&gt;a post&lt;&#x2F;a&gt;
by Andrei Alexandrescu which discusses a branchless variant of the Lomuto
partition. Its inner loop (in C++) looks like this:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(; read &amp;lt; last; ++read) {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;auto&lt;&#x2F;span&gt;&lt;span&gt; x = *read;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;auto&lt;&#x2F;span&gt;&lt;span&gt; smaller = -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span&gt;(x &amp;lt; pivot);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;auto&lt;&#x2F;span&gt;&lt;span&gt; delta = smaller &amp;amp; (read - first);
&lt;&#x2F;span&gt;&lt;span&gt;    first[delta] = *first;
&lt;&#x2F;span&gt;&lt;span&gt;    read[-delta] = x;
&lt;&#x2F;span&gt;&lt;span&gt;    first -= smaller;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;At the time I was not overly impressed, as it does a lot of arithmetic to make
the loop branchless, so I disregarded it. A while back my friend &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;voultapher&#x2F;&quot;&gt;Lukas
Bergdoll&lt;&#x2F;a&gt; approached me with a new partition
algorithm which was doing quite well in his benchmarks, which I recognized as
being a variant of Lomuto. I then found a way the algorithm could be
restructured without using conditional move instructions, which made it perform
better still. I will present this algorithm now.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;simple-version&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#simple-version&quot; aria-label=&quot;Anchor link for: simple-version&quot;&gt;Simple version&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;First, a simplified variant which will make the more optimized variant much
more readily understood:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span&gt;branchless_lomuto_partition_simplified(v, pred):
&lt;&#x2F;span&gt;&lt;span&gt;    i = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Loop invariant: all(pred(x) for x in v[:i])
&lt;&#x2F;span&gt;&lt;span&gt;    j = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Loop invariant: all(not pred(x) for x in v[i:j])
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;j &amp;lt; len(v):
&lt;&#x2F;span&gt;&lt;span&gt;        v[i], v[j] = v[j], v[i]
&lt;&#x2F;span&gt;&lt;span&gt;        i += int(pred(v[i]))
&lt;&#x2F;span&gt;&lt;span&gt;        j += &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;i
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is actually quite similar to the original &lt;code&gt;lomuto_partition&lt;&#x2F;code&gt;, except we
now always &lt;em&gt;unconditionally&lt;&#x2F;em&gt; swap, and replace the conditional increment of $i$
with an &lt;code&gt;if&lt;&#x2F;code&gt; statement by simply converting the boolean condition to an integer
and adding it to $i$.&lt;&#x2F;p&gt;
&lt;p&gt;To visualize this, the state of the array looks like this after the unconditional
swap:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;branchless-lomuto-partitioning&#x2F;branchless-lomuto-post-swap.png&quot; alt=&quot;A visual representation of the array post-swap.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;From this it should be pretty clear that incrementing $i$ if the predicate
is true (corresponding to $v[i] &amp;lt; p$ for sorting) and unconditionally
incrementing $j$ restores our Lomuto loop invariant.
The only corner case is when ${i = j},$ but then the swap is a no-op and the
algorithm remains correct.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;While researching prior art for this article
I came across &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;zeux&#x2F;nanosort&quot;&gt;nanosort&lt;&#x2F;a&gt; by Arseny Kapoulkine
which implements this simplified variant and &lt;em&gt;thanks&lt;&#x2F;em&gt; Andrei Alexandrescu for
his branchless Lomuto partition. But I actually believe it’s fundamentally
different to Andrei’s version.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;h3 id=&quot;eliminating-swaps&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#eliminating-swaps&quot; aria-label=&quot;Anchor link for: eliminating-swaps&quot;&gt;Eliminating swaps&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;Swaps are equivalent to three moves, but by restructuring the algorithm we
can get away with two moves per iteration. The trick is to introduce a &lt;em&gt;gap&lt;&#x2F;em&gt;
in the array by temporarily moving one of the elements out of the array.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span&gt;branchless_lomuto_partition(v, pred):
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;len(v) == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    tmp = v[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;    i = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Loop invariant: all(pred(x) for x in v[:i])
&lt;&#x2F;span&gt;&lt;span&gt;    j = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Loop invariant: all(not pred(x) for x in v[i:j])
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;j &amp;lt; len(v) - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;        v[j] = v[i]
&lt;&#x2F;span&gt;&lt;span&gt;        j += &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span&gt;        v[i] = v[j]
&lt;&#x2F;span&gt;&lt;span&gt;        i += pred(v[i])
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    v[j] = v[i]
&lt;&#x2F;span&gt;&lt;span&gt;    v[i] = tmp
&lt;&#x2F;span&gt;&lt;span&gt;    i += pred(v[i])
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;i
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is our new branchless Lomuto partition algorithm. Its inner loop is
incredibly tight, involving only two moves, one predicate evaluation
and two additions. A full visualization of one iteration of the algorithm (the striped
red area indicates the gap in the array):&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;branchless-lomuto-partitioning&#x2F;branchless-lomuto-iteration.png&quot; alt=&quot;A visualzation of one iteration of the algorithm.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;We can now compare &lt;a href=&quot;https:&#x2F;&#x2F;cpp.godbolt.org&#x2F;z&#x2F;zzzTh47PG&quot;&gt;the assembly&lt;&#x2F;a&gt; of the tight inner loops of Andrei Alexandrescu’s
branchless Lomuto partition and ours:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;asm&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-asm &quot;&gt;&lt;code class=&quot;language-asm&quot; data-lang=&quot;asm&quot;&gt;&lt;span&gt;.alexandrescu:
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;edi, DWORD PTR [rdx]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;rsi, rdx
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;ebp, DWORD PTR [rax]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;cmp     &lt;&#x2F;span&gt;&lt;span&gt;edi, r8d
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;setb    &lt;&#x2F;span&gt;&lt;span&gt;cl
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;sub     &lt;&#x2F;span&gt;&lt;span&gt;rsi, rax
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;sar     &lt;&#x2F;span&gt;&lt;span&gt;rsi, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;test    &lt;&#x2F;span&gt;&lt;span&gt;cl, cl
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;cmove   &lt;&#x2F;span&gt;&lt;span&gt;rsi, rbx
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;sal     &lt;&#x2F;span&gt;&lt;span&gt;rcx, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;63
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;sar     &lt;&#x2F;span&gt;&lt;span&gt;rcx, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;61
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;DWORD PTR [rax+rsi*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;], ebp
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;lea     &lt;&#x2F;span&gt;&lt;span&gt;r11, [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;+rsi*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;rsi, rdx
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;add     &lt;&#x2F;span&gt;&lt;span&gt;rdx, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;sub     &lt;&#x2F;span&gt;&lt;span&gt;rsi, r11
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;sub     &lt;&#x2F;span&gt;&lt;span&gt;rax, rcx
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;DWORD PTR [rsi], edi
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;cmp     &lt;&#x2F;span&gt;&lt;span&gt;rdx, r10
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;jb      &lt;&#x2F;span&gt;&lt;span&gt;.alexandrescu
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;.orlp:
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;lea     &lt;&#x2F;span&gt;&lt;span&gt;rdi, [r8+rax*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;ecx, DWORD PTR [rdi]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;DWORD PTR [rdx], ecx
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;ecx, DWORD PTR [rdx+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;cmp     &lt;&#x2F;span&gt;&lt;span&gt;ecx, r9d
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;DWORD PTR [rdi], ecx
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;adc     &lt;&#x2F;span&gt;&lt;span&gt;rax, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;add     &lt;&#x2F;span&gt;&lt;span&gt;rdx, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;cmp     &lt;&#x2F;span&gt;&lt;span&gt;rdx, r10
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;jne     &lt;&#x2F;span&gt;&lt;span&gt;.orlp
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Half the instruction count! A neat trick the compiler did is translate the
addition of the boolean result of the predicate (which is just a comparison
here) to &lt;code&gt;adc rax, 0&lt;&#x2F;code&gt;. It avoids needing to create a boolean 0&#x2F;1 value in a
register by setting the carry flag using &lt;code&gt;cmp&lt;&#x2F;code&gt; and adding zero with carry.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#conclusion&quot; aria-label=&quot;Anchor link for: conclusion&quot;&gt;Conclusion&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Is the new branchless Lomuto implementation worth it? For that I’ll hand you
over to my friend Lukas Bergdoll who has done an &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Voultapher&#x2F;sort-research-rs&#x2F;blob&#x2F;main&#x2F;writeup&#x2F;lomcyc_partition&#x2F;text.md&quot;&gt;extensive write-up&lt;&#x2F;a&gt;
on the performance of an optimized implementation of this partition with a
variety of real-world benchmarks and metrics.&lt;&#x2F;p&gt;
&lt;p&gt;From an algorithmic standpoint the branchless Lomuto- and Hoare-style algorithms
do have a key difference: they differ in the amount of writes they must perform.
Branchless Lomuto-style algorithms always do at least two moves for each
element, whereas Hoare-style algorithms can get away with doing a single move
for each element (crumsort), or even better, half a move per element on average
for random data (BlockQuicksort, pdqsort and bitsetsort, although they spend
more time figuring out what to move than crumsort does - one of many
trade-offs). So a key component of choosing an algorithm is the question “How
expensive are my moves?” which can vary from very cheap (small integers in
cache) to very expensive (large structs not in cache).&lt;&#x2F;p&gt;
&lt;p&gt;Finally, the Lomuto-style algorithms tend to be significantly smaller in both
source code and generated machine code, which can be a factor for some. They
are also arguably easier to understand and prove correct, Hoare-style partition
algorithms are especially prone to off-by-one errors.&lt;&#x2F;p&gt;
</content>
	</entry>
	<entry xml:lang="en">
		<title>Subtraction Is Functionally Complete</title>
		<author>Orson R. L. Peters</author>
		<published>2023-09-28T00:00:00+00:00</published>
		<updated>2023-09-28T00:00:00+00:00</updated>
		<link rel="alternate" href="https://orlp.net/blog/subtraction-is-functionally-complete/" type="text/html"/>
		<id>https://orlp.net/blog/subtraction-is-functionally-complete/</id>
		<content type="html">&lt;p&gt;To be precise, &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;IEEE_754&quot;&gt;IEEE-754&lt;&#x2F;a&gt; floating point
subtraction is &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Functional_completeness&quot;&gt;functionally
complete&lt;&#x2F;a&gt;. That means you
can construct any binary circuit using nothing but floating point subtraction.&lt;&#x2F;p&gt;
&lt;p&gt;To see how, we must start at the bottom. I quote the IEEE 754-2019 standard, section 6.3:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;h3 id=&quot;6-3-the-sign-bit&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#6-3-the-sign-bit&quot; aria-label=&quot;Anchor link for: 6-3-the-sign-bit&quot;&gt;6.3 The sign bit&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;[…] When neither the inputs nor result are NaN, […]; the sign of a sum, or of a difference $x−y$
regarded as a sum $x+(−y)$, differs from at most one of the addends’ signs; […].
These rules shall apply even when operands or results are zero or infinite.&lt;&#x2F;p&gt;
&lt;p&gt;When the sum of two operands with opposite signs (or the difference of two operands with like
signs) is exactly zero, the sign of that sum (or difference) shall be $+0$ under all rounding-direction
attributes except &lt;code&gt;roundTowardNegative&lt;&#x2F;code&gt;; under that attribute, the sign of an exact zero sum (or
difference) shall be $−0$.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Let’s dissect that.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;A subtraction $x - y$ is considered a sum $x + (-y)$.&lt;&#x2F;li&gt;
&lt;li&gt;Zero can have a sign, $-0$ and $0$ are distinct entities (although they compare
equal when testing with &lt;code&gt;==&lt;&#x2F;code&gt;).&lt;&#x2F;li&gt;
&lt;li&gt;If both of the addends have the same sign, the output must have that sign.
However, for $x - y$ that means if $x$ and $y$ have &lt;em&gt;different&lt;&#x2F;em&gt; signs the output
must have the sign of $x$.&lt;&#x2F;li&gt;
&lt;li&gt;If $x$ and $y$ have the same sign, and $x - y$ is zero, the output will be
$+0$ for all rounding modes except &lt;code&gt;roundTowardNegative&lt;&#x2F;code&gt;, then it will be $-0$.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Now since the default rounding mode in virtually every context is &lt;code&gt;roundTiesToEven&lt;&#x2F;code&gt;,
we shall assume that from now on. However, everything works analogously even for
&lt;code&gt;roundTowardNegative&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;a-truth-table&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#a-truth-table&quot; aria-label=&quot;Anchor link for: a-truth-table&quot;&gt;A truth table&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;So, what does that give us when subtracting zeroes?&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span&gt;- -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span&gt;= +&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Same sign, must be +0.
&lt;&#x2F;span&gt;&lt;span&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span&gt;- +&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span&gt;= -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Different signs, sign from first argument.
&lt;&#x2F;span&gt;&lt;span&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span&gt;- -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span&gt;= +&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Different signs, sign from first argument.
&lt;&#x2F;span&gt;&lt;span&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span&gt;- +&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span&gt;= +&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Same sign, must be +0.
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Interesting… What if we say that $-0$ is false and $+0$ is true?&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;A B | O
&lt;&#x2F;span&gt;&lt;span&gt;----+--
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0 0 &lt;&#x2F;span&gt;&lt;span&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0 1 &lt;&#x2F;span&gt;&lt;span&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 0 &lt;&#x2F;span&gt;&lt;span&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 1 &lt;&#x2F;span&gt;&lt;span&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Our resulting truth table is equivalent to ${A \vee \neg B}$, or ${B \to A}$ (also known as the
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;IMPLY_gate&quot;&gt;IMPLY&lt;&#x2F;a&gt; gate, albeit with the arguments swapped). It turns
out this truth table is &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Functional_completeness&quot;&gt;functionally
complete&lt;&#x2F;a&gt;, which means we can make arbitrary
circuits using only this gate.
Technically speaking, it is only functionally complete if given access to the
constant false. This is necessary to produce a NOT gate, and NOT + IMPLY is
a functionally complete set. I don’t know a better term for ‘functionally complete
if given access to some constant value’, however.&lt;&#x2F;p&gt;
&lt;aside&gt;
NAND and NOR are truly functionally complete by
themselves, even without access to any particular constant value. This is very
valuable when constructing microchips as you only need to be able to produce a
single kind of component, and do not need to worry about routing a consistent
low signal anywhere to produce a NOT gate.
&lt;&#x2F;aside&gt;
&lt;h2 id=&quot;subtraction-circuits&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#subtraction-circuits&quot; aria-label=&quot;Anchor link for: subtraction-circuits&quot;&gt;Subtraction circuits&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Let’s build a demo in Python. First we’ll define our constants and allow us to print them nicely.
Note that even though they are distinct entities, $+0$ and $-0$ compare equal in IEEE 754 floating
point, so we must first extract the sign before comparing to 0 to distinguish.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;import &lt;&#x2F;span&gt;&lt;span&gt;math
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;f_false = -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0
&lt;&#x2F;span&gt;&lt;span&gt;f_true = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0
&lt;&#x2F;span&gt;&lt;span&gt;f_repr = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;True &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;math.copysign(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, x) &amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;False
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We can now make a NOT gate by using the fact that $-0 - x$ flips the sign of
zero $x$:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;f_not = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;x: f_false - x
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Let’s test that:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f_repr(f_not(f_false))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;True
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f_repr(f_not(f_true))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;False
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Great! We can also build an OR gate by noticing that if we flip the sign of
the second argument before subtracting, we always get $+0$ (true) unless
both arguments are $-0$ (false):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;f_or = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;a, b: a - f_not(b)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Let’s test it out:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f_repr(f_or(f_false, f_false))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;False
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f_repr(f_or(f_true,  f_false))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;True
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f_repr(f_or(f_false, f_true))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;True
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f_repr(f_or(f_true, f_true))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;True
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now that we have OR and NOT, we can make all other gates, e.g.:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;f_and = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;a, b: f_not(f_or(f_not(a), f_not(b)))
&lt;&#x2F;span&gt;&lt;span&gt;f_xor = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;a, b: f_or(f_and(f_not(a), b), f_and(a, f_not(b)))
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f_repr(f_and(f_false, f_false))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;False
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f_repr(f_and(f_true,  f_false))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;False
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f_repr(f_and(f_false, f_true))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;False
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f_repr(f_and(f_true, f_true))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;True
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f_repr(f_xor(f_false, f_false))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;False
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f_repr(f_xor(f_true,  f_false))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;True
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f_repr(f_xor(f_false, f_true))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;True
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f_repr(f_xor(f_true, f_true))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;False
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;software-integers&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#software-integers&quot; aria-label=&quot;Anchor link for: software-integers&quot;&gt;Software integers&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;You may have heard of soft-float, software implementations of floating point
using integers. Let’s turn that on its head: an integer implementation done in
software, using only floating point ops. Let’s do it in Rust so we can look at
the final assembly output to see how &lt;del&gt;horrifically slow&lt;&#x2F;del&gt; awesome it is.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;type &lt;&#x2F;span&gt;&lt;span&gt;Bit = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;ZERO&lt;&#x2F;span&gt;&lt;span&gt;: Bit = -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;ONE&lt;&#x2F;span&gt;&lt;span&gt;: Bit = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;not(x: Bit) -&amp;gt; Bit { &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;ZERO &lt;&#x2F;span&gt;&lt;span&gt;- x }
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;or(a: Bit, b: Bit) -&amp;gt; Bit { a - not(b) }
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;and(a: Bit, b: Bit) -&amp;gt; Bit { not(or(not(a), not(b))) }
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;xor(a: Bit, b: Bit) -&amp;gt; Bit { or(and(not(a), b), and(a, not(b))) }
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;adder(a: Bit, b: Bit, c: Bit) -&amp;gt; (Bit, Bit) {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; s = xor(xor(a, b), c);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; c = or(and(xor(a, b), c), and(a, b));
&lt;&#x2F;span&gt;&lt;span&gt;    (s, c)
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;type &lt;&#x2F;span&gt;&lt;span&gt;SoftU8 = [Bit; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;];
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;pub &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;softu8_add(a: SoftU8, b: SoftU8) -&amp;gt; SoftU8 {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;(s0, c) = adder(a[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;], b[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;], &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;ZERO&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;(s1, c) = adder(a[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;], b[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;], c);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;(s2, c) = adder(a[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;], b[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;], c);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;(s3, c) = adder(a[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;], b[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;], c);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;(s4, c) = adder(a[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;], b[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;], c);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;(s5, c) = adder(a[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;], b[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;], c);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;(s6, c) = adder(a[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;], b[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;], c);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span&gt;(s7, _) = adder(a[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;], b[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;], c);
&lt;&#x2F;span&gt;&lt;span&gt;    [s0, s1, s2, s3, s4, s5, s6, s7]
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Hmm? u8? What&amp;#39;s that? Shhhh....
&lt;&#x2F;span&gt;&lt;span&gt;pub &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;to_softu8(x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; SoftU8 {
&lt;&#x2F;span&gt;&lt;span&gt;    std::array::from_fn(|i| &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(x &amp;gt;&amp;gt; i) &amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;ONE &lt;&#x2F;span&gt;&lt;span&gt;} &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;ZERO &lt;&#x2F;span&gt;&lt;span&gt;})
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;pub &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;from_softu8(x: SoftU8) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;).filter(|i| x[*i].signum() &amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0&lt;&#x2F;span&gt;&lt;span&gt;).map(|i| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&amp;lt; i).sum()
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;main() {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; a = to_softu8(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;23&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; b = to_softu8(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;19&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;    println!(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;{}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, from_softu8(softu8_add(a, b)));
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It’s horrible, but it works, it dutifully prints 42. And it &lt;em&gt;only&lt;&#x2F;em&gt; took $\approx 120$
floating point instructions to add two 8-bit integers:&lt;&#x2F;p&gt;
&lt;aside&gt;
On x86-64 there isn&#x27;t actually a floating point negation instruction, instead
the compiler simply emits a XOR with a mask that toggles the top bit, which
is the sign bit of a IEEE-754 floating point number.
&lt;&#x2F;aside&gt;
&lt;pre data-lang=&quot;asm&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-asm &quot;&gt;&lt;code class=&quot;language-asm&quot; data-lang=&quot;asm&quot;&gt;&lt;span&gt;example::softu8_add:
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;rax, rdi
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movups  &lt;&#x2F;span&gt;&lt;span&gt;xmm2, xmmword ptr [rsi]
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movups  &lt;&#x2F;span&gt;&lt;span&gt;xmm0, xmmword ptr [rdx]
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subps   &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm0
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm4, xmm0
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subps   &lt;&#x2F;span&gt;&lt;span&gt;xmm4, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm1, xmmword ptr [rip + .LCPI0_0]
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm4, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subps   &lt;&#x2F;span&gt;&lt;span&gt;xmm4, xmm3
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm3
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm4
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm6, xmm0
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm6, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm6, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm6, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm6, xmm3
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm4
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shufps  &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm4, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;85
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm5, xmm6
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm5, xmm3
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm5, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm10, xmm6
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm10, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm10, xmm3
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm7, xmm0
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shufps  &lt;&#x2F;span&gt;&lt;span&gt;xmm7, xmm0, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;85
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm7, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shufps  &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm2, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;85
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm7, xmm3
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm7, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm8, xmm4
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;unpckhpd        &lt;&#x2F;span&gt;&lt;span&gt;xmm8, xmm4
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm0
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;unpckhpd        &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm0
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm9, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;unpckhpd        &lt;&#x2F;span&gt;&lt;span&gt;xmm9, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm9
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm11, xmm11
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm9, xmm4
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shufps  &lt;&#x2F;span&gt;&lt;span&gt;xmm9, xmm4, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;255
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm7, xmm10
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm10, xmm7
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm10, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm10, xmm8
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm10
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;unpcklps        &lt;&#x2F;span&gt;&lt;span&gt;xmm7, xmm3
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shufps  &lt;&#x2F;span&gt;&lt;span&gt;xmm6, xmm7, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;64
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;addps   &lt;&#x2F;span&gt;&lt;span&gt;xmm11, xmm4
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movlhps &lt;&#x2F;span&gt;&lt;span&gt;xmm5, xmm4
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subps   &lt;&#x2F;span&gt;&lt;span&gt;xmm4, xmm6
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movss   &lt;&#x2F;span&gt;&lt;span&gt;xmm4, xmm11
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subps   &lt;&#x2F;span&gt;&lt;span&gt;xmm7, xmm8
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm7, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shufps  &lt;&#x2F;span&gt;&lt;span&gt;xmm5, xmm7, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;66
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subps   &lt;&#x2F;span&gt;&lt;span&gt;xmm5, xmm4
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm9
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shufps  &lt;&#x2F;span&gt;&lt;span&gt;xmm0, xmm0, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;255
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm0, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shufps  &lt;&#x2F;span&gt;&lt;span&gt;xmm2, xmm2, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;255
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm0, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm0, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movups  &lt;&#x2F;span&gt;&lt;span&gt;xmmword ptr [rdi], xmm5
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movups  &lt;&#x2F;span&gt;&lt;span&gt;xmm2, xmmword ptr [rdx + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm5, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm5, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movups  &lt;&#x2F;span&gt;&lt;span&gt;xmm7, xmmword ptr [rsi + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm5, xmm7
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm5, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm4, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shufps  &lt;&#x2F;span&gt;&lt;span&gt;xmm4, xmm2, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;85
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm4, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm6, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm8, xmm7
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm9, xmm7
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subps   &lt;&#x2F;span&gt;&lt;span&gt;xmm9, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subps   &lt;&#x2F;span&gt;&lt;span&gt;xmm2, xmm7
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shufps  &lt;&#x2F;span&gt;&lt;span&gt;xmm7, xmm7, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;85
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm4, xmm7
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm4, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movhlps &lt;&#x2F;span&gt;&lt;span&gt;xmm6, xmm6
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm6, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movhlps &lt;&#x2F;span&gt;&lt;span&gt;xmm8, xmm8
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm6, xmm8
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm6, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm2, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subps   &lt;&#x2F;span&gt;&lt;span&gt;xmm2, xmm9
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm0, xmm3
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm0
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm5, xmm3
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;unpcklps        &lt;&#x2F;span&gt;&lt;span&gt;xmm0, xmm5
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm5, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shufps  &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm2, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;85
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm5, xmm3
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm4, xmm5
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm4
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm5, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;unpckhpd        &lt;&#x2F;span&gt;&lt;span&gt;xmm5, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm5
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subss   &lt;&#x2F;span&gt;&lt;span&gt;xmm6, xmm3
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;unpcklps        &lt;&#x2F;span&gt;&lt;span&gt;xmm4, xmm6
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movlhps &lt;&#x2F;span&gt;&lt;span&gt;xmm0, xmm4
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movaps  &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subps   &lt;&#x2F;span&gt;&lt;span&gt;xmm3, xmm0
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subps   &lt;&#x2F;span&gt;&lt;span&gt;xmm0, xmm2
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xorps   &lt;&#x2F;span&gt;&lt;span&gt;xmm0, xmm1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;subps   &lt;&#x2F;span&gt;&lt;span&gt;xmm0, xmm3
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movups  &lt;&#x2F;span&gt;&lt;span&gt;xmmword ptr [rdi + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span&gt;], xmm0
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;ret
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
</content>
	</entry>
	<entry xml:lang="en">
		<title>Bitwise Binary Search: Elegant and Fast</title>
		<author>Orson R. L. Peters</author>
		<published>2023-05-16T00:00:00+00:00</published>
		<updated>2023-05-16T00:00:00+00:00</updated>
		<link rel="alternate" href="https://orlp.net/blog/bitwise-binary-search/" type="text/html"/>
		<id>https://orlp.net/blog/bitwise-binary-search/</id>
		<content type="html">&lt;p&gt;I recently read the article &lt;a href=&quot;https:&#x2F;&#x2F;probablydance.com&#x2F;2023&#x2F;04&#x2F;27&#x2F;beautiful-branchless-binary-search&#x2F;&quot;&gt;&lt;em&gt;Beautiful Branchless Binary Search&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;
by Malte Skarupke. In it they discuss the merits of the following snippet of
C++ code implementing a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Binary_search_algorithm&quot;&gt;binary search&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;template&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; It, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; T, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; Cmp&amp;gt;
&lt;&#x2F;span&gt;&lt;span&gt;It lower_bound_skarupke(It begin, It end, const T&amp;amp; value, Cmp comp) {
&lt;&#x2F;span&gt;&lt;span&gt;    size_t length = end - begin;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(length == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; end;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    size_t step = bit_floor(length);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(step != length &amp;amp;&amp;amp; comp(begin[step], value)) {
&lt;&#x2F;span&gt;&lt;span&gt;        length -= step + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(length == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; end;
&lt;&#x2F;span&gt;&lt;span&gt;        step = bit_ceil(length);
&lt;&#x2F;span&gt;&lt;span&gt;        begin = end - step;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(step &#x2F;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;; step != &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;; step &#x2F;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(comp(begin[step], value)) begin += step;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; begin + comp(*begin, value);
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Frankly, while the ideas behind the algorithm are beautiful, I find the
implementation complex and hard to understand or prove correct. This is not
meant as a jab at Malte Skarupke, I find almost all binary search
implementations hard to understand or prove correct.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;Binary search is notoriously difficult to get correct. A &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;10.1145&#x2F;52965.53012&quot;&gt;1988
study&lt;&#x2F;a&gt; found that out of an informal
sample of twenty computer science &lt;strong&gt;textbooks&lt;&#x2F;strong&gt;, only five contained a correct
binary search algorithm. I haven’t checked, but I hope we are doing better
now in 2022. Unfortunately off-by-one errors feel rather timeless to me.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;In this article I will provide an alternative implementation based on similar
ideas but with a very different &lt;em&gt;interpretation&lt;&#x2F;em&gt; that is (in my opinion)
incredibly elegant and clear to understand, at least as far as binary searches
go. The resulting implementation also saves a comparison in almost every case
and ends up quite a bit smaller.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;a-brief-history-lesson&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#a-brief-history-lesson&quot; aria-label=&quot;Anchor link for: a-brief-history-lesson&quot;&gt;A brief history lesson&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;Feel free to &lt;a href=&quot;&#x2F;blog&#x2F;bitwise-binary-search&#x2F;#lower-bounds&quot;&gt;skip&lt;&#x2F;a&gt; this section if you are not interested in history, but I had
to find out whose shoulders we are standing on. This is not only to give credit
where credit is due, but also to see if any useful details were lost in
translation.&lt;&#x2F;p&gt;
&lt;p&gt;Malte Skarupke says they learned about the above algorithm from
Alex Muscar in &lt;a href=&quot;https:&#x2F;&#x2F;muscar.eu&#x2F;shar-binary-search-meta.html&quot;&gt;their blog post&lt;&#x2F;a&gt;.
Alex says they found the algorithm while reading &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Jon_Bentley_(computer_scientist)&quot;&gt;Jon L. Bentley’s&lt;&#x2F;a&gt; book
&lt;em&gt;Writing Efficient Programs&lt;&#x2F;em&gt; (ISBN 0-13-970244-X). Jon Bentley writes:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;If we need more speed then we should consult Knuth’s [1973] definitive treatise on
searching. Section 6.2.1 discusses binary search, and Exercise 6.2.1-11 describes
an extremely efficient binary search program; […]&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;I own the referenced book hardcopy, Donald Knuth’s &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;The_Art_of_Computer_Programming&quot;&gt;The Art of Computer Programming&lt;&#x2F;a&gt;
(also known as TAOCP), volume 3 Sorting and Searching. Exercise 6.2.1-11 is not
the correct exercise in my edition, but 12 and 13 are, which are exercises
referring to “Shar’s method”.&lt;&#x2F;p&gt;
&lt;p&gt;We have to scan chapter 6.2.1 to find the mentioned method. Finally, we find it
on page 416. First as context, Knuth uses the following notation for binary search:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Algorithm U&lt;&#x2F;strong&gt; (&lt;em&gt;Uniform binary search&lt;&#x2F;em&gt;). Given a table of records
$R_1, R_2, \dots, R_N$ whose keys are in increasing order $K_1 &amp;lt; K_2 &amp;lt; \cdots &amp;lt; K_N$,
this algorithm searches for a given argument $K$. If $N$ is even, the
algorithm will sometimes refer to a dummy key $K_0$ that should be set to $-\infty$.
We assume that $N \geq 1$.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Now we can finally see Shar’s method:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Another modification of binary search, suggested in 1971 by L. E. Shar,
will be still faster on some computers, because it is uniform after the first
step, and it requires no table.
The first step is to compare $K$ with $K_i$, where $i = 2^k$, $k = \lfloor \lg N\rfloor$.
If $K$ &amp;lt; $K_i$, we use a uniform search with the $\delta$‘s equal to $2^{k-1},
2^{k-2}, \dots, 1, 0$. On the other hand, if $K &amp;gt; K_i$ we reset $i$ to $i’ = N + 1 - 2^l$
where $l = \lceil\lg(N - 2^k + 1)\rceil$, and pretend that the first
comparison was actually $K &amp;gt; K_{i’},$ using a uniform search with the
$\delta$’s equal to $2^{l-1}, 2^{l-2}, \dots, 1, 0$.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;The $\delta$’s the first paragraph refers to can be understood as the ‘step’
variable in the above C++ code. Overall Skarupke’s C++ code seems a fairly
faithful implementation of Shar’s method as described by Knuth, except that
Knuth uses one-based indexing which Skarupke’s method does not take into account.
Knuth goes on to describe that Shar’s method never makes more than $\lfloor \lg
N \rfloor + 1$ comparisons, which is one more than the minimum possible number
of comparisons.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;I also find it interesting that Knuth mentions a further speed-up of binary search
to be found in exercise 24. Exercise 24 essentialy hints at using an implicit
tree similar to a binary heap where the children of node $i$ are found at $2i$ and $2i + 1$.
This is nowadays known as the &lt;a href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1509.05053&quot;&gt;Eytzinger layout&lt;&#x2F;a&gt;,
which is a much better layout for binary search if you can decide the order of your elements.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;To finish the history lesson, I did look on Google Scholar, but I could not find a 1971 paper by L. E. Shar.
I assume the modification was described in private communication with Knuth.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;lower-bounds&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#lower-bounds&quot; aria-label=&quot;Anchor link for: lower-bounds&quot;&gt;Lower bounds&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Let us assume that we have a zero-indexed array $A$ of length $n$ that is in
ascending order: $A[0] \leq A[1] \leq \cdots \leq A[n-1]$. We want to find the
&lt;em&gt;lower bound&lt;&#x2F;em&gt; of some element $x$ in this array. This is the leftmost position
we could insert $x$ into the array while keeping it sorted. Alternatively
phrased, this is the number of elements strictly less than $x$ in the array.&lt;&#x2F;p&gt;
&lt;p&gt;A traditional binary search algorithm finds this number by keeping a range of
possible solutions, repeatedly cutting that range in two pieces and
selecting the only piece which contains the solution. This tends to be tricky
to get right, as you must avoid overflows while computing the midpoint, and
are dealing with multiple boundary conditions, both in your code as well as
in your correctness invariant.&lt;&#x2F;p&gt;
&lt;p&gt;Before we begin with our solution that avoids this, we have to take a moment and
understand an important aspect of binary search. With each comparison we can
distinguish between two sets of outcomes. Thus with $k$ comparisons, we can
distinguish between $2^k$ total outcomes. However, for $n$ elements, there are
$n+1$ outcomes! For example, for an array of 7 elements there are 8 positions in
which $x$ could be sorted: &lt;img src=&quot;&#x2F;blog&#x2F;bitwise-binary-search&#x2F;full-binary-tree-7.png&quot; alt=&quot;A full binary search tree of size 7.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Thus the natural array size for binary search is $2^k - 1$, and
not $2^k$.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;a-bitwise-approach&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#a-bitwise-approach&quot; aria-label=&quot;Anchor link for: a-bitwise-approach&quot;&gt;A bitwise approach&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Let’s take a look at the sixteen possible 4-bit integers in binary:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt; 0 = 0000      8 = 1000 
&lt;&#x2F;span&gt;&lt;span&gt; 1 = 0001      9 = 1001 
&lt;&#x2F;span&gt;&lt;span&gt; 2 = 0010     10 = 1010 
&lt;&#x2F;span&gt;&lt;span&gt; 3 = 0011     11 = 1011 
&lt;&#x2F;span&gt;&lt;span&gt; 4 = 0100     12 = 1100 
&lt;&#x2F;span&gt;&lt;span&gt; 5 = 0101     13 = 1101 
&lt;&#x2F;span&gt;&lt;span&gt; 6 = 0110     14 = 1110 
&lt;&#x2F;span&gt;&lt;span&gt; 7 = 0111     15 = 1111 
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Notice how if the top bit of the integer is set, it remains set for all larger
integers. And within each group with the same top bit, when the second most
significant bit is set, it remains set for larger integers within that group.
And so on for the third bit within each group with the same top two bits, ad
infinitum. &lt;strong&gt;In binary, within each group with identical top $t$ bits set, the
value of the $t+1$th bit is monotonically increasing.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Since our desired solution is the number of elements strictly less than $x,$
we can rephrase it as finding the largest number $b$ such that ${A[b-1] &amp;lt; x},$
or $b = 0$ if no elements are less than $x$.
We can find the unique $b$ very efficiently by constructing it &lt;strong&gt;directly&lt;&#x2F;strong&gt;, bit-by-bit,
using the above observation.&lt;&#x2F;p&gt;
&lt;p&gt;Let’s assume that $A$ has length $n = 2^k - 1$. Then any possible answer $b$
fits exactly in $k$ bits. Since $A$ is sorted, if we find that $A[i-1] &amp;lt; x$, we
know that ${b \geq i}$. Conversely, if that comparison fails, we know that ${b &amp;lt;
i}.$ Thus, using the above observation we can figure out if the top bit of $b$
is set simply by testing $A[i-1] &amp;lt; x$ with $i = 2^{k-1}$.&lt;&#x2F;p&gt;
&lt;p&gt;Now we know what the top bit should be and set it accordingly, never changing it
again. Using the above observation, this time within the group of
integers with a given top bit, we know that if we set the second bit and find
that $A[i-1] &amp;lt; x$ still holds, that the second bit must be set, and if not it
must be zero. We repeat this process bit-by-bit until we have figured out all
bits of $b$, giving our answer!&lt;&#x2F;p&gt;
&lt;p&gt;Perhaps you are like me, and you are a visual thinker. Let us flip our
earlier tree on its side and visually associate associate a binary $b$
value with each gap between the elements of our array:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;bitwise-binary-search&#x2F;binary-gap-indices.png&quot; alt=&quot;A full binary search tree of size 7 with binary indices for the gaps between them.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The small red arrows indicate which element $A[b-1] &amp;lt; x$ would test
for a given guess of $b$. Note that no element is associated with $b = 0$,
as we can only end up with this value if all other tests failed, and thus
we never have to test this value. A search for $5$ would end up testing
${b = {\color{red}1}00_2}$ (success, set bit), ${b = 1{\color{red}1}0_2}$ (fail, do not set bit) and ${b = 10{\color{red}1}_2}$ (success, set bit).&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;Now, you might argue this ‘bitwise’ binary search is still doing the traditional
splitting of ranges, just implicitly. And looking at the above binary tree, of
course you would be right. But to me the interpretation of the algorithm
through a bit-by-bit decoding of $b$ is more elegant, and easier to see as
correct, with much less worry about boundary conditions and off-by-one
errors.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;In C++ we would get the following:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Only works for n = 2^k - 1.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;template&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; It, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; T, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; Cmp&amp;gt;
&lt;&#x2F;span&gt;&lt;span&gt;It lower_bound_2k1(It begin, It end, const T&amp;amp; value, Cmp comp) {
&lt;&#x2F;span&gt;&lt;span&gt;    size_t two_k = (end - begin) + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    size_t b = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(size_t bit = two_k &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;; bit != &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;; bit &amp;gt;&amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(comp(begin[(b | bit) - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;], value)) b |= bit;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; begin + b;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Note that we always do exactly $k$ comparisons, which is optimal.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;generalizing-to-other-sizes&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#generalizing-to-other-sizes&quot; aria-label=&quot;Anchor link for: generalizing-to-other-sizes&quot;&gt;Generalizing to other sizes&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;However, there is a glaring issue: our original array might not have length
$2^k - 1$. The simplest way to solve this is to add elements with value
$\infty$ to the end, to pad the array out to $2^k - 1$ elements.
Instead of physically adding $\infty$ elements the array, we can simply
check if the index lies in the original array, and if not skip our test
entirely, as it would fail (we’d be testing if $\infty &amp;lt; x$).&lt;&#x2F;p&gt;
&lt;p&gt;To pad our array out we want to find the smallest integer $k$ such that $2^k - 1 \geq n$,
which means $k \geq \log_2(n + 1)$, which after rounding up gives
$$k = \lceil \log_2(n + 1) \rceil = \lfloor \log_2(n) \rfloor + 1.$$&lt;&#x2F;p&gt;
&lt;p&gt;Alternatively and definitely more enlightening is that this can be understood as
initializing &lt;code&gt;bit&lt;&#x2F;code&gt; in our loop to the highest set bit in $n$:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;template&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; It, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; T, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; Cmp&amp;gt;
&lt;&#x2F;span&gt;&lt;span&gt;It lower_bound_pad(It begin, It end, const T&amp;amp; value, Cmp comp) {
&lt;&#x2F;span&gt;&lt;span&gt;    size_t n = end - begin;
&lt;&#x2F;span&gt;&lt;span&gt;    size_t b = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(size_t bit = std::bit_floor(n); bit != &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;; bit &amp;gt;&amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;        size_t i = (b | bit) - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(i &amp;lt; n &amp;amp;&amp;amp; comp(begin[i], value)) b |= bit;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; begin + b;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In my opinion this is the most elegant binary search implementation there is.&lt;&#x2F;p&gt;
&lt;hr&gt;
&lt;h2 id=&quot;making-it-branchless&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#making-it-branchless&quot; aria-label=&quot;Anchor link for: making-it-branchless&quot;&gt;Making it branchless&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;The above works well, but introduces an index check before each array access.
This means the compiler can not eliminate the
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Branch_(computer_science)&quot;&gt;branch&lt;&#x2F;a&gt; here, lest we
access out-of-bounds memory.&lt;&#x2F;p&gt;
&lt;p&gt;To solve this problem we use a similar trick as L. E. Shar: we do an initial
comparison with the middle element, then either look at $2^k - 1$ elements at
the start, or $2^k - 1$ elements at the end of the array. If the array size
itself isn’t of the form $2^k - 1$, these two subslices overlap in the middle.
To completely cover our array (together with the element we initially compare
with) we must have $$(2^k - 1) + (2^k - 1) + 1 = 2^{k+1} - 1 \geq n$$ and thus
we choose $k = \lceil \log_2(n + 1) - 1 \rceil = \lfloor \log_2 (n) \rfloor$ :&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;template&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; It, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; T, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; Cmp&amp;gt;
&lt;&#x2F;span&gt;&lt;span&gt;It lower_bound_overlap(It begin, It end, const T&amp;amp; value, Cmp comp) {
&lt;&#x2F;span&gt;&lt;span&gt;    size_t n = end - begin;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(n == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; begin;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    size_t two_k = std::bit_floor(n);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(comp(begin[n &#x2F; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;], value)) begin = end - (two_k - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;    
&lt;&#x2F;span&gt;&lt;span&gt;    size_t b = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(size_t bit = two_k &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;; bit != &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;; bit &amp;gt;&amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(comp(begin[(b | bit) - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;], value)) b |= bit;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; begin + b;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;improving-the-efficiency&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#improving-the-efficiency&quot; aria-label=&quot;Anchor link for: improving-the-efficiency&quot;&gt;Improving the efficiency&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;If our array doesn’t have size $2^{k+1} - 1$, the subarrays overlap in the
middle. This means that part of the subarrays already eliminated by the initial
comparison $A[n &#x2F; 2] &amp;lt; x$ are being unnecessarily searched. Can we improve on
this?&lt;&#x2F;p&gt;
&lt;p&gt;We can, if we choose two different sizes $2^l - 1$ and $2^r - 1$ for when
we are respectively searching at the start (left) or end (right) of the array.
Again, in combination with the initial element we compare with (which is now
$A[2^l - 1] &amp;lt; x$) we find that we must have&lt;&#x2F;p&gt;
&lt;p&gt;$$(2^l - 1) + (2^r - 1) + 1 = 2^l + 2^r - 1 \geq n$$&lt;&#x2F;p&gt;
&lt;p&gt;to be able to handle an arbitrary size $n$. And of course, we must have
$2^l - 1 \leq n$ and $2^r - 1 \leq n$ for our subarrays to fit. Let’s find the
optimal choice for $l, r$—which is not trivial.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;cost-analysis&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#cost-analysis&quot; aria-label=&quot;Anchor link for: cost-analysis&quot;&gt;Cost analysis&lt;&#x2F;a&gt;&lt;&#x2F;h4&gt;
&lt;p&gt;What is the cost of a certain choice of $l, r$, assuming a uniform distribution
over the $n + 1$ possible outcomes of the binary search? We know that after
the initial comparison, for $2^l$ of those outcomes we use $l$ comparisons,
and thus the rest must use $r$ comparisons for an expected cost of&lt;&#x2F;p&gt;
&lt;p&gt;$$C = 1 + \frac{2^l}{n + 1}\cdot l + \frac{n + 1 - 2^l}{n + 1}\cdot r$$&lt;&#x2F;p&gt;
&lt;p&gt;We only really care about minimizing this cost, so we can throw out the additional
constant $+1$ and the factor $1&#x2F;(n+1)$ as it does not change the relative order:&lt;&#x2F;p&gt;
&lt;p&gt;$$C’ = 2^l\cdot l + (n + 1 - 2^l)\cdot r$$&lt;&#x2F;p&gt;
&lt;h4 id=&quot;optimizing-r&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#optimizing-r&quot; aria-label=&quot;Anchor link for: optimizing-r&quot;&gt;Optimizing $r$&lt;&#x2F;a&gt;&lt;&#x2F;h4&gt;
&lt;p&gt;Compare cost $C’$ to the expression $2^l \cdot l + 2^{r}\cdot r$. In this case
the expression is entirely symmetrical, so we could freely swap $l$ and $r$. But
we know from our earlier array size inequality that $n + 1 - 2^l \leq 2^{r}$.
Thus we can conclude that $l$ has a greater weight in the cost than $r$ and
therefore we can safely assume that $l \leq r$ is minimal.&lt;&#x2F;p&gt;
&lt;p&gt;From this plus the fact that $2^l + 2^{r} - 1 \geq n$
we can immediately deduce that $2^{r} \geq (n + 1) &#x2F; 2$ by weakening the
inequality with $2^l \to 2^r$, and thus rounding up to the nearest integer gives
\begin{align*}
r &amp;amp;= \lceil\log_2(n+1) - 1\rceil = \lfloor\log_2(n)\rfloor
\end{align*}&lt;&#x2F;p&gt;
&lt;p&gt;Note that we can’t choose $r$ any larger, nor smaller, and thus we’ve
determined the optimal value for $r$.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;What’s not shown here is the exploratory process to get to this proof
of optimality.
I tend to write &lt;a href=&quot;https:&#x2F;&#x2F;gist.github.com&#x2F;orlp&#x2F;26efd8aa77a48ad32c884d9978e7bf24&quot;&gt;small scripts&lt;&#x2F;a&gt;
to brute-force some optimal data. I often plug these
brute-forced values into the &lt;a href=&quot;https:&#x2F;&#x2F;oeis.org&#x2F;&quot;&gt;OEIS&lt;&#x2F;a&gt; to find references
and patterns.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;h4 id=&quot;optimizing-l&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#optimizing-l&quot; aria-label=&quot;Anchor link for: optimizing-l&quot;&gt;Optimizing $l$&lt;&#x2F;a&gt;&lt;&#x2F;h4&gt;
&lt;p&gt;Let’s reorder our relative cost $C’$ a bit:&lt;&#x2F;p&gt;
&lt;p&gt;$$C’ = 2^l\cdot (l - r) + (n + 1)\cdot r$$&lt;&#x2F;p&gt;
&lt;p&gt;We can ignore the second term as a constant, as we’re now trying to optimize $l$
given the optimal $r$. The function&lt;&#x2F;p&gt;
&lt;p&gt;$$f(l) = 2^x(l - r)$$
has derivative w.r.t. $l$
$$f’(l) = 2^l(\ln(2)(l - r) + 1)$$
with a single zero corresponding to the global minimum at $r - \frac{1}{\ln(2)} \approx r - 1.4427$.
Let’s plug in the two integers closest to this minimum in $f$:&lt;&#x2F;p&gt;
&lt;p&gt;$$f(r - 1) = 2^{r - 1}(r - 1 - r) = - 2^{r-1}$$
$$f(r - 2) = 2^{r - 2}(r - 2 - r) = - 2^{r-1}$$&lt;&#x2F;p&gt;
&lt;p&gt;Thus we find that both $l = r - 1$ or $l = r - 2$ have optimal cost. For
simplicity we can just limit ourselves to $l = r - 1$ as it is equal but easier
to satisfy $2^l + 2^r - 1 \geq n$ with. Speaking of that inequality,
we can’t always choose $l = r - 1$ as we are sometimes forced to choose $l = r$
by it.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;putting-it-together&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#putting-it-together&quot; aria-label=&quot;Anchor link for: putting-it-together&quot;&gt;Putting it together&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;We found that $r = \lfloor \log2(n) \rfloor$, and that&lt;&#x2F;p&gt;
&lt;p&gt;$$l = \begin{cases}
r-1&amp;amp;\text{if }2^r + 2^{r-1} - 1 \geq n\\
r&amp;amp;\text{otherwise}
\end{cases}.$$&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;Generating the optimal $l$ using the formula we found and plugging the
numbers we get into the OEIS we find &lt;a href=&quot;https:&#x2F;&#x2F;oeis.org&#x2F;A099396&quot;&gt;A099396&lt;&#x2F;a&gt;(n)
$= \left\lfloor \log_2\left(2n&#x2F;3 \right) \right\rfloor.$&lt;&#x2F;p&gt;
&lt;p&gt;This makes sense as the optimal $l$ is incremented every time we have a
size of the form $n = 2^k + 2^{k-1} = 2^k(1 + 1&#x2F;2)$ and thus $2^k = 2n&#x2F;3$.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;The condition $2^r + 2^{r-1} - 1 \geq n$ can be seen to be equivalent to “the
$r-1$th bit of $n$ is not set”. And as $2^r - 2^{r-1} = 2^{r-1}$ we can isolate
that bit, negate it, and subtract it from &lt;code&gt;two_r&lt;&#x2F;code&gt; to get our &lt;code&gt;two_l&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;template&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; It, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; T, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; Cmp&amp;gt;
&lt;&#x2F;span&gt;&lt;span&gt;It lower_bound_opt(It begin, It end, const T&amp;amp; value, Cmp comp) {
&lt;&#x2F;span&gt;&lt;span&gt;    size_t n = end - begin;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(n == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; begin;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    size_t two_r = std::bit_floor(n);
&lt;&#x2F;span&gt;&lt;span&gt;    size_t two_l = two_r - ((two_r &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) &amp;amp; ~n);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;bool&lt;&#x2F;span&gt;&lt;span&gt; use_r = comp(begin[two_l - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;], value);
&lt;&#x2F;span&gt;&lt;span&gt;    size_t two_k = use_r ? two_r : two_l;
&lt;&#x2F;span&gt;&lt;span&gt;    begin = use_r ? end - (two_r - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) : begin;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    size_t b = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(size_t bit = two_k &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;; bit != &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;; bit &amp;gt;&amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(comp(begin[(b | bit) - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;], value)) b |= bit;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; begin + b;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The somewhat odd use of ternary statements and &lt;code&gt;use_r&lt;&#x2F;code&gt; is to convince the
compiler to generate branchless code. We certainly lost some of the elegance we
had before, but at least now we do the minimal number of comparisons we can do
with our bitwise binary search while being branchless. And it is in fact better
than than L. E. Shar’s original method, whose initial comparison $A[i - 1] &amp;lt; x$
uses $i = \left\lfloor \log_2 (n) \right\rfloor$, which is suboptimal as we’ve
seen.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;micro-optimizations&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#micro-optimizations&quot; aria-label=&quot;Anchor link for: micro-optimizations&quot;&gt;Micro-optimizations&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;For some reason the standard implementation of
&lt;a href=&quot;https:&#x2F;&#x2F;en.cppreference.com&#x2F;w&#x2F;cpp&#x2F;numeric&#x2F;bit_floor&quot;&gt;&lt;code&gt;std::bit_floor&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;… sucks
a bit. E.g. on &lt;a href=&quot;https:&#x2F;&#x2F;gcc.godbolt.org&#x2F;z&#x2F;4dsK1Tanj&quot;&gt;x86-64 Clang 16.0&lt;&#x2F;a&gt; with
&lt;code&gt;-O2&lt;&#x2F;code&gt; we compile this:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;size_t bit_floor(size_t n) {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(n == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;std::bit_floor(n);
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;size_t bit_floor_manual(size_t n) {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(n == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;size_t(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) &amp;lt;&amp;lt; (std::bit_width(n) - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;to this:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;asm&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-asm &quot;&gt;&lt;code class=&quot;language-asm&quot; data-lang=&quot;asm&quot;&gt;&lt;span&gt;bit_floor(unsigned long):
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;test    &lt;&#x2F;span&gt;&lt;span&gt;rdi, rdi
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;je      &lt;&#x2F;span&gt;&lt;span&gt;.LBB0_1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shr     &lt;&#x2F;span&gt;&lt;span&gt;rdi
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;je      &lt;&#x2F;span&gt;&lt;span&gt;.LBB0_3
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;bsr     &lt;&#x2F;span&gt;&lt;span&gt;rcx, rdi
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xor     &lt;&#x2F;span&gt;&lt;span&gt;rcx, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;63
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;jmp     &lt;&#x2F;span&gt;&lt;span&gt;.LBB0_5
&lt;&#x2F;span&gt;&lt;span&gt;.LBB0_1:
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;xor     &lt;&#x2F;span&gt;&lt;span&gt;eax, eax
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;ret
&lt;&#x2F;span&gt;&lt;span&gt;.LBB0_3:
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;ecx, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;64
&lt;&#x2F;span&gt;&lt;span&gt;.LBB0_5:
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;neg     &lt;&#x2F;span&gt;&lt;span&gt;cl
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;eax, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shl     &lt;&#x2F;span&gt;&lt;span&gt;rax, cl
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;ret
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;bit_floor_manual(unsigned long):
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;bsr     &lt;&#x2F;span&gt;&lt;span&gt;rcx, rdi
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;eax, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shl     &lt;&#x2F;span&gt;&lt;span&gt;rax, cl
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;test    &lt;&#x2F;span&gt;&lt;span&gt;rdi, rdi
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;cmove   &lt;&#x2F;span&gt;&lt;span&gt;rax, rdi
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;ret
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Yikes. Manual computation it is!&lt;&#x2F;p&gt;
&lt;h3 id=&quot;optimizing-the-tight-loop&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#optimizing-the-tight-loop&quot; aria-label=&quot;Anchor link for: optimizing-the-tight-loop&quot;&gt;Optimizing the tight loop&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The astute observer might have noticed that in the following loop, we only
ever set each bit in &lt;code&gt;b&lt;&#x2F;code&gt; at most once:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;size_t b = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(size_t bit = two_k &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;; bit != &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;; bit &amp;gt;&amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(comp(begin[(b | bit) - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;], value)) b |= bit;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; begin + b;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This means we could change the binary OR to simple addition, which might
optimize better in pointer calculations.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;Various architectures have specialized instructions aimed at pointer arithmetic,
like x86’s LEA instruction. Using bitwise instructions might prevent the compiler
from using these.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;For the bitwise version we see the following tight loop for the above,
in &lt;code&gt;x86-64&lt;&#x2F;code&gt; with GCC:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;asm&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-asm &quot;&gt;&lt;code class=&quot;language-asm&quot; data-lang=&quot;asm&quot;&gt;&lt;span&gt;.L7:
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;rsi, rdx
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;or      &lt;&#x2F;span&gt;&lt;span&gt;rsi, rcx
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;cmp     &lt;&#x2F;span&gt;&lt;span&gt;DWORD PTR [rax-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;+rsi*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;], r8d
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;cmovb   &lt;&#x2F;span&gt;&lt;span&gt;rcx, rsi
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shr     &lt;&#x2F;span&gt;&lt;span&gt;rdx
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;jne     &lt;&#x2F;span&gt;&lt;span&gt;.L7
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;With addition we see the following:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;asm&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-asm &quot;&gt;&lt;code class=&quot;language-asm&quot; data-lang=&quot;asm&quot;&gt;&lt;span&gt;.L7:
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;lea     &lt;&#x2F;span&gt;&lt;span&gt;rsi, [rdx+rcx]
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;cmp     &lt;&#x2F;span&gt;&lt;span&gt;DWORD PTR [rax-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;+rsi*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;], r8d
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;cmovb   &lt;&#x2F;span&gt;&lt;span&gt;rcx, rsi
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shr     &lt;&#x2F;span&gt;&lt;span&gt;rdx
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;jne     &lt;&#x2F;span&gt;&lt;span&gt;.L7
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In fact, when using addition we could eliminate variable $b$ entirely,
and directly add to &lt;code&gt;begin&lt;&#x2F;code&gt; (similar to Skarupke’s original version that sparked
this article):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(size_t bit = two_k &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;; bit != &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;; bit &amp;gt;&amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(comp(begin[bit - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;], value)) begin += bit;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; begin;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;However, I’ve found that some compilers, e.g. GCC on x86-64 will refuse to make
this variant branchless. I hate how fickle compilers can be sometimes, and I
wish compilers exposed not just the
&lt;a href=&quot;https:&#x2F;&#x2F;en.cppreference.com&#x2F;w&#x2F;cpp&#x2F;language&#x2F;attributes&#x2F;likely&quot;&gt;&lt;code&gt;likely&lt;&#x2F;code&gt;&#x2F;&lt;code&gt;unlikely&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
attributes, but also an attribute that allows you to mark something as &lt;code&gt;unpredictable&lt;&#x2F;code&gt;
to nudge the compiler towards using branchless techniques like CMOV’s.&lt;&#x2F;p&gt;
&lt;p&gt;Instead of eliminating &lt;code&gt;b&lt;&#x2F;code&gt;, we can optimize the loop to only do a single addition
explicitly, by moving the &lt;code&gt;-1&lt;&#x2F;code&gt; into the value of &lt;code&gt;b&lt;&#x2F;code&gt; itself:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;size_t b = -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(size_t bit = two_k &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;; bit != &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;; bit &amp;gt;&amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(comp(begin[b + bit], value)) b += bit;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; begin + (b + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Yay for two’s complement and integer overflow! This generated the best code on
all platforms I’ve looked at, so I applied this pattern to all my
implementations in the benchmark.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;You might wonder why we don’t simply decrement &lt;code&gt;begin&lt;&#x2F;code&gt; instead of changing &lt;code&gt;b&lt;&#x2F;code&gt;.
Unfortunately, the C++ standard discriminates against one-before-begin pointers.
One-after-end of array pointers are allowed, but creating a pointer before the
beginning of an allocation is &lt;a href=&quot;https:&#x2F;&#x2F;eel.is&#x2F;c++draft&#x2F;expr.add#4&quot;&gt;undefined
behavior&lt;&#x2F;a&gt;, even if you never dereference
that pointer.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;h2 id=&quot;results&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#results&quot; aria-label=&quot;Anchor link for: results&quot;&gt;Results&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Let’s compare all the variants we’ve made, both in comparisons and actual
runtime. The latter I will test on my Apple M1 2021 Macbook Pro which is an ARM
machine. Your mileage &lt;strong&gt;will&lt;&#x2F;strong&gt; vary on different machines, especially x86-64
machines, but I want this article to focus more on the algorithmic side of
things rather than become an extensive study on the exact characteristics of
branch mispredictions, cache misses, and how to get the compiler to generate
branchless code for a variety of platforms.&lt;&#x2F;p&gt;
&lt;p&gt;The code for the below benchmark is available on &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;orlp&#x2F;bitwise-binary-search&#x2F;&quot;&gt;my Github&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;comparisons&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#comparisons&quot; aria-label=&quot;Anchor link for: comparisons&quot;&gt;Comparisons&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;To test the average number of comparisons for size $n$ we can simply query for
each of the $n + 1$ outcomes how many comparisons it takes to get that outcome.
We then average over all these for a given $n$. The result is the following
graph:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;bitwise-binary-search&#x2F;comparisons.svg&quot; alt=&quot;Comparison count graph.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;We see that &lt;code&gt;lower_bound_opt&lt;&#x2F;code&gt; does in fact do the fewest comparisons of
all the branchless methods, following the optimal &lt;code&gt;lower_bound_std&lt;&#x2F;code&gt;
more closely than &lt;code&gt;lower_bound_pad&lt;&#x2F;code&gt; or &lt;code&gt;lower_bound_skarupke&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Across all sizes less than 256 we see the following average comparison counts,
minus the optimal comparison count:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Algorithm&lt;&#x2F;th&gt;&lt;th style=&quot;text-align: left&quot;&gt;Comparisons above optimal&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;lower_bound_skarupke&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: left&quot;&gt;1.17835&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;lower_bound_overlap&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: left&quot;&gt;0.37250&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;lower_bound_pad&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: left&quot;&gt;0.17668&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;lower_bound_opt&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: left&quot;&gt;0.17238&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;lower_bound_std&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: left&quot;&gt;0.00000&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;All our hard work finding the optimal split into subarrays only saved us ~0.2
comparisons on average on &lt;code&gt;lower_bound_opt&lt;&#x2F;code&gt; versus the much simpler
&lt;code&gt;lower_bound_overlap&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;runtime-32-bit-integers&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#runtime-32-bit-integers&quot; aria-label=&quot;Anchor link for: runtime-32-bit-integers&quot;&gt;Runtime (32-bit integers)&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;To benchmark runtime for a certain size $n$ I pre-generated one million random
integers in the range $[0, n + 1]$. Then I record the time it takes to look them
all up using our lower bound routine of interest, and calculate the average.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;One may argue that querying the same $n$-size array a million times in a row
is unrealistic and that in the real world the array size might change. While
in some cases this is true, I believe that most of the time when you really
care about binary search performance, you are looking up values against a
(mostly) static array. If not, performance is dominated by the effort required
to keep the array sorted.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;Using clang 13.0.0 with &lt;code&gt;g++ -O2 -std=c++20&lt;&#x2F;code&gt; we get the following:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;bitwise-binary-search&#x2F;runtime.svg&quot; alt=&quot;Performance graph.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;I think this graph gives a fascinating insight into the branch predictor on the
Apple M1. Most striking is the relatively poor performance of &lt;code&gt;lower_bound_opt&lt;&#x2F;code&gt;.
Within each bracket of sizes $[2^k, 2^{k+1})$ it performs much worse
than &lt;code&gt;lower_bound_overlap&lt;&#x2F;code&gt;, with a size-dependent slope, before suddenly dropping to a
consistently good performance.&lt;&#x2F;p&gt;
&lt;p&gt;This puzzled me for a while, and I triple-checked to see that &lt;code&gt;lower_bound_opt&lt;&#x2F;code&gt;
was really being compiled with branchless instructions. Only then did I realize
there was a hidden branch all along: the loop exit condition.
&lt;code&gt;lower_bound_overlap&lt;&#x2F;code&gt; always performs the same number of loop iterations,
allowing the CPU to always correctly predict the loop exit, whereas
&lt;code&gt;lower_bound_opt&lt;&#x2F;code&gt; tries to reduce the number of iterations it does to save
comparisons. It turns out that for integers the cost of an extra iteration is
much lower than risking a mispredict on the loop condition on the Apple M1.&lt;&#x2F;p&gt;
&lt;p&gt;If we also look at larger inputs we see that the above pattern keeps up
for quite a while until we start hitting sizes where cache effects become
a factor:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;bitwise-binary-search&#x2F;runtime-large.svg&quot; alt=&quot;Larger performance graph.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;We also note that it truly is important for a binary search benchmark to look at
a variety of sizes, as you might reach rather different conclusions in
performance at $n = 2^{12}$ versus $n = 2^{12} \cdot 1.5$.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;If you are interested in even larger sizes, I would strongly recommend you to
look at &lt;a href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1509.05053&quot;&gt;cache-friendly layouts for binary
search&lt;&#x2F;a&gt; instead of a simple sorted array, as
cache misses are much worse still than the branch mispredictions we’ve been
focusing on so far.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;A commonly heard advice is to not use binary search for small arrays, but to
use a linear search instead. I find that not to be true on the Apple M1 for integers,
at least compared to my branchless binary search, when searching a runtime-sized
but otherwise fixed size array:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;bitwise-binary-search&#x2F;runtime-small.svg&quot; alt=&quot;Smaller performance graph with linear search.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Note that a linear search must always incur at least one branch misprediction:
on the loop exit condition. For a fixed size array &lt;code&gt;lower_bound_overlap&lt;&#x2F;code&gt; has
zero branch mispredictions, including the loop exit.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;runtime-strings&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#runtime-strings&quot; aria-label=&quot;Anchor link for: runtime-strings&quot;&gt;Runtime (strings)&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;To benchmark the performance on strings I copied the above benchmark, except
that I convert all integers to strings, zero-padded to a length of four.
I also reduced the number of samples to 300,000 per size, as the string
benchmark was significantly slower.&lt;&#x2F;p&gt;
&lt;p&gt;Using clang 13.0.0 with &lt;code&gt;g++ -O2 -std=c++20&lt;&#x2F;code&gt; we get the following:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;bitwise-binary-search&#x2F;runtime-str-large.svg&quot; alt=&quot;Large performance graph for strings.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Strings are a lot less interesting than integers in this case, as most of the
branchless optimizations are moot. We find that initially the branchless
versions are only slightly slower than &lt;code&gt;std::lower_bound&lt;&#x2F;code&gt; due to the extra
comparisons needed. However once we get to the larger-than-cache sizes
&lt;code&gt;std::lower_bound&lt;&#x2F;code&gt; becomes significantly better as it can do speculative loads
to reduce cache misses.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;bitwise-binary-search&#x2F;runtime-str-small.svg&quot; alt=&quot;Smaller performance graph with linear search for strings.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;It seems that for strings the advice to use linear searches for small input
arrays doesn’t help that much, but doesn’t hurt either for $n \leq 9$,
on the Apple M1.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#conclusion&quot; aria-label=&quot;Anchor link for: conclusion&quot;&gt;Conclusion&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;In my opinion the bitwise binary search is an elegant alternative to the
traditional binary search method, at the cost of ~0.17 to ~0.37 extra comparisons
on average. It can be implemented in a branchless manner, which can be
significantly faster when searching elements with a branchless comparison
operator.&lt;&#x2F;p&gt;
&lt;p&gt;In this article we found the following implementation to perform the best
on Apple M1 after all micro-optimizations are applied:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;template&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; It, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; T, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; Cmp&amp;gt;
&lt;&#x2F;span&gt;&lt;span&gt;It lower_bound(It begin, It end, const T&amp;amp; value, Cmp comp) {
&lt;&#x2F;span&gt;&lt;span&gt;    size_t n = end - begin;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(n == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; begin;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    size_t two_k = size_t(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) &amp;lt;&amp;lt; (std::bit_width(n) - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;    size_t b = comp(begin[n &#x2F; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;], value) ? n - two_k : -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(size_t bit = two_k &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;; bit != &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;; bit &amp;gt;&amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(comp(begin[b + bit], value)) b += bit;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; begin + (b + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;However, when it comes to clarity and elegance I still find the following
method the most beautiful:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;template&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; It, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; T, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;typename&lt;&#x2F;span&gt;&lt;span&gt; Cmp&amp;gt;
&lt;&#x2F;span&gt;&lt;span&gt;It lower_bound(It begin, It end, const T&amp;amp; value, Cmp comp) {
&lt;&#x2F;span&gt;&lt;span&gt;    size_t n = end - begin;
&lt;&#x2F;span&gt;&lt;span&gt;    size_t b = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(size_t bit = std::bit_floor(n); bit != &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;; bit &amp;gt;&amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) {
&lt;&#x2F;span&gt;&lt;span&gt;        size_t i = (b | bit) - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(i &amp;lt; n &amp;amp;&amp;amp; comp(begin[i], value)) b |= bit;
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; begin + b;
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
</content>
	</entry>
	<entry xml:lang="en">
		<title>The World&#x27;s Smallest Hash Table</title>
		<author>Orson R. L. Peters</author>
		<published>2023-03-04T00:00:00+00:00</published>
		<updated>2023-03-05T00:00:00+00:00</updated>
		<link rel="alternate" href="https://orlp.net/blog/worlds-smallest-hash-table/" type="text/html"/>
		<id>https://orlp.net/blog/worlds-smallest-hash-table/</id>
		<content type="html">&lt;p&gt;This December I once again did the &lt;a href=&quot;https:&#x2F;&#x2F;adventofcode.com&#x2F;&quot;&gt;Advent of Code&lt;&#x2F;a&gt;,
in Rust. If you are interested, &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;orlp&#x2F;aoc2022&#x2F;&quot;&gt;my solutions&lt;&#x2F;a&gt;
are on Github. I wanted to highlight one particular solution to the &lt;a href=&quot;https:&#x2F;&#x2F;adventofcode.com&#x2F;2022&#x2F;day&#x2F;2&quot;&gt;day 2
problem&lt;&#x2F;a&gt; as it is both optimized completely
beyond the point of reason yet contains a useful technique. For simplicity we’re
only going to do part 1 of the day 2 problem here, but the exact same techniques
apply to part 2.&lt;&#x2F;p&gt;
&lt;p&gt;We’re going to start off slow, but stick around because at the end you should
have an idea what on earth this function is doing, how it works, how to make one
and why it’s the world’s smallest hash table:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;pub &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;phf_shift(x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u32&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; shift = x.wrapping_mul(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xa463293e&lt;&#x2F;span&gt;&lt;span&gt;) &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;27&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    ((&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x824a1847&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u32 &lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt; shift) &amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0b11111&lt;&#x2F;span&gt;&lt;span&gt;) as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#the-problem&quot; aria-label=&quot;Anchor link for: the-problem&quot;&gt;The problem&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;We receive a file where each line contains &lt;code&gt;A&lt;&#x2F;code&gt;, &lt;code&gt;B&lt;&#x2F;code&gt;, or &lt;code&gt;C&lt;&#x2F;code&gt;, followed by
a space, followed by &lt;code&gt;X&lt;&#x2F;code&gt;, &lt;code&gt;Y&lt;&#x2F;code&gt;, or &lt;code&gt;Z&lt;&#x2F;code&gt;. These are to be understood as choices in
a game of &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Rock_paper_scissors&quot;&gt;rock-paper-scissors&lt;&#x2F;a&gt; as such:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;A = X = Rock
&lt;&#x2F;span&gt;&lt;span&gt;B = Y = Paper
&lt;&#x2F;span&gt;&lt;span&gt;C = Z = Scissors
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The first letter (&lt;code&gt;A&lt;&#x2F;code&gt;&#x2F;&lt;code&gt;B&lt;&#x2F;code&gt;&#x2F;&lt;code&gt;C&lt;&#x2F;code&gt;) indicates the choice of our opponent, the second
letter (&lt;code&gt;X&lt;&#x2F;code&gt;&#x2F;&lt;code&gt;Y&lt;&#x2F;code&gt;&#x2F;&lt;code&gt;Z&lt;&#x2F;code&gt;) indicates our choice. We then compute a score, which has two
components:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;If we picked Rock we get 1 point, if we picked Paper we get 2 points, and 3 points if we picked Scissors.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;If we lose we gain 0 points, if we draw we gain 3 points, if we win we get 6 points.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;As an example, if our input file looks as such:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;A Y
&lt;&#x2F;span&gt;&lt;span&gt;B X
&lt;&#x2F;span&gt;&lt;span&gt;C Z
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Our total score would be &lt;code&gt;(2 + 6) + (1 + 0) + (3 + 3) = 15&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;an-elegant-solution&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#an-elegant-solution&quot; aria-label=&quot;Anchor link for: an-elegant-solution&quot;&gt;An elegant solution&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;A sane solution would verify that indeed our input lines have the format &lt;code&gt;[ABC] [XYZ]&lt;&#x2F;code&gt;, before extracting those two letters. After converting these letters to
integers &lt;code&gt;0&lt;&#x2F;code&gt;, &lt;code&gt;1&lt;&#x2F;code&gt;, &lt;code&gt;2&lt;&#x2F;code&gt; by subtracting either the ASCII code for &lt;code&gt;&#x27;A&#x27;&lt;&#x2F;code&gt; or &lt;code&gt;&#x27;X&#x27;&lt;&#x2F;code&gt;
respectively we can immediately calculate the first component of our score as &lt;code&gt;1 + ours&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The second component is more involved, but can be elegantly solved using &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Modular_arithmetic&quot;&gt;modular arithmetic&lt;&#x2F;a&gt;.
Note that if Rock = 0, Paper = 1, Scissor = 2 then we always have that choice ${k + 1 \bmod 3}$
beats $k$. Alternatively, $k$ beats ${k - 1}$, modulo 3:&lt;&#x2F;p&gt;
&lt;div class=&quot;small&quot;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;worlds-smallest-hash-table&#x2F;rock-paper-scissors-mod3.png&quot; alt=&quot;Diagram showing modulo 3 arithmetic.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;If we divide the number of points that Advent of Code expects for a loss&#x2F;draw&#x2F;win by
three we find that a loss is $0$, a draw is $1$ and a win is $2$ points. From
these observations we can derive the following modular equivalence&lt;&#x2F;p&gt;
&lt;p&gt;$$1 + \mathrm{ours} - \mathrm{theirs} \equiv \mathrm{points} \pmod 3.$$&lt;&#x2F;p&gt;
&lt;p&gt;To see that it is correct, note that if we drew, &lt;code&gt;ours - theirs&lt;&#x2F;code&gt; is zero and we
correctly get one point. If we add one to &lt;code&gt;ours&lt;&#x2F;code&gt; we change from a draw to a
win, and &lt;code&gt;points&lt;&#x2F;code&gt; becomes congruent with $2$ as desired. Symmetrically, if
we add one to &lt;code&gt;theirs&lt;&#x2F;code&gt; we change from a draw to a loss, and &lt;code&gt;points&lt;&#x2F;code&gt; once
again becomes congruent with $0$ as desired.&lt;&#x2F;p&gt;
&lt;p&gt;Translated into code we find that our total score is&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span&gt;+ ours + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3 &lt;&#x2F;span&gt;&lt;span&gt;* ((&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span&gt;+ ours + (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3 &lt;&#x2F;span&gt;&lt;span&gt;- theirs)) % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;aside&gt;
&lt;p&gt;Instead of &lt;code&gt;ours - theirs&lt;&#x2F;code&gt; we do &lt;code&gt;ours + (3 - theirs)&lt;&#x2F;code&gt; because Rust’s remainder
operator can unfortunately return negative remainders for positive divisors. One
could use
&lt;a href=&quot;https:&#x2F;&#x2F;doc.rust-lang.org&#x2F;std&#x2F;primitive.i64.html#method.rem_euclid&quot;&gt;&lt;code&gt;rem_euclid&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
instead, but I feel bad for recommending it as that one is unfortunately defined
for negative divisors. I should write a blog post about this…&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;h2 id=&quot;a-general-solution&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#a-general-solution&quot; aria-label=&quot;Anchor link for: a-general-solution&quot;&gt;A general solution&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;We found a neat closed form, but if we were even slightly less fortunate it might
not have existed. A more general method for solving similar problems would be nice.
In this particular instance that is possible. There are only $3 \times 3 = 9$
input pairs, so we can simply hardcode the answer for each situation:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; answers = HashMap::from([
&lt;&#x2F;span&gt;&lt;span&gt;    (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;A X&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;),
&lt;&#x2F;span&gt;&lt;span&gt;    (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;A Y&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;),
&lt;&#x2F;span&gt;&lt;span&gt;    (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;A Z&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;),
&lt;&#x2F;span&gt;&lt;span&gt;    (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;B X&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;),
&lt;&#x2F;span&gt;&lt;span&gt;    (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;B Y&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;),
&lt;&#x2F;span&gt;&lt;span&gt;    (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;B Z&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span&gt;),
&lt;&#x2F;span&gt;&lt;span&gt;    (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;C X&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;),
&lt;&#x2F;span&gt;&lt;span&gt;    (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;C Y&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;),
&lt;&#x2F;span&gt;&lt;span&gt;    (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;C Z&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;),
&lt;&#x2F;span&gt;&lt;span&gt;]);
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now we can simply get our answer using &lt;code&gt;answers[input]&lt;&#x2F;code&gt;. This might feel as a
bit of a non-answer, but it is a legitimate technique. We have a mapping
of inputs to outputs, and sometimes the simplest or fastest (in either
programmer time or execution time) solution is to write it out explicitly and
completely rather than compute the answer at runtime with an algorithm.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;perfect-hash-functions&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#perfect-hash-functions&quot; aria-label=&quot;Anchor link for: perfect-hash-functions&quot;&gt;Perfect hash functions&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;The above solution works fine, but it pays a cost for its genericity. It uses
a full-fledged string hash algorithm, and lookups involve the full codepath for
hash table lookups (most notably hash collision resolution).&lt;&#x2F;p&gt;
&lt;p&gt;We can drop the genericity for a significant boost in speed if we were to use a
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Perfect_hash_function&quot;&gt;perfect hash function&lt;&#x2F;a&gt;. A
perfect hash function is a specially constructed hash function on some set $S$
of values such that each value in the set maps to a different hash output,
without collisions. It is important to note that we only care about its behavior
for inputs in the set $S$, with a complete disregard for other inputs.&lt;&#x2F;p&gt;
&lt;p&gt;A &lt;em&gt;minimal&lt;&#x2F;em&gt; perfect hash function is one that also maps the inputs to a dense
range of integers $[0, 1, \dots, |S|-1]$. This can be very useful because you
can then directly use the hash function output to index a lookup table. This
effectively creates a hash table that maps set $S$ to anything you want.
However, strict minimality is not necessary for this as long as you are okay
with wasting some of the space in your lookup table.&lt;&#x2F;p&gt;
&lt;p&gt;There are fully generic methods for constructing (minimal) perfect hash
functions, such as the &lt;a href=&quot;https:&#x2F;&#x2F;link.springer.com&#x2F;chapter&#x2F;10.1007&#x2F;978-3-642-04128-0_61&quot;&gt;&lt;em&gt;“Hash, displace and compress”&lt;&#x2F;em&gt;&lt;&#x2F;a&gt; algorithm by Belazzougui
et. al., which is implemented in the &lt;a href=&quot;https:&#x2F;&#x2F;crates.io&#x2F;crates&#x2F;phf&quot;&gt;phf crate&lt;&#x2F;a&gt;.
However, they tend to use lookup tables to construct the hash itself. For small
inputs where speed and size is absolutely critical I’ve had good success &lt;strong&gt;just
trying stuff&lt;&#x2F;strong&gt;. This might sound vague—because it is—so let me walk you
through some examples.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;reading-the-input&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#reading-the-input&quot; aria-label=&quot;Anchor link for: reading-the-input&quot;&gt;Reading the input&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;aside&gt;
&lt;p&gt;This is where we leave the realm of reasonable solutions for the sake of
education and fun. For simplicity we’re not going to handle things such as
Windows-style newlines (&lt;code&gt;\r\n&lt;&#x2F;code&gt;) or invalid inputs.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;As a bit of a hack we can note that each line of our input from the Advent of
Code consists of exactly four bytes. One letter for our opponent’s choice, a
space, our choice, and a newline byte. So we can simply read our input as a
&lt;code&gt;u32&lt;&#x2F;code&gt;, which simplifies the hash construction immensely instead of dealing with
strings.&lt;&#x2F;p&gt;
&lt;p&gt;For example, consulting the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;ASCII&quot;&gt;ASCII table&lt;&#x2F;a&gt;
we find that &lt;code&gt;A&lt;&#x2F;code&gt; has ASCII code &lt;code&gt;0x41&lt;&#x2F;code&gt;, space maps to &lt;code&gt;0x20&lt;&#x2F;code&gt;, &lt;code&gt;X&lt;&#x2F;code&gt; has code
&lt;code&gt;0x58&lt;&#x2F;code&gt; and the newline symbol has code &lt;code&gt;0x0a&lt;&#x2F;code&gt; so the input &lt;code&gt;&quot;A X\n&quot;&lt;&#x2F;code&gt; can also
simply be viewed as the integer &lt;code&gt;0x0a582041&lt;&#x2F;code&gt; if you are on a
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Endianness&quot;&gt;little-endian&lt;&#x2F;a&gt; machine. If you are
confused why &lt;code&gt;0x41&lt;&#x2F;code&gt; is in the last position remember that we humans write numbers with the
least significant digit on the right as a convention.&lt;&#x2F;p&gt;
&lt;p&gt;Note that on a big-endian machine the order of bytes in a &lt;code&gt;u32&lt;&#x2F;code&gt; is flipped, so
reading those four bytes into an integer would result in the value &lt;code&gt;0x4120580a&lt;&#x2F;code&gt;.
Calling &lt;code&gt;u32::from_le_bytes&lt;&#x2F;code&gt; converts four bytes assumed to be little-endian to
the native integer representation by swapping the bytes on a big-endian machine
and doing nothing on a little-endian machine. Almost all modern CPUs are
little-endian however, so it’s generally a good idea to write your code such
that the little-endian path is fast and the big-endian path involves a
conversion step, if a conversion step can not be avoided.&lt;&#x2F;p&gt;
&lt;p&gt;Doing this for all inputs gives us the following desired integer →
integer mapping:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;Input       LE u32      Answer
&lt;&#x2F;span&gt;&lt;span&gt;-------------------------------
&lt;&#x2F;span&gt;&lt;span&gt; A X       0xa582041       4
&lt;&#x2F;span&gt;&lt;span&gt; A Y       0xa592041       8
&lt;&#x2F;span&gt;&lt;span&gt; A Z       0xa5a2041       3
&lt;&#x2F;span&gt;&lt;span&gt; B X       0xa582042       1
&lt;&#x2F;span&gt;&lt;span&gt; B Y       0xa592042       5
&lt;&#x2F;span&gt;&lt;span&gt; B Z       0xa5a2042       9
&lt;&#x2F;span&gt;&lt;span&gt; C X       0xa582043       7
&lt;&#x2F;span&gt;&lt;span&gt; C Y       0xa592043       2
&lt;&#x2F;span&gt;&lt;span&gt; C Z       0xa5a2043       6
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;example-constructions&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#example-constructions&quot; aria-label=&quot;Anchor link for: example-constructions&quot;&gt;Example constructions&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;When I said I just try stuff, I mean it. Let’s load our mapping into Python
and write a test:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;inputs =  [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xa582041&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xa592041&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xa5a2041&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xa582042&lt;&#x2F;span&gt;&lt;span&gt;,
&lt;&#x2F;span&gt;&lt;span&gt;           &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xa592042&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xa5a2042&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xa582043&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xa592043&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xa5a2043&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;answers = [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span&gt;is_phf(h, inputs):
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;len({h(x) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;inputs}) == len(inputs)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;There are nine inputs, so perhaps we get lucky and get a minimal perfect
hash function right away:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; [x % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;inputs]
&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Alas, there are collisions. What if we don’t have to be absolutely minimal?&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; next(m &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;m &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;range(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;...      &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;is_phf(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;x: x % m, inputs))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;12
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; [x % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;12 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;inputs]
&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;11&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That’s not too bad! Only three elements of wasted space. We can make our first
perfect hash table by placing the answers in the correct spots:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span&gt;make_lut(h, inputs, answers):
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;assert &lt;&#x2F;span&gt;&lt;span&gt;is_phf(h, inputs)
&lt;&#x2F;span&gt;&lt;span&gt;    lut = [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;] * (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span&gt;+ max(h(x) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;inputs))
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(x, ans) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;zip(inputs, answers):
&lt;&#x2F;span&gt;&lt;span&gt;        lut[h(x)] = ans
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;lut
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; make_lut(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;x: x % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;12&lt;&#x2F;span&gt;&lt;span&gt;, inputs, answers)
&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Giving the simple mapping:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;LUT&lt;&#x2F;span&gt;&lt;span&gt;: [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8&lt;&#x2F;span&gt;&lt;span&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;12&lt;&#x2F;span&gt;&lt;span&gt;] = [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;];
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;pub &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;answer(x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u32&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;LUT&lt;&#x2F;span&gt;&lt;span&gt;[(x % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;12&lt;&#x2F;span&gt;&lt;span&gt;) as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;usize&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h4 id=&quot;compressing-the-table&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#compressing-the-table&quot; aria-label=&quot;Anchor link for: compressing-the-table&quot;&gt;Compressing the table&lt;&#x2F;a&gt;&lt;&#x2F;h4&gt;
&lt;p&gt;We stopped here on the first modulus that works, which is honestly fine in this
case because only three bytes of wasted space is pretty good. But what if we
didn’t get so lucky? We have to keep looking. Even though
modulo $m$ has $[0, m)$ as its
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Codomain&quot;&gt;&lt;em&gt;codomain&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;, when applied to our set of
inputs its &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Image_(mathematics)&quot;&gt;&lt;em&gt;image&lt;&#x2F;em&gt;&lt;&#x2F;a&gt; might
span a smaller subset. Let’s inspect some:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; [(m, max(x % m &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;inputs))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;...  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;m &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;range(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;30&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;...  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;is_phf(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;x: x % m, inputs)]
&lt;&#x2F;span&gt;&lt;span&gt;[(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;12&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;11&lt;&#x2F;span&gt;&lt;span&gt;), (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;13&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;11&lt;&#x2F;span&gt;&lt;span&gt;), (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;19&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;18&lt;&#x2F;span&gt;&lt;span&gt;), (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;20&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;19&lt;&#x2F;span&gt;&lt;span&gt;), (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;21&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;17&lt;&#x2F;span&gt;&lt;span&gt;), (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;23&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;22&lt;&#x2F;span&gt;&lt;span&gt;),
&lt;&#x2F;span&gt;&lt;span&gt; (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;24&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;19&lt;&#x2F;span&gt;&lt;span&gt;), (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;25&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;23&lt;&#x2F;span&gt;&lt;span&gt;), (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;26&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;21&lt;&#x2F;span&gt;&lt;span&gt;), (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;27&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;25&lt;&#x2F;span&gt;&lt;span&gt;), (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;28&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;19&lt;&#x2F;span&gt;&lt;span&gt;), (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;29&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span&gt;)]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Unfortunately but also logically, there is an upwards trend of the maximum
index as you increase the modulus. But $13$ also seems promising, let’s take a look:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; [x % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;13 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;inputs]
&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;11&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; make_lut(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;x: x % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;13&lt;&#x2F;span&gt;&lt;span&gt;, inputs, answers)
&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Well, well, well, aren’t we lucky? The first three indices are unused, so we can
shift all the others back and get a minimal perfect hash function!&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;Ironically this one would almost surely perform worse than the previous one
because Rust has to do a bounds check now whereas the previous version is
infallible, and it has an extra subtraction.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;LUT&lt;&#x2F;span&gt;&lt;span&gt;: [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8&lt;&#x2F;span&gt;&lt;span&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span&gt;] = [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;];
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;pub &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;answer(x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u32&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;LUT&lt;&#x2F;span&gt;&lt;span&gt;[(x % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;13 &lt;&#x2F;span&gt;&lt;span&gt;- &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;) as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;usize&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In my experience with creating a bunch of similar mappings in the past, &lt;strong&gt;you’d
be surprised to see how often you get lucky&lt;&#x2F;strong&gt;, as long as your mapping isn’t too
large. As you add more ‘things to try’ to your toolbox, you also have more
opportunities of getting lucky.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;fixing-near-misses&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#fixing-near-misses&quot; aria-label=&quot;Anchor link for: fixing-near-misses&quot;&gt;Fixing near-misses&lt;&#x2F;a&gt;&lt;&#x2F;h4&gt;
&lt;p&gt;Another thing to try is fixing near-misses. For example, let’s take another look
at our original naive attempt:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; [x % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;inputs]
&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Only the last two inputs give a collision. So a rather naive but possible way
to resolve these collisions is to move those to a different index:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; [x % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9 &lt;&#x2F;span&gt;&lt;span&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;*(x == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xa592043&lt;&#x2F;span&gt;&lt;span&gt;) - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;*(x == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xa5a2043&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;inputs]
&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Oh look, we got slightly lucky again: both are using the constant 3, which can be
factored out. It can be quite addictive to try out various permutations of
operations and tweaks to find these (minimal) perfect hash functions using as
few operations as possible.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;an-interlude-integer-division&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#an-interlude-integer-division&quot; aria-label=&quot;Anchor link for: an-interlude-integer-division&quot;&gt;An interlude: integer division&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;So far we’ve just been using the modulo operator to reduce our input domain to a
much smaller one. However, integer division&#x2F;modulo is rather slow on most
processors. If we take a look at &lt;a href=&quot;https:&#x2F;&#x2F;www.agner.org&#x2F;optimize&#x2F;&quot;&gt;Agner Fog’s instruction
tables&lt;&#x2F;a&gt; we see that the 32-bit &lt;code&gt;DIV&lt;&#x2F;code&gt;
instruction has a latency of 9-12 cycles on AMD Zen3 and 12 cycles on Intel Ice
Lake. However, we don’t need a fully generic division instruction, since our
divisor is constant here. Let’s take a quick look at what the compiler does for
mod 13:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;pub &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;mod13(x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u32&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u32 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    x % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;13
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre data-lang=&quot;asm&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-asm &quot;&gt;&lt;code class=&quot;language-asm&quot; data-lang=&quot;asm&quot;&gt;&lt;span&gt;example::mod13:
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;eax, edi
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;ecx, edi
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;imul    &lt;&#x2F;span&gt;&lt;span&gt;rcx, rcx, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1321528399
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shr     &lt;&#x2F;span&gt;&lt;span&gt;rcx, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;34
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;lea     &lt;&#x2F;span&gt;&lt;span&gt;edx, [rcx + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;*rcx]
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;lea     &lt;&#x2F;span&gt;&lt;span&gt;ecx, [rcx + &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;*rdx]
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;sub     &lt;&#x2F;span&gt;&lt;span&gt;eax, ecx
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;ret
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It translates the modulo operation into a multiplication with some shifts &#x2F; adds &#x2F; subtractions instead.
To see how that works let’s first consider the most magical part: the multiplication by $1321528399$ followed
by the right shift of $34$. That magical constant is actually $\lceil 2^{34} &#x2F; 13 \rceil$
which means it’s computing&lt;&#x2F;p&gt;
&lt;p&gt;$$q = \left\lfloor \frac{x \cdot \lceil 2^{34} &#x2F; 13 \rceil}{2^{34}}\right\rfloor = \lfloor x &#x2F; 13 \rfloor.$$&lt;&#x2F;p&gt;
&lt;p&gt;To prove that is in fact correct we note that $2^{34} + 3$ is divisible by $13$ allowing us
to split the division in the correct result plus an error term:&lt;&#x2F;p&gt;
&lt;p&gt;$$\frac{x \cdot \lceil 2^{34} &#x2F; 13 \rceil}{2^{34}} = \frac{x \cdot (2^{34} + 3) &#x2F; 13}{2^{34}} = x &#x2F; 13 + \frac{3x}{13\cdot 2^{34}}.$$&lt;&#x2F;p&gt;
&lt;p&gt;Then we inspect the error term and substitute $x = 2^{32}$ as an upper bound
to see it never affects the result after flooring:&lt;&#x2F;p&gt;
&lt;p&gt;$$\frac{3x}{13\cdot 2^{34}} \leq \frac{3 \cdot 2^{32}}{13\cdot 2^{34}} \leq \frac{3}{13 \cdot 4} &amp;lt; 1&#x2F;13.$$&lt;&#x2F;p&gt;
&lt;p&gt;For more context and references I would suggest &lt;a href=&quot;https:&#x2F;&#x2F;www.ncbi.nlm.nih.gov&#x2F;pmc&#x2F;articles&#x2F;PMC8258644&#x2F;&quot;&gt;&lt;em&gt;“Integer division by constants:
optimal bounds”&lt;&#x2F;em&gt;&lt;&#x2F;a&gt; by
Lemire et. al.&lt;&#x2F;p&gt;
&lt;p&gt;After computing $q = \lfloor x&#x2F;13\rfloor$ it then computes the actual modulo we
want as $m = x - 13q$ using
the identity $$x \bmod m = x - \lfloor x &#x2F; m \rfloor \cdot m.$$
It avoids the use of another (relatively) expensive integer multiplication by using
the &lt;code&gt;lea&lt;&#x2F;code&gt; instruction which can compute &lt;code&gt;a + k*b&lt;&#x2F;code&gt;, where &lt;code&gt;k&lt;&#x2F;code&gt; can be a constant
1, 2, 4, or 8. This is how it computes $13q$:&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;The &lt;code&gt;LEA&lt;&#x2F;code&gt; instruction was originally intended for array index computations,
because &lt;code&gt;arr[i]&lt;&#x2F;code&gt; is found at address &lt;code&gt;arr_start + sizeof(T)*i&lt;&#x2F;code&gt;, and &lt;code&gt;sizeof(T)&lt;&#x2F;code&gt;
is very often a small power of two.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;Instruction                  Translation     Effect
&lt;&#x2F;span&gt;&lt;span&gt;lea     edx, [rcx + 2*rcx]   t := q + 2*q    t = 3q
&lt;&#x2F;span&gt;&lt;span&gt;lea     ecx, [rcx + 4*rdx]   o := q + 4*t    o = (q + 4*3q) = 13q
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;bit-mixing&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#bit-mixing&quot; aria-label=&quot;Anchor link for: bit-mixing&quot;&gt;Bit mixing&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;We have seen that choosing different moduli works, and that compilers implement
fixed-divisor modulo using multiplication. It is time to cut out the
middleman and go straight to the good stuff: integer multiplication.
We can get a better understanding of what integer multiplication actually does
by multiplying two integers in binary using the schoolbook method:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;4242 = 0b1000010010010
&lt;&#x2F;span&gt;&lt;span&gt;4871 = 0b1001100100111 = 2^0 + 2^1 + 2^2 + 2^5 + 2^8 + 2^9 + 2^12
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;          Binary                    Decimal
&lt;&#x2F;span&gt;&lt;span&gt;               1000010010010   |   4242 * 2^0
&lt;&#x2F;span&gt;&lt;span&gt;              1000010010010    |   4242 * 2^1
&lt;&#x2F;span&gt;&lt;span&gt;             1000010010010     |   4242 * 2^2
&lt;&#x2F;span&gt;&lt;span&gt;          1000010010010        |   4242 * 2^5
&lt;&#x2F;span&gt;&lt;span&gt;       1000010010010           |   4242 * 2^8
&lt;&#x2F;span&gt;&lt;span&gt;      1000010010010            |   4242 * 2^9
&lt;&#x2F;span&gt;&lt;span&gt;   1000010010010               |   4242 * 2^12
&lt;&#x2F;span&gt;&lt;span&gt;-------------------------------|---------------- +
&lt;&#x2F;span&gt;&lt;span&gt;   1001110110100100111111110   |   20662782
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;There is a beautiful property here we can take advantage of: all of the upper
bits of the product $x \cdot c$ for some constant $c$ depend on most of the bits
of $x$. That is, for good choices of the constants $c$ and $s$, &lt;code&gt;c*x &amp;gt;&amp;gt; s&lt;&#x2F;code&gt; will
give you a result that is wildly different even for small differences in $x$. It
is a strong &lt;em&gt;bit mixer&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Hash functions like bit mixing functions, because they want to be unpredictable.
A good measure of unpredictability is found in the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Avalanche_effect&quot;&gt;avalanche
effect&lt;&#x2F;a&gt;. For a true random oracle
changing one bit in the input should flip all bits in the output with 50% probability.
Thus having all your output bits depend on the input is a good property for
a hash function, as a random oracle is the ideal hash function.&lt;&#x2F;p&gt;
&lt;p&gt;So, let’s just &lt;strong&gt;try something&lt;&#x2F;strong&gt;. We’ll stick with using modulo $2^k$
for maximum speed (as those can be computed with a binary AND instead of
needing multiplication), and try to find constants $c$
and $s$ that work. We want our codomain to have size $2^4 = 16$ since that’s the
smallest power of two bigger than $9$. We’ll use a $32 \times 32 \to 32$ bit multiply
since we only need 4 bits of output, and the top 4 bits of the multiplication
will depend sufficiently on most of the bits of the input. By doing a right-shift
of $28$ on a &lt;code&gt;u32&lt;&#x2F;code&gt; result we also get our mod $2^4$ for free.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;If we needed more than four bits of
output, or we couldn’t find a constant that works, I would try a
$32 \times 32 \to 64$ bit multiply as this gives you more output bits to work with.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;import &lt;&#x2F;span&gt;&lt;span&gt;random
&lt;&#x2F;span&gt;&lt;span&gt;random.seed(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;42&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span&gt;h(x, c):
&lt;&#x2F;span&gt;&lt;span&gt;    m = (x * c) % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;m &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;28
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;True&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;    c = random.randrange(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;is_phf(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;x: h(x, c), inputs):
&lt;&#x2F;span&gt;&lt;span&gt;        print(hex(c))
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;break
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It’s always a bit exciting to hit enter when doing a random search for a magic
constant, not knowing if you’ll get an answer or not. In this case it instantly
printed &lt;code&gt;0x46685257&lt;&#x2F;code&gt;. Since it was so fast there are likely many solutions, so
we can definitely be a bit greedier and see if we can get closer to a minimal
perfect hash function:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;best = float(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;#39;inf&amp;#39;&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;best &amp;gt;= len(inputs):
&lt;&#x2F;span&gt;&lt;span&gt;    c = random.randrange(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;    max_idx = max(h(x, c) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;inputs)
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;max_idx &amp;lt; best and is_phf(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;x: h(x, c), inputs):
&lt;&#x2F;span&gt;&lt;span&gt;        print(max_idx, hex(c))
&lt;&#x2F;span&gt;&lt;span&gt;        best = max_idx
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This quickly iterated through a couple of solutions before finding a constant that gives a minimal perfect hash function,
&lt;code&gt;0xedc72f12&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; [h(x, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xedc72f12&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;inputs]
&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; make_lut(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;x: h(x, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xedc72f12&lt;&#x2F;span&gt;&lt;span&gt;), inputs, answers)
&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Ironically, if we want the optimal performance in safe Rust, we still need to
zero-pad the array to 16 elements so we can never go out-of-bounds. But if you
are &lt;strong&gt;absolutely certain&lt;&#x2F;strong&gt; there are no inputs other than the specified inputs,
and you wanted optimal speed, you could reduce your memory usage to 9 bytes with
unsafe Rust. Sticking with the safe code option we’ll get:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;LUT&lt;&#x2F;span&gt;&lt;span&gt;: [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8&lt;&#x2F;span&gt;&lt;span&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span&gt;] = [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;,
&lt;&#x2F;span&gt;&lt;span&gt;                       &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;];
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;pub &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;phf_lut(x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u32&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;LUT&lt;&#x2F;span&gt;&lt;span&gt;[(x.wrapping_mul(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xedc72f12&lt;&#x2F;span&gt;&lt;span&gt;) &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;28&lt;&#x2F;span&gt;&lt;span&gt;) as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;usize&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Inspecting the assembly code using the &lt;a href=&quot;https:&#x2F;&#x2F;rust.godbolt.org&#x2F;z&#x2F;KvP4v187P&quot;&gt;Compiler
Explorer&lt;&#x2F;a&gt; it is incredibly tight now:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;asm&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-asm &quot;&gt;&lt;code class=&quot;language-asm&quot; data-lang=&quot;asm&quot;&gt;&lt;span&gt;example::phf_lut:
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;imul    &lt;&#x2F;span&gt;&lt;span&gt;eax, dword ptr [rdi], -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;305713390
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shr     &lt;&#x2F;span&gt;&lt;span&gt;rax, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;28
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;lea     &lt;&#x2F;span&gt;&lt;span&gt;rcx, [rip + .L__unnamed_1]
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;movzx   &lt;&#x2F;span&gt;&lt;span&gt;eax, byte ptr [rax + rcx]
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;ret
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;.L__unnamed_1:
&lt;&#x2F;span&gt;&lt;span&gt;        .asciz  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;\007\001\004\002\005\b\006\t\003\000\000\000\000\000\000&amp;quot;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;the-world-s-smallest-hash-table&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#the-world-s-smallest-hash-table&quot; aria-label=&quot;Anchor link for: the-world-s-smallest-hash-table&quot;&gt;The World’s Smallest Hash Table&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;You thought 9 bytes was the world’s smallest hash table? We’re only just getting
started! You see, it is actually possible to have a small lookup table without
accessing memory, by storing it in the code.&lt;&#x2F;p&gt;
&lt;aside&gt;Code ultimately has to be stored in memory as well, but it saves an
indirection.&lt;&#x2F;aside&gt;
&lt;p&gt;A particularly effective method for storing a small lookup table with small
elements is to store it as a constant, indexed using shifts. For example, the lookup
table &lt;code&gt;[1, 42, 17, 26][i]&lt;&#x2F;code&gt; could also be written as such:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0b11010010001101010000001 &lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span&gt;*i) &amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0b111111
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Each individual value fits in 6 bits, and we can easily fit $4\times 6 = 24$
bits in a &lt;code&gt;u32&lt;&#x2F;code&gt;. In isolation this might not make sense over a normal lookup
table, but it can be combined with perfect hashing, and can be vectorized as
well.&lt;&#x2F;p&gt;
&lt;p&gt;Unfortunately we have 9 values that each require 5 bits, which doesn’t fit in
a &lt;code&gt;u32&lt;&#x2F;code&gt;… or does it? You see, by co-designing the lookup table with the
perfect hash function we could theoretically &lt;em&gt;overlap&lt;&#x2F;em&gt; the end of the bitstring
of one value with the start of another if we directly use the hash
function output as the shift amount.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Update on 2023-03-05&lt;&#x2F;strong&gt;: As tinix0 rightfully &lt;a href=&quot;https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;programming&#x2F;comments&#x2F;11i3hfy&#x2F;the_worlds_smallest_hash_table&#x2F;jazrwok&#x2F;&quot;&gt;points out on
reddit&lt;&#x2F;a&gt;,
our values only require 4 bits, not 5. I’ve made things unnecessarily harder for
myself by effectively prepending a zero bit to each value. That said, you would
still need overlapping for fitting $4 \times 9 = 36$ bits in a &lt;code&gt;u32&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;We could also just use a &lt;code&gt;u64&lt;&#x2F;code&gt; to store the data,
but that’s boring and we’re trying to create the smallest
possible hash table here.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;We are thus looking for two 32-bit constants &lt;code&gt;c&lt;&#x2F;code&gt; and &lt;code&gt;d&lt;&#x2F;code&gt; such that&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;(d &amp;gt;&amp;gt; (x.wrapping_mul(c) &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;27&lt;&#x2F;span&gt;&lt;span&gt;)) &amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0b11111 &lt;&#x2F;span&gt;&lt;span&gt;== answer(x)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Note that the magic shift is now 32 - 5 = 27 because we want 5 bits of output to feed into the
second shift, as $2^5 = 32$.&lt;&#x2F;p&gt;
&lt;p&gt;Luckily we don’t have to actually increase the search space, as we can construct
&lt;code&gt;d&lt;&#x2F;code&gt; from &lt;code&gt;c&lt;&#x2F;code&gt; by just placing the answer bits in the indicated shift positions.
Doing this we can also find out whether &lt;code&gt;c&lt;&#x2F;code&gt; is valid or not by detecting conflicts
in whether a bit should be $0$ or $1$ for different inputs. Will we be lucky?&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span&gt;build_bit_lut(h, w, inputs, answers):
&lt;&#x2F;span&gt;&lt;span&gt;    zeros = ones = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;x, a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;zip(inputs, answers):
&lt;&#x2F;span&gt;&lt;span&gt;        shift = h(x)
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;zeros &amp;amp; (a &amp;lt;&amp;lt; shift) or ones &amp;amp; (~a % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**w &amp;lt;&amp;lt; shift):
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;None  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;# Conflicting bits.
&lt;&#x2F;span&gt;&lt;span&gt;        zeros |= ~a % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**w &amp;lt;&amp;lt; shift
&lt;&#x2F;span&gt;&lt;span&gt;        ones |= a &amp;lt;&amp;lt; shift
&lt;&#x2F;span&gt;&lt;span&gt;    
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;ones
&lt;&#x2F;span&gt;&lt;span&gt;    
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span&gt;h(x, c):
&lt;&#x2F;span&gt;&lt;span&gt;    m = (x * c) % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;m &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;27
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;random.seed(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;42&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;True&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;    c = random.randrange(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;    lut = build_bit_lut(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;x: h(x, c), &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, inputs, answers)
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;lut is not &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;None &lt;&#x2F;span&gt;&lt;span&gt;and lut &amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;        print(hex(c), hex(lut))
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;break
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It takes a second or two, but we found a solution!&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;pub &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;phf_shift(x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u32&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8 &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; shift = x.wrapping_mul(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0xa463293e&lt;&#x2F;span&gt;&lt;span&gt;) &amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;27&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    ((&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x824a1847&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u32 &lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt; shift) &amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0b11111&lt;&#x2F;span&gt;&lt;span&gt;) as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u8
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre data-lang=&quot;asm&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-asm &quot;&gt;&lt;code class=&quot;language-asm&quot; data-lang=&quot;asm&quot;&gt;&lt;span&gt;example::phf_shift:
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;imul    &lt;&#x2F;span&gt;&lt;span&gt;ecx, dword ptr [rdi], -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1537005250
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shr     &lt;&#x2F;span&gt;&lt;span&gt;ecx, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;27
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;mov     &lt;&#x2F;span&gt;&lt;span&gt;eax, -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2109073337
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;shr     &lt;&#x2F;span&gt;&lt;span&gt;eax, cl
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;and     &lt;&#x2F;span&gt;&lt;span&gt;al, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;31
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;ret
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We have managed to replace a fully-fledged hash table
with one that is so small that it consists of 6
(vectorizable) assembly instructions without any
further data.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#conclusion&quot; aria-label=&quot;Anchor link for: conclusion&quot;&gt;Conclusion&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Wew, that was a wild ride. Was it worth it? Let’s
compare four hash-based versions on how long they take to process
ten million lines of random input and sum all answers.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;hashmap_str&lt;&#x2F;code&gt; processes the lines properly as newline
delimited strings, as in the &lt;a href=&quot;&#x2F;blog&#x2F;worlds-smallest-hash-table&#x2F;#a-general-solution&quot;&gt;general solution&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;hashmap_u32&lt;&#x2F;code&gt; still uses a hashmap, but reads the lines and does lookups
using &lt;code&gt;u32&lt;&#x2F;code&gt;s like the perfect hash functions do.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;phf_lut&lt;&#x2F;code&gt; is the earlier defined function that feeds a perfect
hash function into a lookup table.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;phf_shift&lt;&#x2F;code&gt; is our world’s smallest hash function.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;The complete test code can be found &lt;a href=&quot;https:&#x2F;&#x2F;gist.github.com&#x2F;orlp&#x2F;fdc27b86e658c3b6df709c68ab477a14&quot;&gt;here&lt;&#x2F;a&gt;. On my 2021 Apple M1 Macbook Pro
I get the following results with &lt;code&gt;cargo run --release&lt;&#x2F;code&gt; on Rust 1.67.1:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Algorithm&lt;&#x2F;th&gt;&lt;th style=&quot;text-align: left&quot;&gt;Time&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;hashmap_str&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: left&quot;&gt;262.83 ms&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;hashmap_u32&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: left&quot;&gt;81.33 ms&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;phf_lut&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: left&quot;&gt;2.97 ms&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;phf_shift&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td style=&quot;text-align: left&quot;&gt;1.41 ms&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;So not only is it the smallest, it’s also the fastest, beating the original
string-based &lt;code&gt;HashMap&lt;&#x2F;code&gt; solution by over 180 times. The reason &lt;code&gt;phf_shift&lt;&#x2F;code&gt; is
two times faster than &lt;code&gt;phf_lut&lt;&#x2F;code&gt; on this machine is because it can be fully
vectorized by the compiler whereas &lt;code&gt;phf_lut&lt;&#x2F;code&gt; needs to do a lookup in memory
which is either impossible or relatively slow to do in vectorized code,
depending on which SIMD instructions you have available.&lt;&#x2F;p&gt;
&lt;p&gt;Your results may vary, and you might need
&lt;code&gt;RUSTFLAGS=&#x27;-C target-cpu=native&#x27;&lt;&#x2F;code&gt; for &lt;code&gt;phf_shift&lt;&#x2F;code&gt; to autovectorize.&lt;&#x2F;p&gt;
</content>
	</entry>
	<entry xml:lang="en">
		<title>Magical Fibonacci Formulae</title>
		<author>Orson R. L. Peters</author>
		<published>2023-02-06T00:00:00+00:00</published>
		<updated>2023-02-06T00:00:00+00:00</updated>
		<link rel="alternate" href="https://orlp.net/blog/magical-fibonacci-formulae/" type="text/html"/>
		<id>https://orlp.net/blog/magical-fibonacci-formulae/</id>
		<content type="html">&lt;p&gt;The following Python function computes the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Fibonacci_number&quot;&gt;Fibonacci
sequence&lt;&#x2F;a&gt;, without loops,
recursion or floating point arithmetic:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;f=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;n:(b:=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&amp;lt;n)**n*b&#x2F;&#x2F;(b*b-b-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)%b
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It really does:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; [f(n) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;n &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;range(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span&gt;)]
&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;13&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;21&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;34&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;How does it work? As a teaser, look at the decimal expansions of $100 &#x2F; 9899$
and $1000 &#x2F; 998999$ and see if you notice anything:&lt;&#x2F;p&gt;
&lt;p&gt;$$\frac{100}{9899} = 0.0101020305081321345590463…$$
$$\frac{1000}{998999} = 0.001001002003005008013021034055089144233377610988…$$&lt;&#x2F;p&gt;
&lt;h2 id=&quot;generating-functions&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#generating-functions&quot; aria-label=&quot;Anchor link for: generating-functions&quot;&gt;Generating functions&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;We define the Fibonacci sequence as
\begin{align*}
f(0) &amp;amp;= 0,\\
f(1) &amp;amp;= 1,\\
f(n) &amp;amp;= f(n - 1) + f(n - 2).
\end{align*}
However, we will also define a rather strange object known as a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Generating_function&quot;&gt;generating
function&lt;&#x2F;a&gt;. It is an
‘infinite polynomial’ (also known as a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Power_series&quot;&gt;power series&lt;&#x2F;a&gt;)
in variable $x$ whose &lt;em&gt;coefficients&lt;&#x2F;em&gt; correspond to our series of interest. This
gives us
$$F(x) = 0 + 1x + 1x^2 + 2x^3 + 3x^4 + 5x^5 + 8x^6 + \cdots,$$
$$F(x) = f(0)x^0 + f(1)x^1 + f(2)x^2 + \cdots = \sum_{n=0}^\infty f(n)x^n.$$&lt;&#x2F;p&gt;
&lt;p&gt;Does this function converge? Is this really allowed? All interesting questions
we are going to ignore here. Instead, let’s just start doing interesting things
with our new object, like taking out the first two terms from the sum:&lt;&#x2F;p&gt;
&lt;aside&gt; 
&lt;p&gt;There is an amazing free book called &lt;a href=&quot;https:&#x2F;&#x2F;www2.math.upenn.edu&#x2F;~wilf&#x2F;DownldGF.html&quot;&gt;generatingfunctionology&lt;&#x2F;a&gt;.
You can read a lot more about generating functions there. It also dives into
the question about convergence.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;$$F(x) = f(0)x^0 + f(1)x^1 + \sum_{n=2}^\infty f(n)x^n = x + \sum_{n=2}^\infty f(n)x^n.$$&lt;&#x2F;p&gt;
&lt;p&gt;We can also substitute $f(n) = f(n - 1) + f(n - 2)$ now, because our iteration
variable starts at $2$:&lt;&#x2F;p&gt;
&lt;p&gt;\begin{align*}
F(x) &amp;amp;= x + \sum_{n=2}^\infty \Big(f(n-1) + f(n-2) \Big)x^n\\
&amp;amp;= x + \sum_{n=2}^\infty f(n-1)x^n + \sum_{n=2}^\infty f(n-2)x^n.
\end{align*}
Now we can substitute the loop variables, re-insert the $f(0)$ term (which is just
zero) into the first sum, and factor out the extra $x$ terms:
\begin{align*}
F(x) &amp;amp;= x + \sum_{n=1}^\infty f(n)x^{n+1} + \sum_{n=0}^\infty f(n)x^{n+2}\\
&amp;amp;= x - f(0)x^{1} + \sum_{n=0}^\infty f(n)x^{n+1} + \sum_{n=0}^\infty f(n)x^{n+2}\\
&amp;amp;= x + x\sum_{n=0}^\infty f(n)x^{n} + x^2\sum_{n=0}^\infty f(n)x^{n}.
\end{align*}&lt;&#x2F;p&gt;
&lt;p&gt;And now using the crucial observation that $F(x) = \sum_{n=0}^\infty f(n)x^{n}$:
\begin{align*}
F(x) &amp;amp;= x + xF(x) + x^2F(x),\\
(1 - x - x^2)F(x) &amp;amp;= x,\\
F(x) &amp;amp;= \frac{x}{1 - x - x^2}.
\end{align*}
Wow. Somehow the simple expression $x &#x2F; (1 - x - x^2)$ ‘contains’ the entire
Fibonacci sequence. If you substitute $x = 10^{-3}$ in $F(x)$ you will retrieve
our earlier value
$$\frac{1000}{998999} = 0.001001002003005008013021034055089144233377610988…$$&lt;&#x2F;p&gt;
&lt;h2 id=&quot;an-interlude-by-binet&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#an-interlude-by-binet&quot; aria-label=&quot;Anchor link for: an-interlude-by-binet&quot;&gt;An interlude by Binet&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Solving $1 - x - x^2 = 0$ with the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Quadratic_formula&quot;&gt;quadratic formula&lt;&#x2F;a&gt; gives us
roots $-(\sqrt{5} \pm 1)&#x2F;2$, which one might recognize as
the (negative) &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Golden_ratio&quot;&gt;golden ratio&lt;&#x2F;a&gt; $\phi$ and its inverse
$\phi^{-1}$:
$$\phi = \frac{\sqrt{5} + 1}{2}, \quad \phi^{-1} = \frac{2}{\sqrt{5} + 1} = \frac{2\left(\sqrt{5} - 1\right)}{\left(\sqrt{5} + 1\right)\left(\sqrt{5} - 1\right)} = \frac{\sqrt{5} - 1}{2}.$$&lt;&#x2F;p&gt;
&lt;p&gt;This allows us to factor and
use  &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Partial_fraction_decomposition&quot;&gt;partial fraction decomposition&lt;&#x2F;a&gt;
on $F(x)$:
$$\frac{x}{1 - x - x^2} = \frac{x}{(1 - \phi x)(1 + \phi^{-1} x)} = \frac{1}{\phi + \phi^{-1}}\left(\frac{1}{1 - \phi x} - \frac{1}{1 + \phi^{-1} x}\right).$$
This is a rather arduous (but strictly elementary) algebraic process so it is much easier
to verify by expanding that the identity holds than following along.
To verify use the fact that $\phi \cdot \phi^{-1} = \phi - \phi^{-1} = 1$.&lt;&#x2F;p&gt;
&lt;p&gt;If we recall the formula
for the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Geometric_series&quot;&gt;geometric series&lt;&#x2F;a&gt;,&lt;&#x2F;p&gt;
&lt;p&gt;$$\frac{1}{1 - r} = \sum_{n=0}^\infty r^n,$$&lt;&#x2F;p&gt;
&lt;p&gt;and apply it to our above expression (once
again completely ignoring convergence) while
noticing that $\phi + \phi^{-1} = \sqrt{5}$ we find&lt;&#x2F;p&gt;
&lt;p&gt;$$F(x) = \frac{1}{\sqrt{5}} \left( \sum_{n=0}^\infty \phi^n x^n - \sum_{n=0}^\infty {(-\phi})^{-n} x^n \right),$$
$$F(x) = \sum_{n=0}^\infty \frac{1}{\sqrt{5}} \left(  \phi^n - {(-\phi})^{-n} \right) x^n.$$&lt;&#x2F;p&gt;
&lt;p&gt;And now for the true magic, recall that we defined $F(x) = \sum_{n=0}^\infty f(n)x^n$, and thus we conclude
$$f(n) = \frac{1}{\sqrt{5}}\left(\phi^n - {(-\phi})^{-n} \right).$$&lt;&#x2F;p&gt;
&lt;aside&gt;
Note that $\phi^{-n}$ very quickly approaches 0, making $\phi^n &#x2F; \sqrt{5}$
rounded to the nearest integer also correct. In turn this explains why the
ratio of consecutive Fibonacci numbers approaches the golden ratio.
&lt;&#x2F;aside&gt;
&lt;p&gt;We have recovered &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Fibonacci_number#Binet&amp;#x27;s_formula&quot;&gt;Binet’s formula&lt;&#x2F;a&gt;
for the Fibonacci numbers, a closed form. Unfortunately evaluating it in
Python would eventually fail due to the use of floating-point numbers,
which is why this is only an interlude. But it is nevertheless cool:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; phi = (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.5&lt;&#x2F;span&gt;&lt;span&gt;) &#x2F; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; [(phi**n - (-phi)**-n) &#x2F; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.5 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;n &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;range(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span&gt;)]
&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0.0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1.0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1.0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2.0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3.0000000000000004&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5.000000000000001&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8.000000000000002&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;13.000000000000002&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;21.000000000000004&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;34.00000000000001&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;evaluating-generating-functions&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#evaluating-generating-functions&quot; aria-label=&quot;Anchor link for: evaluating-generating-functions&quot;&gt;Evaluating generating functions&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Take another look at $F(10^{-3})$:&lt;&#x2F;p&gt;
&lt;p&gt;$$\frac{1000}{998999} = 0.001001002003005008013021034055089144233377610988…$$&lt;&#x2F;p&gt;
&lt;p&gt;Each next integer in the series starts three places shifted back from the previous
one. This makes sense, because $F(10^{-3})$ is the sum of $f(n)10^{-3n}$
for all $n$.&lt;&#x2F;p&gt;
&lt;p&gt;This also means that eventually the method fails, when numbers outgrow
three digits and overflow into the previous one. If we ignore that for now,
we can study $10^{3n} F(10^{-3})$:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;n = 0                   0.001001002003005008013021034055...
&lt;&#x2F;span&gt;&lt;span&gt;n = 1                   1.001002003005008013021034055089...
&lt;&#x2F;span&gt;&lt;span&gt;n = 2                1001.002003005008013021034055089144...
&lt;&#x2F;span&gt;&lt;span&gt;n = 3             1001002.003005008013021034055089144233...
&lt;&#x2F;span&gt;&lt;span&gt;n = 4          1001002003.005008013021034055089144233377...
&lt;&#x2F;span&gt;&lt;span&gt;n = 5       1001002003005.008013021034055089144233377610...
&lt;&#x2F;span&gt;&lt;span&gt;n = 6    1001002003005008.013021034055089144233377610988...
&lt;&#x2F;span&gt;&lt;span&gt;                      ^^^
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If we could extract just the highlighted column we have our Fibonacci numbers.
Luckily, we can. By flooring we can remove the entire fractional part, and with
modulo $10^3$ we can ignore everything after the third digit.&lt;&#x2F;p&gt;
&lt;p&gt;We still have the overflowing issue however. But this is easily fixed by
choosing a number $k$ instead of $3$, large enough such that the next Fibonacci
number doesn’t overflow into our number of interest. For example the choice $k
= n$ works, as the $n$th Fibonacci number certainly won’t overflow $n$ decimal
digits, giving formula&lt;&#x2F;p&gt;
&lt;p&gt;$$f(n) = \lfloor 10^{n^2} \cdot F(10^{-n}) \rfloor \bmod 10^{n}.$$&lt;&#x2F;p&gt;
&lt;p&gt;We can generalize much more however. Our choice of $10^3$ and $10^n$ as bases
was rather arbitrary. You can use any base $b$, as long as $b$ is large enough. This gives:&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;We abuse notation a bit for brevity here, in reality $b$ is a function of $n$.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;$$f(n) = \left\lfloor b^n \cdot F(b^{-1}) \right\rfloor \bmod b,$$
$$f(n) = \left\lfloor b^n \cdot \frac{b^{-1}}{1 - b^{-1} - b^{-2}} \right\rfloor \bmod b,$$
$$f(n) = \left\lfloor \frac{b^{n+1}}{b^2 - b - 1} \right\rfloor \bmod b.$$&lt;&#x2F;p&gt;
&lt;p&gt;This is actually a closed form we can evaluate without the use of floating
point arithmetic, as it simply consists of the division of two integers.
I have experimentally found that choosing $b = 3^n$ suffices to not overflow
for computing $f(n)$, giving our magical formula:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;n: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;**(n*(n+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)) &#x2F;&#x2F; (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;**(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;*n) - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;**n - &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) % &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;**n
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; [f(n) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;n &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;range(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span&gt;)]
&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;13&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;21&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;34&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;aside&gt;Needless to say, none of this is actually a good idea. We&#x27;re just having fun here.&lt;&#x2F;aside&gt;
&lt;p&gt;We can &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Code_golf&quot;&gt;golf&lt;&#x2F;a&gt; this quite a bit by using Python’s &lt;a href=&quot;https:&#x2F;&#x2F;docs.python.org&#x2F;3&#x2F;whatsnew&#x2F;3.8.html#assignment-expressions&quot;&gt;“walrus operator”&lt;&#x2F;a&gt;
(also called &lt;em&gt;assignment expression&lt;&#x2F;em&gt; by boring people) introduced in Python 3.8. If you write &lt;code&gt;(foo := bar)&lt;&#x2F;code&gt;
in an expression the parentheses take on value &lt;code&gt;bar&lt;&#x2F;code&gt; as well as storing &lt;code&gt;bar&lt;&#x2F;code&gt;
in a new variable &lt;code&gt;foo&lt;&#x2F;code&gt; available in the rest of the expression. Finally as a
bit of flair and efficiency, ${b = 2^{n+1}}$ also works, which can be computed as
&lt;code&gt;2 &amp;lt;&amp;lt; n&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; f=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;lambda &lt;&#x2F;span&gt;&lt;span&gt;n:(b:=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&amp;lt;n)**n*b&#x2F;&#x2F;(b*b-b-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)%b
&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; [f(n) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;n &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;range(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span&gt;)]
&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;13&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;21&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;34&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
</content>
	</entry>
	<entry xml:lang="en">
		<title>Ordering Numbers, How Hard Can It Be?</title>
		<author>Orson R. L. Peters</author>
		<published>2023-01-27T00:00:00+00:00</published>
		<updated>2023-02-07T00:00:00+00:00</updated>
		<link rel="alternate" href="https://orlp.net/blog/ordering-numbers/" type="text/html"/>
		<id>https://orlp.net/blog/ordering-numbers/</id>
		<content type="html">&lt;aside&gt;
&lt;p&gt;This article is &lt;strong&gt;not&lt;&#x2F;strong&gt; about deciding whether two floating
point numbers are ‘close enough’. There are plenty of resources on
this (often subjective) problem. We simply want to know if ${x \leq y.}$&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;Suppose that you are a programmer, and that you have two numbers. You want to
know which number, if any, is larger. Now, if both numbers have the same type,
the solution is trivial in almost any programming language. Usually there is
even a dedicated operator, &lt;code&gt;&amp;lt;=&lt;&#x2F;code&gt;, for this operation. For example, in Python:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;120&amp;quot; &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;1132&amp;quot;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;False
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;aside&gt;Comparing two numbers in Brainfuck is left as an exercise to the reader.&lt;&#x2F;aside&gt;
&lt;p&gt;Oh. Well technically those are strings, not numbers, which typically are
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Lexicographic_order&quot;&gt;&lt;em&gt;lexicographically&lt;&#x2F;em&gt;&lt;&#x2F;a&gt; sorted.
Then again, they are numbers, just represented using strings. This may seem silly
but it’s actually a common problem in user interfaces, e.g. a list of files.
This is why you want to zero-pad your numeric filenames (&lt;code&gt;frame-00001.png&lt;&#x2F;code&gt;), or use lexicographically
order-preserving representations, such as &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;ISO_8601&quot;&gt;ISO 8601&lt;&#x2F;a&gt;
for dates.&lt;&#x2F;p&gt;
&lt;p&gt;But I digress, let’s assume our numbers really are represented using numeric
types. Then indeed it is easy, and &lt;code&gt;&amp;lt;=&lt;&#x2F;code&gt; just works:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;120 &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1132
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;True
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Or does it?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;mixed-type-integer-comparisons&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#mixed-type-integer-comparisons&quot; aria-label=&quot;Anchor link for: mixed-type-integer-comparisons&quot;&gt;Mixed-type integer comparisons&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;What if the two numbers you are comparing do not have the same type? Your
first approach might be to just use &lt;code&gt;&amp;lt;=&lt;&#x2F;code&gt; anyway, for example in C++:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;cpp&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-cpp &quot;&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span&gt;std::cout &amp;lt;&amp;lt; ((-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) &amp;lt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u&lt;&#x2F;span&gt;&lt;span&gt;) &amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Outputs 0.
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Oh. C++ automatically promoted &lt;code&gt;-1&lt;&#x2F;code&gt; to an &lt;code&gt;unsigned int&lt;&#x2F;code&gt; which caused
it to wrap around to the maximum value (which is obviously bigger than &lt;code&gt;1&lt;&#x2F;code&gt;).
Well at least a modern compiler will warn you by default, right?&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;$ g++ -std=c++20 main.cpp &amp;amp;&amp;amp; .&#x2F;a.out
&lt;&#x2F;span&gt;&lt;span&gt;0
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;aside&gt;I&#x27;m interested to see a study on how many bugs have occurred due to
mixed-type integer comparisons. I would not be surprised to see a significant
amount of bugs, especially in C.&lt;&#x2F;aside&gt;
&lt;p&gt;Great. Yet another reason you should not forget to turn on warnings (&lt;code&gt;-Wall -pedantic&lt;&#x2F;code&gt;).
Let’s try Rust:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;i32 &lt;&#x2F;span&gt;&lt;span&gt;= -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; y: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u32 &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;println!(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;{:?}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, x &amp;lt;= y);
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;error[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;E0308&lt;&#x2F;span&gt;&lt;span&gt;]: mismatched types
&lt;&#x2F;span&gt;&lt;span&gt; --&amp;gt; src&#x2F;main.rs:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;22
&lt;&#x2F;span&gt;&lt;span&gt;  |
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;4 &lt;&#x2F;span&gt;&lt;span&gt;| println!(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;{:?}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, x &amp;lt;= y);
&lt;&#x2F;span&gt;&lt;span&gt;  |                      ^ expected `i32`, found `u32`
&lt;&#x2F;span&gt;&lt;span&gt;  |
&lt;&#x2F;span&gt;&lt;span&gt;help: you can convert a `&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u32&lt;&#x2F;span&gt;&lt;span&gt;` to an `&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;i32&lt;&#x2F;span&gt;&lt;span&gt;` and panic &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt; the
&lt;&#x2F;span&gt;&lt;span&gt;converted value doesn&amp;#39;t fit
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Well at least it doesn’t silently miscompile… The suggested solution is
horrible though. There’s no reason to panic at all. The most efficient solution
here is to &lt;em&gt;promote&lt;&#x2F;em&gt; both values to a type that is a &lt;em&gt;superset&lt;&#x2F;em&gt; of both. For
example:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;println!(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;{:?}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, (x as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;i64&lt;&#x2F;span&gt;&lt;span&gt;) &amp;lt;= (y as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;i64&lt;&#x2F;span&gt;&lt;span&gt;)); &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; Outputs true.
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But what if there’s no type that’s a superset of both? At least for integer
types this is not a problem. For example there is no type in Rust that is
a superset of &lt;code&gt;i128&lt;&#x2F;code&gt; and &lt;code&gt;u128&lt;&#x2F;code&gt;. But we do know that an &lt;code&gt;i128&lt;&#x2F;code&gt; fits in an
&lt;code&gt;u128&lt;&#x2F;code&gt; if it is non-negative, and if it is negative, it is always smaller:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;less_eq(x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;i128&lt;&#x2F;span&gt;&lt;span&gt;, y: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u128&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;bool &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt; x &amp;lt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;true &lt;&#x2F;span&gt;&lt;span&gt;} &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span&gt;{ x as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;u128 &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;= y }
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;All of this is quite error prone, and frankly annoying. Cross-integer type
comparisons are always cheap, so I don’t see a good reason why the compiler
doesn’t automatically generate the above code. For example on Apple ARM the
above for &lt;code&gt;i64 &amp;lt;= u64&lt;&#x2F;code&gt; compiles to three instructions:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;color:#202020;&quot;&gt;&lt;code&gt;&lt;span&gt;example::less_eq:
&lt;&#x2F;span&gt;&lt;span&gt;        cmp     x0, #1
&lt;&#x2F;span&gt;&lt;span&gt;        ccmp    x0, x1, #0, ge
&lt;&#x2F;span&gt;&lt;span&gt;        cset    w0, ls
&lt;&#x2F;span&gt;&lt;span&gt;        ret
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We &lt;em&gt;really&lt;&#x2F;em&gt; should automatically be doing the right thing here, instead
of pushing people to hand-written conversions that may or may not be correct, or worse,
silently generating wrong code. C++20 at least introduced new
&lt;a href=&quot;https:&#x2F;&#x2F;en.cppreference.com&#x2F;w&#x2F;cpp&#x2F;utility&#x2F;intcmp&quot;&gt;integer comparison functions&lt;&#x2F;a&gt;
for cross-type integer comparisons, but the regular comparison operators are
still just as dangerous.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;floating-point-numbers-are-exact&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#floating-point-numbers-are-exact&quot; aria-label=&quot;Anchor link for: floating-point-numbers-are-exact&quot;&gt;Floating point numbers are exact&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Before we dive into mixed-type floating point comparisons, we have to do a quick refresher on
what floating point &lt;em&gt;is&lt;&#x2F;em&gt;. When I say floating-point, I mean the binary floating
point numbers defined in the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;IEEE_754&quot;&gt;IEEE 754&lt;&#x2F;a&gt;
standard, in particular &lt;code&gt;binary32&lt;&#x2F;code&gt; (also known as &lt;code&gt;f32&lt;&#x2F;code&gt; or &lt;code&gt;float&lt;&#x2F;code&gt;), and &lt;code&gt;binary64&lt;&#x2F;code&gt;
(also known as &lt;code&gt;f64&lt;&#x2F;code&gt; or &lt;code&gt;double&lt;&#x2F;code&gt;). The latest version of the standard has
DOI &lt;a href=&quot;https:&#x2F;&#x2F;doi.org&#x2F;10.1109&#x2F;IEEESTD.2019.8766229&quot;&gt;10.1109&#x2F;IEEESTD.2019.8766229&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;aside&gt;
&lt;p&gt;Did you know that &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Sci-Hub&quot;&gt;Sci-Hub&lt;&#x2F;a&gt; exists? It is
an important project removing the barriers and paywalls to human knowledge.
Usually if you have a DOI reference the document is only a couple clicks away!
(Whether this is legal depends on your jurisdiction.)&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;h3 id=&quot;ieee-754-refresher&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#ieee-754-refresher&quot; aria-label=&quot;Anchor link for: ieee-754-refresher&quot;&gt;IEEE 754 refresher&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;This floating point format has various warts, but the reality is that the machine
you’re reading this on almost surely uses it as its native floating point
implementation. In this format a number consists of one sign bit, $w$ exponent
bits and $t$ &lt;em&gt;trailing&lt;&#x2F;em&gt; mantissa bits. For &lt;code&gt;f32&lt;&#x2F;code&gt; we have $w = 8$ and $t = 23$,
for &lt;code&gt;f64&lt;&#x2F;code&gt; we have $w = 11, t = 52$. There is also an exponent bias $b$, which is
$127$ for &lt;code&gt;f32&lt;&#x2F;code&gt; and $1023$ for &lt;code&gt;f64&lt;&#x2F;code&gt;, which is used to get negative exponents.&lt;&#x2F;p&gt;
&lt;p&gt;To decode a floating point number (ignoring &lt;code&gt;NaN&lt;&#x2F;code&gt;s and infinities), we first look at our
$1 + w + t$ bits and decode three unsigned binary integers $s$, $e$ and $m$:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;ordering-numbers&#x2F;ieee-format.png&quot; alt=&quot;A visual representation of the IEEE 754 float format.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Then, if $e = 0$ the value of our number is
\begin{equation}f = {(-1)}^s \times 2^{e - b + 1} \times (0 + m &#x2F; 2^t),\end{equation}
otherwise it is
\begin{equation}f = {(-1)}^s \times 2^{e - b} \times (1 + m &#x2F; 2^t).\end{equation}
This is why they’re called &lt;em&gt;trailing&lt;&#x2F;em&gt; mantissa bits: the first
digit of the mantissa is determined by the exponent. When the exponent field is
zero (before applying the bias), the mantissa starts with a $0$, otherwise it
starts with a $1$. When the mantissa starts with a $0$ we call it a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Subnormal_number&quot;&gt;&lt;em&gt;subnormal&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;
number. They are important because they close the otherwise relatively large
gap between zero and the first floating point number.
A nice way to get a feeling for all this is by playing around with the &lt;a href=&quot;https:&#x2F;&#x2F;evanw.github.io&#x2F;float-toy&#x2F;&quot;&gt;Float Toy&lt;&#x2F;a&gt;
app.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;imprecision&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#imprecision&quot; aria-label=&quot;Anchor link for: imprecision&quot;&gt;Imprecision&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;Now that we know all of the above, I want to explicitly state the following:
IEEE 754 floating point types define a set of exact numbers. There is no
ambiguity (except two representations for zero, $+0$ and $-0$), nor is there a
loss of precision, fuzziness, interval representations, etc. For example,
$1.0$ is represented &lt;strong&gt;exactly&lt;&#x2F;strong&gt; by the &lt;code&gt;f32&lt;&#x2F;code&gt; with value &lt;code&gt;0x3f800000&lt;&#x2F;code&gt;, and the
next bigger &lt;code&gt;f32&lt;&#x2F;code&gt; is &lt;code&gt;0x3f800001&lt;&#x2F;code&gt; with value
$${(-1)}^0 \times 2^0 \times (1 + 1 &#x2F; 2^{23}) = 1.00000011920928955078125.$$&lt;&#x2F;p&gt;
&lt;p&gt;For example in Rust:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;println!(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;{}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32&lt;&#x2F;span&gt;&lt;span&gt;::from_bits(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x3f800001&lt;&#x2F;span&gt;&lt;span&gt;));
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1.0000001
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Oh. Did I lie? No, it is Rust who lies:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; full = format!(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;{:.1000}&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32&lt;&#x2F;span&gt;&lt;span&gt;::from_bits(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x3f800001&lt;&#x2F;span&gt;&lt;span&gt;));
&lt;&#x2F;span&gt;&lt;span&gt;println!(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;{}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, full.trim_end_matches(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;#39;0&amp;#39;&lt;&#x2F;span&gt;&lt;span&gt;));
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1.00000011920928955078125
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is not a bug or an accident. Rust—and most programming languages in
fact—only try to print as few digits as possible to guarantee the &lt;em&gt;round-trip&lt;&#x2F;em&gt;
is correct. And indeed:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;println!(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;0x&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;{:x}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;1.0000001&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;.parse::&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;().unwrap().to_bits());
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;0x3f800001
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;However, this has some nasty implications, if you then parse the number as a
more precise type:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;1.0000001&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;.parse::&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f64&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;().unwrap() == &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1.00000011920928955078125
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;false
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Mind you that $1.00000011920928955078125$ is exactly representable as both
a &lt;code&gt;f32&lt;&#x2F;code&gt; and &lt;code&gt;f64&lt;&#x2F;code&gt; (because &lt;code&gt;f32&lt;&#x2F;code&gt; is a strict subset of &lt;code&gt;f64&lt;&#x2F;code&gt;), yet you &lt;em&gt;lose&lt;&#x2F;em&gt;
precision by printing as an &lt;code&gt;f32&lt;&#x2F;code&gt; and parsing as an &lt;code&gt;f64&lt;&#x2F;code&gt;. The reason is that
while &lt;code&gt;1.0000001&lt;&#x2F;code&gt; is the shortest decimal number that rounds to
$1.00000011920928955078125$ in the &lt;code&gt;f32&lt;&#x2F;code&gt; floating point format, it rounds
to $$1.0000001000000000583867176828789524734020233154296875$$ instead in the &lt;code&gt;f64&lt;&#x2F;code&gt; format.&lt;&#x2F;p&gt;
&lt;p&gt;Ironically, in this
case it is more accurate to parse as an &lt;code&gt;f32&lt;&#x2F;code&gt; and then convert to &lt;code&gt;f64&lt;&#x2F;code&gt;, because
Rust guarantees the round-trip correctness:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;1.0000001&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;.parse::&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f32&lt;&#x2F;span&gt;&lt;span&gt;&amp;gt;().unwrap() as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f64 &lt;&#x2F;span&gt;&lt;span&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1.00000011920928955078125
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;true
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;So &lt;code&gt;f32 -&amp;gt; f64&lt;&#x2F;code&gt; is lossless, as is &lt;code&gt;f32 -&amp;gt; String -&amp;gt; f32 -&amp;gt; f64&lt;&#x2F;code&gt;. But
&lt;code&gt;f32 -&amp;gt; String -&amp;gt; f64&lt;&#x2F;code&gt; loses precision.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;It is vital to understand the above, to be able to investigate and debug floating point
problems.&lt;&#x2F;strong&gt;
Programming languages will silently round a floating point number to the nearest
representable number when you write a number in your source code, silently
round it when parsing, and silently round it when printing. The way these
languages round differs, and it can even differ depending on the type in question.
At every step of the way you are potentially being lied to.&lt;&#x2F;p&gt;
&lt;p&gt;Given how much silent rounding occurs, I do not blame you if you got the impression
that floating point is ‘fuzzy’. It provides the illusion of having a ‘real
number’ type. But in reality the underlying numbers are an &lt;strong&gt;exact&lt;&#x2F;strong&gt;, finite set of
numbers.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;mixed-type-floating-point-comparisons&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#mixed-type-floating-point-comparisons&quot; aria-label=&quot;Anchor link for: mixed-type-floating-point-comparisons&quot;&gt;Mixed-type floating point comparisons&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Why do I place so much emphasis on the exactness of the IEEE 754 floating point
numbers? Because it means that (aside from &lt;code&gt;NaN&lt;&#x2F;code&gt;s), the comparison of integers
and floats is also unambiguously well-defined. They are both, after, all, exact
numbers placed on the real number line.&lt;&#x2F;p&gt;
&lt;p&gt;Before reading on, I challenge you: try to write a &lt;em&gt;correct&lt;&#x2F;em&gt; implementation of the
following function:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F;&#x2F; x &amp;lt;= y
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;is_less_eq(x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;i64&lt;&#x2F;span&gt;&lt;span&gt;, y: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f64&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;bool &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    todo!()
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If you want to
try in Rust, I wrote a (non-exhaustive) set of tests on the &lt;a href=&quot;https:&#x2F;&#x2F;play.rust-lang.org&#x2F;?version=stable&amp;amp;mode=debug&amp;amp;edition=2021&amp;amp;gist=a0a54f421b8160b461c1a6f3e5a830ee&quot;&gt;Rust playground&lt;&#x2F;a&gt;
you can plug your implementation into, which might show you an input that fails.
If you want to try it in a different language, remember that the programming language might lie to you by default! For example:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;i64 &lt;&#x2F;span&gt;&lt;span&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;58&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; y: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f64 &lt;&#x2F;span&gt;&lt;span&gt;= x as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f64&lt;&#x2F;span&gt;&lt;span&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; 2^58, exactly representable.
&lt;&#x2F;span&gt;&lt;span&gt;println!(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;{x}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt; &amp;lt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;{y}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;{}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#648424;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, x as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f64 &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;= y);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;288230376151711744 &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;288230376151711740&lt;&#x2F;span&gt;&lt;span&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;true
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This may look like a bad comparison or conversion from &lt;code&gt;i64&lt;&#x2F;code&gt; to &lt;code&gt;f64&lt;&#x2F;code&gt;, but it isn’t. The problem
lies entirely in the rounding during formatting.&lt;&#x2F;p&gt;
&lt;p&gt;The main difficulty lies in the fact that for many type combinations (e.g. &lt;code&gt;i64&lt;&#x2F;code&gt; and &lt;code&gt;f64&lt;&#x2F;code&gt;) there
does not exist a native type in the programming language that is a superset
of both. For example, $2^{1000}$ is exactly representable as an &lt;code&gt;f64&lt;&#x2F;code&gt; but not &lt;code&gt;i64&lt;&#x2F;code&gt;. And
$2^{53} + 1$ is exactly representable in &lt;code&gt;i64&lt;&#x2F;code&gt; but not &lt;code&gt;f64&lt;&#x2F;code&gt;. So we can’t simply
convert one to the other and be done with, yet &lt;strong&gt;that is what many people do&lt;&#x2F;strong&gt;.
In fact, it’s so common ChatGPT has learned to do so:&lt;&#x2F;p&gt;
&lt;div class=&quot;big&quot;&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;ordering-numbers&#x2F;chat-gpt.png&quot; alt=&quot;A ChatGPT prompt to implement is_less_eq.&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;aside&gt;
&lt;p&gt;Asking ChatGPT to fix the bug with an explicit counterexample is
fruitless, it will blabber some nonsense about &lt;code&gt;f64::EPSILON&lt;&#x2F;code&gt; and comparing
a difference to that.&lt;&#x2F;p&gt;
&lt;&#x2F;aside&gt;
&lt;p&gt;Our above test framework shows that &lt;code&gt;x as f64 &amp;lt;= y&lt;&#x2F;code&gt; fails because we find that&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9007199254740993 &lt;&#x2F;span&gt;&lt;span&gt;as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f64 &lt;&#x2F;span&gt;&lt;span&gt;&amp;lt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9007199254740992.0
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;which is obviously wrong. The problem is that &lt;code&gt;9007199254740993&lt;&#x2F;code&gt; (which is
$2^{53}+1$) is not representable as &lt;code&gt;f64&lt;&#x2F;code&gt;, and gets rounded to
$2^{53}$, after which the comparison succeeds.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-correct-implementation-for-i64-f64&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#the-correct-implementation-for-i64-f64&quot; aria-label=&quot;Anchor link for: the-correct-implementation-for-i64-f64&quot;&gt;The correct implementation for &lt;code&gt;i64&lt;&#x2F;code&gt; &amp;lt;= &lt;code&gt;f64&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The trick for implementing $i \leq f$ correctly is to perform the operation in the
integer domain after rounding the floating point number down to the nearest
integer, as for integer $i$ we have
$$  i \leq f \iff i \leq \lfloor f \rfloor.$$
We need not worry that rounding a float up or down to the nearest integer goes wrong and
skips an integer, because for IEEE 754 the &lt;code&gt;floor&lt;&#x2F;code&gt; &#x2F; &lt;code&gt;ceil&lt;&#x2F;code&gt; functions are exact. This
is because in the part of the number line where IEEE 754 floats are fractional
it is also dense in the integers.&lt;&#x2F;p&gt;
&lt;p&gt;We still have to worry about our floating point value not fitting in our integer
type. Luckily when that happens our comparison is trivial. Unluckily, our integer
types have a different range in the negative and positive domains, so we still
have to be a bit careful, especially because we can not compare with $2^{63} - 1$
(the maximum &lt;code&gt;i64&lt;&#x2F;code&gt; value) in the float domain.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;is_less_eq(x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;i64&lt;&#x2F;span&gt;&lt;span&gt;, y: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f64&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;bool &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt; y.is_nan() { &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;false&lt;&#x2F;span&gt;&lt;span&gt;; }
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt; y &amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9223372036854775808.0 &lt;&#x2F;span&gt;&lt;span&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; 2^63
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;true &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; y is always bigger.
&lt;&#x2F;span&gt;&lt;span&gt;    } &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;else if&lt;&#x2F;span&gt;&lt;span&gt; y &amp;gt;= -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9223372036854775808.0 &lt;&#x2F;span&gt;&lt;span&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; -2^63
&lt;&#x2F;span&gt;&lt;span&gt;        x &amp;lt;= y.floor() as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;i64  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; y is in [-2^63, 2^63)
&lt;&#x2F;span&gt;&lt;span&gt;    } &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;false &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; y is always smaller.
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You might think you can get away without the &lt;code&gt;floor&lt;&#x2F;code&gt; as we convert to integer
immediately after. Alas, &lt;code&gt;as i64&lt;&#x2F;code&gt; rounds towards zero, but we need to round
towards negative infinity or else we will end up claiming &lt;code&gt;-1 &amp;lt;= -1.5&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;generalizing&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#generalizing&quot; aria-label=&quot;Anchor link for: generalizing&quot;&gt;Generalizing&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;Ok, we can compare $i \leq f$. What about $i \geq f$? We can’t re-use the same
implementation by swapping the order of the arguments because their types are
different. We can however make a new implementation from scratch applying the same
principle, but we must use &lt;code&gt;ceil&lt;&#x2F;code&gt; instead of &lt;code&gt;floor&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;p&gt;$$  i \geq f \iff i \geq \lceil f \rceil.$$&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;is_greater_eq(x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;i64&lt;&#x2F;span&gt;&lt;span&gt;, y: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f64&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;bool &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt; y.is_nan() { &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;false&lt;&#x2F;span&gt;&lt;span&gt;; }
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt; y &amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9223372036854775808.0 &lt;&#x2F;span&gt;&lt;span&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; 2^63
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;false &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; y is always bigger.
&lt;&#x2F;span&gt;&lt;span&gt;    } &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;else if&lt;&#x2F;span&gt;&lt;span&gt; y &amp;gt;= -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9223372036854775808.0 &lt;&#x2F;span&gt;&lt;span&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; -2^63
&lt;&#x2F;span&gt;&lt;span&gt;        x &amp;gt;= y.ceil() as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;i64  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; y is in [-2^63, 2^63)
&lt;&#x2F;span&gt;&lt;span&gt;    } &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;true &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; y is always smaller.
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;What if we want strict inequality? Now our &lt;code&gt;floor&lt;&#x2F;code&gt;&#x2F;&lt;code&gt;ceil&lt;&#x2F;code&gt; trick introduces
problems surrounding equality. One way to solve this is with a separate check
for equality in the integer domain followed by inequality in the float domain:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;is_less(x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;i64&lt;&#x2F;span&gt;&lt;span&gt;, y: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f64&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;bool &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt; y.is_nan() { &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;false&lt;&#x2F;span&gt;&lt;span&gt;; }
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt; y &amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9223372036854775808.0 &lt;&#x2F;span&gt;&lt;span&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; 2^63
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;true
&lt;&#x2F;span&gt;&lt;span&gt;    } &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;else if&lt;&#x2F;span&gt;&lt;span&gt; y &amp;gt;= -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9223372036854775808.0 &lt;&#x2F;span&gt;&lt;span&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; -2^63
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; yf = y.floor(); &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; y is in [-2^63, 2^63)
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; yfi = yf as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;i64&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;        x &amp;lt; yfi || x == yfi &amp;amp;&amp;amp; yf &amp;lt; y
&lt;&#x2F;span&gt;&lt;span&gt;    } &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;false
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You get the point. There might be a more clever and&#x2F;or efficient way to do this,
but at least this works.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Update on 2023-02-07&lt;&#x2F;strong&gt;: Pavel Mayorov contacted me with a suggestion for a more efficient
version of inequality. It works on the observation that for integer $i$ we have&lt;&#x2F;p&gt;
&lt;p&gt;$$  i &amp;lt; f \iff i &amp;lt; \lceil f \rceil.$$&lt;&#x2F;p&gt;
&lt;p&gt;That is, instead of flooring for $\leq$ we use ceiling for $&amp;lt;$.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#ffffff;color:#202020;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span&gt;is_less(x: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;i64&lt;&#x2F;span&gt;&lt;span&gt;, y: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;f64&lt;&#x2F;span&gt;&lt;span&gt;) -&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;bool &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt; y.is_nan() { &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;false&lt;&#x2F;span&gt;&lt;span&gt;; }
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span&gt; y &amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9223372036854775808.0 &lt;&#x2F;span&gt;&lt;span&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; 2^63
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;true
&lt;&#x2F;span&gt;&lt;span&gt;    } &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;else if&lt;&#x2F;span&gt;&lt;span&gt; y &amp;gt;= -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;9223372036854775808.0 &lt;&#x2F;span&gt;&lt;span&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; -2^63
&lt;&#x2F;span&gt;&lt;span&gt;        x &amp;lt; y.ceil() as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#215da8;&quot;&gt;i64 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#76647b;&quot;&gt;&#x2F;&#x2F; y is in [-2^63, 2^63)
&lt;&#x2F;span&gt;&lt;span&gt;    } &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#202020;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ae6018;&quot;&gt;false
&lt;&#x2F;span&gt;&lt;span&gt;    }
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a class=&quot;anchor&quot; href=&quot;#conclusion&quot; aria-label=&quot;Anchor link for: conclusion&quot;&gt;Conclusion&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;So, &lt;em&gt;ordering numbers, how hard can it be&lt;&#x2F;em&gt;? Pretty damn hard I would say, if
your language does not support it natively. From challenging a variety of people
to write a correct implementation of &lt;code&gt;is_less_eq&lt;&#x2F;code&gt;, no one gets it right on their
first try. And that’s after already explicitly being told that the challenge is
to do it correctly for all inputs. I quote the Python standard library:
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;python&#x2F;cpython&#x2F;blob&#x2F;135ec7cefbaffd516b77362ad2b2ad1025af462e&#x2F;Objects&#x2F;floatobject.c#L397&quot;&gt;“comparison is pretty much a nightmare.”&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Of all the languages I looked at that have distinct integer and floating point
types, Python, Julia, Ruby and Go get this right. Good job! Some warn you or
disallow cross-type comparisons by default, but Kotlin for example will straight
up tell you that &lt;code&gt;9007199254740993 &amp;lt;= 9007199254740992.0&lt;&#x2F;code&gt; is &lt;code&gt;true&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;For Rust I’ve made the &lt;a href=&quot;https:&#x2F;&#x2F;crates.io&#x2F;crates&#x2F;num-ord&quot;&gt;&lt;code&gt;num-ord&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; crate for
now that allows you to compare any two built-in numeric types correctly. But I
would love to see it (and others) adopt an approach where this is done right
natively. Because if it isn’t people have to do it correctly themselves, which
they won’t.&lt;&#x2F;p&gt;
</content>
	</entry>
</feed>