Taming Floating-Point Sums

Orson R. L. Peters — 2024-05-25T00:00:00+00:00
Suppose you have an array of floating-point numbers, and wish to sum them. You might naively think you can simply add them, e.g. in Rust:</p>
fn </span>naive_sum(arr: &[</span>f32</span>]) -> </span>f32 </span>{
</span>    </span>let </span>mut out = </span>0.0</span>;
</span>    </span>for</span> x in arr {
</span>        out += *x;
</span>    }
</span>    out
</span>}
</span></code></pre>
This however can easily result in an arbitrarily large accumulated error. Let’s try it out:</p>
naive_sum(&vec![</span>1.0</span>;     </span>1_000_000</span>]) =  </span>1000000.0
</span>naive_sum(&vec![</span>1.0</span>;    </span>10_000_000</span>]) = </span>10000000.0
</span>naive_sum(&vec![</span>1.0</span>;   </span>100_000_000</span>]) = </span>16777216.0
</span>naive_sum(&vec![</span>1.0</span>; </span>1_000_000_000</span>]) = </span>16777216.0
</span></code></pre>
Uh-oh… What happened? When you compute $a + b$ the result must be rounded to
the nearest representable floating-point number, breaking ties towards the
number with an even mantissa. The problem is that the next 32-bit floating-point
number after 16777216</code> is 16777218</code>. In this case that means 16777216 + 1</code>
rounds back to 16777216</code> again. We’re stuck.</p>
Luckily, there are better ways to sum an array.</p>
Pairwise summation</a></h2>
A method that’s a bit more clever is to use pairwise summation</a>.
Instead of a completely linear sum with a single accumulator it recursively
sums an array by splitting the array in half, summing the halves, and then adding
the sums.</p>
fn </span>pairwise_sum(arr: &[</span>f32</span>]) -> </span>f32 </span>{
</span>    </span>if</span> arr.len() == </span>0 </span>{ </span>return </span>0.0</span>; }
</span>    </span>if</span> arr.len() == </span>1 </span>{ </span>return</span> arr[</span>0</span>]; }
</span>    </span>let </span>(first, second) = arr.split_at(arr.len() / </span>2</span>);
</span>    pairwise_sum(first) + pairwise_sum(second)
</span>}
</span></code></pre>
This is more accurate:</p>
pairwise_sum(&vec![</span>1.0</span>;     </span>1_000_000</span>]) =    </span>1000000.0
</span>pairwise_sum(&vec![</span>1.0</span>;    </span>10_000_000</span>]) =   </span>10000000.0
</span>pairwise_sum(&vec![</span>1.0</span>;   </span>100_000_000</span>]) =  </span>100000000.0
</span>pairwise_sum(&vec![</span>1.0</span>; </span>1_000_000_000</span>]) = </span>1000000000.0
</span></code></pre>
However, this is rather slow. To get a summation routine that goes as fast as
possible while still being reasonably accurate we should not recurse down
all the way to length-1 arrays, as this gives too much call overhead. We can
still use our naive sum for small sizes, and only recurse on large sizes.
This does make our worst-case error worse by a constant factor, but in turn
makes the pairwise sum almost as fast as a naive sum.</p>

By choosing the splitpoint as a multiple of 256 we ensure that the base case in
the recursion always has exactly 256 elements except on the very last block.
This makes sure we use the most optimal reduction and always correctly predict
the loop condition. This small detail ended up improving the throughput by 40%
for large arrays!</p>
</aside>
fn </span>block_pairwise_sum(arr: &[</span>f32</span>]) -> </span>f32 </span>{
</span>    </span>if</span> arr.len() > </span>256 </span>{
</span>        </span>let</span> split = (arr.len() / </span>2</span>).next_multiple_of(</span>256</span>);
</span>        </span>let </span>(first, second) = arr.split_at(split);
</span>        block_pairwise_sum(first) + block_pairwise_sum(second)
</span>    } </span>else </span>{
</span>        naive_sum(arr)
</span>    }
</span>}
</span></code></pre>
Kahan summation</a></h2>
The worst-case round-off error of naive summation scales with $O(n \epsilon)$
when summing $n$ elements, where $\epsilon$ is the machine
epsilon</a> of your floating-point
type (here $2^{-24}$). Pairwise summation improves this to  $O((\log n) \epsilon + n\epsilon^2)$. However, Kahan
summation</a> improves this
further to $O(n\epsilon^2)$, eliminating the $\epsilon$ term entirely, leaving only
the $\epsilon^2$ term which is negligible unless you sum a very large amount of
numbers.</p>

All of these bounds scale with $\sum_i |x_i|$, so the worst-case absolute error
bound is still quadratic in terms of $n$ even for Kahan summation.</p>
In practice all summation algorithms do significantly better than their
worst-case bounds, as in most scenarios the errors do not exclusively
round up or down, but cancel each other out on average.</p>
</aside>
pub </span>fn </span>kahan_sum(arr: &[</span>f32</span>]) -> </span>f32 </span>{
</span>    </span>let </span>mut sum = </span>0.0</span>;
</span>    </span>let </span>mut c = </span>0.0</span>;
</span>    </span>for</span> x in arr {
</span>        </span>let</span> y = *x - c;
</span>        </span>let</span> t = sum + y;
</span>        c = (t - sum) - y;
</span>        sum = t;
</span>    }
</span>    sum
</span>}
</span></code></pre>
The Kahan summation works by maintaining the sum in two registers, the actual
bulk sum and a small error correcting term $c$. If you were using infinitely
precise arithmetic $c$ would always be zero, but with floating-point it might
not be. The downside is that each number now takes four operations to add to the
sum instead of just one.</p>
To mitigate this we can do something similar to what we did with the pairwise
summation. We can first accumulate blocks into sums naively before combining the
block sums with Kaham summation to reduce overhead at the cost of accuracy:</p>
pub </span>fn </span>block_kahan_sum(arr: &[</span>f32</span>]) -> </span>f32 </span>{
</span>    </span>let </span>mut sum = </span>0.0</span>;
</span>    </span>let </span>mut c = </span>0.0</span>;
</span>    </span>for</span> chunk in arr.chunks(</span>256</span>) {
</span>        </span>let</span> x = naive_sum(chunk);
</span>        </span>let</span> y = x - c;
</span>        </span>let</span> t = sum + y;
</span>        c = (t - sum) - y;
</span>        sum = t;
</span>    }
</span>    sum
</span>}
</span></code></pre>
Exact summation</a></h2>
I know of at least two general methods to produce the correctly-rounded</em> sum of a sequence
of floating-point numbers. That is, it logically computes the sum with
infinite precision before rounding it back to a floating-point value at the end.</p>
The first method is based on the 2Sum</a>
primitive which is an error-free transform from two numbers $x, y$ to $s, t$
such that $x + y = s + t$, where $t$ is a small error. By applying this
repeatedly until the errors vanish you can get a correctly-rounded sum.
Keeping track of what to add in what order can be tricky, and the worst-case
requires $O(n^2)$ additions to make all the terms vanish. This is what’s
implemented in Python’s math.fsum</code></a>
and in the Rust crate fsum</code></a> which use
extra memory to keep the partial sums around. The accurate</code></a> crate also
implements this using in-place mutation in i_fast_sum_in_place</code></a>.</p>
Another method is to keep a large buffer of integers around, one per exponent.
Then when adding a floating-point number you decompose it into a an exponent
and mantissa, and add the mantissa to the corresponding integer in the buffer. 
If the integer buf[i]</code> overflows you increment the integer in buf[i + w]</code>,
where w</code> is the width of your integer.</p>
This can actually compute a completely exact sum, without any rounding at all,
and is effectively just an overly permissive representation of a fixed-point
number optimized for accumulating floats. This latter method is $O(n)$ time, but
uses a large but constant amount of memory ($\approx$ 1 KB for f32</code>, $\approx$
16 KB for f64</code>). An advantage of this method is that it’s also an online
algorithm - both adding a number to the sum and getting the current total are
amortized $O(1)$.</p>
A variant of this method is implemented in the accurate</code></a>
crate
as OnlineExactSum</code></a>
crate which uses floats instead of integers for the buffer.</p>
Unleashing the compiler</a></h2>
Besides accuracy, there is another problem with naive_sum</code>. The Rust compiler
is not allowed to reorder floating-point additions, because floating-point
addition is not associative. So it cannot autovectorize the naive_sum</code> to use
SIMD instructions to compute the sum, nor use instruction-level parallelism.</p>
To solve this there are compiler intrinsics in Rust that do float sums while
allowing associativity, such as
std::intrinsics::fadd_fast</code></a>.
However, these instructions are incredibly dangerous</em>, as they assume that both
the input and output are finite numbers (no infinities, no NaNs), or otherwise
they are undefined behavior. This functionally makes them unusable, as only in
the most restricted scenarios when computing a sum do you know that all inputs
are finite numbers, and that their sum cannot overflow.</p>
I recently uttered my annoyance 
with these operators to Ben Kimock</a>, and together
we proposed (and he implemented) a new set of operators:
std::intrinsics::fadd_algebraic</code></a>
and friends</a>.</p>
I proposed we call the operators algebraic</em>, as they allow (in theory) any
transformation that is justified by real algebra. For example, substituting
${x - x \to 0}$, ${cx + cy \to c(x + y)}$, or ${x^6 \to (x^2)^3.}$
In general these operators are treated as-if they are done using real numbers,
and can map to any set of floating-point instructions that would be equivalent
to the original expression, assuming the floating-point instructions would be
exact.</p>

Note that the real numbers do not contain NaNs or infinities, so these operators
assume those do not exist for the validity of transformations, however it is not
undefined behavior when you do encounter those values.</p>
They also allow fused multiply-add</a>
instructions to be generated, as under real arithmetic $\operatorname{fma}(a, b, c) = ab + c.$</p>
</aside>
Using those new instructions it is trivial to generate an autovectorized sum:</p>
#![allow(internal_features)]
</span>#![feature(core_intrinsics)]
</span>use </span>std::intrinsics::fadd_algebraic;
</span>
</span>fn </span>naive_sum_autovec(arr: &[</span>f32</span>]) -> </span>f32 </span>{
</span>    </span>let </span>mut out = </span>0.0</span>;
</span>    </span>for</span> x in arr {
</span>        out = fadd_algebraic(out, *x);
</span>    }
</span>    out
</span>}
</span></code></pre>
If we compile with -C target-cpu=broadwell</code> we see that the compiler
automatically generated the following tight loop for us, using 4 accumulators
and AVX2 instructions:</p>
.LBB0_5:
</span>    </span>vaddps  </span>ymm0, ymm0, ymmword ptr [rdi + </span>4</span>*r8]
</span>    </span>vaddps  </span>ymm1, ymm1, ymmword ptr [rdi + </span>4</span>*r8 + </span>32</span>]
</span>    </span>vaddps  </span>ymm2, ymm2, ymmword ptr [rdi + </span>4</span>*r8 + </span>64</span>]
</span>    </span>vaddps  </span>ymm3, ymm3, ymmword ptr [rdi + </span>4</span>*r8 + </span>96</span>]
</span>    </span>add     </span>r8, </span>32
</span>    </span>cmp     </span>rdx, r8
</span>    </span>jne     </span>.LBB0_5
</span></code></pre>
This will process 128 bytes of floating-point data (so 32 elements) in 7
instructions. Additionally, all the vaddps</code> instructions are independent of
each other as they accumulate to different registers. If we analyze this with
uiCA</a> we see that it estimates the above loop to take
4 cycles to complete, processing 32 bytes / cycle. At 4GHz that’s up to 128GB/s!
Note that that’s way above what my machine’s RAM bandwidth is, so you will only
achieve that speed when summing data that is already in cache.</p>
With this in mind we can also easily define block_pairwise_sum_autovec</code> and
block_kahan_sum_autovec</code> by replacing their calls to naive_sum</code> with
naive_sum_autovec</code>.</p>
Accuracy and speed</a></h2>
Let’s take a look at how the different summation methods compare. As a
relatively arbitrary benchmark, let’s sum 100,000 random floats ranging from
-100,000 to +100,000. This is 400 KB worth of data, so it still fits in cache on
my AMD Threadripper 2950x.</p>
All the code is available on Github</a>.
Compiled with RUSTFLAGS=-C target-cpu=native</code> and --release</code> I get the
following results:</p>
Algorithm</th> Throughput</th> Mean absolute error</th></tr></thead>

naive</code></td> 5.5 GB/s</td> 71.796</td></tr>
pairwise</code></td> 0.9 GB/s</td> 1.5528</td></tr>
kahan</code></td> 1.4 GB/s</td> 0.2229</td></tr>
block_pairwise</code></td> 5.8 GB/s</td> 3.8597</td></tr>
block_kahan</code></td> 5.9 GB/s</td> 4.2184</td></tr>
naive_autovec</code></td> 118.6 GB/s</td> 14.538</td></tr>
block_pairwise_autovec</code></td> 71.7 GB/s</td> 1.6132</td></tr>
block_kahan_autovec</code></td> 98.0 GB/s</td> 1.2306</td></tr>
crate_accurate_buffer</code></td> 1.1 GB/s</td> 0.0015</td></tr>
crate_accurate_inplace</code></td> 1.9 GB/s</td> 0.0015</td></tr>
crate_fsum</code></td> 1.2 GB/s</td> 0.0000</td></tr>
</tbody></table>

The reason the accurate</code> crate has a non-zero absolute error is because it
currently does not implement rounding to nearest</a>
correctly, so it can be off by one unit in the last place for the final result.</p>
</aside>
First I’d like to note that there’s more than a 100x</strong> performance difference
between the fastest and slowest method. For summing an array! Now this might not
be entirely fair as the slowest methods are computing something significantly
harder, but there’s still a 20x performance difference between a seemingly
reasonable naive implementation and the fastest one.</p>
We find that in general the _autovec</code> methods that use fadd_algebraic</code> are
faster and</em> more accurate than the ones using regular floating-point addition.
The reason they’re more accurate as well is the same reason a pairwise sum is
more accurate: any reordering of the additions is better as the default
long-chain-of-additions is already the worst case for accuracy in a sum.</p>
Limiting ourselves to Pareto-optimal</a> choices
we get the following four implementations:</p>
Algorithm</th> Throughput</th> Mean absolute error</th></tr></thead>

naive_autovec</code></td> 118.6 GB/s</td> 14.538</td></tr>
block_kahan_autovec</code></td> 98.0 GB/s</td> 1.2306</td></tr>
crate_accurate_inplace</code></td> 1.9 GB/s</td> 0.0015</td></tr>
crate_fsum</code></td> 1.2 GB/s</td> 0.0000</td></tr>
</tbody></table>
Note that implementation differences can be quite impactful, and there are
likely dozens more methods of compensated summing I did not compare here.</p>
For most cases I think block_kahan_autovec</code> wins here, having good accuracy
(that doesn’t degenerate with larger inputs) at nearly the maximum speed. For
most applications the extra accuracy from the correctly-rounded sums is
unnecessary, and they are 50-100x slower.</p>
By splitting the loop up into an explicit remainder plus a tight loop of
256-element sums we can squeeze out a bit more performance, and avoid a couple
floating-point ops for the last chunk:</p>
#![allow(internal_features)]
</span>#![feature(core_intrinsics)]
</span>use </span>std::intrinsics::fadd_algebraic;
</span>
</span>fn </span>sum_block(arr: &[</span>f32</span>]) -> </span>f32 </span>{
</span>    arr.iter().fold(</span>0.0</span>, |x, y| fadd_algebraic(x, *y))
</span>}
</span>
</span>pub </span>fn </span>sum_orlp(arr: &[</span>f32</span>]) -> </span>f32 </span>{
</span>    </span>let </span>mut chunks = arr.chunks_exact(</span>256</span>);
</span>    </span>let </span>mut sum = </span>0.0</span>;
</span>    </span>let </span>mut c = </span>0.0</span>;
</span>    </span>for</span> chunk in &mut chunks {
</span>        </span>let</span> y = sum_block(chunk) - c;
</span>        </span>let</span> t = sum + y;
</span>        c = (t - sum) - y;
</span>        sum = t;
</span>    }
</span>    sum + (sum_block(chunks.remainder()) - c)
</span>}
</span></code></pre>
Algorithm</th> Throughput</th> Mean absolute error</th></tr></thead>

sum_orlp</code></td> 112.2 GB/s</td> 1.2306</td></tr>
</tbody></table>
You can of course tweak the number 256, I found that using 128 was $\approx$ 20%
slower, and that 512 didn’t really improve performance but did cost accuracy.</p>
Conclusion</a></h2>
I think the fadd_algebraic</code> and similar algebraic intrinsics are very useful
for achieving high-speed floating-point routines, and that other languages
should add them as well. A global -ffast-math</code> is not good enough, as we’ve
seen above the best implementation was a hybrid between automatically optimized
math for speed, and manually implemented non-associative compensated operations.</p>
Finally, if you are using LLVM, beware of -ffast-math</code>. It is undefined
behavior</strong> to produce a NaN or infinity while that flag is set in LLVM. I have
no idea why they chose this hardcore stance which makes virtually every program
that uses it unsound. If you are targetting LLVM with your language, avoid the
nnan</code> and ninf</code> fast-math flags</a>.</p>


Extracting and Depositing Bits
Orson R. L. Peters — 2024-01-13T00:00:00+00:00
Suppose you have a 64-bit word and wish to extract a couple bits from it.
For example you just performed a SWAR</a>
algorithm and wish to extract the least significant bit of each byte in the u64</code>.
This is simple enough, you simply perform a binary AND with a mask of
the bits you wish to keep:</p>
let</span> out = word & </span>0x0101010101010101</span>;
</span></code></pre>
However, this still leaves the bits of interest spread throughout the 64-bit
word. What if we also want to compress the 8 bits we wish to extract into a
single byte? Or what if we want the inverse, spreading the 8 bits of a byte
among the least significant bits of each byte in a 64-bit word?</p>
</p>
PEXT</code> and PDEP</code></a></h2>
If you are using a modern x86-64 CPU, you are in luck. In the much underrated
BMI instruction
set</a> there
are two very powerful instructions: PDEP</code> and PEXT</code>. They are inverses of each
other, PEXT</code> extracts</em> bits, PDEP</code> deposits</em> bits.</p>
PEXT</code> takes in a word and a
mask, takes just those bits from the word where the mask has a 1 bit, and
compresses all selected bits to a contiguous output word. Simulated in Rust
this would be:</p>
fn </span>pext64(word: </span>u64</span>, mask: </span>u64</span>) -> </span>u64 </span>{
</span>    </span>let </span>mut out = </span>0</span>;
</span>    </span>let </span>mut out_idx = </span>0</span>;
</span>
</span>    </span>for</span> i in </span>0</span>..</span>64 </span>{
</span>        </span>let</span> ith_mask_bit = (mask >> i) & </span>1</span>;
</span>        </span>let</span> ith_word_bit = (word >> i) & </span>1</span>;
</span>        </span>if</span> ith_mask_bit == </span>1 </span>{
</span>            out |= ith_word_bit << out_idx;
</span>            out_idx += </span>1</span>;
</span>        }
</span>    }
</span>
</span>    out
</span>}
</span></code></pre>
For example if you had the bitstring abcdefgh</code> and mask 10110001</code> you would
get output bitstring 0000acdh</code>.</p>
PDEP</code> is exactly its inverse, it takes contiguous data bits as a word, and
a mask, and deposits the data bits one-by-one (starting at the least significant
bits) into those bits where the mask
has a 1 bit, leaving the rest as zeros:</p>
fn </span>pdep64(word: </span>u64</span>, mask: </span>u64</span>) -> </span>u64 </span>{
</span>    </span>let </span>mut out = </span>0</span>;
</span>    </span>let </span>mut input_idx = </span>0</span>;
</span>
</span>    </span>for</span> i in </span>0</span>..</span>64 </span>{
</span>        </span>let</span> ith_mask_bit = (mask >> i) & </span>1</span>;
</span>        </span>if</span> ith_mask_bit == </span>1 </span>{
</span>            </span>let</span> next_word_bit = (word >> input_idx) & </span>1</span>;
</span>            out |= next_word_bit << i;
</span>            input_idx += </span>1</span>;
</span>        }
</span>    }
</span>
</span>    out
</span>}
</span></code></pre>
So if you had the bitstring abcdefgh</code> and mask 10100110</code> you would get output
e0f00gh0</code> (recall that we traditionally write bitstrings with the least
significant bit on the right).</p>
These instructions are incredibly powerful and flexible, and the amazing thing
is that these instructions only take a single cycle on modern Intel and AMD
CPUs! However, they are not available in other instruction sets, so whenever you
use them you will also likely need to write a cross-platform alternative.</p>

Unfortunately, both PDEP</code> and PEXT</code> are very slow</a>
on AMD Zen and Zen2. They are implemented in microcode, which is really unfortunate.
The platform advertises through CPUID</a> that
the instructions are supported, but they’re almost unusably slow.
Use with caution.</p>
</aside>
Extracting bits with multiplication</a></h2>
While the following technique can’t replace all PEXT</code> cases, it can be quite
general. It is applicable when:</p>

The bit pattern you want to extract is static and known in advance.</li>
If you want to extract $k$ bits, there must at least be a $k-1$ gap between
two bits of interest.</li>
</ol>
We compute the bit extraction by adding together many left-shifted copies of
our input word, such that we construct our desired bit pattern in the uppermost
bits. The trick is to then realize that w << i</code> is equivalent to w * (1 << i)</code>
and thus the sum of many left-shifted copies is equivalent to a single
multiplication by (1 << i) + (1 << j) + ...</code></p>
I think the technique is best understood by visual example. Let’s use our
example from earlier, extracting the least significant bit of each byte in a
64-bit word. We start off by masking off just those bits. After that we shift
the most significant bit of interest to the topmost bit of the word to get our
first shifted copy. We then repeat this, shifting the second most significant
bit of interest to the second topmost bit, etc. We sum all these shifted copies.
This results in the following (using underscores instead of zeros for clarity):</p>
mask    = _______1_______1_______1_______1_______1_______1_______1_______1
</span>t       = w & mask
</span>t       = _______a_______b_______c_______d_______e_______f_______g_______h
</span>
</span>t << 7  = a_______b_______c_______d_______e_______f_______g_______h_______
</span>t << 14 = _b_______c_______d_______e_______f_______g_______h______________
</span>t << 21 = __c_______d_______e_______f_______g_______h_____________________
</span>t << 28 = ___d_______e_______f_______g_______h____________________________
</span>t << 35 = ____e_______f_______g_______h___________________________________
</span>t << 42 = _____f_______g_______h__________________________________________
</span>t << 49 = ______g_______h_________________________________________________
</span>t << 56 = _______h________________________________________________________
</span>    sum = abcdefghbcdefgh_cdefh___defgh___efgh____fgh_____gh______h_______
</span></code></pre>
Note how we constructed abcdefgh</code> in the topmost 8 bits, which we can then
extract using a single right-shift by $64 - 8 = 56$ bits. Since
(1 << 7) + (1 << 14) + ... + (1 << 56) = 0x102040810204080</code> we get the
following implementation:</p>
fn </span>extract_lsb_bit_per_byte(w: </span>u64</span>) -> </span>u8 </span>{
</span>    </span>let</span> mask = </span>0x0101010101010101</span>;
</span>    </span>let</span> sum_of_shifts = </span>0x102040810204080</span>;
</span>    ((w & mask).wrapping_mul(sum_of_shifts) >> </span>56</span>) as </span>u8
</span>}
</span></code></pre>
Not as good as PEXT</code>, but three arithmetic instructions is not bad at all.</p>
Depositing bits with multiplication</a></h2>
Unfortunately the following technique is significantly less general than the
previous one. While you can take inspiration from it to implement similar
algorithms, as-is it is limited to just spreading the bits of one byte to the
least significant bit of each byte in a 64-bit word.</p>
The trick is similar to the one above. We add 8 shifted copies of
our byte which once again translates to a multiplication. By choosing a shift
that increases in multiples if 9 instead of 8 we ensure that the bit pattern
shifts over by one position in each byte. We then mask out our bits of interest,
and finish off with a shift and byteswap (which compiles to a single instruction bswap</code>
on Intel or rev</code> on ARM) to put our output bits on the least significant bits
and reverse the order.</p>
This technique visualized:</p>
b       = ________________________________________________________abcdefgh
</span>b <<  9 = _______________________________________________abcdefgh_________
</span>b << 18 = ______________________________________abcdefgh__________________
</span>b << 27 = _____________________________abcdefgh___________________________
</span>b << 36 = ____________________abcdefgh____________________________________
</span>b << 45 = ___________abcdefgh_____________________________________________
</span>b << 54 = __abcdefgh______________________________________________________
</span>b << 63 = h_______________________________________________________________
</span>    sum = h_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh
</span>   mask = 1_______1_______1_______1_______1_______1_______1_______1_______
</span>s & msk = h_______g_______f_______e_______d_______c_______b_______a_______
</span></code></pre>
We once again note that the sum of shifts can be precomputed as 1 + (1 << 9) + ... + (1 << 63) = 0x8040201008040201</code>, allowing the following implementation:</p>
fn </span>deposit_lsb_bit_per_byte(b: </span>u8</span>) -> </span>u64 </span>{
</span>    </span>let</span> sum_of_shifts = </span>0x8040201008040201</span>;
</span>    </span>let</span> mask = </span>0x8080808080808080</span>;
</span>    </span>let</span> spread = (b as </span>u64</span>).wrapping_mul(sum_of_shifts) & mask;
</span>    </span>u64</span>::swap_bytes(spread >> </span>7</span>)
</span>}
</span></code></pre>
This time it required 4 arithmetic instructions, not quite as good as PDEP</code>,
but again not bad compared to a naive implementation, and this is cross-platform.</p>


When Random Isn't
Orson R. L. Peters — 2024-01-10T00:00:00+00:00
This post is an anecdote from over a decade ago, of which I lost the actual
code. So please forgive me if I do not accurately remember all the details. Some
details are also simplified so that anyone that likes computer security can
enjoy this article, not just those who have played World of Warcraft (although
the Venn diagram</a> of those two
groups likely has a solid overlap).</p>
When I was around 14 years old I discovered World of
Warcraft</a> developed by Blizzard
Games and was immediately hooked. Not long after I discovered add-ons which allow
you to modify how your game’s user interface looks and works. However, not all
add-ons I downloaded did exactly what I wanted to do. I wanted more. So I went to
find out how they were made.</p>
In a weird twist of fate, I blame World of Warcraft for me seriously picking up
programming. It turned out that they were made in the
Lua</a> programming language. Add-ons were nothing more than
a couple .lua</code> source files in a folder directly loaded into the game. The
barrier of entry was incredibly low: just edit a file, press save and reload the
interface. The fact that the game loaded your</em> source code and you could see it
running was magical!</p>
I enjoyed it immensely and in no time I was only writing add-ons and was barely playing
the game itself anymore. I published quite a few
add-ons</a> in the next
two years, which mostly involved copying other people’s code with some
refactoring / recombining / tweaking to my wishes.</p>
Add-on security</a></h2>
A thought you might have is that it’s a really bad idea to let users have fully
programmable add-ons in your game, lest you get bots. However, the system
Blizzard made to prevent arbitrary programmable actions was quite clever.
Naturally, it did nothing to prevent actual botting, but at least
regular rule-abiding players were fundamentally restricted to the automation
Blizzard allowed.</p>
Most UI elements that you could create were strictly decorative or
informational. These were completely unrestricted, as were most APIs that
strictly gather information. For example you can make a health bar display
using two frames, a background and a foreground, sizing the foreground
frame using an API call to get the health of your character.</p>
Not all API calls were available to you however. Some were protected so they
could only be called from official Blizzard code. These typically involved
the API calls that would move your character, cast spells, use items, etc.
Generally speaking anything that actually makes you perform an in-game action
was protected.</p>

The API for getting your exact world location and camera orientation also became
protected at some point. This was a reaction by Blizzard to new add-ons that were
actively drawing 3D elements on top of the game world to make boss fights easier.</p>
</aside>
However, some UI elements needed to actually interact with the game itself, e.g.
if I want to make a button that casts a certain spell. For this you could
construct a special kind of button that executes code in a secure environment
when clicked. You were only allowed to create/destroy/move such buttons when not
in combat, so you couldn’t simply conditionally place such buttons underneath
your cursor to automate actions during combat.</p>
The catch was that this secure
environment</a>
did</em> allow you to programmatically set which spell to cast, but doesn’t
let you gather the information you would need to do arbitrary automation. All
access to state from outside the secure environment was blocked. There were
some information gathering API calls available to match the more accessible
in-game macro system, but nothing as fancy as getting skill cooldowns or
unit health which would enable automatic optimal spellcasting. </p>
So there were two environments: an insecure one where you can get all
information but can’t act on it, and a secure one where you can act but can’t
get the information needed for automation.</p>
A backdoor channel</a></h2>
Fast forward a couple years and I had mostly stopped playing. My interests had
mainly moved on to more “serious” programming, and I was only occasionally
playing, mostly messing around with add-on ideas. But this secure environment kept
on nagging in my brain; I wanted to break it.</p>
Of course there was third-party
software that completely disables the security restrictions from Blizzard, but
what’s the fun in that? I wanted to do it “legitimately”, using the technically
allowed tools, as a challenge.</p>

Obviously using clever code to bypass security restrictions is no better than
using third-party software, and both would likely get you banned. I never
actually wanted to use the code, just to see if I could make it work.</p>
</aside>
So I scanned the secure environment allowed function list to see if I could smuggle any
information from the outside into the secure environment. It all seemed pretty
hopeless until I saw one tiny, innocent little function: random</code>.</p>
An evil idea came in my head: random number generators (RNGs) used in computers are almost
always pseudorandom number generators</a>
with (hidden) internal state. If I can manipulate this state, perhaps I can use
that to pass information into the secure environment.</p>
Random number generator woes</a></h2>
It turned out that random</code> was just a small shim around C’s
rand</code></a>. I was excited! 
This meant that there was a single global random state that was shared in the
process. It also helps that rand</code> implementations tended to be on the weak side.
Since World
of Warcraft was compiled with MSVC, the actual implementation of rand</code> was as follows:</p>
uint32_t state;
</span>
</span>int </span>rand() {
</span>    state = state * </span>214013 </span>+ </span>2531011</span>;
</span>    </span>return </span>(state >> </span>16</span>) & </span>0x7fff</span>;
</span>}
</span></code></pre>
This RNG is, for the lack of a better word, shite. It is
a naked linear congruential generator</a>,
and a weak one at that. Which in my case, was a good thing.</p>

I can understand MSVC keeps rand</code> the same for backwards compatibility, and at
least all documentation I could find for rand</code> recommends you not to use rand</code>
for cryptographic purposes. But was there ever a time where such a bad PRNG
implementation was fit for any</em> purpose?</p>
</aside>
So let’s get to breaking this thing. Since the state is so laughably small
and you can see 15 bits of the state directly you can keep a full list of
all possible states consistent with a single output of the RNG and use
further calls to the RNG to eliminate possibilities until a
single one remains. But we can be significantly more clever.</p>
First we note that the top bit of state</code> never affects anything in this RNG.
(state >> 16) & 0x7fff</code> masks out 15 bits, after shifting away the bottom 16
bits, and thus effectively works mod $2^{31}$. Since on any update the new state
is a linear function of the previous state, we can propagate this modular form
all the way down to the initial state as $$f(x) \equiv f(x \bmod m) \mod m$$ for
any linear $f$.</p>
Let $a = 214013$ and $b = 2531011$. We observe the 15-bit output $r_0, r_1$ of
two RNG calls. We’ll call the 16-bit portion of the RNG state that is hidden by
the shift $h_0, h_1$ respectively, for the states after the first and second
call. This means the state of the RNG after the first call is $2^{16} r_0 + h_0$
and similarly for $2^{16} r_1 + h_1$ after the second call. Then we have the following identity:</p>
$$a\cdot (2^{16}r_0 + h_0) + b \equiv 2^{16}r_1 + h_1 \mod 2^{31},$$</p>
$$ah_0 \equiv h_1 + 2^{16}(r_1 - ar_0) - b \mod 2^{31}.$$</p>
Now let $c \geq 0$ be the known constant $(2^{16}(r_1 - ar_0) - b) \bmod 2^{31}$, then
for some integer $k$ we have</p>
$$ah_0 = h_1 + c + 2^{31} k.$$</p>
Note that the left hand side ranges from $0$ to $a (2^{16} - 1) \approx 2^{33.71}$.
Thus we must have $-1 \leq k \leq 2^{2.71} < 7$. Reordering we get the following
expression for $h_0$:
$$h_0 = \frac{c + 2^{31} k}{a} + h_1/a.$$
Since $a > 2^{16}$ while $0 \leq h_1 < 2^{16}$ we note that the term $0 \leq h_1/a < 1$.
Thus, assuming a solution exists, we must have
$$h_0 = \left\lceil\frac{c + 2^{31} k}{a}\right\rceil.$$</p>
So for $-1 \leq k < 7$ we compute the above guess for the hidden portion of
the RNG state after the first call. This gives us 8 guesses, after which we can
reject bad guesses using follow-up calls to the RNG until a single unique answer remains.</p>

While I was able to re-derive the above with little difficulty now, 18 year old
me wasn’t as experienced in discrete math. So I asked on crypto.SE</a>,
with the excuse that I wanted to ‘show my colleagues how weak this RNG is’.
It worked, which sparks all kinds of interesting ethics questions.</p>
</aside>
An example implementation of this process in Python:</p>
import </span>random
</span>
</span>A = </span>214013
</span>B = </span>2531011
</span>
</span>class </span>MsvcRng:
</span>    </span>def </span>__init__(self, state):
</span>        self.state = state
</span>        
</span>    </span>def </span>__call__(self):
</span>        self.state = (self.state * A + B) % </span>2</span>**</span>32
</span>        </span>return </span>(self.state >> </span>16</span>) & </span>0x7fff
</span>
</span># Create a random RNG state we'll reverse engineer.
</span>hidden_rng = MsvcRng(random.randint(</span>0</span>, </span>2</span>**</span>32</span>))
</span>
</span># Compute guesses for hidden state from 2 observations.
</span>r0 = hidden_rng()
</span>r1 = hidden_rng()
</span>c = (</span>2</span>**</span>16 </span>* (r1 - A * r0) - B) % </span>2</span>**</span>31
</span>ceil_div = </span>lambda </span>a, b: (a + b - </span>1</span>) // b
</span>h_guesses = [ceil_div(c + </span>2</span>**</span>31 </span>* k, A) </span>for </span>k </span>in </span>range(-</span>1</span>, </span>7</span>)]
</span>
</span># Validate guesses until a single guess remains.
</span>guess_rngs = [MsvcRng(</span>2</span>**</span>16 </span>* r0 + h0) </span>for </span>h0 </span>in </span>h_guesses]
</span>guess_rngs = [g </span>for </span>g </span>in </span>guess_rngs </span>if </span>g() == r1]
</span>while </span>len(guess_rngs) > </span>1</span>:
</span>    r = hidden_rng()
</span>    guess_rngs = [g </span>for </span>g </span>in </span>guess_rngs </span>if </span>g() == r]
</span>    
</span># The top bit can not be recovered as it never affects the output,
</span># but we should have recovered the effective hidden state.
</span>assert </span>guess_rngs[</span>0</span>].state % </span>2</span>**</span>31 </span>== hidden_rng.state % </span>2</span>**</span>31
</span></code></pre>
While I did write the above process with a while</code> loop, it appears to only ever
need a third output at most to narrow it down to a single guess.</p>
Putting it together</a></h2>
Once we could reverse-engineer the internal state of the random number
generator we could make arbitrary automated decisions in the supposedly secure
environment. How it worked was as follows:</p>


An insecure hook was registered that would execute right before the secure
environment code would run.</p>
</li>

In this hook we have full access to information, and make a decision as to
which action should be taken (e.g. casting a particular spell). This action
is looked up in a hardcoded list to get an index.</p>
</li>

The current state of the RNG is reverse-engineered using the above process.</p>
</li>

We predict the outcome of the next RNG call. If this (modulo the length
of our action list) does not give our desired outcome, we advance the RNG and
try again. This repeats until the next random number would correspond to our
desired action.</p>
</li>

The hook returns, and the secure environment starts. It generates a “random”
number, indexes our hardcoded list of actions, and performs the “random” action.</p>
</li>
</ol>
That’s all! By being able to simulate the RNG and looking one step ahead we could
use it as our information channel by choosing exactly the right moment to call
random</code> in the secure environment. Now if you wanted to support a list of $n$
actions it would on average take $n$ steps of the RNG before the correct
number came up to pass along, but that wasn’t a problem in practice.</p>
Conclusion</a></h2>
I don’t know when Blizzard fixed the issue where the RNG state is so weak and
shared, or whether they were aware of it being an issue at all. A few years
after I had written the code I tried it again out of curiosity, and it had
stopped working. Maybe they switched to a different algorithm,
or had a properly separated RNG state for the secure environment.</p>
All-in-all it was a lot of effort for a niche exploit in a video game that I
didn’t even want to use. But there certainly was a magic to manipulating
something supposedly random into doing exactly what you want, like a magician
pulling four aces from a shuffled deck.</p>


Branchless Lomuto Partitioning
Orson R. L. Peters — 2023-12-04T00:00:00+00:00
A partition function accepts as input an array of elements, and a function
returning a bool (a predicate</em>) which indicates if an element should be in the
first, or second partition. Then it returns two arrays, the two partitions</em>:</p>
def </span>partition(v, pred):
</span>    first = [x </span>for </span>x </span>in </span>v </span>if </span>pred(x)]
</span>    second = [x </span>for </span>x </span>in </span>v </span>if </span>not pred(x)]
</span>    </span>return </span>first, second
</span></code></pre>
This can actually be done without needing any extra memory beyond the original
array. An in-place</em> partition algorithm reorders the array such that all the
elements for which pred(x)</code> is true come before the elements for which
pred(x)</code> is false (or vice versa - it doesn’t really matter). Finally by
returning how many elements satisfied pred(x)</code> to the caller they can then
logically split the array in two slices. This scheme is used in C++’s
std::partition</code></a>
for example.</p>

Usually partitioning is discussed in the context of sorting, most notably
quicksort</a>. There it is typically used
with a pivot</em> $p$, and you partition the data into ${x < p}$ and ${x \geq p}$
before recursing on both partitions. However, it is more generally applicable than that.</p>
</aside>
In-place partition algorithms</a></h2>
There are many possible variations / implementations of in-place partition
algorithms, but they usually follow one of two schools: Hoare or Lomuto, named
after their original inventors Tony
Hoare</a> (the inventor of quicksort),
and Nico Lomuto.</p>
Hoare</a></h3>
In Hoare-style partition algorithms you have two iterators scanning the array,
one left-to-right ($i$) and one right-to-left ($j$). The former tries to find
elements that belong on the right, and the latter tries to find elements that
belong on the left. When both iterators have found an element, you swap them,
and continue. When the iterators cross each other you are done.</p>
def </span>hoare_partition(v, pred):
</span>    i = </span>0       </span># Loop invariant: all(pred(x) for x in v[:i])
</span>    j = len(v)  </span># Loop invariant: all(not pred(x) for x in v[j:])
</span>    </span>while </span>True</span>:
</span>        </span>while </span>i < j and pred(v[i]): i += </span>1
</span>        j -= </span>1
</span>        </span>while </span>i < j and not pred(v[j]): j -= </span>1
</span>        </span>if </span>i >= j: </span>return </span>i
</span>        v[i], v[j] = v[j], v[i]
</span>        i += </span>1
</span></code></pre>
The loop invariant can visually be seen as such (using the symbols $<$ and
$\geq$ for the predicate outcomes, as is usual in sorting):</p>
</p>
Doing this efficiently is perhaps an article for another day, if you are curious you can check out
the paper BlockQuicksort: How Branch Mispredictions don’t affect
Quicksort</a> by Stefan Edelkamp and Armin Weiß,
which is the technique I implemented in pdqsort</a>. Another take on the same
idea is found in bitsetsort</a>. Key here is that it can be
done branchlessly</em>. A branch</em> is a point at which the CPU has to make a choice of which code to
run. The most recognizable form of a branch is the if</code> statement, but there are others (e.g.
while</code> conditions, calling a function from a lookup table, short-circuiting logical operators).</p>
To cut a long story short, CPUs try to predict which piece of code it should run
next and already starts doing it even before it knows if the code it is sending
down the pipeline</a> is the
right choice. This is great in most cases as most branches are easy to predict,
but it does incur a penalty when the prediction was wrong, as the CPU needs to
stop, go back and restart from the right point once it realizes it was wrong
(which can take a while).</p>
Especially in sorting when the outcomes of comparisons are ideally unpredictable
(the more unpredictable the outcome of a comparison, the more informative</a>
getting the answer is), it can thus be advisable to avoid branching on comparisons
altogether.</p>

Another branchless partition algorithm that is similar to Hoare but which makes
a temporary gap in the array so it can use moves rather than swaps is the fulcrum
partition</em> found in crumsort</a> by Igor
van den Hoven.</p>
</aside>
Lomuto</a></h3>
In Lomuto-style partition algorithms the following invariant is followed:</p>
</p>
That is, there is a single iterator scanning the array from left-to-right ($j$). If
the element is found to belong in the left partition, it is swapped with the
first element of the right partition (tracked by $i$), otherwise it is left
where it is.</p>
def </span>lomuto_partition(v, pred):
</span>    i = </span>0  </span># Loop invariant: all(pred(x) for x in v[:i])
</span>    j = </span>0  </span># Loop invariant: all(not pred(x) for x in v[i:j])
</span>    </span>while </span>j < len(v):
</span>        </span>if </span>pred(v[j]):
</span>            v[i], v[j] = v[j], v[i]
</span>            i += </span>1
</span>        j += </span>1
</span>
</span>    </span>return </span>i
</span></code></pre>
This article is focused on optimizing this style of partition.</p>
Branchless Lomuto</a></h2>
A few years ago I read a post</a>
by Andrei Alexandrescu which discusses a branchless variant of the Lomuto
partition. Its inner loop (in C++) looks like this:</p>
for </span>(; read < last; ++read) {
</span>    </span>auto</span> x = *read;
</span>    </span>auto</span> smaller = -</span>int</span>(x < pivot);
</span>    </span>auto</span> delta = smaller & (read - first);
</span>    first[delta] = *first;
</span>    read[-delta] = x;
</span>    first -= smaller;
</span>}
</span></code></pre>
At the time I was not overly impressed, as it does a lot of arithmetic to make
the loop branchless, so I disregarded it. A while back my friend Lukas
Bergdoll</a> approached me with a new partition
algorithm which was doing quite well in his benchmarks, which I recognized as
being a variant of Lomuto. I then found a way the algorithm could be
restructured without using conditional move instructions, which made it perform
better still. I will present this algorithm now.</p>
Simple version</a></h3>
First, a simplified variant which will make the more optimized variant much
more readily understood:</p>
def </span>branchless_lomuto_partition_simplified(v, pred):
</span>    i = </span>0  </span># Loop invariant: all(pred(x) for x in v[:i])
</span>    j = </span>0  </span># Loop invariant: all(not pred(x) for x in v[i:j])
</span>    </span>while </span>j < len(v):
</span>        v[i], v[j] = v[j], v[i]
</span>        i += int(pred(v[i]))
</span>        j += </span>1
</span>
</span>    </span>return </span>i
</span></code></pre>
This is actually quite similar to the original lomuto_partition</code>, except we
now always unconditionally</em> swap, and replace the conditional increment of $i$
with an if</code> statement by simply converting the boolean condition to an integer
and adding it to $i$.</p>
To visualize this, the state of the array looks like this after the unconditional
swap:</p>
</p>
From this it should be pretty clear that incrementing $i$ if the predicate
is true (corresponding to $v[i] < p$ for sorting) and unconditionally
incrementing $j$ restores our Lomuto loop invariant.
The only corner case is when ${i = j},$ but then the swap is a no-op and the
algorithm remains correct.</p>

While researching prior art for this article
I came across nanosort</a> by Arseny Kapoulkine
which implements this simplified variant and thanks</em> Andrei Alexandrescu for
his branchless Lomuto partition. But I actually believe it’s fundamentally
different to Andrei’s version.</p>
</aside>
Eliminating swaps</a></h3>
Swaps are equivalent to three moves, but by restructuring the algorithm we
can get away with two moves per iteration. The trick is to introduce a gap</em>
in the array by temporarily moving one of the elements out of the array.</p>
def </span>branchless_lomuto_partition(v, pred):
</span>    </span>if </span>len(v) == </span>0</span>: </span>return </span>0
</span>
</span>    tmp = v[</span>0</span>]
</span>    i = </span>0  </span># Loop invariant: all(pred(x) for x in v[:i])
</span>    j = </span>0  </span># Loop invariant: all(not pred(x) for x in v[i:j])
</span>    </span>while </span>j < len(v) - </span>1</span>:
</span>        v[j] = v[i]
</span>        j += </span>1
</span>        v[i] = v[j]
</span>        i += pred(v[i])
</span>
</span>    v[j] = v[i]
</span>    v[i] = tmp
</span>    i += pred(v[i])
</span>    </span>return </span>i
</span></code></pre>
This is our new branchless Lomuto partition algorithm. Its inner loop is
incredibly tight, involving only two moves, one predicate evaluation
and two additions. A full visualization of one iteration of the algorithm (the striped
red area indicates the gap in the array):</p>
</p>
We can now compare the assembly</a> of the tight inner loops of Andrei Alexandrescu’s
branchless Lomuto partition and ours:</p>
.alexandrescu:
</span>    </span>mov     </span>edi, DWORD PTR [rdx]
</span>    </span>mov     </span>rsi, rdx
</span>    </span>mov     </span>ebp, DWORD PTR [rax]
</span>    </span>cmp     </span>edi, r8d
</span>    </span>setb    </span>cl
</span>    </span>sub     </span>rsi, rax
</span>    </span>sar     </span>rsi, </span>2
</span>    </span>test    </span>cl, cl
</span>    </span>cmove   </span>rsi, rbx
</span>    </span>sal     </span>rcx, </span>63
</span>    </span>sar     </span>rcx, </span>61
</span>    </span>mov     </span>DWORD PTR [rax+rsi*</span>4</span>], ebp
</span>    </span>lea     </span>r11, [</span>0</span>+rsi*</span>4</span>]
</span>    </span>mov     </span>rsi, rdx
</span>    </span>add     </span>rdx, </span>4
</span>    </span>sub     </span>rsi, r11
</span>    </span>sub     </span>rax, rcx
</span>    </span>mov     </span>DWORD PTR [rsi], edi
</span>    </span>cmp     </span>rdx, r10
</span>    </span>jb      </span>.alexandrescu
</span>
</span>.orlp:
</span>    </span>lea     </span>rdi, [r8+rax*</span>4</span>]
</span>    </span>mov     </span>ecx, DWORD PTR [rdi]
</span>    </span>mov     </span>DWORD PTR [rdx], ecx
</span>    </span>mov     </span>ecx, DWORD PTR [rdx+</span>4</span>]
</span>    </span>cmp     </span>ecx, r9d
</span>    </span>mov     </span>DWORD PTR [rdi], ecx
</span>    </span>adc     </span>rax, </span>0
</span>    </span>add     </span>rdx, </span>4
</span>    </span>cmp     </span>rdx, r10
</span>    </span>jne     </span>.orlp
</span></code></pre>
Half the instruction count! A neat trick the compiler did is translate the
addition of the boolean result of the predicate (which is just a comparison
here) to adc rax, 0</code>. It avoids needing to create a boolean 0/1 value in a
register by setting the carry flag using cmp</code> and adding zero with carry.</p>
Conclusion</a></h2>
Is the new branchless Lomuto implementation worth it? For that I’ll hand you
over to my friend Lukas Bergdoll who has done an extensive write-up</a>
on the performance of an optimized implementation of this partition with a
variety of real-world benchmarks and metrics.</p>
From an algorithmic standpoint the branchless Lomuto- and Hoare-style algorithms
do have a key difference: they differ in the amount of writes they must perform.
Branchless Lomuto-style algorithms always do at least two moves for each
element, whereas Hoare-style algorithms can get away with doing a single move
for each element (crumsort), or even better, half a move per element on average
for random data (BlockQuicksort, pdqsort and bitsetsort, although they spend
more time figuring out what to move than crumsort does - one of many
trade-offs). So a key component of choosing an algorithm is the question “How
expensive are my moves?” which can vary from very cheap (small integers in
cache) to very expensive (large structs not in cache).</p>
Finally, the Lomuto-style algorithms tend to be significantly smaller in both
source code and generated machine code, which can be a factor for some. They
are also arguably easier to understand and prove correct, Hoare-style partition
algorithms are especially prone to off-by-one errors.</p>


Subtraction Is Functionally Complete
Orson R. L. Peters — 2023-09-28T00:00:00+00:00
To be precise, IEEE-754</a> floating point
subtraction is functionally
complete</a>. That means you
can construct any binary circuit using nothing but floating point subtraction.</p>
To see how, we must start at the bottom. I quote the IEEE 754-2019 standard, section 6.3:</p>

6.3 The sign bit</a></h3>
[…] When neither the inputs nor result are NaN, […]; the sign of a sum, or of a difference $x−y$
regarded as a sum $x+(−y)$, differs from at most one of the addends’ signs; […].
These rules shall apply even when operands or results are zero or infinite.</p>
When the sum of two operands with opposite signs (or the difference of two operands with like
signs) is exactly zero, the sign of that sum (or difference) shall be $+0$ under all rounding-direction
attributes except roundTowardNegative</code>; under that attribute, the sign of an exact zero sum (or
difference) shall be $−0$.</p>
</blockquote>
Let’s dissect that.</p>

A subtraction $x - y$ is considered a sum $x + (-y)$.</li>
Zero can have a sign, $-0$ and $0$ are distinct entities (although they compare
equal when testing with ==</code>).</li>
If both of the addends have the same sign, the output must have that sign.
However, for $x - y$ that means if $x$ and $y$ have different</em> signs the output
must have the sign of $x$.</li>
If $x$ and $y$ have the same sign, and $x - y$ is zero, the output will be
$+0$ for all rounding modes except roundTowardNegative</code>, then it will be $-0$.</li>
</ol>
Now since the default rounding mode in virtually every context is roundTiesToEven</code>,
we shall assume that from now on. However, everything works analogously even for
roundTowardNegative</code>.</p>
A truth table</a></h2>
So, what does that give us when subtracting zeroes?</p>
-</span>0 </span>- -</span>0 </span>= +</span>0  </span># Same sign, must be +0.
</span>-</span>0 </span>- +</span>0 </span>= -</span>0  </span># Different signs, sign from first argument.
</span>+</span>0 </span>- -</span>0 </span>= +</span>0  </span># Different signs, sign from first argument.
</span>+</span>0 </span>- +</span>0 </span>= +</span>0  </span># Same sign, must be +0.
</span></code></pre>
Interesting… What if we say that $-0$ is false and $+0$ is true?</p>
A B | O
</span>----+--
</span>0 0 </span>| </span>1
</span>0 1 </span>| </span>0
</span>1 0 </span>| </span>1
</span>1 1 </span>| </span>1
</span></code></pre>
Our resulting truth table is equivalent to ${A \vee \neg B}$, or ${B \to A}$ (also known as the
IMPLY</a> gate, albeit with the arguments swapped). It turns
out this truth table is functionally
complete</a>, which means we can make arbitrary
circuits using only this gate.
Technically speaking, it is only functionally complete if given access to the
constant false. This is necessary to produce a NOT gate, and NOT + IMPLY is
a functionally complete set. I don’t know a better term for ‘functionally complete
if given access to some constant value’, however.</p>

NAND and NOR are truly functionally complete by
themselves, even without access to any particular constant value. This is very
valuable when constructing microchips as you only need to be able to produce a
single kind of component, and do not need to worry about routing a consistent
low signal anywhere to produce a NOT gate.
</aside>
Subtraction circuits</a></h2>
Let’s build a demo in Python. First we’ll define our constants and allow us to print them nicely.
Note that even though they are distinct entities, $+0$ and $-0$ compare equal in IEEE 754 floating
point, so we must first extract the sign before comparing to 0 to distinguish.</p>
import </span>math
</span>
</span>f_false = -</span>0.0
</span>f_true = </span>0.0
</span>f_repr = </span>lambda </span>x: </span>True </span>if </span>math.copysign(</span>1</span>, x) > </span>0 </span>else </span>False
</span></code></pre>
We can now make a NOT gate by using the fact that $-0 - x$ flips the sign of
zero $x$:</p>
f_not = </span>lambda </span>x: f_false - x
</span></code></pre>
Let’s test that:</p>
>>> f_repr(f_not(f_false))
</span>True
</span>>>> f_repr(f_not(f_true))
</span>False
</span></code></pre>
Great! We can also build an OR gate by noticing that if we flip the sign of
the second argument before subtracting, we always get $+0$ (true) unless
both arguments are $-0$ (false):</p>
f_or = </span>lambda </span>a, b: a - f_not(b)
</span></code></pre>
Let’s test it out:</p>
>>> f_repr(f_or(f_false, f_false))
</span>False
</span>>>> f_repr(f_or(f_true,  f_false))
</span>True
</span>>>> f_repr(f_or(f_false, f_true))
</span>True
</span>>>> f_repr(f_or(f_true, f_true))
</span>True
</span></code></pre>
Now that we have OR and NOT, we can make all other gates, e.g.:</p>
f_and = </span>lambda </span>a, b: f_not(f_or(f_not(a), f_not(b)))
</span>f_xor = </span>lambda </span>a, b: f_or(f_and(f_not(a), b), f_and(a, f_not(b)))
</span></code></pre>
>>> f_repr(f_and(f_false, f_false))
</span>False
</span>>>> f_repr(f_and(f_true,  f_false))
</span>False
</span>>>> f_repr(f_and(f_false, f_true))
</span>False
</span>>>> f_repr(f_and(f_true, f_true))
</span>True
</span>
</span>>>> f_repr(f_xor(f_false, f_false))
</span>False
</span>>>> f_repr(f_xor(f_true,  f_false))
</span>True
</span>>>> f_repr(f_xor(f_false, f_true))
</span>True
</span>>>> f_repr(f_xor(f_true, f_true))
</span>False
</span></code></pre>
Software integers</a></h2>
You may have heard of soft-float, software implementations of floating point
using integers. Let’s turn that on its head: an integer implementation done in
software, using only floating point ops. Let’s do it in Rust so we can look at
the final assembly output to see how horrifically slow</del> awesome it is.</p>
type </span>Bit = </span>f32</span>;
</span>const </span>ZERO</span>: Bit = -</span>0.0</span>;
</span>const </span>ONE</span>: Bit = </span>0.0</span>;
</span>
</span>fn </span>not(x: Bit) -> Bit { </span>ZERO </span>- x }
</span>fn </span>or(a: Bit, b: Bit) -> Bit { a - not(b) }
</span>fn </span>and(a: Bit, b: Bit) -> Bit { not(or(not(a), not(b))) }
</span>fn </span>xor(a: Bit, b: Bit) -> Bit { or(and(not(a), b), and(a, not(b))) }
</span>fn </span>adder(a: Bit, b: Bit, c: Bit) -> (Bit, Bit) {
</span>    </span>let</span> s = xor(xor(a, b), c);
</span>    </span>let</span> c = or(and(xor(a, b), c), and(a, b));
</span>    (s, c)
</span>}
</span>
</span>type </span>SoftU8 = [Bit; </span>8</span>];
</span>
</span>pub </span>fn </span>softu8_add(a: SoftU8, b: SoftU8) -> SoftU8 {
</span>    </span>let </span>(s0, c) = adder(a[</span>0</span>], b[</span>0</span>], </span>ZERO</span>);
</span>    </span>let </span>(s1, c) = adder(a[</span>1</span>], b[</span>1</span>], c);
</span>    </span>let </span>(s2, c) = adder(a[</span>2</span>], b[</span>2</span>], c);
</span>    </span>let </span>(s3, c) = adder(a[</span>3</span>], b[</span>3</span>], c);
</span>    </span>let </span>(s4, c) = adder(a[</span>4</span>], b[</span>4</span>], c);
</span>    </span>let </span>(s5, c) = adder(a[</span>5</span>], b[</span>5</span>], c);
</span>    </span>let </span>(s6, c) = adder(a[</span>6</span>], b[</span>6</span>], c);
</span>    </span>let </span>(s7, _) = adder(a[</span>7</span>], b[</span>7</span>], c);
</span>    [s0, s1, s2, s3, s4, s5, s6, s7]
</span>}
</span>
</span>// Hmm? u8? What's that? Shhhh....
</span>pub </span>fn </span>to_softu8(x: </span>u8</span>) -> SoftU8 {
</span>    std::array::from_fn(|i| </span>if </span>(x >> i) & </span>1 </span>== </span>1 </span>{ </span>ONE </span>} </span>else </span>{ </span>ZERO </span>})
</span>}
</span>
</span>pub </span>fn </span>from_softu8(x: SoftU8) -> </span>u8 </span>{
</span>    (</span>0</span>..</span>8</span>).filter(|i| x[*i].signum() > </span>0.0</span>).map(|i| </span>1 </span><< i).sum()
</span>}
</span>
</span>fn </span>main() {
</span>    </span>let</span> a = to_softu8(</span>23</span>);
</span>    </span>let</span> b = to_softu8(</span>19</span>);
</span>    println!(</span>"</span>{}</span>"</span>, from_softu8(softu8_add(a, b)));
</span>}
</span></code></pre>
It’s horrible, but it works, it dutifully prints 42. And it only</em> took $\approx 120$
floating point instructions to add two 8-bit integers:</p>

On x86-64 there isn't actually a floating point negation instruction, instead
the compiler simply emits a XOR with a mask that toggles the top bit, which
is the sign bit of a IEEE-754 floating point number.
</aside>
example::softu8_add:
</span>        </span>mov     </span>rax, rdi
</span>        </span>movups  </span>xmm2, xmmword ptr [rsi]
</span>        </span>movups  </span>xmm0, xmmword ptr [rdx]
</span>        </span>movaps  </span>xmm3, xmm2
</span>        </span>subps   </span>xmm3, xmm0
</span>        </span>movaps  </span>xmm4, xmm0
</span>        </span>subps   </span>xmm4, xmm2
</span>        </span>movaps  </span>xmm1, xmmword ptr [rip + .LCPI0_0]
</span>        </span>xorps   </span>xmm4, xmm1
</span>        </span>subps   </span>xmm4, xmm3
</span>        </span>xorps   </span>xmm3, xmm3
</span>        </span>subss   </span>xmm3, xmm4
</span>        </span>movaps  </span>xmm6, xmm0
</span>        </span>xorps   </span>xmm6, xmm1
</span>        </span>subss   </span>xmm6, xmm2
</span>        </span>xorps   </span>xmm6, xmm1
</span>        </span>subss   </span>xmm6, xmm3
</span>        </span>movaps  </span>xmm3, xmm4
</span>        </span>shufps  </span>xmm3, xmm4, </span>85
</span>        </span>movaps  </span>xmm5, xmm6
</span>        </span>subss   </span>xmm5, xmm3
</span>        </span>xorps   </span>xmm5, xmm1
</span>        </span>movaps  </span>xmm10, xmm6
</span>        </span>xorps   </span>xmm10, xmm1
</span>        </span>subss   </span>xmm10, xmm3
</span>        </span>movaps  </span>xmm7, xmm0
</span>        </span>shufps  </span>xmm7, xmm0, </span>85
</span>        </span>xorps   </span>xmm7, xmm1
</span>        </span>movaps  </span>xmm3, xmm2
</span>        </span>shufps  </span>xmm3, xmm2, </span>85
</span>        </span>subss   </span>xmm7, xmm3
</span>        </span>xorps   </span>xmm7, xmm1
</span>        </span>movaps  </span>xmm8, xmm4
</span>        </span>unpckhpd        </span>xmm8, xmm4
</span>        </span>movaps  </span>xmm3, xmm0
</span>        </span>unpckhpd        </span>xmm3, xmm0
</span>        </span>xorps   </span>xmm3, xmm1
</span>        </span>movaps  </span>xmm9, xmm2
</span>        </span>unpckhpd        </span>xmm9, xmm2
</span>        </span>subss   </span>xmm3, xmm9
</span>        </span>xorps   </span>xmm3, xmm1
</span>        </span>xorps   </span>xmm11, xmm11
</span>        </span>movaps  </span>xmm9, xmm4
</span>        </span>shufps  </span>xmm9, xmm4, </span>255
</span>        </span>subss   </span>xmm7, xmm10
</span>        </span>movaps  </span>xmm10, xmm7
</span>        </span>xorps   </span>xmm10, xmm1
</span>        </span>subss   </span>xmm10, xmm8
</span>        </span>subss   </span>xmm3, xmm10
</span>        </span>unpcklps        </span>xmm7, xmm3
</span>        </span>shufps  </span>xmm6, xmm7, </span>64
</span>        </span>addps   </span>xmm11, xmm4
</span>        </span>movlhps </span>xmm5, xmm4
</span>        </span>subps   </span>xmm4, xmm6
</span>        </span>movss   </span>xmm4, xmm11
</span>        </span>subps   </span>xmm7, xmm8
</span>        </span>xorps   </span>xmm7, xmm1
</span>        </span>shufps  </span>xmm5, xmm7, </span>66
</span>        </span>subps   </span>xmm5, xmm4
</span>        </span>xorps   </span>xmm3, xmm1
</span>        </span>subss   </span>xmm3, xmm9
</span>        </span>shufps  </span>xmm0, xmm0, </span>255
</span>        </span>xorps   </span>xmm0, xmm1
</span>        </span>shufps  </span>xmm2, xmm2, </span>255
</span>        </span>subss   </span>xmm0, xmm2
</span>        </span>xorps   </span>xmm0, xmm1
</span>        </span>movups  </span>xmmword ptr [rdi], xmm5
</span>        </span>movups  </span>xmm2, xmmword ptr [rdx + </span>16</span>]
</span>        </span>movaps  </span>xmm5, xmm2
</span>        </span>xorps   </span>xmm5, xmm1
</span>        </span>movups  </span>xmm7, xmmword ptr [rsi + </span>16</span>]
</span>        </span>subss   </span>xmm5, xmm7
</span>        </span>xorps   </span>xmm5, xmm1
</span>        </span>movaps  </span>xmm4, xmm2
</span>        </span>shufps  </span>xmm4, xmm2, </span>85
</span>        </span>xorps   </span>xmm4, xmm1
</span>        </span>movaps  </span>xmm6, xmm2
</span>        </span>movaps  </span>xmm8, xmm7
</span>        </span>movaps  </span>xmm9, xmm7
</span>        </span>subps   </span>xmm9, xmm2
</span>        </span>subps   </span>xmm2, xmm7
</span>        </span>shufps  </span>xmm7, xmm7, </span>85
</span>        </span>subss   </span>xmm4, xmm7
</span>        </span>xorps   </span>xmm4, xmm1
</span>        </span>movhlps </span>xmm6, xmm6
</span>        </span>xorps   </span>xmm6, xmm1
</span>        </span>movhlps </span>xmm8, xmm8
</span>        </span>subss   </span>xmm6, xmm8
</span>        </span>xorps   </span>xmm6, xmm1
</span>        </span>xorps   </span>xmm2, xmm1
</span>        </span>subps   </span>xmm2, xmm9
</span>        </span>subss   </span>xmm0, xmm3
</span>        </span>movaps  </span>xmm3, xmm0
</span>        </span>xorps   </span>xmm3, xmm1
</span>        </span>subss   </span>xmm3, xmm2
</span>        </span>subss   </span>xmm5, xmm3
</span>        </span>unpcklps        </span>xmm0, xmm5
</span>        </span>xorps   </span>xmm5, xmm1
</span>        </span>movaps  </span>xmm3, xmm2
</span>        </span>shufps  </span>xmm3, xmm2, </span>85
</span>        </span>subss   </span>xmm5, xmm3
</span>        </span>subss   </span>xmm4, xmm5
</span>        </span>movaps  </span>xmm3, xmm4
</span>        </span>xorps   </span>xmm3, xmm1
</span>        </span>movaps  </span>xmm5, xmm2
</span>        </span>unpckhpd        </span>xmm5, xmm2
</span>        </span>subss   </span>xmm3, xmm5
</span>        </span>subss   </span>xmm6, xmm3
</span>        </span>unpcklps        </span>xmm4, xmm6
</span>        </span>movlhps </span>xmm0, xmm4
</span>        </span>movaps  </span>xmm3, xmm2
</span>        </span>subps   </span>xmm3, xmm0
</span>        </span>subps   </span>xmm0, xmm2
</span>        </span>xorps   </span>xmm0, xmm1
</span>        </span>subps   </span>xmm0, xmm3
</span>        </span>movups  </span>xmmword ptr [rdi + </span>16</span>], xmm0
</span>        </span>ret
</span></code></pre>


Bitwise Binary Search: Elegant and Fast
Orson R. L. Peters — 2023-05-16T00:00:00+00:00
I recently read the article Beautiful Branchless Binary Search</em></a>
by Malte Skarupke. In it they discuss the merits of the following snippet of
C++ code implementing a binary search</a>:</p>
template</span><</span>typename</span> It, </span>typename</span> T, </span>typename</span> Cmp>
</span>It lower_bound_skarupke(It begin, It end, const T& value, Cmp comp) {
</span>    size_t length = end - begin;
</span>    </span>if </span>(length == </span>0</span>) </span>return</span> end;
</span>
</span>    size_t step = bit_floor(length);
</span>    </span>if </span>(step != length && comp(begin[step], value)) {
</span>        length -= step + </span>1</span>;
</span>        </span>if </span>(length == </span>0</span>) </span>return</span> end;
</span>        step = bit_ceil(length);
</span>        begin = end - step;
</span>    }
</span>
</span>    </span>for </span>(step /= </span>2</span>; step != </span>0</span>; step /= </span>2</span>) {
</span>        </span>if </span>(comp(begin[step], value)) begin += step;
</span>    }
</span>
</span>    </span>return</span> begin + comp(*begin, value);
</span>}
</span></code></pre>
Frankly, while the ideas behind the algorithm are beautiful, I find the
implementation complex and hard to understand or prove correct. This is not
meant as a jab at Malte Skarupke, I find almost all binary search
implementations hard to understand or prove correct.</p>

Binary search is notoriously difficult to get correct. A 1988
study</a> found that out of an informal
sample of twenty computer science textbooks</strong>, only five contained a correct
binary search algorithm. I haven’t checked, but I hope we are doing better
now in 2022. Unfortunately off-by-one errors feel rather timeless to me.</p>
</aside>
In this article I will provide an alternative implementation based on similar
ideas but with a very different interpretation</em> that is (in my opinion)
incredibly elegant and clear to understand, at least as far as binary searches
go. The resulting implementation also saves a comparison in almost every case
and ends up quite a bit smaller.</p>
A brief history lesson</a></h3>
Feel free to skip</a> this section if you are not interested in history, but I had
to find out whose shoulders we are standing on. This is not only to give credit
where credit is due, but also to see if any useful details were lost in
translation.</p>
Malte Skarupke says they learned about the above algorithm from
Alex Muscar in their blog post</a>.
Alex says they found the algorithm while reading Jon L. Bentley’s</a> book
Writing Efficient Programs</em> (ISBN 0-13-970244-X). Jon Bentley writes:</p>

If we need more speed then we should consult Knuth’s [1973] definitive treatise on
searching. Section 6.2.1 discusses binary search, and Exercise 6.2.1-11 describes
an extremely efficient binary search program; […]</p>
</blockquote>
I own the referenced book hardcopy, Donald Knuth’s The Art of Computer Programming</a>
(also known as TAOCP), volume 3 Sorting and Searching. Exercise 6.2.1-11 is not
the correct exercise in my edition, but 12 and 13 are, which are exercises
referring to “Shar’s method”.</p>
We have to scan chapter 6.2.1 to find the mentioned method. Finally, we find it
on page 416. First as context, Knuth uses the following notation for binary search:</p>

Algorithm U</strong> (Uniform binary search</em>). Given a table of records
$R_1, R_2, \dots, R_N$ whose keys are in increasing order $K_1 < K_2 < \cdots < K_N$,
this algorithm searches for a given argument $K$. If $N$ is even, the
algorithm will sometimes refer to a dummy key $K_0$ that should be set to $-\infty$.
We assume that $N \geq 1$.</p>
</blockquote>
Now we can finally see Shar’s method:</p>

Another modification of binary search, suggested in 1971 by L. E. Shar,
will be still faster on some computers, because it is uniform after the first
step, and it requires no table.
The first step is to compare $K$ with $K_i$, where $i = 2^k$, $k = \lfloor \lg N\rfloor$.
If $K$ < $K_i$, we use a uniform search with the $\delta$‘s equal to $2^{k-1},
2^{k-2}, \dots, 1, 0$. On the other hand, if $K > K_i$ we reset $i$ to $i’ = N + 1 - 2^l$
where $l = \lceil\lg(N - 2^k + 1)\rceil$, and pretend that the first
comparison was actually $K > K_{i’},$ using a uniform search with the
$\delta$’s equal to $2^{l-1}, 2^{l-2}, \dots, 1, 0$.</p>
</blockquote>
The $\delta$’s the first paragraph refers to can be understood as the ‘step’
variable in the above C++ code. Overall Skarupke’s C++ code seems a fairly
faithful implementation of Shar’s method as described by Knuth, except that
Knuth uses one-based indexing which Skarupke’s method does not take into account.
Knuth goes on to describe that Shar’s method never makes more than $\lfloor \lg
N \rfloor + 1$ comparisons, which is one more than the minimum possible number
of comparisons.</p>

I also find it interesting that Knuth mentions a further speed-up of binary search
to be found in exercise 24. Exercise 24 essentialy hints at using an implicit
tree similar to a binary heap where the children of node $i$ are found at $2i$ and $2i + 1$.
This is nowadays known as the Eytzinger layout</a>,
which is a much better layout for binary search if you can decide the order of your elements.</p>
</aside>
To finish the history lesson, I did look on Google Scholar, but I could not find a 1971 paper by L. E. Shar.
I assume the modification was described in private communication with Knuth.</p>
Lower bounds</a></h2>
Let us assume that we have a zero-indexed array $A$ of length $n$ that is in
ascending order: $A[0] \leq A[1] \leq \cdots \leq A[n-1]$. We want to find the
lower bound</em> of some element $x$ in this array. This is the leftmost position
we could insert $x$ into the array while keeping it sorted. Alternatively
phrased, this is the number of elements strictly less than $x$ in the array.</p>
A traditional binary search algorithm finds this number by keeping a range of
possible solutions, repeatedly cutting that range in two pieces and
selecting the only piece which contains the solution. This tends to be tricky
to get right, as you must avoid overflows while computing the midpoint, and
are dealing with multiple boundary conditions, both in your code as well as
in your correctness invariant.</p>
Before we begin with our solution that avoids this, we have to take a moment and
understand an important aspect of binary search. With each comparison we can
distinguish between two sets of outcomes. Thus with $k$ comparisons, we can
distinguish between $2^k$ total outcomes. However, for $n$ elements, there are
$n+1$ outcomes! For example, for an array of 7 elements there are 8 positions in
which $x$ could be sorted: </p>
Thus the natural array size for binary search is $2^k - 1$, and
not $2^k$.</p>
A bitwise approach</a></h2>
Let’s take a look at the sixteen possible 4-bit integers in binary:</p>
 0 = 0000      8 = 1000 
</span> 1 = 0001      9 = 1001 
</span> 2 = 0010     10 = 1010 
</span> 3 = 0011     11 = 1011 
</span> 4 = 0100     12 = 1100 
</span> 5 = 0101     13 = 1101 
</span> 6 = 0110     14 = 1110 
</span> 7 = 0111     15 = 1111 
</span></code></pre>
Notice how if the top bit of the integer is set, it remains set for all larger
integers. And within each group with the same top bit, when the second most
significant bit is set, it remains set for larger integers within that group.
And so on for the third bit within each group with the same top two bits, ad
infinitum. In binary, within each group with identical top $t$ bits set, the
value of the $t+1$th bit is monotonically increasing.</strong></p>
Since our desired solution is the number of elements strictly less than $x,$
we can rephrase it as finding the largest number $b$ such that ${A[b-1] < x},$
or $b = 0$ if no elements are less than $x$.
We can find the unique $b$ very efficiently by constructing it directly</strong>, bit-by-bit,
using the above observation.</p>
Let’s assume that $A$ has length $n = 2^k - 1$. Then any possible answer $b$
fits exactly in $k$ bits. Since $A$ is sorted, if we find that $A[i-1] < x$, we
know that ${b \geq i}$. Conversely, if that comparison fails, we know that ${b <
i}.$ Thus, using the above observation we can figure out if the top bit of $b$
is set simply by testing $A[i-1] < x$ with $i = 2^{k-1}$.</p>
Now we know what the top bit should be and set it accordingly, never changing it
again. Using the above observation, this time within the group of
integers with a given top bit, we know that if we set the second bit and find
that $A[i-1] < x$ still holds, that the second bit must be set, and if not it
must be zero. We repeat this process bit-by-bit until we have figured out all
bits of $b$, giving our answer!</p>
Perhaps you are like me, and you are a visual thinker. Let us flip our
earlier tree on its side and visually associate associate a binary $b$
value with each gap between the elements of our array:</p>
</p>
The small red arrows indicate which element $A[b-1] < x$ would test
for a given guess of $b$. Note that no element is associated with $b = 0$,
as we can only end up with this value if all other tests failed, and thus
we never have to test this value. A search for $5$ would end up testing
${b = {\color{red}1}00_2}$ (success, set bit), ${b = 1{\color{red}1}0_2}$ (fail, do not set bit) and ${b = 10{\color{red}1}_2}$ (success, set bit).</p>

Now, you might argue this ‘bitwise’ binary search is still doing the traditional
splitting of ranges, just implicitly. And looking at the above binary tree, of
course you would be right. But to me the interpretation of the algorithm
through a bit-by-bit decoding of $b$ is more elegant, and easier to see as
correct, with much less worry about boundary conditions and off-by-one
errors.</p>
</aside>
In C++ we would get the following:</p>
// Only works for n = 2^k - 1.
</span>template</span><</span>typename</span> It, </span>typename</span> T, </span>typename</span> Cmp>
</span>It lower_bound_2k1(It begin, It end, const T& value, Cmp comp) {
</span>    size_t two_k = (end - begin) + </span>1</span>;
</span>    size_t b = </span>0</span>;
</span>    </span>for </span>(size_t bit = two_k >> </span>1</span>; bit != </span>0</span>; bit >>= </span>1</span>) {
</span>        </span>if </span>(comp(begin[(b | bit) - </span>1</span>], value)) b |= bit;
</span>    }
</span>    </span>return</span> begin + b;
</span>}
</span></code></pre>
Note that we always do exactly $k$ comparisons, which is optimal.</p>
Generalizing to other sizes</a></h2>
However, there is a glaring issue: our original array might not have length
$2^k - 1$. The simplest way to solve this is to add elements with value
$\infty$ to the end, to pad the array out to $2^k - 1$ elements. 
Instead of physically adding $\infty$ elements the array, we can simply
check if the index lies in the original array, and if not skip our test
entirely, as it would fail (we’d be testing if $\infty < x$).</p>
To pad our array out we want to find the smallest integer $k$ such that $2^k - 1 \geq n$,
which means $k \geq \log_2(n + 1)$, which after rounding up gives
$$k = \lceil \log_2(n + 1) \rceil = \lfloor \log_2(n) \rfloor + 1.$$</p>
Alternatively and definitely more enlightening is that this can be understood as
initializing bit</code> in our loop to the highest set bit in $n$:</p>
template</span><</span>typename</span> It, </span>typename</span> T, </span>typename</span> Cmp>
</span>It lower_bound_pad(It begin, It end, const T& value, Cmp comp) {
</span>    size_t n = end - begin;
</span>    size_t b = </span>0</span>;
</span>    </span>for </span>(size_t bit = std::bit_floor(n); bit != </span>0</span>; bit >>= </span>1</span>) {
</span>        size_t i = (b | bit) - </span>1</span>;
</span>        </span>if </span>(i < n && comp(begin[i], value)) b |= bit;
</span>    }
</span>    </span>return</span> begin + b;
</span>}
</span></code></pre>
In my opinion this is the most elegant binary search implementation there is.</p>

Making it branchless</a></h2>
The above works well, but introduces an index check before each array access.
This means the compiler can not eliminate the
branch</a> here, lest we
access out-of-bounds memory.</p>
To solve this problem we use a similar trick as L. E. Shar: we do an initial
comparison with the middle element, then either look at $2^k - 1$ elements at
the start, or $2^k - 1$ elements at the end of the array. If the array size
itself isn’t of the form $2^k - 1$, these two subslices overlap in the middle.
To completely cover our array (together with the element we initially compare
with) we must have $$(2^k - 1) + (2^k - 1) + 1 = 2^{k+1} - 1 \geq n$$ and thus
we choose $k = \lceil \log_2(n + 1) - 1 \rceil = \lfloor \log_2 (n) \rfloor$ :</p>
template</span><</span>typename</span> It, </span>typename</span> T, </span>typename</span> Cmp>
</span>It lower_bound_overlap(It begin, It end, const T& value, Cmp comp) {
</span>    size_t n = end - begin;
</span>    </span>if </span>(n == </span>0</span>) </span>return</span> begin;
</span>
</span>    size_t two_k = std::bit_floor(n);
</span>    </span>if </span>(comp(begin[n / </span>2</span>], value)) begin = end - (two_k - </span>1</span>);
</span>    
</span>    size_t b = </span>0</span>;
</span>    </span>for </span>(size_t bit = two_k >> </span>1</span>; bit != </span>0</span>; bit >>= </span>1</span>) {
</span>        </span>if </span>(comp(begin[(b | bit) - </span>1</span>], value)) b |= bit;
</span>    }
</span>    </span>return</span> begin + b;
</span>}
</span></code></pre>
Improving the efficiency</a></h3>
If our array doesn’t have size $2^{k+1} - 1$, the subarrays overlap in the
middle. This means that part of the subarrays already eliminated by the initial
comparison $A[n / 2] < x$ are being unnecessarily searched. Can we improve on
this?</p>
We can, if we choose two different sizes $2^l - 1$ and $2^r - 1$ for when
we are respectively searching at the start (left) or end (right) of the array.
Again, in combination with the initial element we compare with (which is now
$A[2^l - 1] < x$) we find that we must have</p>
$$(2^l - 1) + (2^r - 1) + 1 = 2^l + 2^r - 1 \geq n$$</p>
to be able to handle an arbitrary size $n$. And of course, we must have
$2^l - 1 \leq n$ and $2^r - 1 \leq n$ for our subarrays to fit. Let’s find the
optimal choice for $l, r$—which is not trivial.</p>
Cost analysis</a></h4>
What is the cost of a certain choice of $l, r$, assuming a uniform distribution
over the $n + 1$ possible outcomes of the binary search? We know that after
the initial comparison, for $2^l$ of those outcomes we use $l$ comparisons,
and thus the rest must use $r$ comparisons for an expected cost of</p>
$$C = 1 + \frac{2^l}{n + 1}\cdot l + \frac{n + 1 - 2^l}{n + 1}\cdot r$$</p>
We only really care about minimizing this cost, so we can throw out the additional
constant $+1$ and the factor $1/(n+1)$ as it does not change the relative order:</p>
$$C’ = 2^l\cdot l + (n + 1 - 2^l)\cdot r$$</p>
Optimizing $r$</a></h4>
Compare cost $C’$ to the expression $2^l \cdot l + 2^{r}\cdot r$. In this case
the expression is entirely symmetrical, so we could freely swap $l$ and $r$. But
we know from our earlier array size inequality that $n + 1 - 2^l \leq 2^{r}$.
Thus we can conclude that $l$ has a greater weight in the cost than $r$ and
therefore we can safely assume that $l \leq r$ is minimal.</p>
From this plus the fact that $2^l + 2^{r} - 1 \geq n$
we can immediately deduce that $2^{r} \geq (n + 1) / 2$ by weakening the
inequality with $2^l \to 2^r$, and thus rounding up to the nearest integer gives
\begin{align*}
r &= \lceil\log_2(n+1) - 1\rceil = \lfloor\log_2(n)\rfloor
\end{align*}</p>
Note that we can’t choose $r$ any larger, nor smaller, and thus we’ve
determined the optimal value for $r$.</p>

What’s not shown here is the exploratory process to get to this proof
of optimality.
I tend to write small scripts</a>
to brute-force some optimal data. I often plug these
brute-forced values into the OEIS</a> to find references
and patterns.</p>
</aside>
Optimizing $l$</a></h4>
Let’s reorder our relative cost $C’$ a bit:</p>
$$C’ = 2^l\cdot (l - r) + (n + 1)\cdot r$$</p>
We can ignore the second term as a constant, as we’re now trying to optimize $l$
given the optimal $r$. The function</p>
$$f(l) = 2^x(l - r)$$
has derivative w.r.t. $l$
$$f’(l) = 2^l(\ln(2)(l - r) + 1)$$
with a single zero corresponding to the global minimum at $r - \frac{1}{\ln(2)} \approx r - 1.4427$.
Let’s plug in the two integers closest to this minimum in $f$:</p>
$$f(r - 1) = 2^{r - 1}(r - 1 - r) = - 2^{r-1}$$
$$f(r - 2) = 2^{r - 2}(r - 2 - r) = - 2^{r-1}$$</p>
Thus we find that both $l = r - 1$ or $l = r - 2$ have optimal cost. For
simplicity we can just limit ourselves to $l = r - 1$ as it is equal but easier
to satisfy $2^l + 2^r - 1 \geq n$ with. Speaking of that inequality,
we can’t always choose $l = r - 1$ as we are sometimes forced to choose $l = r$
by it.</p>
Putting it together</a></h3>
We found that $r = \lfloor \log2(n) \rfloor$, and that</p>
$$l = \begin{cases}
r-1&\text{if }2^r + 2^{r-1} - 1 \geq n\\
r&\text{otherwise}
\end{cases}.$$</p>

Generating the optimal $l$ using the formula we found and plugging the
numbers we get into the OEIS we find A099396</a>(n)
$= \left\lfloor \log_2\left(2n/3 \right) \right\rfloor.$</p>
This makes sense as the optimal $l$ is incremented every time we have a
size of the form $n = 2^k + 2^{k-1} = 2^k(1 + 1/2)$ and thus $2^k = 2n/3$.</p>
</aside>
The condition $2^r + 2^{r-1} - 1 \geq n$ can be seen to be equivalent to “the
$r-1$th bit of $n$ is not set”. And as $2^r - 2^{r-1} = 2^{r-1}$ we can isolate
that bit, negate it, and subtract it from two_r</code> to get our two_l</code>:</p>
template</span><</span>typename</span> It, </span>typename</span> T, </span>typename</span> Cmp>
</span>It lower_bound_opt(It begin, It end, const T& value, Cmp comp) {
</span>    size_t n = end - begin;
</span>    </span>if </span>(n == </span>0</span>) </span>return</span> begin;
</span>
</span>    size_t two_r = std::bit_floor(n);
</span>    size_t two_l = two_r - ((two_r >> </span>1</span>) & ~n);
</span>    </span>bool</span> use_r = comp(begin[two_l - </span>1</span>], value);
</span>    size_t two_k = use_r ? two_r : two_l;
</span>    begin = use_r ? end - (two_r - </span>1</span>) : begin;
</span>
</span>    size_t b = </span>0</span>;
</span>    </span>for </span>(size_t bit = two_k >> </span>1</span>; bit != </span>0</span>; bit >>= </span>1</span>) {
</span>        </span>if </span>(comp(begin[(b | bit) - </span>1</span>], value)) b |= bit;
</span>    }
</span>    </span>return</span> begin + b;
</span>}
</span></code></pre>
The somewhat odd use of ternary statements and use_r</code> is to convince the
compiler to generate branchless code. We certainly lost some of the elegance we
had before, but at least now we do the minimal number of comparisons we can do
with our bitwise binary search while being branchless. And it is in fact better
than than L. E. Shar’s original method, whose initial comparison $A[i - 1] < x$
uses $i = \left\lfloor \log_2 (n) \right\rfloor$, which is suboptimal as we’ve
seen.</p>
Micro-optimizations</a></h2>
For some reason the standard implementation of
std::bit_floor</code></a>… sucks
a bit. E.g. on x86-64 Clang 16.0</a> with
-O2</code> we compile this:</p>
size_t bit_floor(size_t n) {
</span>    </span>if </span>(n == </span>0</span>) </span>return </span>0</span>;
</span>    </span>return </span>std::bit_floor(n);
</span>}
</span>
</span>size_t bit_floor_manual(size_t n) {
</span>    </span>if </span>(n == </span>0</span>) </span>return </span>0</span>;
</span>    </span>return </span>size_t(</span>1</span>) << (std::bit_width(n) - </span>1</span>);
</span>}
</span></code></pre>
to this:</p>
bit_floor(unsigned long):
</span>        </span>test    </span>rdi, rdi
</span>        </span>je      </span>.LBB0_1
</span>        </span>shr     </span>rdi
</span>        </span>je      </span>.LBB0_3
</span>        </span>bsr     </span>rcx, rdi
</span>        </span>xor     </span>rcx, </span>63
</span>        </span>jmp     </span>.LBB0_5
</span>.LBB0_1:
</span>        </span>xor     </span>eax, eax
</span>        </span>ret
</span>.LBB0_3:
</span>        </span>mov     </span>ecx, </span>64
</span>.LBB0_5:
</span>        </span>neg     </span>cl
</span>        </span>mov     </span>eax, </span>1
</span>        </span>shl     </span>rax, cl
</span>        </span>ret
</span>
</span>bit_floor_manual(unsigned long):
</span>        </span>bsr     </span>rcx, rdi
</span>        </span>mov     </span>eax, </span>1
</span>        </span>shl     </span>rax, cl
</span>        </span>test    </span>rdi, rdi
</span>        </span>cmove   </span>rax, rdi
</span>        </span>ret
</span></code></pre>
Yikes. Manual computation it is!</p>
Optimizing the tight loop</a></h3>
The astute observer might have noticed that in the following loop, we only
ever set each bit in b</code> at most once:</p>
size_t b = </span>0</span>;
</span>for </span>(size_t bit = two_k >> </span>1</span>; bit != </span>0</span>; bit >>= </span>1</span>) {
</span>    </span>if </span>(comp(begin[(b | bit) - </span>1</span>], value)) b |= bit;
</span>}
</span>return</span> begin + b;
</span></code></pre>
This means we could change the binary OR to simple addition, which might
optimize better in pointer calculations.</p>

Various architectures have specialized instructions aimed at pointer arithmetic,
like x86’s LEA instruction. Using bitwise instructions might prevent the compiler
from using these.</p>
</aside>
For the bitwise version we see the following tight loop for the above,
in x86-64</code> with GCC:</p>
.L7:
</span>        </span>mov     </span>rsi, rdx
</span>        </span>or      </span>rsi, rcx
</span>        </span>cmp     </span>DWORD PTR [rax-</span>4</span>+rsi*</span>4</span>], r8d
</span>        </span>cmovb   </span>rcx, rsi
</span>        </span>shr     </span>rdx
</span>        </span>jne     </span>.L7
</span></code></pre>
With addition we see the following:</p>
.L7:
</span>        </span>lea     </span>rsi, [rdx+rcx]
</span>        </span>cmp     </span>DWORD PTR [rax-</span>4</span>+rsi*</span>4</span>], r8d
</span>        </span>cmovb   </span>rcx, rsi
</span>        </span>shr     </span>rdx
</span>        </span>jne     </span>.L7
</span></code></pre>
In fact, when using addition we could eliminate variable $b$ entirely,
and directly add to begin</code> (similar to Skarupke’s original version that sparked
this article):</p>
for </span>(size_t bit = two_k >> </span>1</span>; bit != </span>0</span>; bit >>= </span>1</span>) {
</span>    </span>if </span>(comp(begin[bit - </span>1</span>], value)) begin += bit;
</span>}
</span>return</span> begin;
</span></code></pre>
However, I’ve found that some compilers, e.g. GCC on x86-64 will refuse to make
this variant branchless. I hate how fickle compilers can be sometimes, and I
wish compilers exposed not just the
likely</code>/unlikely</code></a>
attributes, but also an attribute that allows you to mark something as unpredictable</code>
to nudge the compiler towards using branchless techniques like CMOV’s.</p>
Instead of eliminating b</code>, we can optimize the loop to only do a single addition
explicitly, by moving the -1</code> into the value of b</code> itself:</p>
size_t b = -</span>1</span>;
</span>for </span>(size_t bit = two_k >> </span>1</span>; bit != </span>0</span>; bit >>= </span>1</span>) {
</span>    </span>if </span>(comp(begin[b + bit], value)) b += bit;
</span>}
</span>return</span> begin + (b + </span>1</span>);
</span></code></pre>
Yay for two’s complement and integer overflow! This generated the best code on
all platforms I’ve looked at, so I applied this pattern to all my
implementations in the benchmark.</p>

You might wonder why we don’t simply decrement begin</code> instead of changing b</code>.
Unfortunately, the C++ standard discriminates against one-before-begin pointers.
One-after-end of array pointers are allowed, but creating a pointer before the
beginning of an allocation is undefined
behavior</a>, even if you never dereference
that pointer.</p>
</aside>
Results</a></h2>
Let’s compare all the variants we’ve made, both in comparisons and actual
runtime. The latter I will test on my Apple M1 2021 Macbook Pro which is an ARM
machine. Your mileage will</strong> vary on different machines, especially x86-64
machines, but I want this article to focus more on the algorithmic side of
things rather than become an extensive study on the exact characteristics of
branch mispredictions, cache misses, and how to get the compiler to generate
branchless code for a variety of platforms.</p>
The code for the below benchmark is available on my Github</a>.</p>
Comparisons</a></h3>
To test the average number of comparisons for size $n$ we can simply query for
each of the $n + 1$ outcomes how many comparisons it takes to get that outcome.
We then average over all these for a given $n$. The result is the following
graph:</p>
</p>
We see that lower_bound_opt</code> does in fact do the fewest comparisons of
all the branchless methods, following the optimal lower_bound_std</code>
more closely than lower_bound_pad</code> or lower_bound_skarupke</code>.</p>
Across all sizes less than 256 we see the following average comparison counts,
minus the optimal comparison count:</p>
Algorithm</th> Comparisons above optimal</th></tr></thead>

lower_bound_skarupke</code></td> 1.17835</td></tr>
lower_bound_overlap</code></td> 0.37250</td></tr>
lower_bound_pad</code></td> 0.17668</td></tr>
lower_bound_opt</code></td> 0.17238</td></tr>
lower_bound_std</code></td> 0.00000</td></tr>
</tbody></table>
All our hard work finding the optimal split into subarrays only saved us ~0.2
comparisons on average on lower_bound_opt</code> versus the much simpler
lower_bound_overlap</code>.</p>
Runtime (32-bit integers)</a></h3>
To benchmark runtime for a certain size $n$ I pre-generated one million random
integers in the range $[0, n + 1]$. Then I record the time it takes to look them
all up using our lower bound routine of interest, and calculate the average.</p>

One may argue that querying the same $n$-size array a million times in a row
is unrealistic and that in the real world the array size might change. While
in some cases this is true, I believe that most of the time when you really
care about binary search performance, you are looking up values against a
(mostly) static array. If not, performance is dominated by the effort required
to keep the array sorted.</p>
</aside>
Using clang 13.0.0 with g++ -O2 -std=c++20</code> we get the following:</p>
</p>
I think this graph gives a fascinating insight into the branch predictor on the
Apple M1. Most striking is the relatively poor performance of lower_bound_opt</code>.
Within each bracket of sizes $[2^k, 2^{k+1})$ it performs much worse
than lower_bound_overlap</code>, with a size-dependent slope, before suddenly dropping to a
consistently good performance.</p>
This puzzled me for a while, and I triple-checked to see that lower_bound_opt</code>
was really being compiled with branchless instructions. Only then did I realize
there was a hidden branch all along: the loop exit condition.
lower_bound_overlap</code> always performs the same number of loop iterations,
allowing the CPU to always correctly predict the loop exit, whereas
lower_bound_opt</code> tries to reduce the number of iterations it does to save
comparisons. It turns out that for integers the cost of an extra iteration is
much lower than risking a mispredict on the loop condition on the Apple M1.</p>
If we also look at larger inputs we see that the above pattern keeps up
for quite a while until we start hitting sizes where cache effects become
a factor:</p>
</p>
We also note that it truly is important for a binary search benchmark to look at
a variety of sizes, as you might reach rather different conclusions in
performance at $n = 2^{12}$ versus $n = 2^{12} \cdot 1.5$.</p>

If you are interested in even larger sizes, I would strongly recommend you to
look at cache-friendly layouts for binary
search</a> instead of a simple sorted array, as
cache misses are much worse still than the branch mispredictions we’ve been
focusing on so far.</p>
</aside>
A commonly heard advice is to not use binary search for small arrays, but to
use a linear search instead. I find that not to be true on the Apple M1 for integers,
at least compared to my branchless binary search, when searching a runtime-sized
but otherwise fixed size array:</p>
</p>
Note that a linear search must always incur at least one branch misprediction:
on the loop exit condition. For a fixed size array lower_bound_overlap</code> has
zero branch mispredictions, including the loop exit.</p>
Runtime (strings)</a></h3>
To benchmark the performance on strings I copied the above benchmark, except
that I convert all integers to strings, zero-padded to a length of four.
I also reduced the number of samples to 300,000 per size, as the string
benchmark was significantly slower.</p>
Using clang 13.0.0 with g++ -O2 -std=c++20</code> we get the following:</p>
</p>
Strings are a lot less interesting than integers in this case, as most of the
branchless optimizations are moot. We find that initially the branchless
versions are only slightly slower than std::lower_bound</code> due to the extra
comparisons needed. However once we get to the larger-than-cache sizes
std::lower_bound</code> becomes significantly better as it can do speculative loads
to reduce cache misses.</p>
</p>
It seems that for strings the advice to use linear searches for small input
arrays doesn’t help that much, but doesn’t hurt either for $n \leq 9$,
on the Apple M1.</p>
Conclusion</a></h2>
In my opinion the bitwise binary search is an elegant alternative to the
traditional binary search method, at the cost of ~0.17 to ~0.37 extra comparisons
on average. It can be implemented in a branchless manner, which can be
significantly faster when searching elements with a branchless comparison
operator.</p>
In this article we found the following implementation to perform the best
on Apple M1 after all micro-optimizations are applied:</p>
template</span><</span>typename</span> It, </span>typename</span> T, </span>typename</span> Cmp>
</span>It lower_bound(It begin, It end, const T& value, Cmp comp) {
</span>    size_t n = end - begin;
</span>    </span>if </span>(n == </span>0</span>) </span>return</span> begin;
</span>
</span>    size_t two_k = size_t(</span>1</span>) << (std::bit_width(n) - </span>1</span>);
</span>    size_t b = comp(begin[n / </span>2</span>], value) ? n - two_k : -</span>1</span>;
</span>    </span>for </span>(size_t bit = two_k >> </span>1</span>; bit != </span>0</span>; bit >>= </span>1</span>) {
</span>        </span>if </span>(comp(begin[b + bit], value)) b += bit;
</span>    }
</span>    </span>return</span> begin + (b + </span>1</span>);
</span>}
</span></code></pre>
However, when it comes to clarity and elegance I still find the following
method the most beautiful:</p>
template</span><</span>typename</span> It, </span>typename</span> T, </span>typename</span> Cmp>
</span>It lower_bound(It begin, It end, const T& value, Cmp comp) {
</span>    size_t n = end - begin;
</span>    size_t b = </span>0</span>;
</span>    </span>for </span>(size_t bit = std::bit_floor(n); bit != </span>0</span>; bit >>= </span>1</span>) {
</span>        size_t i = (b | bit) - </span>1</span>;
</span>        </span>if </span>(i < n && comp(begin[i], value)) b |= bit;
</span>    }
</span>    </span>return</span> begin + b;
</span>}
</span></code></pre>


The World's Smallest Hash Table
Orson R. L. Peters — 2023-03-04T00:00:00+00:00
This December I once again did the Advent of Code</a>,
in Rust. If you are interested, my solutions</a>
are on Github. I wanted to highlight one particular solution to the day 2
problem</a> as it is both optimized completely
beyond the point of reason yet contains a useful technique. For simplicity we’re
only going to do part 1 of the day 2 problem here, but the exact same techniques
apply to part 2.</p>
We’re going to start off slow, but stick around because at the end you should
have an idea what on earth this function is doing, how it works, how to make one
and why it’s the world’s smallest hash table:</p>
pub </span>fn </span>phf_shift(x: </span>u32</span>) -> </span>u8 </span>{
</span>    </span>let</span> shift = x.wrapping_mul(</span>0xa463293e</span>) >> </span>27</span>;
</span>    ((</span>0x824a1847</span>u32 </span>>> shift) & </span>0b11111</span>) as </span>u8
</span>}
</span></code></pre>
The problem</a></h2>
We receive a file where each line contains A</code>, B</code>, or C</code>, followed by
a space, followed by X</code>, Y</code>, or Z</code>. These are to be understood as choices in
a game of rock-paper-scissors</a> as such:</p>
A = X = Rock
</span>B = Y = Paper
</span>C = Z = Scissors
</span></code></pre>
The first letter (A</code>/B</code>/C</code>) indicates the choice of our opponent, the second
letter (X</code>/Y</code>/Z</code>) indicates our choice. We then compute a score, which has two
components:</p>


If we picked Rock we get 1 point, if we picked Paper we get 2 points, and 3 points if we picked Scissors.</p>
</li>

If we lose we gain 0 points, if we draw we gain 3 points, if we win we get 6 points.</p>
</li>
</ol>
As an example, if our input file looks as such:</p>
A Y
</span>B X
</span>C Z
</span></code></pre>
Our total score would be (2 + 6) + (1 + 0) + (3 + 3) = 15</code>.</p>
An elegant solution</a></h2>
A sane solution would verify that indeed our input lines have the format [ABC] [XYZ]</code>, before extracting those two letters. After converting these letters to
integers 0</code>, 1</code>, 2</code> by subtracting either the ASCII code for 'A'</code> or 'X'</code>
respectively we can immediately calculate the first component of our score as 1 + ours</code>.</p>
The second component is more involved, but can be elegantly solved using modular arithmetic</a>.
Note that if Rock = 0, Paper = 1, Scissor = 2 then we always have that choice ${k + 1 \bmod 3}$
beats $k$. Alternatively, $k$ beats ${k - 1}$, modulo 3:</p>

</p>
</div>
If we divide the number of points that Advent of Code expects for a loss/draw/win by
three we find that a loss is $0$, a draw is $1$ and a win is $2$ points. From
these observations we can derive the following modular equivalence</p>
$$1 + \mathrm{ours} - \mathrm{theirs} \equiv \mathrm{points} \pmod 3.$$</p>
To see that it is correct, note that if we drew, ours - theirs</code> is zero and we
correctly get one point. If we add one to ours</code> we change from a draw to a
win, and points</code> becomes congruent with $2$ as desired. Symmetrically, if
we add one to theirs</code> we change from a draw to a loss, and points</code> once
again becomes congruent with $0$ as desired.</p>
Translated into code we find that our total score is</p>
1 </span>+ ours + </span>3 </span>* ((</span>1 </span>+ ours + (</span>3 </span>- theirs)) % </span>3</span>)
</span></code></pre>

Instead of ours - theirs</code> we do ours + (3 - theirs)</code> because Rust’s remainder
operator can unfortunately return negative remainders for positive divisors. One
could use
rem_euclid</code></a>
instead, but I feel bad for recommending it as that one is unfortunately defined
for negative divisors. I should write a blog post about this…</p>
</aside>
A general solution</a></h2>
We found a neat closed form, but if we were even slightly less fortunate it might
not have existed. A more general method for solving similar problems would be nice.
In this particular instance that is possible. There are only $3 \times 3 = 9$
input pairs, so we can simply hardcode the answer for each situation:</p>
let</span> answers = HashMap::from([
</span>    (</span>"A X"</span>, </span>4</span>),
</span>    (</span>"A Y"</span>, </span>8</span>),
</span>    (</span>"A Z"</span>, </span>3</span>),
</span>    (</span>"B X"</span>, </span>1</span>),
</span>    (</span>"B Y"</span>, </span>5</span>),
</span>    (</span>"B Z"</span>, </span>9</span>),
</span>    (</span>"C X"</span>, </span>7</span>),
</span>    (</span>"C Y"</span>, </span>2</span>),
</span>    (</span>"C Z"</span>, </span>6</span>),
</span>]);
</span></code></pre>
Now we can simply get our answer using answers[input]</code>. This might feel as a
bit of a non-answer, but it is a legitimate technique. We have a mapping
of inputs to outputs, and sometimes the simplest or fastest (in either
programmer time or execution time) solution is to write it out explicitly and
completely rather than compute the answer at runtime with an algorithm.</p>
Perfect hash functions</a></h2>
The above solution works fine, but it pays a cost for its genericity. It uses
a full-fledged string hash algorithm, and lookups involve the full codepath for
hash table lookups (most notably hash collision resolution).</p>
We can drop the genericity for a significant boost in speed if we were to use a
perfect hash function</a>. A
perfect hash function is a specially constructed hash function on some set $S$
of values such that each value in the set maps to a different hash output,
without collisions. It is important to note that we only care about its behavior
for inputs in the set $S$, with a complete disregard for other inputs.</p>
A minimal</em> perfect hash function is one that also maps the inputs to a dense
range of integers $[0, 1, \dots, |S|-1]$. This can be very useful because you
can then directly use the hash function output to index a lookup table. This
effectively creates a hash table that maps set $S$ to anything you want.
However, strict minimality is not necessary for this as long as you are okay
with wasting some of the space in your lookup table.</p>
There are fully generic methods for constructing (minimal) perfect hash
functions, such as the “Hash, displace and compress”</em></a> algorithm by Belazzougui
et. al., which is implemented in the phf crate</a>.
However, they tend to use lookup tables to construct the hash itself. For small
inputs where speed and size is absolutely critical I’ve had good success just
trying stuff</strong>. This might sound vague—because it is—so let me walk you
through some examples.</p>
Reading the input</a></h3>

This is where we leave the realm of reasonable solutions for the sake of
education and fun. For simplicity we’re not going to handle things such as
Windows-style newlines (\r\n</code>) or invalid inputs.</p>
</aside>
As a bit of a hack we can note that each line of our input from the Advent of
Code consists of exactly four bytes. One letter for our opponent’s choice, a
space, our choice, and a newline byte. So we can simply read our input as a
u32</code>, which simplifies the hash construction immensely instead of dealing with
strings.</p>
For example, consulting the ASCII table</a>
we find that A</code> has ASCII code 0x41</code>, space maps to 0x20</code>, X</code> has code
0x58</code> and the newline symbol has code 0x0a</code> so the input "A X\n"</code> can also
simply be viewed as the integer 0x0a582041</code> if you are on a
little-endian</a> machine. If you are
confused why 0x41</code> is in the last position remember that we humans write numbers with the
least significant digit on the right as a convention.</p>
Note that on a big-endian machine the order of bytes in a u32</code> is flipped, so
reading those four bytes into an integer would result in the value 0x4120580a</code>.
Calling u32::from_le_bytes</code> converts four bytes assumed to be little-endian to
the native integer representation by swapping the bytes on a big-endian machine
and doing nothing on a little-endian machine. Almost all modern CPUs are
little-endian however, so it’s generally a good idea to write your code such
that the little-endian path is fast and the big-endian path involves a
conversion step, if a conversion step can not be avoided.</p>
Doing this for all inputs gives us the following desired integer →
integer mapping:</p>
Input       LE u32      Answer
</span>-------------------------------
</span> A X       0xa582041       4
</span> A Y       0xa592041       8
</span> A Z       0xa5a2041       3
</span> B X       0xa582042       1
</span> B Y       0xa592042       5
</span> B Z       0xa5a2042       9
</span> C X       0xa582043       7
</span> C Y       0xa592043       2
</span> C Z       0xa5a2043       6
</span></code></pre>
Example constructions</a></h3>
When I said I just try stuff, I mean it. Let’s load our mapping into Python
and write a test:</p>
inputs =  [</span>0xa582041</span>, </span>0xa592041</span>, </span>0xa5a2041</span>, </span>0xa582042</span>,
</span>           </span>0xa592042</span>, </span>0xa5a2042</span>, </span>0xa582043</span>, </span>0xa592043</span>, </span>0xa5a2043</span>]
</span>answers = [</span>4</span>, </span>8</span>, </span>3</span>, </span>1</span>, </span>5</span>, </span>9</span>, </span>7</span>, </span>2</span>, </span>6</span>]
</span>
</span>def </span>is_phf(h, inputs):
</span>    </span>return </span>len({h(x) </span>for </span>x </span>in </span>inputs}) == len(inputs)
</span></code></pre>
There are nine inputs, so perhaps we get lucky and get a minimal perfect
hash function right away:</p>
>>> [x % </span>9 </span>for </span>x </span>in </span>inputs]
</span>[</span>0</span>, </span>7</span>, </span>5</span>, </span>1</span>, </span>8</span>, </span>6</span>, </span>2</span>, </span>0</span>, </span>7</span>]
</span></code></pre>
Alas, there are collisions. What if we don’t have to be absolutely minimal?</p>
>>> next(m </span>for </span>m </span>in </span>range(</span>9</span>, </span>2</span>**</span>32</span>)
</span>...      </span>if </span>is_phf(</span>lambda </span>x: x % m, inputs))
</span>12
</span>>>> [x % </span>12 </span>for </span>x </span>in </span>inputs]
</span>[</span>9</span>, </span>1</span>, </span>5</span>, </span>10</span>, </span>2</span>, </span>6</span>, </span>11</span>, </span>3</span>, </span>7</span>]
</span></code></pre>
That’s not too bad! Only three elements of wasted space. We can make our first
perfect hash table by placing the answers in the correct spots:</p>
def </span>make_lut(h, inputs, answers):
</span>    </span>assert </span>is_phf(h, inputs)
</span>    lut = [</span>0</span>] * (</span>1 </span>+ max(h(x) </span>for </span>x </span>in </span>inputs))
</span>    </span>for </span>(x, ans) </span>in </span>zip(inputs, answers):
</span>        lut[h(x)] = ans
</span>    </span>return </span>lut
</span></code></pre>
>>> make_lut(</span>lambda </span>x: x % </span>12</span>, inputs, answers)
</span>[</span>0</span>, </span>8</span>, </span>5</span>, </span>2</span>, </span>0</span>, </span>3</span>, </span>9</span>, </span>6</span>, </span>0</span>, </span>4</span>, </span>1</span>, </span>7</span>]
</span></code></pre>
Giving the simple mapping:</p>
const </span>LUT</span>: [</span>u8</span>; </span>12</span>] = [</span>0</span>, </span>8</span>, </span>5</span>, </span>2</span>, </span>0</span>, </span>3</span>, </span>9</span>, </span>6</span>, </span>0</span>, </span>4</span>, </span>1</span>, </span>7</span>];
</span>
</span>pub </span>fn </span>answer(x: </span>u32</span>) -> </span>u8 </span>{
</span>    </span>LUT</span>[(x % </span>12</span>) as </span>usize</span>]
</span>}
</span></code></pre>
Compressing the table</a></h4>
We stopped here on the first modulus that works, which is honestly fine in this
case because only three bytes of wasted space is pretty good. But what if we
didn’t get so lucky? We have to keep looking. Even though
modulo $m$ has $[0, m)$ as its
codomain</em></a>, when applied to our set of
inputs its image</em></a> might
span a smaller subset. Let’s inspect some:</p>
>>> [(m, max(x % m </span>for </span>x </span>in </span>inputs))
</span>...  </span>for </span>m </span>in </span>range(</span>1</span>, </span>30</span>)
</span>...  </span>if </span>is_phf(</span>lambda </span>x: x % m, inputs)]
</span>[(</span>12</span>, </span>11</span>), (</span>13</span>, </span>11</span>), (</span>19</span>, </span>18</span>), (</span>20</span>, </span>19</span>), (</span>21</span>, </span>17</span>), (</span>23</span>, </span>22</span>),
</span> (</span>24</span>, </span>19</span>), (</span>25</span>, </span>23</span>), (</span>26</span>, </span>21</span>), (</span>27</span>, </span>25</span>), (</span>28</span>, </span>19</span>), (</span>29</span>, </span>16</span>)]
</span></code></pre>
Unfortunately but also logically, there is an upwards trend of the maximum
index as you increase the modulus. But $13$ also seems promising, let’s take a look:</p>
>>> [x % </span>13 </span>for </span>x </span>in </span>inputs]
</span>[</span>3</span>, </span>6</span>, </span>9</span>, </span>4</span>, </span>7</span>, </span>10</span>, </span>5</span>, </span>8</span>, </span>11</span>]
</span>>>> make_lut(</span>lambda </span>x: x % </span>13</span>, inputs, answers)
</span>[</span>0</span>, </span>0</span>, </span>0</span>, </span>4</span>, </span>1</span>, </span>7</span>, </span>8</span>, </span>5</span>, </span>2</span>, </span>3</span>, </span>9</span>, </span>6</span>]
</span></code></pre>
Well, well, well, aren’t we lucky? The first three indices are unused, so we can
shift all the others back and get a minimal perfect hash function!</p>

Ironically this one would almost surely perform worse than the previous one
because Rust has to do a bounds check now whereas the previous version is
infallible, and it has an extra subtraction.</p>
</aside>
const </span>LUT</span>: [</span>u8</span>; </span>9</span>] = [</span>4</span>, </span>1</span>, </span>7</span>, </span>8</span>, </span>5</span>, </span>2</span>, </span>3</span>, </span>9</span>, </span>6</span>];
</span>
</span>pub </span>fn </span>answer(x: </span>u32</span>) -> </span>u8 </span>{
</span>    </span>LUT</span>[(x % </span>13 </span>- </span>3</span>) as </span>usize</span>]
</span>}
</span></code></pre>
In my experience with creating a bunch of similar mappings in the past, you’d
be surprised to see how often you get lucky</strong>, as long as your mapping isn’t too
large. As you add more ‘things to try’ to your toolbox, you also have more
opportunities of getting lucky.</p>
Fixing near-misses</a></h4>
Another thing to try is fixing near-misses. For example, let’s take another look
at our original naive attempt:</p>
>>> [x % </span>9 </span>for </span>x </span>in </span>inputs]
</span>[</span>0</span>, </span>7</span>, </span>5</span>, </span>1</span>, </span>8</span>, </span>6</span>, </span>2</span>, </span>0</span>, </span>7</span>]
</span></code></pre>
Only the last two inputs give a collision. So a rather naive but possible way
to resolve these collisions is to move those to a different index:</p>
>>> [x % </span>9 </span>+ </span>3</span>*(x == </span>0xa592043</span>) - </span>3</span>*(x == </span>0xa5a2043</span>) </span>for </span>x </span>in </span>inputs]
</span>[</span>0</span>, </span>7</span>, </span>5</span>, </span>1</span>, </span>8</span>, </span>6</span>, </span>2</span>, </span>3</span>, </span>4</span>]
</span></code></pre>
Oh look, we got slightly lucky again: both are using the constant 3, which can be
factored out. It can be quite addictive to try out various permutations of
operations and tweaks to find these (minimal) perfect hash functions using as
few operations as possible.</p>
An interlude: integer division</a></h2>
So far we’ve just been using the modulo operator to reduce our input domain to a
much smaller one. However, integer division/modulo is rather slow on most
processors. If we take a look at Agner Fog’s instruction
tables</a> we see that the 32-bit DIV</code>
instruction has a latency of 9-12 cycles on AMD Zen3 and 12 cycles on Intel Ice
Lake. However, we don’t need a fully generic division instruction, since our
divisor is constant here. Let’s take a quick look at what the compiler does for
mod 13:</p>
pub </span>fn </span>mod13(x: </span>u32</span>) -> </span>u32 </span>{
</span>    x % </span>13
</span>}
</span></code></pre>
example::mod13:
</span>        </span>mov     </span>eax, edi
</span>        </span>mov     </span>ecx, edi
</span>        </span>imul    </span>rcx, rcx, </span>1321528399
</span>        </span>shr     </span>rcx, </span>34
</span>        </span>lea     </span>edx, [rcx + </span>2</span>*rcx]
</span>        </span>lea     </span>ecx, [rcx + </span>4</span>*rdx]
</span>        </span>sub     </span>eax, ecx
</span>        </span>ret
</span></code></pre>
It translates the modulo operation into a multiplication with some shifts / adds / subtractions instead.
To see how that works let’s first consider the most magical part: the multiplication by $1321528399$ followed
by the right shift of $34$. That magical constant is actually $\lceil 2^{34} / 13 \rceil$ 
which means it’s computing</p>
$$q = \left\lfloor \frac{x \cdot \lceil 2^{34} / 13 \rceil}{2^{34}}\right\rfloor = \lfloor x / 13 \rfloor.$$</p>
To prove that is in fact correct we note that $2^{34} + 3$ is divisible by $13$ allowing us
to split the division in the correct result plus an error term:</p>
$$\frac{x \cdot \lceil 2^{34} / 13 \rceil}{2^{34}} = \frac{x \cdot (2^{34} + 3) / 13}{2^{34}} = x / 13 + \frac{3x}{13\cdot 2^{34}}.$$</p>
Then we inspect the error term and substitute $x = 2^{32}$ as an upper bound
to see it never affects the result after flooring:</p>
$$\frac{3x}{13\cdot 2^{34}} \leq \frac{3 \cdot 2^{32}}{13\cdot 2^{34}} \leq \frac{3}{13 \cdot 4} < 1/13.$$</p>
For more context and references I would suggest “Integer division by constants:
optimal bounds”</em></a> by
Lemire et. al.</p>
After computing $q = \lfloor x/13\rfloor$ it then computes the actual modulo we
want as $m = x - 13q$ using
the identity $$x \bmod m = x - \lfloor x / m \rfloor \cdot m.$$
It avoids the use of another (relatively) expensive integer multiplication by using
the lea</code> instruction which can compute a + k*b</code>, where k</code> can be a constant
1, 2, 4, or 8. This is how it computes $13q$:</p>

The LEA</code> instruction was originally intended for array index computations,
because arr[i]</code> is found at address arr_start + sizeof(T)*i</code>, and sizeof(T)</code>
is very often a small power of two.</p>
</aside>
Instruction                  Translation     Effect
</span>lea     edx, [rcx + 2*rcx]   t := q + 2*q    t = 3q
</span>lea     ecx, [rcx + 4*rdx]   o := q + 4*t    o = (q + 4*3q) = 13q
</span></code></pre>
Bit mixing</a></h2>
We have seen that choosing different moduli works, and that compilers implement
fixed-divisor modulo using multiplication. It is time to cut out the
middleman and go straight to the good stuff: integer multiplication.
We can get a better understanding of what integer multiplication actually does
by multiplying two integers in binary using the schoolbook method:</p>
4242 = 0b1000010010010
</span>4871 = 0b1001100100111 = 2^0 + 2^1 + 2^2 + 2^5 + 2^8 + 2^9 + 2^12
</span>
</span>          Binary                    Decimal
</span>               1000010010010   |   4242 * 2^0
</span>              1000010010010    |   4242 * 2^1
</span>             1000010010010     |   4242 * 2^2
</span>          1000010010010        |   4242 * 2^5
</span>       1000010010010           |   4242 * 2^8
</span>      1000010010010            |   4242 * 2^9
</span>   1000010010010               |   4242 * 2^12
</span>-------------------------------|---------------- +
</span>   1001110110100100111111110   |   20662782
</span></code></pre>
There is a beautiful property here we can take advantage of: all of the upper
bits of the product $x \cdot c$ for some constant $c$ depend on most of the bits
of $x$. That is, for good choices of the constants $c$ and $s$, c*x >> s</code> will
give you a result that is wildly different even for small differences in $x$. It
is a strong bit mixer</em>.</p>
Hash functions like bit mixing functions, because they want to be unpredictable.
A good measure of unpredictability is found in the avalanche
effect</a>. For a true random oracle
changing one bit in the input should flip all bits in the output with 50% probability.
Thus having all your output bits depend on the input is a good property for
a hash function, as a random oracle is the ideal hash function.</p>
So, let’s just try something</strong>. We’ll stick with using modulo $2^k$
for maximum speed (as those can be computed with a binary AND instead of
needing multiplication), and try to find constants $c$
and $s$ that work. We want our codomain to have size $2^4 = 16$ since that’s the
smallest power of two bigger than $9$. We’ll use a $32 \times 32 \to 32$ bit multiply
since we only need 4 bits of output, and the top 4 bits of the multiplication
will depend sufficiently on most of the bits of the input. By doing a right-shift
of $28$ on a u32</code> result we also get our mod $2^4$ for free.</p>

If we needed more than four bits of
output, or we couldn’t find a constant that works, I would try a
$32 \times 32 \to 64$ bit multiply as this gives you more output bits to work with.</p>
</aside>
import </span>random
</span>random.seed(</span>42</span>)
</span>
</span>def </span>h(x, c):
</span>    m = (x * c) % </span>2</span>**</span>32
</span>    </span>return </span>m >> </span>28
</span>
</span>while </span>True</span>:
</span>    c = random.randrange(</span>2</span>**</span>32</span>)
</span>    </span>if </span>is_phf(</span>lambda </span>x: h(x, c), inputs):
</span>        print(hex(c))
</span>        </span>break
</span></code></pre>
It’s always a bit exciting to hit enter when doing a random search for a magic
constant, not knowing if you’ll get an answer or not. In this case it instantly
printed 0x46685257</code>. Since it was so fast there are likely many solutions, so
we can definitely be a bit greedier and see if we can get closer to a minimal
perfect hash function:</p>
best = float(</span>'inf'</span>)
</span>while </span>best >= len(inputs):
</span>    c = random.randrange(</span>2</span>**</span>32</span>)
</span>    max_idx = max(h(x, c) </span>for </span>x </span>in </span>inputs)
</span>    </span>if </span>max_idx < best and is_phf(</span>lambda </span>x: h(x, c), inputs):
</span>        print(max_idx, hex(c))
</span>        best = max_idx
</span></code></pre>
This quickly iterated through a couple of solutions before finding a constant that gives a minimal perfect hash function,
0xedc72f12</code>:</p>
>>> [h(x, </span>0xedc72f12</span>) </span>for </span>x </span>in </span>inputs]
</span>[</span>2</span>, </span>5</span>, </span>8</span>, </span>1</span>, </span>4</span>, </span>7</span>, </span>0</span>, </span>3</span>, </span>6</span>]
</span>>>> make_lut(</span>lambda </span>x: h(x, </span>0xedc72f12</span>), inputs, answers)
</span>[</span>7</span>, </span>1</span>, </span>4</span>, </span>2</span>, </span>5</span>, </span>8</span>, </span>6</span>, </span>9</span>, </span>3</span>]
</span></code></pre>
Ironically, if we want the optimal performance in safe Rust, we still need to
zero-pad the array to 16 elements so we can never go out-of-bounds. But if you
are absolutely certain</strong> there are no inputs other than the specified inputs,
and you wanted optimal speed, you could reduce your memory usage to 9 bytes with
unsafe Rust. Sticking with the safe code option we’ll get:</p>
const </span>LUT</span>: [</span>u8</span>; </span>16</span>] = [</span>7</span>, </span>1</span>, </span>4</span>, </span>2</span>, </span>5</span>, </span>8</span>, </span>6</span>, </span>9</span>, </span>3</span>,
</span>                       </span>0</span>, </span>0</span>, </span>0</span>, </span>0</span>, </span>0</span>, </span>0</span>, </span>0</span>];
</span>
</span>pub </span>fn </span>phf_lut(x: </span>u32</span>) -> </span>u8 </span>{
</span>    </span>LUT</span>[(x.wrapping_mul(</span>0xedc72f12</span>) >> </span>28</span>) as </span>usize</span>]
</span>}
</span></code></pre>
Inspecting the assembly code using the Compiler
Explorer</a> it is incredibly tight now:</p>
example::phf_lut:
</span>        </span>imul    </span>eax, dword ptr [rdi], -</span>305713390
</span>        </span>shr     </span>rax, </span>28
</span>        </span>lea     </span>rcx, [rip + .L__unnamed_1]
</span>        </span>movzx   </span>eax, byte ptr [rax + rcx]
</span>        </span>ret
</span>
</span>.L__unnamed_1:
</span>        .asciz  </span>"\007\001\004\002\005\b\006\t\003\000\000\000\000\000\000"
</span></code></pre>
The World’s Smallest Hash Table</a></h2>
You thought 9 bytes was the world’s smallest hash table? We’re only just getting
started! You see, it is actually possible to have a small lookup table without
accessing memory, by storing it in the code.</p>
Code ultimately has to be stored in memory as well, but it saves an
indirection.</aside>
A particularly effective method for storing a small lookup table with small
elements is to store it as a constant, indexed using shifts. For example, the lookup
table [1, 42, 17, 26][i]</code> could also be written as such:</p>
(</span>0b11010010001101010000001 </span>>> </span>6</span>*i) & </span>0b111111
</span></code></pre>
Each individual value fits in 6 bits, and we can easily fit $4\times 6 = 24$
bits in a u32</code>. In isolation this might not make sense over a normal lookup
table, but it can be combined with perfect hashing, and can be vectorized as
well.</p>
Unfortunately we have 9 values that each require 5 bits, which doesn’t fit in
a u32</code>… or does it? You see, by co-designing the lookup table with the
perfect hash function we could theoretically overlap</em> the end of the bitstring
of one value with the start of another if we directly use the hash
function output as the shift amount.</p>
Update on 2023-03-05</strong>: As tinix0 rightfully points out on
reddit</a>,
our values only require 4 bits, not 5. I’ve made things unnecessarily harder for
myself by effectively prepending a zero bit to each value. That said, you would
still need overlapping for fitting $4 \times 9 = 36$ bits in a u32</code>.</p>

We could also just use a u64</code> to store the data,
but that’s boring and we’re trying to create the smallest
possible hash table here.</p>
</aside>
We are thus looking for two 32-bit constants c</code> and d</code> such that</p>
(d >> (x.wrapping_mul(c) >> </span>27</span>)) & </span>0b11111 </span>== answer(x)
</span></code></pre>
Note that the magic shift is now 32 - 5 = 27 because we want 5 bits of output to feed into the
second shift, as $2^5 = 32$.</p>
Luckily we don’t have to actually increase the search space, as we can construct
d</code> from c</code> by just placing the answer bits in the indicated shift positions.
Doing this we can also find out whether c</code> is valid or not by detecting conflicts
in whether a bit should be $0$ or $1$ for different inputs. Will we be lucky?</p>
def </span>build_bit_lut(h, w, inputs, answers):
</span>    zeros = ones = </span>0
</span>
</span>    </span>for </span>x, a </span>in </span>zip(inputs, answers):
</span>        shift = h(x)
</span>        </span>if </span>zeros & (a << shift) or ones & (~a % </span>2</span>**w << shift):
</span>            </span>return </span>None  </span># Conflicting bits.
</span>        zeros |= ~a % </span>2</span>**w << shift
</span>        ones |= a << shift
</span>    
</span>    </span>return </span>ones
</span>    
</span>def </span>h(x, c):
</span>    m = (x * c) % </span>2</span>**</span>32
</span>    </span>return </span>m >> </span>27
</span>
</span>random.seed(</span>42</span>)
</span>while </span>True</span>:
</span>    c = random.randrange(</span>2</span>**</span>32</span>)
</span>    lut = build_bit_lut(</span>lambda </span>x: h(x, c), </span>5</span>, inputs, answers)
</span>    </span>if </span>lut is not </span>None </span>and lut < </span>2</span>**</span>32</span>:
</span>        print(hex(c), hex(lut))
</span>        </span>break
</span></code></pre>
It takes a second or two, but we found a solution! </p>
pub </span>fn </span>phf_shift(x: </span>u32</span>) -> </span>u8 </span>{
</span>    </span>let</span> shift = x.wrapping_mul(</span>0xa463293e</span>) >> </span>27</span>;
</span>    ((</span>0x824a1847</span>u32 </span>>> shift) & </span>0b11111</span>) as </span>u8
</span>}
</span></code></pre>
example::phf_shift:
</span>        </span>imul    </span>ecx, dword ptr [rdi], -</span>1537005250
</span>        </span>shr     </span>ecx, </span>27
</span>        </span>mov     </span>eax, -</span>2109073337
</span>        </span>shr     </span>eax, cl
</span>        </span>and     </span>al, </span>31
</span>        </span>ret
</span></code></pre>
We have managed to replace a fully-fledged hash table
with one that is so small that it consists of 6
(vectorizable) assembly instructions without any
further data.</p>
Conclusion</a></h2>
Wew, that was a wild ride. Was it worth it? Let’s
compare four hash-based versions on how long they take to process
ten million lines of random input and sum all answers.</p>


hashmap_str</code> processes the lines properly as newline
delimited strings, as in the general solution</a>.</p>
</li>

hashmap_u32</code> still uses a hashmap, but reads the lines and does lookups
using u32</code>s like the perfect hash functions do.</p>
</li>

phf_lut</code> is the earlier defined function that feeds a perfect
hash function into a lookup table.</p>
</li>

phf_shift</code> is our world’s smallest hash function.</p>
</li>
</ol>
The complete test code can be found here</a>. On my 2021 Apple M1 Macbook Pro
I get the following results with cargo run --release</code> on Rust 1.67.1:</p>
Algorithm</th> Time</th></tr></thead>

hashmap_str</code></td> 262.83 ms</td></tr>
hashmap_u32</code></td> 81.33 ms</td></tr>
phf_lut</code></td> 2.97 ms</td></tr>
phf_shift</code></td> 1.41 ms</td></tr>
</tbody></table>
So not only is it the smallest, it’s also the fastest, beating the original
string-based HashMap</code> solution by over 180 times. The reason phf_shift</code> is
two times faster than phf_lut</code> on this machine is because it can be fully
vectorized by the compiler whereas phf_lut</code> needs to do a lookup in memory
which is either impossible or relatively slow to do in vectorized code,
depending on which SIMD instructions you have available.</p>
Your results may vary, and you might need 
RUSTFLAGS='-C target-cpu=native'</code> for phf_shift</code> to autovectorize.</p>


Magical Fibonacci Formulae
Orson R. L. Peters — 2023-02-06T00:00:00+00:00
The following Python function computes the Fibonacci
sequence</a>, without loops,
recursion or floating point arithmetic:</p>
f=</span>lambda </span>n:(b:=</span>2</span><<n)**n*b//(b*b-b-</span>1</span>)%b
</span></code></pre>
It really does:</p>
>>> [f(n) </span>for </span>n </span>in </span>range(</span>10</span>)]
</span>[</span>0</span>, </span>1</span>, </span>1</span>, </span>2</span>, </span>3</span>, </span>5</span>, </span>8</span>, </span>13</span>, </span>21</span>, </span>34</span>]
</span></code></pre>
How does it work? As a teaser, look at the decimal expansions of $100 / 9899$
and $1000 / 998999$ and see if you notice anything:</p>
$$\frac{100}{9899} = 0.0101020305081321345590463…$$
$$\frac{1000}{998999} = 0.001001002003005008013021034055089144233377610988…$$</p>
Generating functions</a></h2>
We define the Fibonacci sequence as 
\begin{align*}
f(0) &= 0,\\
f(1) &= 1,\\
f(n) &= f(n - 1) + f(n - 2).
\end{align*}
However, we will also define a rather strange object known as a generating
function</a>. It is an
‘infinite polynomial’ (also known as a power series</a>)
in variable $x$ whose coefficients</em> correspond to our series of interest. This
gives us
$$F(x) = 0 + 1x + 1x^2 + 2x^3 + 3x^4 + 5x^5 + 8x^6 + \cdots,$$
$$F(x) = f(0)x^0 + f(1)x^1 + f(2)x^2 + \cdots = \sum_{n=0}^\infty f(n)x^n.$$</p>
Does this function converge? Is this really allowed? All interesting questions
we are going to ignore here. Instead, let’s just start doing interesting things
with our new object, like taking out the first two terms from the sum:</p>
 
There is an amazing free book called generatingfunctionology</a>.
You can read a lot more about generating functions there. It also dives into
the question about convergence.</p>
</aside>
$$F(x) = f(0)x^0 + f(1)x^1 + \sum_{n=2}^\infty f(n)x^n = x + \sum_{n=2}^\infty f(n)x^n.$$</p>
We can also substitute $f(n) = f(n - 1) + f(n - 2)$ now, because our iteration
variable starts at $2$:</p>
\begin{align*}
F(x) &= x + \sum_{n=2}^\infty \Big(f(n-1) + f(n-2) \Big)x^n\\
&= x + \sum_{n=2}^\infty f(n-1)x^n + \sum_{n=2}^\infty f(n-2)x^n.
\end{align*}
Now we can substitute the loop variables, re-insert the $f(0)$ term (which is just
zero) into the first sum, and factor out the extra $x$ terms:
\begin{align*}
F(x) &= x + \sum_{n=1}^\infty f(n)x^{n+1} + \sum_{n=0}^\infty f(n)x^{n+2}\\
&= x - f(0)x^{1} + \sum_{n=0}^\infty f(n)x^{n+1} + \sum_{n=0}^\infty f(n)x^{n+2}\\
&= x + x\sum_{n=0}^\infty f(n)x^{n} + x^2\sum_{n=0}^\infty f(n)x^{n}.
\end{align*}</p>
And now using the crucial observation that $F(x) = \sum_{n=0}^\infty f(n)x^{n}$:
\begin{align*}
F(x) &= x + xF(x) + x^2F(x),\\
(1 - x - x^2)F(x) &= x,\\
F(x) &= \frac{x}{1 - x - x^2}.
\end{align*}
Wow. Somehow the simple expression $x / (1 - x - x^2)$ ‘contains’ the entire
Fibonacci sequence. If you substitute $x = 10^{-3}$ in $F(x)$ you will retrieve
our earlier value
$$\frac{1000}{998999} = 0.001001002003005008013021034055089144233377610988…$$</p>
An interlude by Binet</a></h2>
Solving $1 - x - x^2 = 0$ with the quadratic formula</a> gives us
roots $-(\sqrt{5} \pm 1)/2$, which one might recognize as
the (negative) golden ratio</a> $\phi$ and its inverse
$\phi^{-1}$:
$$\phi = \frac{\sqrt{5} + 1}{2}, \quad \phi^{-1} = \frac{2}{\sqrt{5} + 1} = \frac{2\left(\sqrt{5} - 1\right)}{\left(\sqrt{5} + 1\right)\left(\sqrt{5} - 1\right)} = \frac{\sqrt{5} - 1}{2}.$$</p>
This allows us to factor and
use  partial fraction decomposition</a>
on $F(x)$:
$$\frac{x}{1 - x - x^2} = \frac{x}{(1 - \phi x)(1 + \phi^{-1} x)} = \frac{1}{\phi + \phi^{-1}}\left(\frac{1}{1 - \phi x} - \frac{1}{1 + \phi^{-1} x}\right).$$
This is a rather arduous (but strictly elementary) algebraic process so it is much easier
to verify by expanding that the identity holds than following along.
To verify use the fact that $\phi \cdot \phi^{-1} = \phi - \phi^{-1} = 1$.</p>
If we recall the formula
for the geometric series</a>,</p>
$$\frac{1}{1 - r} = \sum_{n=0}^\infty r^n,$$</p>
and apply it to our above expression (once
again completely ignoring convergence) while
noticing that $\phi + \phi^{-1} = \sqrt{5}$ we find</p>
$$F(x) = \frac{1}{\sqrt{5}} \left( \sum_{n=0}^\infty \phi^n x^n - \sum_{n=0}^\infty {(-\phi})^{-n} x^n \right),$$
$$F(x) = \sum_{n=0}^\infty \frac{1}{\sqrt{5}} \left(  \phi^n - {(-\phi})^{-n} \right) x^n.$$</p>
And now for the true magic, recall that we defined $F(x) = \sum_{n=0}^\infty f(n)x^n$, and thus we conclude
$$f(n) = \frac{1}{\sqrt{5}}\left(\phi^n - {(-\phi})^{-n} \right).$$</p>

Note that $\phi^{-n}$ very quickly approaches 0, making $\phi^n / \sqrt{5}$
rounded to the nearest integer also correct. In turn this explains why the
ratio of consecutive Fibonacci numbers approaches the golden ratio.
</aside>
We have recovered Binet’s formula</a>
for the Fibonacci numbers, a closed form. Unfortunately evaluating it in
Python would eventually fail due to the use of floating-point numbers,
which is why this is only an interlude. But it is nevertheless cool:</p>
>>> phi = (</span>1 </span>+ </span>5</span>**</span>0.5</span>) / </span>2
</span>>>> [(phi**n - (-phi)**-n) / </span>5</span>**</span>0.5 </span>for </span>n </span>in </span>range(</span>10</span>)]
</span>[</span>0.0</span>, </span>1.0</span>, </span>1.0</span>, </span>2.0</span>, </span>3.0000000000000004</span>, </span>5.000000000000001</span>, </span>8.000000000000002</span>, </span>13.000000000000002</span>, </span>21.000000000000004</span>, </span>34.00000000000001</span>]
</span></code></pre>
Evaluating generating functions</a></h2>
Take another look at $F(10^{-3})$:</p>
$$\frac{1000}{998999} = 0.001001002003005008013021034055089144233377610988…$$</p>
Each next integer in the series starts three places shifted back from the previous
one. This makes sense, because $F(10^{-3})$ is the sum of $f(n)10^{-3n}$
for all $n$.</p>
This also means that eventually the method fails, when numbers outgrow
three digits and overflow into the previous one. If we ignore that for now,
we can study $10^{3n} F(10^{-3})$:</p>
n = 0                   0.001001002003005008013021034055...
</span>n = 1                   1.001002003005008013021034055089...
</span>n = 2                1001.002003005008013021034055089144...
</span>n = 3             1001002.003005008013021034055089144233...
</span>n = 4          1001002003.005008013021034055089144233377...
</span>n = 5       1001002003005.008013021034055089144233377610...
</span>n = 6    1001002003005008.013021034055089144233377610988...
</span>                      ^^^
</span></code></pre>
If we could extract just the highlighted column we have our Fibonacci numbers.
Luckily, we can. By flooring we can remove the entire fractional part, and with
modulo $10^3$ we can ignore everything after the third digit.</p>
We still have the overflowing issue however. But this is easily fixed by
choosing a number $k$ instead of $3$, large enough such that the next Fibonacci
number doesn’t overflow into our number of interest. For example the choice $k
= n$ works, as the $n$th Fibonacci number certainly won’t overflow $n$ decimal
digits, giving formula</p>
$$f(n) = \lfloor 10^{n^2} \cdot F(10^{-n}) \rfloor \bmod 10^{n}.$$</p>
We can generalize much more however. Our choice of $10^3$ and $10^n$ as bases
was rather arbitrary. You can use any base $b$, as long as $b$ is large enough. This gives:</p>

We abuse notation a bit for brevity here, in reality $b$ is a function of $n$.</p>
</aside>
$$f(n) = \left\lfloor b^n \cdot F(b^{-1}) \right\rfloor \bmod b,$$
$$f(n) = \left\lfloor b^n \cdot \frac{b^{-1}}{1 - b^{-1} - b^{-2}} \right\rfloor \bmod b,$$
$$f(n) = \left\lfloor \frac{b^{n+1}}{b^2 - b - 1} \right\rfloor \bmod b.$$</p>
This is actually a closed form we can evaluate without the use of floating
point arithmetic, as it simply consists of the division of two integers.
I have experimentally found that choosing $b = 3^n$ suffices to not overflow
for computing $f(n)$, giving our magical formula:</p>
>>> f = </span>lambda </span>n: </span>3</span>**(n*(n+</span>1</span>)) // (</span>3</span>**(</span>2</span>*n) - </span>3</span>**n - </span>1</span>) % </span>3</span>**n
</span>>>> [f(n) </span>for </span>n </span>in </span>range(</span>10</span>)]
</span>[</span>0</span>, </span>1</span>, </span>1</span>, </span>2</span>, </span>3</span>, </span>5</span>, </span>8</span>, </span>13</span>, </span>21</span>, </span>34</span>]
</span></code></pre>
Needless to say, none of this is actually a good idea. We're just having fun here.</aside>
We can golf</a> this quite a bit by using Python’s “walrus operator”</a>
(also called assignment expression</em> by boring people) introduced in Python 3.8. If you write (foo := bar)</code>
in an expression the parentheses take on value bar</code> as well as storing bar</code>
in a new variable foo</code> available in the rest of the expression. Finally as a
bit of flair and efficiency, ${b = 2^{n+1}}$ also works, which can be computed as
2 << n</code>:</p>
>>> f=</span>lambda </span>n:(b:=</span>2</span><<n)**n*b//(b*b-b-</span>1</span>)%b
</span>>>> [f(n) </span>for </span>n </span>in </span>range(</span>10</span>)]
</span>[</span>0</span>, </span>1</span>, </span>1</span>, </span>2</span>, </span>3</span>, </span>5</span>, </span>8</span>, </span>13</span>, </span>21</span>, </span>34</span>]
</span></code></pre>


Ordering Numbers, How Hard Can It Be?
Orson R. L. Peters — 2023-01-27T00:00:00+00:00

This article is not</strong> about deciding whether two floating
point numbers are ‘close enough’. There are plenty of resources on
this (often subjective) problem. We simply want to know if ${x \leq y.}$</p>
</aside>
Suppose that you are a programmer, and that you have two numbers. You want to
know which number, if any, is larger. Now, if both numbers have the same type,
the solution is trivial in almost any programming language. Usually there is
even a dedicated operator, <=</code>, for this operation. For example, in Python:</p>
>>> </span>"120" </span><= </span>"1132"
</span>False
</span></code></pre>
Comparing two numbers in Brainfuck is left as an exercise to the reader.</aside>
Oh. Well technically those are strings, not numbers, which typically are
lexicographically</em></a> sorted.
Then again, they are numbers, just represented using strings. This may seem silly
but it’s actually a common problem in user interfaces, e.g. a list of files.
This is why you want to zero-pad your numeric filenames (frame-00001.png</code>), or use lexicographically
order-preserving representations, such as ISO 8601</a>
for dates.</p>
But I digress, let’s assume our numbers really are represented using numeric
types. Then indeed it is easy, and <=</code> just works:</p>
>>> </span>120 </span><= </span>1132
</span>True
</span></code></pre>
Or does it?</p>
Mixed-type integer comparisons</a></h2>
What if the two numbers you are comparing do not have the same type? Your
first approach might be to just use <=</code> anyway, for example in C++:</p>
std::cout << ((-</span>1</span>) <= </span>1</span>u</span>) << </span>"</span>\n</span>"</span>;  </span>// Outputs 0.
</span></code></pre>
Oh. C++ automatically promoted -1</code> to an unsigned int</code> which caused
it to wrap around to the maximum value (which is obviously bigger than 1</code>).
Well at least a modern compiler will warn you by default, right?</p>
$ g++ -std=c++20 main.cpp && ./a.out
</span>0
</span></code></pre>
I'm interested to see a study on how many bugs have occurred due to
mixed-type integer comparisons. I would not be surprised to see a significant
amount of bugs, especially in C.</aside>
Great. Yet another reason you should not forget to turn on warnings (-Wall -pedantic</code>).
Let’s try Rust:</p>
let</span> x: </span>i32 </span>= -</span>1</span>; </span>let</span> y: </span>u32 </span>= </span>1</span>;
</span>println!(</span>"</span>{:?}</span>"</span>, x <= y);
</span></code></pre>
error[</span>E0308</span>]: mismatched types
</span> --> src/main.rs:</span>4</span>:</span>22
</span>  |
</span>4 </span>| println!(</span>"</span>{:?}</span>"</span>, x <= y);
</span>  |                      ^ expected `i32`, found `u32`
</span>  |
</span>help: you can convert a `</span>u32</span>` to an `</span>i32</span>` and panic </span>if</span> the
</span>converted value doesn't fit
</span></code></pre>
Well at least it doesn’t silently miscompile… The suggested solution is
horrible though. There’s no reason to panic at all. The most efficient solution
here is to promote</em> both values to a type that is a superset</em> of both. For
example:</p>
println!(</span>"</span>{:?}</span>"</span>, (x as </span>i64</span>) <= (y as </span>i64</span>)); </span>// Outputs true.
</span></code></pre>
But what if there’s no type that’s a superset of both? At least for integer
types this is not a problem. For example there is no type in Rust that is
a superset of i128</code> and u128</code>. But we do know that an i128</code> fits in an
u128</code> if it is non-negative, and if it is negative, it is always smaller:</p>
fn </span>less_eq(x: </span>i128</span>, y: </span>u128</span>) -> </span>bool </span>{
</span>    </span>if</span> x <= </span>0 </span>{ </span>true </span>} </span>else </span>{ x as </span>u128 </span><= y }
</span>}
</span></code></pre>
All of this is quite error prone, and frankly annoying. Cross-integer type
comparisons are always cheap, so I don’t see a good reason why the compiler
doesn’t automatically generate the above code. For example on Apple ARM the
above for i64 <= u64</code> compiles to three instructions:</p>
example::less_eq:
</span>        cmp     x0, #1
</span>        ccmp    x0, x1, #0, ge
</span>        cset    w0, ls
</span>        ret
</span></code></pre>
We really</em> should automatically be doing the right thing here, instead
of pushing people to hand-written conversions that may or may not be correct, or worse,
silently generating wrong code. C++20 at least introduced new
integer comparison functions</a>
for cross-type integer comparisons, but the regular comparison operators are
still just as dangerous.</p>
Floating point numbers are exact</a></h2>
Before we dive into mixed-type floating point comparisons, we have to do a quick refresher on
what floating point is</em>. When I say floating-point, I mean the binary floating
point numbers defined in the IEEE 754</a>
standard, in particular binary32</code> (also known as f32</code> or float</code>), and binary64</code>
(also known as f64</code> or double</code>). The latest version of the standard has
DOI 10.1109/IEEESTD.2019.8766229</a>.</p>

Did you know that Sci-Hub</a> exists? It is
an important project removing the barriers and paywalls to human knowledge.
Usually if you have a DOI reference the document is only a couple clicks away!
(Whether this is legal depends on your jurisdiction.)</p>
</aside>
IEEE 754 refresher</a></h3>
This floating point format has various warts, but the reality is that the machine
you’re reading this on almost surely uses it as its native floating point 
implementation. In this format a number consists of one sign bit, $w$ exponent
bits and $t$ trailing</em> mantissa bits. For f32</code> we have $w = 8$ and $t = 23$,
for f64</code> we have $w = 11, t = 52$. There is also an exponent bias $b$, which is
$127$ for f32</code> and $1023$ for f64</code>, which is used to get negative exponents.</p>
To decode a floating point number (ignoring NaN</code>s and infinities), we first look at our
$1 + w + t$ bits and decode three unsigned binary integers $s$, $e$ and $m$:</p>

</p>
</div>
Then, if $e = 0$ the value of our number is
\begin{equation}f = {(-1)}^s \times 2^{e - b + 1} \times (0 + m / 2^t),\end{equation}
otherwise it is
\begin{equation}f = {(-1)}^s \times 2^{e - b} \times (1 + m / 2^t).\end{equation}
This is why they’re called trailing</em> mantissa bits: the first
digit of the mantissa is determined by the exponent. When the exponent field is
zero (before applying the bias), the mantissa starts with a $0$, otherwise it
starts with a $1$. When the mantissa starts with a $0$ we call it a subnormal</em></a>
number. They are important because they close the otherwise relatively large
gap between zero and the first floating point number.
A nice way to get a feeling for all this is by playing around with the Float Toy</a>
app.</p>
Imprecision</a></h3>
Now that we know all of the above, I want to explicitly state the following:
IEEE 754 floating point types define a set of exact numbers. There is no
ambiguity (except two representations for zero, $+0$ and $-0$), nor is there a
loss of precision, fuzziness, interval representations, etc. For example,
$1.0$ is represented exactly</strong> by the f32</code> with value 0x3f800000</code>, and the
next bigger f32</code> is 0x3f800001</code> with value
$${(-1)}^0 \times 2^0 \times (1 + 1 / 2^{23}) = 1.00000011920928955078125.$$</p>
For example in Rust:</p>
println!(</span>"</span>{}</span>"</span>, </span>f32</span>::from_bits(</span>0x3f800001</span>));
</span>1.0000001
</span></code></pre>
Oh. Did I lie? No, it is Rust who lies:</p>
let</span> full = format!(</span>"{:.1000}"</span>, </span>f32</span>::from_bits(</span>0x3f800001</span>));
</span>println!(</span>"</span>{}</span>"</span>, full.trim_end_matches(</span>'0'</span>));
</span>1.00000011920928955078125
</span></code></pre>
This is not a bug or an accident. Rust—and most programming languages in
fact—only try to print as few digits as possible to guarantee the round-trip</em>
is correct. And indeed:</p>
println!(</span>"0x</span>{:x}</span>"</span>, </span>"1.0000001"</span>.parse::<</span>f32</span>>().unwrap().to_bits());
</span>0x3f800001
</span></code></pre>
However, this has some nasty implications, if you then parse the number as a
more precise type:</p>
"1.0000001"</span>.parse::<</span>f64</span>>().unwrap() == </span>1.00000011920928955078125
</span>false
</span></code></pre>
Mind you that $1.00000011920928955078125$ is exactly representable as both
a f32</code> and f64</code> (because f32</code> is a strict subset of f64</code>), yet you lose</em>
precision by printing as an f32</code> and parsing as an f64</code>. The reason is that
while 1.0000001</code> is the shortest decimal number that rounds to 
$1.00000011920928955078125$ in the f32</code> floating point format, it rounds
to $$1.0000001000000000583867176828789524734020233154296875$$ instead in the f64</code> format.</p>
Ironically, in this
case it is more accurate to parse as an f32</code> and then convert to f64</code>, because
Rust guarantees the round-trip correctness:</p>
"1.0000001"</span>.parse::<</span>f32</span>>().unwrap() as </span>f64 </span>== </span>1.00000011920928955078125
</span>true
</span></code></pre>
So f32 -> f64</code> is lossless, as is f32 -> String -> f32 -> f64</code>. But
f32 -> String -> f64</code> loses precision.</p>
It is vital to understand the above, to be able to investigate and debug floating point
problems.</strong>
Programming languages will silently round a floating point number to the nearest
representable number when you write a number in your source code, silently
round it when parsing, and silently round it when printing. The way these
languages round differs, and it can even differ depending on the type in question.
At every step of the way you are potentially being lied to.</p>
Given how much silent rounding occurs, I do not blame you if you got the impression
that floating point is ‘fuzzy’. It provides the illusion of having a ‘real
number’ type. But in reality the underlying numbers are an exact</strong>, finite set of
numbers.</p>
Mixed-type floating point comparisons</a></h2>
Why do I place so much emphasis on the exactness of the IEEE 754 floating point
numbers? Because it means that (aside from NaN</code>s), the comparison of integers
and floats is also unambiguously well-defined. They are both, after, all, exact
numbers placed on the real number line.</p>
Before reading on, I challenge you: try to write a correct</em> implementation of the
following function:</p>
/// x <= y
</span>fn </span>is_less_eq(x: </span>i64</span>, y: </span>f64</span>) -> </span>bool </span>{
</span>    todo!()
</span>}
</span></code></pre>
If you want to
try in Rust, I wrote a (non-exhaustive) set of tests on the Rust playground</a>
you can plug your implementation into, which might show you an input that fails.
If you want to try it in a different language, remember that the programming language might lie to you by default! For example:</p>
let</span> x: </span>i64 </span>= </span>1 </span><< </span>58</span>;
</span>let</span> y: </span>f64 </span>= x as </span>f64</span>; </span>// 2^58, exactly representable.
</span>println!(</span>"</span>{x}</span> <= </span>{y}</span>: </span>{}</span>"</span>, x as </span>f64 </span><= y);
</span>288230376151711744 </span><= </span>288230376151711740</span>: </span>true
</span></code></pre>
This may look like a bad comparison or conversion from i64</code> to f64</code>, but it isn’t. The problem
lies entirely in the rounding during formatting.</p>
The main difficulty lies in the fact that for many type combinations (e.g. i64</code> and f64</code>) there
does not exist a native type in the programming language that is a superset
of both. For example, $2^{1000}$ is exactly representable as an f64</code> but not i64</code>. And
$2^{53} + 1$ is exactly representable in i64</code> but not f64</code>. So we can’t simply
convert one to the other and be done with, yet that is what many people do</strong>.
In fact, it’s so common ChatGPT has learned to do so:</p>

</p>
</div>

Asking ChatGPT to fix the bug with an explicit counterexample is
fruitless, it will blabber some nonsense about f64::EPSILON</code> and comparing
a difference to that.</p>
</aside>
Our above test framework shows that x as f64 <= y</code> fails because we find that</p>
9007199254740993 </span>as </span>f64 </span><= </span>9007199254740992.0
</span></code></pre>
which is obviously wrong. The problem is that 9007199254740993</code> (which is
$2^{53}+1$) is not representable as f64</code>, and gets rounded to 
$2^{53}$, after which the comparison succeeds.</p>
The correct implementation for i64</code> <= f64</code></a></h3>
The trick for implementing $i \leq f$ correctly is to perform the operation in the
integer domain after rounding the floating point number down to the nearest
integer, as for integer $i$ we have
$$  i \leq f \iff i \leq \lfloor f \rfloor.$$
We need not worry that rounding a float up or down to the nearest integer goes wrong and
skips an integer, because for IEEE 754 the floor</code> / ceil</code> functions are exact. This
is because in the part of the number line where IEEE 754 floats are fractional
it is also dense in the integers.</p>
We still have to worry about our floating point value not fitting in our integer
type. Luckily when that happens our comparison is trivial. Unluckily, our integer
types have a different range in the negative and positive domains, so we still
have to be a bit careful, especially because we can not compare with $2^{63} - 1$
(the maximum i64</code> value) in the float domain.</p>
fn </span>is_less_eq(x: </span>i64</span>, y: </span>f64</span>) -> </span>bool </span>{
</span>    </span>if</span> y.is_nan() { </span>return </span>false</span>; }
</span>    </span>if</span> y >= </span>9223372036854775808.0 </span>{ </span>// 2^63
</span>        </span>true </span>// y is always bigger.
</span>    } </span>else if</span> y >= -</span>9223372036854775808.0 </span>{ </span>// -2^63
</span>        x <= y.floor() as </span>i64  </span>// y is in [-2^63, 2^63)
</span>    } </span>else </span>{
</span>        </span>false </span>// y is always smaller.
</span>    }
</span>}
</span></code></pre>
You might think you can get away without the floor</code> as we convert to integer
immediately after. Alas, as i64</code> rounds towards zero, but we need to round
towards negative infinity or else we will end up claiming -1 <= -1.5</code>.</p>
Generalizing</a></h3>
Ok, we can compare $i \leq f$. What about $i \geq f$? We can’t re-use the same
implementation by swapping the order of the arguments because their types are
different. We can however make a new implementation from scratch applying the same
principle, but we must use ceil</code> instead of floor</code>:</p>
$$  i \geq f \iff i \geq \lceil f \rceil.$$</p>
fn </span>is_greater_eq(x: </span>i64</span>, y: </span>f64</span>) -> </span>bool </span>{
</span>    </span>if</span> y.is_nan() { </span>return </span>false</span>; }
</span>    </span>if</span> y >= </span>9223372036854775808.0 </span>{ </span>// 2^63
</span>        </span>false </span>// y is always bigger.
</span>    } </span>else if</span> y >= -</span>9223372036854775808.0 </span>{ </span>// -2^63
</span>        x >= y.ceil() as </span>i64  </span>// y is in [-2^63, 2^63)
</span>    } </span>else </span>{
</span>        </span>true </span>// y is always smaller.
</span>    }
</span>}
</span></code></pre>
What if we want strict inequality? Now our floor</code>/ceil</code> trick introduces
problems surrounding equality. One way to solve this is with a separate check
for equality in the integer domain followed by inequality in the float domain:</p>
fn </span>is_less(x: </span>i64</span>, y: </span>f64</span>) -> </span>bool </span>{
</span>    </span>if</span> y.is_nan() { </span>return </span>false</span>; }
</span>    </span>if</span> y >= </span>9223372036854775808.0 </span>{ </span>// 2^63
</span>        </span>true
</span>    } </span>else if</span> y >= -</span>9223372036854775808.0 </span>{ </span>// -2^63
</span>        </span>let</span> yf = y.floor(); </span>// y is in [-2^63, 2^63)
</span>        </span>let</span> yfi = yf as </span>i64</span>;
</span>        x < yfi || x == yfi && yf < y
</span>    } </span>else </span>{
</span>        </span>false
</span>    }
</span>}
</span></code></pre>
You get the point. There might be a more clever and/or efficient way to do this,
but at least this works.</p>
Update on 2023-02-07</strong>: Pavel Mayorov contacted me with a suggestion for a more efficient
version of inequality. It works on the observation that for integer $i$ we have</p>
$$  i < f \iff i < \lceil f \rceil.$$</p>
That is, instead of flooring for $\leq$ we use ceiling for $<$.</p>
fn </span>is_less(x: </span>i64</span>, y: </span>f64</span>) -> </span>bool </span>{
</span>    </span>if</span> y.is_nan() { </span>return </span>false</span>; }
</span>    </span>if</span> y >= </span>9223372036854775808.0 </span>{ </span>// 2^63
</span>        </span>true
</span>    } </span>else if</span> y >= -</span>9223372036854775808.0 </span>{ </span>// -2^63
</span>        x < y.ceil() as </span>i64 </span>// y is in [-2^63, 2^63)
</span>    } </span>else </span>{
</span>        </span>false
</span>    }
</span>}
</span></code></pre>
Conclusion</a></h2>
So, ordering numbers, how hard can it be</em>? Pretty damn hard I would say, if
your language does not support it natively. From challenging a variety of people
to write a correct implementation of is_less_eq</code>, no one gets it right on their
first try. And that’s after already explicitly being told that the challenge is
to do it correctly for all inputs. I quote the Python standard library:
“comparison is pretty much a nightmare.”</a></p>
Of all the languages I looked at that have distinct integer and floating point
types, Python, Julia, Ruby and Go get this right. Good job! Some warn you or
disallow cross-type comparisons by default, but Kotlin for example will straight
up tell you that 9007199254740993 <= 9007199254740992.0</code> is true</code>.</p>
For Rust I’ve made the num-ord</code></a> crate for
now that allows you to compare any two built-in numeric types correctly. But I
would love to see it (and others) adopt an approach where this is done right
natively. Because if it isn’t people have to do it correctly themselves, which
they won’t.</p>
orlp.net - Blog Archive

Taming Floating-Point Sums

Extracting and Depositing Bits

When Random Isn't

Branchless Lomuto Partitioning

Subtraction Is Functionally Complete

Bitwise Binary Search: Elegant and Fast

The World's Smallest Hash Table

Magical Fibonacci Formulae

Ordering Numbers, How Hard Can It Be?