Taming Floating-Point Sums
Suppose you have an array of floating-point numbers, and wish to sum them. You might naively think you can simply add them, e.g. in Rust:
fn naive_sum(arr: &[f32]) -> f32 {
let mut out = 0.0;
for x in arr {
out += *x;
}
out
}
This however can easily result in an arbitrarily large accumulated error. Let’s try it out:
naive_sum(&vec![1.0; 1_000_000]) = 1000000.0
naive_sum(&vec![1.0; 10_000_000]) = 10000000.0
naive_sum(&vec![1.0; 100_000_000]) = 16777216.0
naive_sum(&vec![1.0; 1_000_000_000]) = 16777216.0
Uh-oh… What happened? When you compute $a + b$ the result must be rounded to
the nearest representable floating-point number, breaking ties towards the
number with an even mantissa. The problem is that the next 32-bit floating-point
number after 16777216
is 16777218
. In this case that means 16777216 + 1
rounds back to 16777216
again. We’re stuck.
Luckily, there are better ways to sum an array.
Pairwise summation
A method that’s a bit more clever is to use pairwise summation. Instead of a completely linear sum with a single accumulator it recursively sums an array by splitting the array in half, summing the halves, and then adding the sums.
fn pairwise_sum(arr: &[f32]) -> f32 {
if arr.len() == 0 { return 0.0; }
if arr.len() == 1 { return arr[0]; }
let (first, second) = arr.split_at(arr.len() / 2);
pairwise_sum(first) + pairwise_sum(second)
}
This is more accurate:
pairwise_sum(&vec![1.0; 1_000_000]) = 1000000.0
pairwise_sum(&vec![1.0; 10_000_000]) = 10000000.0
pairwise_sum(&vec![1.0; 100_000_000]) = 100000000.0
pairwise_sum(&vec![1.0; 1_000_000_000]) = 1000000000.0
However, this is rather slow. To get a summation routine that goes as fast as possible while still being reasonably accurate we should not recurse down all the way to length-1 arrays, as this gives too much call overhead. We can still use our naive sum for small sizes, and only recurse on large sizes. This does make our worst-case error worse by a constant factor, but in turn makes the pairwise sum almost as fast as a naive sum.
fn block_pairwise_sum(arr: &[f32]) -> f32 {
if arr.len() > 256 {
let split = (arr.len() / 2).next_multiple_of(256);
let (first, second) = arr.split_at(split);
block_pairwise_sum(first) + block_pairwise_sum(second)
} else {
naive_sum(arr)
}
}
Kahan summation
The worst-case round-off error of naive summation scales with $O(n \epsilon)$ when summing $n$ elements, where $\epsilon$ is the machine epsilon of your floating-point type (here $2^{-24}$). Pairwise summation improves this to $O((\log n) \epsilon + n\epsilon^2)$. However, Kahan summation improves this further to $O(n\epsilon^2)$, eliminating the $\epsilon$ term entirely, leaving only the $\epsilon^2$ term which is negligible unless you sum a very large amount of numbers.
pub fn kahan_sum(arr: &[f32]) -> f32 {
let mut sum = 0.0;
let mut c = 0.0;
for x in arr {
let y = *x - c;
let t = sum + y;
c = (t - sum) - y;
sum = t;
}
sum
}
The Kahan summation works by maintaining the sum in two registers, the actual bulk sum and a small error correcting term $c$. If you were using infinitely precise arithmetic $c$ would always be zero, but with floating-point it might not be. The downside is that each number now takes four operations to add to the sum instead of just one.
To mitigate this we can do something similar to what we did with the pairwise summation. We can first accumulate blocks into sums naively before combining the block sums with Kaham summation to reduce overhead at the cost of accuracy:
pub fn block_kahan_sum(arr: &[f32]) -> f32 {
let mut sum = 0.0;
let mut c = 0.0;
for chunk in arr.chunks(256) {
let x = naive_sum(chunk);
let y = x - c;
let t = sum + y;
c = (t - sum) - y;
sum = t;
}
sum
}
Exact summation
I know of at least two general methods to produce the correctly-rounded sum of a sequence of floating-point numbers. That is, it logically computes the sum with infinite precision before rounding it back to a floating-point value at the end.
The first method is based on the 2Sum
primitive which is an error-free transform from two numbers $x, y$ to $s, t$
such that $x + y = s + t$, where $t$ is a small error. By applying this
repeatedly until the errors vanish you can get a correctly-rounded sum.
Keeping track of what to add in what order can be tricky, and the worst-case
requires $O(n^2)$ additions to make all the terms vanish. This is what’s
implemented in Python’s math.fsum
and in the Rust crate fsum
which use
extra memory to keep the partial sums around. The accurate
crate also
implements this using in-place mutation in i_fast_sum_in_place
.
Another method is to keep a large buffer of integers around, one per exponent.
Then when adding a floating-point number you decompose it into a an exponent
and mantissa, and add the mantissa to the corresponding integer in the buffer.
If the integer buf[i]
overflows you increment the integer in buf[i + w]
,
where w
is the width of your integer.
This can actually compute a completely exact sum, without any rounding at all,
and is effectively just an overly permissive representation of a fixed-point
number optimized for accumulating floats. This latter method is $O(n)$ time, but
uses a large but constant amount of memory ($\approx$ 1 KB for f32
, $\approx$
16 KB for f64
). An advantage of this method is that it’s also an online
algorithm - both adding a number to the sum and getting the current total are
amortized $O(1)$.
A variant of this method is implemented in the accurate
crate
as OnlineExactSum
crate which uses floats instead of integers for the buffer.
Unleashing the compiler
Besides accuracy, there is another problem with naive_sum
. The Rust compiler
is not allowed to reorder floating-point additions, because floating-point
addition is not associative. So it cannot autovectorize the naive_sum
to use
SIMD instructions to compute the sum, nor use instruction-level parallelism.
To solve this there are compiler intrinsics in Rust that do float sums while
allowing associativity, such as
std::intrinsics::fadd_fast
.
However, these instructions are incredibly dangerous, as they assume that both
the input and output are finite numbers (no infinities, no NaNs), or otherwise
they are undefined behavior. This functionally makes them unusable, as only in
the most restricted scenarios when computing a sum do you know that all inputs
are finite numbers, and that their sum cannot overflow.
I recently uttered my annoyance
with these operators to Ben Kimock, and together
we proposed (and he implemented) a new set of operators:
std::intrinsics::fadd_algebraic
and friends.
I proposed we call the operators algebraic, as they allow (in theory) any transformation that is justified by real algebra. For example, substituting ${x - x \to 0}$, ${cx + cy \to c(x + y)}$, or ${x^6 \to (x^2)^3.}$ In general these operators are treated as-if they are done using real numbers, and can map to any set of floating-point instructions that would be equivalent to the original expression, assuming the floating-point instructions would be exact.
Using those new instructions it is trivial to generate an autovectorized sum:
#![allow(internal_features)]
#![feature(core_intrinsics)]
use std::intrinsics::fadd_algebraic;
fn naive_sum_autovec(arr: &[f32]) -> f32 {
let mut out = 0.0;
for x in arr {
out = fadd_algebraic(out, *x);
}
out
}
If we compile with -C target-cpu=broadwell
we see that the compiler
automatically generated the following tight loop for us, using 4 accumulators
and AVX2 instructions:
.LBB0_5:
vaddps ymm0, ymm0, ymmword ptr [rdi + 4*r8]
vaddps ymm1, ymm1, ymmword ptr [rdi + 4*r8 + 32]
vaddps ymm2, ymm2, ymmword ptr [rdi + 4*r8 + 64]
vaddps ymm3, ymm3, ymmword ptr [rdi + 4*r8 + 96]
add r8, 32
cmp rdx, r8
jne .LBB0_5
This will process 128 bytes of floating-point data (so 32 elements) in 7
instructions. Additionally, all the vaddps
instructions are independent of
each other as they accumulate to different registers. If we analyze this with
uiCA we see that it estimates the above loop to take
4 cycles to complete, processing 32 bytes / cycle. At 4GHz that’s up to 128GB/s!
Note that that’s way above what my machine’s RAM bandwidth is, so you will only
achieve that speed when summing data that is already in cache.
With this in mind we can also easily define block_pairwise_sum_autovec
and
block_kahan_sum_autovec
by replacing their calls to naive_sum
with
naive_sum_autovec
.
Accuracy and speed
Let’s take a look at how the different summation methods compare. As a relatively arbitrary benchmark, let’s sum 100,000 random floats ranging from -100,000 to +100,000. This is 400 KB worth of data, so it still fits in cache on my AMD Threadripper 2950x.
All the code is available on Github.
Compiled with RUSTFLAGS=-C target-cpu=native
and --release
I get the
following results:
Algorithm | Throughput | Mean absolute error |
---|---|---|
naive | 5.5 GB/s | 71.796 |
pairwise | 0.9 GB/s | 1.5528 |
kahan | 1.4 GB/s | 0.2229 |
block_pairwise | 5.8 GB/s | 3.8597 |
block_kahan | 5.9 GB/s | 4.2184 |
naive_autovec | 118.6 GB/s | 14.538 |
block_pairwise_autovec | 71.7 GB/s | 1.6132 |
block_kahan_autovec | 98.0 GB/s | 1.2306 |
crate_accurate_buffer | 1.1 GB/s | 0.0015 |
crate_accurate_inplace | 1.9 GB/s | 0.0015 |
crate_fsum | 1.2 GB/s | 0.0000 |
First I’d like to note that there’s more than a 100x performance difference between the fastest and slowest method. For summing an array! Now this might not be entirely fair as the slowest methods are computing something significantly harder, but there’s still a 20x performance difference between a seemingly reasonable naive implementation and the fastest one.
We find that in general the _autovec
methods that use fadd_algebraic
are
faster and more accurate than the ones using regular floating-point addition.
The reason they’re more accurate as well is the same reason a pairwise sum is
more accurate: any reordering of the additions is better as the default
long-chain-of-additions is already the worst case for accuracy in a sum.
Limiting ourselves to Pareto-optimal choices we get the following four implementations:
Algorithm | Throughput | Mean absolute error |
---|---|---|
naive_autovec | 118.6 GB/s | 14.538 |
block_kahan_autovec | 98.0 GB/s | 1.2306 |
crate_accurate_inplace | 1.9 GB/s | 0.0015 |
crate_fsum | 1.2 GB/s | 0.0000 |
Note that implementation differences can be quite impactful, and there are likely dozens more methods of compensated summing I did not compare here.
For most cases I think block_kahan_autovec
wins here, having good accuracy
(that doesn’t degenerate with larger inputs) at nearly the maximum speed. For
most applications the extra accuracy from the correctly-rounded sums is
unnecessary, and they are 50-100x slower.
By splitting the loop up into an explicit remainder plus a tight loop of 256-element sums we can squeeze out a bit more performance, and avoid a couple floating-point ops for the last chunk:
#![allow(internal_features)]
#![feature(core_intrinsics)]
use std::intrinsics::fadd_algebraic;
fn sum_block(arr: &[f32]) -> f32 {
arr.iter().fold(0.0, |x, y| fadd_algebraic(x, *y))
}
pub fn sum_orlp(arr: &[f32]) -> f32 {
let mut chunks = arr.chunks_exact(256);
let mut sum = 0.0;
let mut c = 0.0;
for chunk in &mut chunks {
let y = sum_block(chunk) - c;
let t = sum + y;
c = (t - sum) - y;
sum = t;
}
sum + (sum_block(chunks.remainder()) - c)
}
Algorithm | Throughput | Mean absolute error |
---|---|---|
sum_orlp | 112.2 GB/s | 1.2306 |
You can of course tweak the number 256, I found that using 128 was $\approx$ 20% slower, and that 512 didn’t really improve performance but did cost accuracy.
Conclusion
I think the fadd_algebraic
and similar algebraic intrinsics are very useful
for achieving high-speed floating-point routines, and that other languages
should add them as well. A global -ffast-math
is not good enough, as we’ve
seen above the best implementation was a hybrid between automatically optimized
math for speed, and manually implemented non-associative compensated operations.
Finally, if you are using LLVM, beware of -ffast-math
. It is undefined
behavior to produce a NaN or infinity while that flag is set in LLVM. I have
no idea why they chose this hardcore stance which makes virtually every program
that uses it unsound. If you are targetting LLVM with your language, avoid the
nnan
and ninf
fast-math flags.