Taming Floating-Point Sums

Suppose you have an array of floating-point numbers, and wish to sum them. You might naively think you can simply add them, e.g. in Rust:

fn naive_sum(arr: &[f32]) -> f32 {
    let mut out = 0.0;
    for x in arr {
        out += *x;
    }
    out
}

This however can easily result in an arbitrarily large accumulated error. Let’s try it out:

naive_sum(&vec![1.0;     1_000_000]) =  1000000.0
naive_sum(&vec![1.0;    10_000_000]) = 10000000.0
naive_sum(&vec![1.0;   100_000_000]) = 16777216.0
naive_sum(&vec![1.0; 1_000_000_000]) = 16777216.0

Uh-oh… What happened? When you compute $a + b$ the result must be rounded to the nearest representable floating-point number, breaking ties towards the number with an even mantissa. The problem is that the next 32-bit floating-point number after 16777216 is 16777218. In this case that means 16777216 + 1 rounds back to 16777216 again. We’re stuck.

Luckily, there are better ways to sum an array.

Pairwise summation

A method that’s a bit more clever is to use pairwise summation. Instead of a completely linear sum with a single accumulator it recursively sums an array by splitting the array in half, summing the halves, and then adding the sums.

fn pairwise_sum(arr: &[f32]) -> f32 {
    if arr.len() == 0 { return 0.0; }
    if arr.len() == 1 { return arr[0]; }
    let (first, second) = arr.split_at(arr.len() / 2);
    pairwise_sum(first) + pairwise_sum(second)
}

This is more accurate:

pairwise_sum(&vec![1.0;     1_000_000]) =    1000000.0
pairwise_sum(&vec![1.0;    10_000_000]) =   10000000.0
pairwise_sum(&vec![1.0;   100_000_000]) =  100000000.0
pairwise_sum(&vec![1.0; 1_000_000_000]) = 1000000000.0

However, this is rather slow. To get a summation routine that goes as fast as possible while still being reasonably accurate we should not recurse down all the way to length-1 arrays, as this gives too much call overhead. We can still use our naive sum for small sizes, and only recurse on large sizes. This does make our worst-case error worse by a constant factor, but in turn makes the pairwise sum almost as fast as a naive sum.

fn block_pairwise_sum(arr: &[f32]) -> f32 {
    if arr.len() > 256 {
        let split = (arr.len() / 2).next_multiple_of(256);
        let (first, second) = arr.split_at(split);
        block_pairwise_sum(first) + block_pairwise_sum(second)
    } else {
        naive_sum(arr)
    }
}

Kahan summation

The worst-case round-off error of naive summation scales with $O(n \epsilon)$ when summing $n$ elements, where $\epsilon$ is the machine epsilon of your floating-point type (here $2^{-24}$). Pairwise summation improves this to $O((\log n) \epsilon + n\epsilon^2)$. However, Kahan summation improves this further to $O(n\epsilon^2)$, eliminating the $\epsilon$ term entirely, leaving only the $\epsilon^2$ term which is negligible unless you sum a very large amount of numbers.

pub fn kahan_sum(arr: &[f32]) -> f32 {
    let mut sum = 0.0;
    let mut c = 0.0;
    for x in arr {
        let y = *x - c;
        let t = sum + y;
        c = (t - sum) - y;
        sum = t;
    }
    sum
}

The Kahan summation works by maintaining the sum in two registers, the actual bulk sum and a small error correcting term $c$. If you were using infinitely precise arithmetic $c$ would always be zero, but with floating-point it might not be. The downside is that each number now takes four operations to add to the sum instead of just one.

To mitigate this we can do something similar to what we did with the pairwise summation. We can first accumulate blocks into sums naively before combining the block sums with Kaham summation to reduce overhead at the cost of accuracy:

pub fn block_kahan_sum(arr: &[f32]) -> f32 {
    let mut sum = 0.0;
    let mut c = 0.0;
    for chunk in arr.chunks(256) {
        let x = naive_sum(chunk);
        let y = x - c;
        let t = sum + y;
        c = (t - sum) - y;
        sum = t;
    }
    sum
}

Exact summation

I know of at least two general methods to produce the correctly-rounded sum of a sequence of floating-point numbers. That is, it logically computes the sum with infinite precision before rounding it back to a floating-point value at the end.

The first method is based on the 2Sum primitive which is an error-free transform from two numbers $x, y$ to $s, t$ such that $x + y = s + t$, where $t$ is a small error. By applying this repeatedly until the errors vanish you can get a correctly-rounded sum. Keeping track of what to add in what order can be tricky, and the worst-case requires $O(n^2)$ additions to make all the terms vanish. This is what’s implemented in Python’s math.fsum and in the Rust crate fsum which use extra memory to keep the partial sums around. The accurate crate also implements this using in-place mutation in i_fast_sum_in_place.

Another method is to keep a large buffer of integers around, one per exponent. Then when adding a floating-point number you decompose it into a an exponent and mantissa, and add the mantissa to the corresponding integer in the buffer. If the integer buf[i] overflows you increment the integer in buf[i + w], where w is the width of your integer.

This can actually compute a completely exact sum, without any rounding at all, and is effectively just an overly permissive representation of a fixed-point number optimized for accumulating floats. This latter method is $O(n)$ time, but uses a large but constant amount of memory ($\approx$ 1 KB for f32, $\approx$ 16 KB for f64). An advantage of this method is that it’s also an online algorithm - both adding a number to the sum and getting the current total are amortized $O(1)$.

A variant of this method is implemented in the accurate crate as OnlineExactSum crate which uses floats instead of integers for the buffer.

Unleashing the compiler

Besides accuracy, there is another problem with naive_sum. The Rust compiler is not allowed to reorder floating-point additions, because floating-point addition is not associative. So it cannot autovectorize the naive_sum to use SIMD instructions to compute the sum, nor use instruction-level parallelism.

To solve this there are compiler intrinsics in Rust that do float sums while allowing associativity, such as std::intrinsics::fadd_fast. However, these instructions are incredibly dangerous, as they assume that both the input and output are finite numbers (no infinities, no NaNs), or otherwise they are undefined behavior. This functionally makes them unusable, as only in the most restricted scenarios when computing a sum do you know that all inputs are finite numbers, and that their sum cannot overflow.

I recently uttered my annoyance with these operators to Ben Kimock, and together we proposed (and he implemented) a new set of operators: std::intrinsics::fadd_algebraic and friends.

I proposed we call the operators algebraic, as they allow (in theory) any transformation that is justified by real algebra. For example, substituting ${x - x \to 0}$, ${cx + cy \to c(x + y)}$, or ${x^6 \to (x^2)^3.}$ In general these operators are treated as-if they are done using real numbers, and can map to any set of floating-point instructions that would be equivalent to the original expression, assuming the floating-point instructions would be exact.

Using those new instructions it is trivial to generate an autovectorized sum:

#![allow(internal_features)]
#![feature(core_intrinsics)]
use std::intrinsics::fadd_algebraic;

fn naive_sum_autovec(arr: &[f32]) -> f32 {
    let mut out = 0.0;
    for x in arr {
        out = fadd_algebraic(out, *x);
    }
    out
}

If we compile with -C target-cpu=broadwell we see that the compiler automatically generated the following tight loop for us, using 4 accumulators and AVX2 instructions:

.LBB0_5:
    vaddps  ymm0, ymm0, ymmword ptr [rdi + 4*r8]
    vaddps  ymm1, ymm1, ymmword ptr [rdi + 4*r8 + 32]
    vaddps  ymm2, ymm2, ymmword ptr [rdi + 4*r8 + 64]
    vaddps  ymm3, ymm3, ymmword ptr [rdi + 4*r8 + 96]
    add     r8, 32
    cmp     rdx, r8
    jne     .LBB0_5

This will process 128 bytes of floating-point data (so 32 elements) in 7 instructions. Additionally, all the vaddps instructions are independent of each other as they accumulate to different registers. If we analyze this with uiCA we see that it estimates the above loop to take 4 cycles to complete, processing 32 bytes / cycle. At 4GHz that’s up to 128GB/s! Note that that’s way above what my machine’s RAM bandwidth is, so you will only achieve that speed when summing data that is already in cache.

With this in mind we can also easily define block_pairwise_sum_autovec and block_kahan_sum_autovec by replacing their calls to naive_sum with naive_sum_autovec.

Accuracy and speed

Let’s take a look at how the different summation methods compare. As a relatively arbitrary benchmark, let’s sum 100,000 random floats ranging from -100,000 to +100,000. This is 400 KB worth of data, so it still fits in cache on my AMD Threadripper 2950x.

All the code is available on Github. Compiled with RUSTFLAGS=-C target-cpu=native and --release I get the following results:

AlgorithmThroughputMean absolute error
naive5.5 GB/s71.796
pairwise0.9 GB/s1.5528
kahan1.4 GB/s0.2229
block_pairwise5.8 GB/s3.8597
block_kahan5.9 GB/s4.2184
naive_autovec118.6 GB/s14.538
block_pairwise_autovec71.7 GB/s1.6132
block_kahan_autovec98.0 GB/s1.2306
crate_accurate_buffer1.1 GB/s0.0015
crate_accurate_inplace1.9 GB/s0.0015
crate_fsum1.2 GB/s0.0000

First I’d like to note that there’s more than a 100x performance difference between the fastest and slowest method. For summing an array! Now this might not be entirely fair as the slowest methods are computing something significantly harder, but there’s still a 20x performance difference between a seemingly reasonable naive implementation and the fastest one.

We find that in general the _autovec methods that use fadd_algebraic are faster and more accurate than the ones using regular floating-point addition. The reason they’re more accurate as well is the same reason a pairwise sum is more accurate: any reordering of the additions is better as the default long-chain-of-additions is already the worst case for accuracy in a sum.

Limiting ourselves to Pareto-optimal choices we get the following four implementations:

AlgorithmThroughputMean absolute error
naive_autovec118.6 GB/s14.538
block_kahan_autovec98.0 GB/s1.2306
crate_accurate_inplace1.9 GB/s0.0015
crate_fsum1.2 GB/s0.0000

Note that implementation differences can be quite impactful, and there are likely dozens more methods of compensated summing I did not compare here.

For most cases I think block_kahan_autovec wins here, having good accuracy (that doesn’t degenerate with larger inputs) at nearly the maximum speed. For most applications the extra accuracy from the correctly-rounded sums is unnecessary, and they are 50-100x slower.

By splitting the loop up into an explicit remainder plus a tight loop of 256-element sums we can squeeze out a bit more performance, and avoid a couple floating-point ops for the last chunk:

#![allow(internal_features)]
#![feature(core_intrinsics)]
use std::intrinsics::fadd_algebraic;

fn sum_block(arr: &[f32]) -> f32 {
    arr.iter().fold(0.0, |x, y| fadd_algebraic(x, *y))
}

pub fn sum_orlp(arr: &[f32]) -> f32 {
    let mut chunks = arr.chunks_exact(256);
    let mut sum = 0.0;
    let mut c = 0.0;
    for chunk in &mut chunks {
        let y = sum_block(chunk) - c;
        let t = sum + y;
        c = (t - sum) - y;
        sum = t;
    }
    sum + (sum_block(chunks.remainder()) - c)
}
AlgorithmThroughputMean absolute error
sum_orlp112.2 GB/s1.2306

You can of course tweak the number 256, I found that using 128 was $\approx$ 20% slower, and that 512 didn’t really improve performance but did cost accuracy.

Conclusion

I think the fadd_algebraic and similar algebraic intrinsics are very useful for achieving high-speed floating-point routines, and that other languages should add them as well. A global -ffast-math is not good enough, as we’ve seen above the best implementation was a hybrid between automatically optimized math for speed, and manually implemented non-associative compensated operations.

Finally, if you are using LLVM, beware of -ffast-math. It is undefined behavior to produce a NaN or infinity while that flag is set in LLVM. I have no idea why they chose this hardcore stance which makes virtually every program that uses it unsound. If you are targetting LLVM with your language, avoid the nnan and ninf fast-math flags.