Extracting and Depositing Bits
Suppose you have a 64-bit word and wish to extract a couple bits from it.
For example you just performed a SWAR
algorithm and wish to extract the least significant bit of each byte in the u64
.
This is simple enough, you simply perform a binary AND with a mask of
the bits you wish to keep:
let out = word & 0x0101010101010101;
However, this still leaves the bits of interest spread throughout the 64-bit word. What if we also want to compress the 8 bits we wish to extract into a single byte? Or what if we want the inverse, spreading the 8 bits of a byte among the least significant bits of each byte in a 64-bit word?
PEXT
and PDEP
If you are using a modern x86-64 CPU, you are in luck. In the much underrated
BMI instruction
set there
are two very powerful instructions: PDEP
and PEXT
. They are inverses of each
other, PEXT
extracts bits, PDEP
deposits bits.
PEXT
takes in a word and a
mask, takes just those bits from the word where the mask has a 1 bit, and
compresses all selected bits to a contiguous output word. Simulated in Rust
this would be:
fn pext64(word: u64, mask: u64) -> u64 {
let mut out = 0;
let mut out_idx = 0;
for i in 0..64 {
let ith_mask_bit = (mask >> i) & 1;
let ith_word_bit = (word >> i) & 1;
if ith_mask_bit == 1 {
out |= ith_word_bit << out_idx;
out_idx += 1;
}
}
out
}
For example if you had the bitstring abcdefgh
and mask 10110001
you would
get output bitstring 0000acdh
.
PDEP
is exactly its inverse, it takes contiguous data bits as a word, and
a mask, and deposits the data bits one-by-one (starting at the least significant
bits) into those bits where the mask
has a 1 bit, leaving the rest as zeros:
fn pdep64(word: u64, mask: u64) -> u64 {
let mut out = 0;
let mut input_idx = 0;
for i in 0..64 {
let ith_mask_bit = (mask >> i) & 1;
if ith_mask_bit == 1 {
let next_word_bit = (word >> input_idx) & 1;
out |= next_word_bit << i;
input_idx += 1;
}
}
out
}
So if you had the bitstring abcdefgh
and mask 10100110
you would get output
e0f00gh0
(recall that we traditionally write bitstrings with the least
significant bit on the right).
These instructions are incredibly powerful and flexible, and the amazing thing is that these instructions only take a single cycle on modern Intel and AMD CPUs! However, they are not available in other instruction sets, so whenever you use them you will also likely need to write a cross-platform alternative.
Extracting bits with multiplication
While the following technique can’t replace all PEXT
cases, it can be quite
general. It is applicable when:
- The bit pattern you want to extract is static and known in advance.
- If you want to extract $k$ bits, there must at least be a $k-1$ gap between two bits of interest.
We compute the bit extraction by adding together many left-shifted copies of
our input word, such that we construct our desired bit pattern in the uppermost
bits. The trick is to then realize that w << i
is equivalent to w * (1 << i)
and thus the sum of many left-shifted copies is equivalent to a single
multiplication by (1 << i) + (1 << j) + ...
I think the technique is best understood by visual example. Let’s use our example from earlier, extracting the least significant bit of each byte in a 64-bit word. We start off by masking off just those bits. After that we shift the most significant bit of interest to the topmost bit of the word to get our first shifted copy. We then repeat this, shifting the second most significant bit of interest to the second topmost bit, etc. We sum all these shifted copies. This results in the following (using underscores instead of zeros for clarity):
mask = _______1_______1_______1_______1_______1_______1_______1_______1
t = w & mask
t = _______a_______b_______c_______d_______e_______f_______g_______h
t << 7 = a_______b_______c_______d_______e_______f_______g_______h_______
t << 14 = _b_______c_______d_______e_______f_______g_______h______________
t << 21 = __c_______d_______e_______f_______g_______h_____________________
t << 28 = ___d_______e_______f_______g_______h____________________________
t << 35 = ____e_______f_______g_______h___________________________________
t << 42 = _____f_______g_______h__________________________________________
t << 49 = ______g_______h_________________________________________________
t << 56 = _______h________________________________________________________
sum = abcdefghbcdefgh_cdefh___defgh___efgh____fgh_____gh______h_______
Note how we constructed abcdefgh
in the topmost 8 bits, which we can then
extract using a single right-shift by $64 - 8 = 56$ bits. Since
(1 << 7) + (1 << 14) + ... + (1 << 56) = 0x102040810204080
we get the
following implementation:
fn extract_lsb_bit_per_byte(w: u64) -> u8 {
let mask = 0x0101010101010101;
let sum_of_shifts = 0x102040810204080;
((w & mask).wrapping_mul(sum_of_shifts) >> 56) as u8
}
Not as good as PEXT
, but three arithmetic instructions is not bad at all.
Depositing bits with multiplication
Unfortunately the following technique is significantly less general than the previous one. While you can take inspiration from it to implement similar algorithms, as-is it is limited to just spreading the bits of one byte to the least significant bit of each byte in a 64-bit word.
The trick is similar to the one above. We add 8 shifted copies of
our byte which once again translates to a multiplication. By choosing a shift
that increases in multiples if 9 instead of 8 we ensure that the bit pattern
shifts over by one position in each byte. We then mask out our bits of interest,
and finish off with a shift and byteswap (which compiles to a single instruction bswap
on Intel or rev
on ARM) to put our output bits on the least significant bits
and reverse the order.
This technique visualized:
b = ________________________________________________________abcdefgh
b << 9 = _______________________________________________abcdefgh_________
b << 18 = ______________________________________abcdefgh__________________
b << 27 = _____________________________abcdefgh___________________________
b << 36 = ____________________abcdefgh____________________________________
b << 45 = ___________abcdefgh_____________________________________________
b << 54 = __abcdefgh______________________________________________________
b << 63 = h_______________________________________________________________
sum = h_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh
mask = 1_______1_______1_______1_______1_______1_______1_______1_______
s & msk = h_______g_______f_______e_______d_______c_______b_______a_______
We once again note that the sum of shifts can be precomputed as 1 + (1 << 9) + ... + (1 << 63) = 0x8040201008040201
, allowing the following implementation:
fn deposit_lsb_bit_per_byte(b: u8) -> u64 {
let sum_of_shifts = 0x8040201008040201;
let mask = 0x8080808080808080;
let spread = (b as u64).wrapping_mul(sum_of_shifts) & mask;
u64::swap_bytes(spread >> 7)
}
This time it required 4 arithmetic instructions, not quite as good as PDEP
,
but again not bad compared to a naive implementation, and this is cross-platform.