Dividing by multiplying
05 OCT 2024Today I was staring at some LLVM IR and AArch64 assembly that went a little like this:
for (int i = 0; i < foo.field.size(); ++i) { ... }
cleanup: // %preds: for.body, ...
%sdiv.i.i = sdiv exact i64 %ptr.sub.i.i, i64 120, ...
.BB34: ; %cleanup
ldp x8, x9, [x20, #72]
sub x9, x9, x8
asr x9, x9, 3
mul x9, x9, x23 ; !!!
That 120 comes from foo.field
being a std::vector<T>
where sizeof(T) == 120
. The assembly loads the start and end of the vector’s active capacity and does some pointer arithmetic to calculate size()
, where it has to divide by 120. It divides by 8 with a shift, and that mul
somehow performs division by 15.
A little manual dataflow analysis showed that x23
was initialized as such in the loop preheader and never def’d again:
mov x23, #-1229782938247303442 ; 0xeeeeeeeeeeeeeeee
; ... about ten instructions, for some reason
movk x23, #61167 ; 0xeeef
ARM’s movk
instruction means to move in a 16-bit immediate while keeping the rest of the bits unchanged. One can optionally shift the immediate before moving, but that didn’t happen here. In fact, because ARM instructions must fit in 32-bit words, it typically takes several movk
instructions to load a 64-bit immediate, like so:
mov x22, #14673
movk x22, #34235, lsl #16
movk x22, #36191, lsl #32
movk x22, #62601, lsl #48
I suppose some compression magic happens to let it load 0xeeeeeeeeeeeeeeee
in one go. Anyways, the final value of x23
is 0xeeeeeeeeeeeeeeef
. Turns out this is the multiplicative inverse of 15 in \(\mathbb{Z}_{2^{64}}\). Just for fun, here’s that multiplication carried out by hand against 480, yielding 32 as expected (in 16-bit, for brevity):
mtplr | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
mcand | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | |||||||
___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | |
carry | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | |
1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | ||||||
1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | |||||||
0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | ||||||||
1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | |||||||||
___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___ | |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
This suggests that division by a compile-time constant actually never requires division. Say \(p\) is some prime radix (typically 2) in a number system with \(j\) bits, and \(n = p^k \ell\) with \(k \leq j\) and \(\gcd(p, \ell) = 1 \implies \gcd(p^j, \ell) = 1\). By counting the number of zero bits of \(n\) from the right, you can divide out all the powers of \(p\) with bitshifts and then multiply by \(\ell^{-1} \in \mathbb{Z}_{p^j}\), which is guaranteed to exist. LLVM does exactly this! They find multiplicative inverses with Newton’s method.
So yeah. If you ever want to know the inverse of \(n\) (broken up into 16-bit chunks of hex) in \(\mathbb{Z}_{2^{k}}\) for \(k \in \{8,16,32,64\}\) and more, just divide by \(n\) in some program and pass it to clang with -S -O3
.