These rely on the hilarious BSR and BSF instructions on x64. ARMv8 has
an instruction that does something similar: count leading zeros.
Unfortunately, their semantics differ in a few important ways:
- BSR and BSF, when given a zero input, set the zero flag and have
undefined output. CLZ outputs 64.
- BSR and BSF return the index (starting from 0 at the LSB) of the most- or
least-significant 1 bit. CLZ does what it says: count leading zeros.
Bottom line, there has to be some bit trickery somewhere to fit the two
instructions into the same interface. I kept the x64 implementations the
same and put all the nastiness in the ARM implementations to make them
match, since obviously the perf of the ARM implementations doesn't
matter yet.