Add new log2 implementation
Similar algorithm is used as in log, but there are more operations
(and more error) due to the 1/ln2 multiplier.
There is separate code path when fma instruction is not available for
computing x/c - 1 precisely, for which the table size is doubled,
and to compute (x/c - 1)/ln2 precisely.
The worst case error is 0.547 ULP (0.55 without fma), the read only
global data size is 1168 bytes (2192 without fma). The non-nearest
rounding error is less than 1 ULP.
Improvements on Cortex-A72 compared to current glibc master:
log latency: 2.04x
log thruput: 1.87x
7 files changed