Add new log2 implementation

Similar algorithm is used as in log, but there are more operations
(and more error) due to the 1/ln2 multiplier.

There is separate code path when fma instruction is not available for
computing x/c - 1 precisely, for which the table size is doubled,
and to compute (x/c - 1)/ln2 precisely.

The worst case error is 0.547 ULP (0.55 without fma), the read only
global data size is 1168 bytes (2192 without fma).  The non-nearest
rounding error is less than 1 ULP.

Improvements on Cortex-A72 compared to current glibc master:
log latency: 2.04x
log thruput: 1.87x
7 files changed
tree: 6196f61c3386e50ad8257d6a1f21c90ef39dddb8
  1. auxiliary/
  2. math/
  3. test/
  4. .gitignore
  5. config.mk.dist
  6. LICENSE
  7. Makefile
  8. README