Add new pow implementation
The algorithm is exp(y * log(x)), where log(x) is computed with about
1.8*2^-66 relative error, returning the result in two doubles, and the
exp part uses the same algorithm (and lookup tables) as exp, but takes
the input as two doubles and a sign (to handle negative bases with odd
integer exponent).
There is separate code path when fma is not available but the worst
case error is about 0.67 ULP in both cases. The lookup table and
consts for log are 4224 bytes, the code is 1196 bytes. The non-nearest
rounding error is less than 1 ULP.
Improvements on Cortex-A72 compared to current glibc master:
latency: 1.8x
thruput: 2.5x
6 files changed