Improve pow implementation
The log part of pow got rewritten to use a slightly different algorithm.
This improves precision and throughput while keeps the same table size.
Near 1 cases are no longer special cased, there is a slight performance
regression in that case. And when the fma instruction is not available
this algorithm is expected to have slightly worse performance.
Worst-case error improved from 0.67 ULP to 0.57 ULP.
On Cortex-A72 i see
thruput near 1: 7% worse
latency near 1: 2% worse
thruput general: 8% better
latency general: 2% better
3 files changed