Improve pow implementation

The log part of pow got rewritten to use a slightly different algorithm.
This improves precision and throughput while keeps the same table size.

Near 1 cases are no longer special cased, there is a slight performance
regression in that case.  And when the fma instruction is not available
this algorithm is expected to have slightly worse performance.

Worst-case error improved from 0.67 ULP to 0.57 ULP.

On Cortex-A72 i see
thruput near 1:  7% worse
latency near 1:  2% worse
thruput general: 8% better
latency general: 2% better
3 files changed
tree: a39a631bc71a5786cda7746fceb7d04d5007ca24
  1. auxiliary/
  2. math/
  3. test/
  4. .gitignore
  5. config.mk.dist
  6. LICENSE
  7. Makefile
  8. README