[AArch64] Optimized memcmp

Patch written by Wilco Dijkstra submitted for review to newlib:
https://sourceware.org/ml/newlib/2017/msg00524.html

This is an optimized memcmp for AArch64.  This is a complete rewrite
using a different algorithm.  The previous version split into cases
where both inputs were aligned, the inputs were mutually aligned and
unaligned using a byte loop.  The new version combines all these cases,
while small inputs of less than 8 bytes are handled separately.

This allows the main code to be sped up using unaligned loads since
there are now at least 8 bytes to be compared.  After the first 8 bytes,
align the first input.  This ensures each iteration does at most one
unaligned access and mutually aligned inputs behave as aligned.
After the main loop, process the last 8 bytes using unaligned accesses.

This improves performance of (mutually) aligned cases by 25% and
unaligned by >500% (yes >6 times faster) on large inputs.

2017-06-28  Wilco Dijkstra  <wdijkstr@arm.com>

        * bionic/libc/arch-arm64/generic/bionic/memcmp.S (memcmp):
                Rewrite of optimized memcmp.

GLIBC benchtests/bench-memcmp.c performance comparison for Cortex-A53:

Length    1, alignment  1/ 1:        153%
Length    1, alignment  1/ 1:        119%
Length    1, alignment  1/ 1:        154%
Length    2, alignment  2/ 2:        121%
Length    2, alignment  2/ 2:        140%
Length    2, alignment  2/ 2:        121%
Length    3, alignment  3/ 3:        105%
Length    3, alignment  3/ 3:        105%
Length    3, alignment  3/ 3:        105%
Length    4, alignment  4/ 4:        155%
Length    4, alignment  4/ 4:        154%
Length    4, alignment  4/ 4:        161%
Length    5, alignment  5/ 5:        173%
Length    5, alignment  5/ 5:        173%
Length    5, alignment  5/ 5:        173%
Length    6, alignment  6/ 6:        145%
Length    6, alignment  6/ 6:        145%
Length    6, alignment  6/ 6:        145%
Length    7, alignment  7/ 7:        125%
Length    7, alignment  7/ 7:        125%
Length    7, alignment  7/ 7:        125%
Length    8, alignment  8/ 8:        111%
Length    8, alignment  8/ 8:        130%
Length    8, alignment  8/ 8:        124%
Length    9, alignment  9/ 9:        160%
Length    9, alignment  9/ 9:        160%
Length    9, alignment  9/ 9:        150%
Length   10, alignment 10/10:        170%
Length   10, alignment 10/10:        137%
Length   10, alignment 10/10:        150%
Length   11, alignment 11/11:        160%
Length   11, alignment 11/11:        160%
Length   11, alignment 11/11:        160%
Length   12, alignment 12/12:        146%
Length   12, alignment 12/12:        168%
Length   12, alignment 12/12:        156%
Length   13, alignment 13/13:        167%
Length   13, alignment 13/13:        167%
Length   13, alignment 13/13:        173%
Length   14, alignment 14/14:        167%
Length   14, alignment 14/14:        168%
Length   14, alignment 14/14:        168%
Length   15, alignment 15/15:        168%
Length   15, alignment 15/15:        173%
Length   15, alignment 15/15:        173%
Length    1, alignment  0/ 0:        134%
Length    1, alignment  0/ 0:        127%
Length    1, alignment  0/ 0:        119%
Length    2, alignment  0/ 0:        94%
Length    2, alignment  0/ 0:        94%
Length    2, alignment  0/ 0:        106%
Length    3, alignment  0/ 0:        82%
Length    3, alignment  0/ 0:        87%
Length    3, alignment  0/ 0:        82%
Length    4, alignment  0/ 0:        115%
Length    4, alignment  0/ 0:        115%
Length    4, alignment  0/ 0:        122%
Length    5, alignment  0/ 0:        127%
Length    5, alignment  0/ 0:        119%
Length    5, alignment  0/ 0:        127%
Length    6, alignment  0/ 0:        103%
Length    6, alignment  0/ 0:        100%
Length    6, alignment  0/ 0:        100%
Length    7, alignment  0/ 0:        82%
Length    7, alignment  0/ 0:        91%
Length    7, alignment  0/ 0:        87%
Length    8, alignment  0/ 0:        111%
Length    8, alignment  0/ 0:        124%
Length    8, alignment  0/ 0:        124%
Length    9, alignment  0/ 0:        136%
Length    9, alignment  0/ 0:        136%
Length    9, alignment  0/ 0:        136%
Length   10, alignment  0/ 0:        136%
Length   10, alignment  0/ 0:        135%
Length   10, alignment  0/ 0:        136%
Length   11, alignment  0/ 0:        136%
Length   11, alignment  0/ 0:        136%
Length   11, alignment  0/ 0:        135%
Length   12, alignment  0/ 0:        136%
Length   12, alignment  0/ 0:        136%
Length   12, alignment  0/ 0:        136%
Length   13, alignment  0/ 0:        135%
Length   13, alignment  0/ 0:        136%
Length   13, alignment  0/ 0:        136%
Length   14, alignment  0/ 0:        136%
Length   14, alignment  0/ 0:        136%
Length   14, alignment  0/ 0:        136%
Length   15, alignment  0/ 0:        136%
Length   15, alignment  0/ 0:        136%
Length   15, alignment  0/ 0:        136%
Length    4, alignment  0/ 0:        115%
Length    4, alignment  0/ 0:        115%
Length    4, alignment  0/ 0:        115%
Length   32, alignment  0/ 0:        127%
Length   32, alignment  7/ 2:        395%
Length   32, alignment  0/ 0:        127%
Length   32, alignment  0/ 0:        127%
Length    8, alignment  0/ 0:        111%
Length    8, alignment  0/ 0:        124%
Length    8, alignment  0/ 0:        124%
Length   64, alignment  0/ 0:        128%
Length   64, alignment  6/ 4:        475%
Length   64, alignment  0/ 0:        131%
Length   64, alignment  0/ 0:        134%
Length   16, alignment  0/ 0:        128%
Length   16, alignment  0/ 0:        119%
Length   16, alignment  0/ 0:        128%
Length  128, alignment  0/ 0:        129%
Length  128, alignment  5/ 6:        475%
Length  128, alignment  0/ 0:        130%
Length  128, alignment  0/ 0:        129%
Length   32, alignment  0/ 0:        126%
Length   32, alignment  0/ 0:        126%
Length   32, alignment  0/ 0:        126%
Length  256, alignment  0/ 0:        127%
Length  256, alignment  4/ 8:        545%
Length  256, alignment  0/ 0:        126%
Length  256, alignment  0/ 0:        128%
Length   64, alignment  0/ 0:        171%
Length   64, alignment  0/ 0:        171%
Length   64, alignment  0/ 0:        174%
Length  512, alignment  0/ 0:        126%
Length  512, alignment  3/10:        585%
Length  512, alignment  0/ 0:        126%
Length  512, alignment  0/ 0:        127%
Length  128, alignment  0/ 0:        129%
Length  128, alignment  0/ 0:        128%
Length  128, alignment  0/ 0:        129%
Length 1024, alignment  0/ 0:        125%
Length 1024, alignment  2/12:        611%
Length 1024, alignment  0/ 0:        126%
Length 1024, alignment  0/ 0:        126%
Length  256, alignment  0/ 0:        128%
Length  256, alignment  0/ 0:        127%
Length  256, alignment  0/ 0:        128%
Length 2048, alignment  0/ 0:        125%
Length 2048, alignment  1/14:        625%
Length 2048, alignment  0/ 0:        125%
Length 2048, alignment  0/ 0:        125%
Length  512, alignment  0/ 0:        126%
Length  512, alignment  0/ 0:        127%
Length  512, alignment  0/ 0:        127%
Length 4096, alignment  0/ 0:        125%
Length 4096, alignment  0/16:        125%
Length 4096, alignment  0/ 0:        125%
Length 4096, alignment  0/ 0:        125%
Length 1024, alignment  0/ 0:        126%
Length 1024, alignment  0/ 0:        126%
Length 1024, alignment  0/ 0:        126%
Length 8192, alignment  0/ 0:        125%
Length 8192, alignment 63/18:        636%
Length 8192, alignment  0/ 0:        125%
Length 8192, alignment  0/ 0:        125%
Length   16, alignment  1/ 2:        317%
Length   16, alignment  1/ 2:        317%
Length   16, alignment  1/ 2:        317%
Length   32, alignment  2/ 4:        395%
Length   32, alignment  2/ 4:        395%
Length   32, alignment  2/ 4:        398%
Length   64, alignment  3/ 6:        475%
Length   64, alignment  3/ 6:        475%
Length   64, alignment  3/ 6:        477%
Length  128, alignment  4/ 8:        479%
Length  128, alignment  4/ 8:        479%
Length  128, alignment  4/ 8:        479%
Length  256, alignment  5/10:        543%
Length  256, alignment  5/10:        539%
Length  256, alignment  5/10:        543%
Length  512, alignment  6/12:        585%
Length  512, alignment  6/12:        585%
Length  512, alignment  6/12:        585%
Length 1024, alignment  7/14:        611%
Length 1024, alignment  7/14:        611%
Length 1024, alignment  7/14:        611%

The performance measured on the bionic-benchmarks on a hikey
board with a new benchmark for unaligned memcmp submitted for
review at https://android-review.googlesource.com/414860

The base is with the libc from /system/lib64. The bionic libc
with this patch is in /data.

hikey:/data # export LD_LIBRARY_PATH=/system/lib64
hikey:/data # ./bionic-benchmarks --benchmark_filter=BM_string_memcmp
Run on (8 X 2.4 MHz CPU s)
Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp/8                      30 ns         30 ns   22955680    251.07MB/s
BM_string_memcmp/64                     57 ns         57 ns   12349184   1076.99MB/s
BM_string_memcmp/512                   305 ns        305 ns    2297163   1.56496GB/s
BM_string_memcmp/1024                  571 ns        571 ns    1225211   1.66912GB/s
BM_string_memcmp/8k                   4307 ns       4306 ns     162562   1.77177GB/s
BM_string_memcmp/16k                  8676 ns       8675 ns      80676   1.75887GB/s
BM_string_memcmp/32k                 19233 ns      19230 ns      36394   1.58695GB/s
BM_string_memcmp/64k                 36986 ns      36984 ns      18952   1.65029GB/s
BM_string_memcmp_aligned/8             199 ns        199 ns    3519166   38.3336MB/s
BM_string_memcmp_aligned/64            386 ns        386 ns    1810734   158.073MB/s
BM_string_memcmp_aligned/512          1735 ns       1734 ns     403981   281.525MB/s
BM_string_memcmp_aligned/1024         3200 ns       3200 ns     218838   305.151MB/s
BM_string_memcmp_aligned/8k          25084 ns      25080 ns      28180   311.507MB/s
BM_string_memcmp_aligned/16k         51730 ns      51729 ns      13521   302.057MB/s
BM_string_memcmp_aligned/32k        103228 ns     103228 ns       6782   302.727MB/s
BM_string_memcmp_aligned/64k        207117 ns     207087 ns       3450   301.806MB/s
BM_string_memcmp_unaligned/8           339 ns        339 ns    2070998   22.5302MB/s
BM_string_memcmp_unaligned/64         1392 ns       1392 ns     502796   43.8454MB/s
BM_string_memcmp_unaligned/512        9194 ns       9194 ns      76133   53.1104MB/s
BM_string_memcmp_unaligned/1024      18325 ns      18323 ns      38206   53.2963MB/s
BM_string_memcmp_unaligned/8k       148579 ns     148574 ns       4713   52.5831MB/s
BM_string_memcmp_unaligned/16k      298169 ns     298120 ns       2344   52.4118MB/s
BM_string_memcmp_unaligned/32k      598813 ns     598797 ns       1085    52.188MB/s
BM_string_memcmp_unaligned/64k     1196079 ns    1196083 ns        540   52.2539MB/s

hikey:/data # export LD_LIBRARY_PATH=/data
hikey:/data # ./bionic-benchmarks --benchmark_filter=BM_string_memcmp

Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp/8                      27 ns         27 ns   26198166   286.069MB/s
BM_string_memcmp/64                     45 ns         45 ns   15553753   1.32443GB/s
BM_string_memcmp/512                   242 ns        242 ns    2892423   1.97049GB/s
BM_string_memcmp/1024                  455 ns        455 ns    1537290   2.09436GB/s
BM_string_memcmp/8k                   3446 ns       3446 ns     203295   2.21392GB/s
BM_string_memcmp/16k                  7567 ns       7567 ns      92582   2.01657GB/s
BM_string_memcmp/32k                 16081 ns      16081 ns      43524    1.8977GB/s
BM_string_memcmp/64k                 31029 ns      31028 ns      22565   1.96712GB/s
BM_string_memcmp_aligned/8             184 ns        184 ns    3800912   41.3654MB/s
BM_string_memcmp_aligned/64            287 ns        287 ns    2438835    212.65MB/s
BM_string_memcmp_aligned/512          1370 ns       1370 ns     511014   356.498MB/s
BM_string_memcmp_aligned/1024         2543 ns       2543 ns     275253   384.006MB/s
BM_string_memcmp_aligned/8k          20413 ns      20411 ns      34306   382.764MB/s
BM_string_memcmp_aligned/16k         42908 ns      42907 ns      16132   364.158MB/s
BM_string_memcmp_aligned/32k         88902 ns      88886 ns       8087   351.574MB/s
BM_string_memcmp_aligned/64k        173016 ns     173007 ns       4122   361.258MB/s
BM_string_memcmp_unaligned/8           212 ns        212 ns    3304163   36.0243MB/s
BM_string_memcmp_unaligned/64          361 ns        361 ns    1941597   169.279MB/s
BM_string_memcmp_unaligned/512        1754 ns       1753 ns     399210   278.492MB/s
BM_string_memcmp_unaligned/1024       3308 ns       3308 ns     211622   295.243MB/s
BM_string_memcmp_unaligned/8k        27227 ns      27225 ns      25637   286.964MB/s
BM_string_memcmp_unaligned/16k       55877 ns      55874 ns      12455   279.645MB/s
BM_string_memcmp_unaligned/32k      112397 ns     112366 ns       6200    278.11MB/s
BM_string_memcmp_unaligned/64k      223493 ns     223482 ns       3127   279.665MB/s

Test: bionic-benchmarks --benchmark_filter='BM_string_memcmp*'

Change-Id: Ia16a8cf69c68b8c0533f025f03b925c9883bb708
2 files changed
tree: ba2a6ab697e79c1cff84c748a5642eab3f001ea1
  1. benchmarks/
  2. build/
  3. docs/
  4. libc/
  5. libdl/
  6. libm/
  7. libstdc++/
  8. linker/
  9. tests/
  10. tools/
  11. .clang-format
  12. .gitignore
  13. android-changes-for-ndk-developers.md
  14. Android.bp
  15. Android.mk
  16. CleanSpec.mk
  17. CPPLINT.cfg
  18. PREUPLOAD.cfg
  19. README.md
README.md

Using bionic

See the additional documentation.

Working on bionic

What are the big pieces of bionic?

libc/ --- libc.so, libc.a

The C library. Stuff like fopen(3) and kill(2).

libm/ --- libm.so, libm.a

The math library. Traditionally Unix systems kept stuff like sin(3) and cos(3) in a separate library to save space in the days before shared libraries.

libdl/ --- libdl.so

The dynamic linker interface library. This is actually just a bunch of stubs that the dynamic linker replaces with pointers to its own implementation at runtime. This is where stuff like dlopen(3) lives.

libstdc++/ --- libstdc++.so

The C++ ABI support functions. The C++ compiler doesn't know how to implement thread-safe static initialization and the like, so it just calls functions that are supplied by the system. Stuff like __cxa_guard_acquire and __cxa_pure_virtual live here.

linker/ --- /system/bin/linker and /system/bin/linker64

The dynamic linker. When you run a dynamically-linked executable, its ELF file has a DT_INTERP entry that says "use the following program to start me". On Android, that's either linker or linker64 (depending on whether it's a 32-bit or 64-bit executable). It's responsible for loading the ELF executable into memory and resolving references to symbols (so that when your code tries to jump to fopen(3), say, it lands in the right place).

tests/ --- unit tests

The tests/ directory contains unit tests. Roughly arranged as one file per publicly-exported header file.

benchmarks/ --- benchmarks

The benchmarks/ directory contains benchmarks, with its own documentation.

What's in libc/?

Adding system calls

Adding a system call usually involves:

  1. Add entries to SYSCALLS.TXT. See SYSCALLS.TXT itself for documentation on the format.
  2. Run the gensyscalls.py script.
  3. Add constants (and perhaps types) to the appropriate header file. Note that you should check to see whether the constants are already in kernel uapi header files, in which case you just need to make sure that the appropriate POSIX header file in libc/include/ includes the relevant file or files.
  4. Add function declarations to the appropriate header file.
  5. Add the function name to the correct section in libc/libc.map.txt and run ./libc/tools/genversion-scripts.py.
  6. Add at least basic tests. Even a test that deliberately supplies an invalid argument helps check that we're generating the right symbol and have the right declaration in the header file, and that you correctly updated the maps in step 5. (You can use strace(1) to confirm that the correct system call is being made.)

Updating kernel header files

As mentioned above, this is currently a two-step process:

  1. Use generate_uapi_headers.sh to go from a Linux source tree to appropriate contents for external/kernel-headers/.
  2. Run update_all.py to scrub those headers and import them into bionic.

Note that if you're actually just trying to expose device-specific headers to build your device drivers, you shouldn't modify bionic. Instead use TARGET_DEVICE_KERNEL_HEADERS and friends described in config.mk.

Updating tzdata

This is fully automated (and these days handled by the libcore team, because they own icu, and that needs to be updated in sync with bionic):

  1. Run update-tzdata.py in external/icu/tools/.

Verifying changes

If you make a change that is likely to have a wide effect on the tree (such as a libc header change), you should run make checkbuild. A regular make will not build the entire tree; just the minimum number of projects that are required for the device. Tests, additional developer tools, and various other modules will not be built. Note that make checkbuild will not be complete either, as make tests covers a few additional modules, but generally speaking make checkbuild is enough.

Running the tests

The tests are all built from the tests/ directory.

Device tests

$ mma # In $ANDROID_ROOT/bionic.
$ adb root && adb remount && adb sync
$ adb shell /data/nativetest/bionic-unit-tests/bionic-unit-tests32
$ adb shell \
    /data/nativetest/bionic-unit-tests-static/bionic-unit-tests-static32
# Only for 64-bit targets
$ adb shell /data/nativetest64/bionic-unit-tests/bionic-unit-tests64
$ adb shell \
    /data/nativetest64/bionic-unit-tests-static/bionic-unit-tests-static64

Note that we use our own custom gtest runner that offers a superset of the options documented at https://github.com/google/googletest/blob/master/googletest/docs/AdvancedGuide.md#running-test-programs-advanced-options, in particular for test isolation and parallelism (both on by default).

Device tests via CTS

Most of the unit tests are executed by CTS. By default, CTS runs as a non-root user, so the unit tests must also pass when not run as root. Some tests cannot do any useful work unless run as root. In this case, the test should check getuid() == 0 and do nothing otherwise (typically we log in this case to prevent accidents!). Obviously, if the test can be rewritten to not require root, that's an even better solution.

Currently, the list of bionic CTS tests is generated at build time by running a host version of the test executable and dumping the list of all tests. In order for this to continue to work, all architectures must have the same number of tests, and the host version of the executable must also have the same number of tests.

Running the gtests directly is orders of magnitude faster than using CTS, but in cases where you really have to run CTS:

$ make cts # In $ANDROID_ROOT.
$ adb unroot # Because real CTS doesn't run as root.
# This will sync any *test* changes, but not *code* changes:
$ cts-tradefed \
    run singleCommand cts --skip-preconditions -m CtsBionicTestCases

Host tests

The host tests require that you have lunched either an x86 or x86_64 target. Note that due to ABI limitations (specifically, the size of pthread_mutex_t), 32-bit bionic requires PIDs less than 65536. To enforce this, set /proc/sys/kernel/pid_max to 65536.

$ ./tests/run-on-host.sh 32
$ ./tests/run-on-host.sh 64   # For x86_64-bit *targets* only.

You can supply gtest flags as extra arguments to this script.

Against glibc

As a way to check that our tests do in fact test the correct behavior (and not just the behavior we think is correct), it is possible to run the tests against the host's glibc.

$ ./tests/run-on-host.sh glibc

Gathering test coverage

For either host or target coverage, you must first:

  • $ export NATIVE_COVERAGE=true
    • Note that the build system is ignorant to this flag being toggled, i.e. if you change this flag, you will have to manually rebuild bionic.
  • Set bionic_coverage=true in libc/Android.mk and libm/Android.mk.

Coverage from device tests

$ mma
$ adb sync
$ adb shell \
    GCOV_PREFIX=/data/local/tmp/gcov \
    GCOV_PREFIX_STRIP=`echo $ANDROID_BUILD_TOP | grep -o / | wc -l` \
    /data/nativetest/bionic-unit-tests/bionic-unit-tests32
$ acov

acov will pull all coverage information from the device, push it to the right directories, run lcov, and open the coverage report in your browser.

Coverage from host tests

First, build and run the host tests as usual (see above).

$ croot
$ lcov -c -d $ANDROID_PRODUCT_OUT -o coverage.info
$ genhtml -o covreport coverage.info # or lcov --list coverage.info

The coverage report is now available at covreport/index.html.

Attaching GDB to the tests

Bionic's test runner will run each test in its own process by default to prevent tests failures from impacting other tests. This also has the added benefit of running them in parallel, so they are much faster.

However, this also makes it difficult to run the tests under GDB. To prevent each test from being forked, run the tests with the flag --no-isolate.

32-bit ABI bugs

See 32-bit ABI bugs.