commit | ed9bfc46162414baa7e2db037f4466ff010e2743 | [log] [tgz] |
---|---|---|
author | Sebastian Pop <s.pop@samsung.com> | Mon Jun 19 12:39:02 2017 -0500 |
committer | Jake Weinstein <jake@aospa.co> | Fri Nov 03 13:21:07 2017 -0400 |
tree | ba2a6ab697e79c1cff84c748a5642eab3f001ea1 | |
parent | 232541aa02e22ab8fafcdd503c74aae22fa09699 [diff] |
[AArch64] Optimized memcmp Patch written by Wilco Dijkstra submitted for review to newlib: https://sourceware.org/ml/newlib/2017/msg00524.html This is an optimized memcmp for AArch64. This is a complete rewrite using a different algorithm. The previous version split into cases where both inputs were aligned, the inputs were mutually aligned and unaligned using a byte loop. The new version combines all these cases, while small inputs of less than 8 bytes are handled separately. This allows the main code to be sped up using unaligned loads since there are now at least 8 bytes to be compared. After the first 8 bytes, align the first input. This ensures each iteration does at most one unaligned access and mutually aligned inputs behave as aligned. After the main loop, process the last 8 bytes using unaligned accesses. This improves performance of (mutually) aligned cases by 25% and unaligned by >500% (yes >6 times faster) on large inputs. 2017-06-28 Wilco Dijkstra <wdijkstr@arm.com> * bionic/libc/arch-arm64/generic/bionic/memcmp.S (memcmp): Rewrite of optimized memcmp. GLIBC benchtests/bench-memcmp.c performance comparison for Cortex-A53: Length 1, alignment 1/ 1: 153% Length 1, alignment 1/ 1: 119% Length 1, alignment 1/ 1: 154% Length 2, alignment 2/ 2: 121% Length 2, alignment 2/ 2: 140% Length 2, alignment 2/ 2: 121% Length 3, alignment 3/ 3: 105% Length 3, alignment 3/ 3: 105% Length 3, alignment 3/ 3: 105% Length 4, alignment 4/ 4: 155% Length 4, alignment 4/ 4: 154% Length 4, alignment 4/ 4: 161% Length 5, alignment 5/ 5: 173% Length 5, alignment 5/ 5: 173% Length 5, alignment 5/ 5: 173% Length 6, alignment 6/ 6: 145% Length 6, alignment 6/ 6: 145% Length 6, alignment 6/ 6: 145% Length 7, alignment 7/ 7: 125% Length 7, alignment 7/ 7: 125% Length 7, alignment 7/ 7: 125% Length 8, alignment 8/ 8: 111% Length 8, alignment 8/ 8: 130% Length 8, alignment 8/ 8: 124% Length 9, alignment 9/ 9: 160% Length 9, alignment 9/ 9: 160% Length 9, alignment 9/ 9: 150% Length 10, alignment 10/10: 170% Length 10, alignment 10/10: 137% Length 10, alignment 10/10: 150% Length 11, alignment 11/11: 160% Length 11, alignment 11/11: 160% Length 11, alignment 11/11: 160% Length 12, alignment 12/12: 146% Length 12, alignment 12/12: 168% Length 12, alignment 12/12: 156% Length 13, alignment 13/13: 167% Length 13, alignment 13/13: 167% Length 13, alignment 13/13: 173% Length 14, alignment 14/14: 167% Length 14, alignment 14/14: 168% Length 14, alignment 14/14: 168% Length 15, alignment 15/15: 168% Length 15, alignment 15/15: 173% Length 15, alignment 15/15: 173% Length 1, alignment 0/ 0: 134% Length 1, alignment 0/ 0: 127% Length 1, alignment 0/ 0: 119% Length 2, alignment 0/ 0: 94% Length 2, alignment 0/ 0: 94% Length 2, alignment 0/ 0: 106% Length 3, alignment 0/ 0: 82% Length 3, alignment 0/ 0: 87% Length 3, alignment 0/ 0: 82% Length 4, alignment 0/ 0: 115% Length 4, alignment 0/ 0: 115% Length 4, alignment 0/ 0: 122% Length 5, alignment 0/ 0: 127% Length 5, alignment 0/ 0: 119% Length 5, alignment 0/ 0: 127% Length 6, alignment 0/ 0: 103% Length 6, alignment 0/ 0: 100% Length 6, alignment 0/ 0: 100% Length 7, alignment 0/ 0: 82% Length 7, alignment 0/ 0: 91% Length 7, alignment 0/ 0: 87% Length 8, alignment 0/ 0: 111% Length 8, alignment 0/ 0: 124% Length 8, alignment 0/ 0: 124% Length 9, alignment 0/ 0: 136% Length 9, alignment 0/ 0: 136% Length 9, alignment 0/ 0: 136% Length 10, alignment 0/ 0: 136% Length 10, alignment 0/ 0: 135% Length 10, alignment 0/ 0: 136% Length 11, alignment 0/ 0: 136% Length 11, alignment 0/ 0: 136% Length 11, alignment 0/ 0: 135% Length 12, alignment 0/ 0: 136% Length 12, alignment 0/ 0: 136% Length 12, alignment 0/ 0: 136% Length 13, alignment 0/ 0: 135% Length 13, alignment 0/ 0: 136% Length 13, alignment 0/ 0: 136% Length 14, alignment 0/ 0: 136% Length 14, alignment 0/ 0: 136% Length 14, alignment 0/ 0: 136% Length 15, alignment 0/ 0: 136% Length 15, alignment 0/ 0: 136% Length 15, alignment 0/ 0: 136% Length 4, alignment 0/ 0: 115% Length 4, alignment 0/ 0: 115% Length 4, alignment 0/ 0: 115% Length 32, alignment 0/ 0: 127% Length 32, alignment 7/ 2: 395% Length 32, alignment 0/ 0: 127% Length 32, alignment 0/ 0: 127% Length 8, alignment 0/ 0: 111% Length 8, alignment 0/ 0: 124% Length 8, alignment 0/ 0: 124% Length 64, alignment 0/ 0: 128% Length 64, alignment 6/ 4: 475% Length 64, alignment 0/ 0: 131% Length 64, alignment 0/ 0: 134% Length 16, alignment 0/ 0: 128% Length 16, alignment 0/ 0: 119% Length 16, alignment 0/ 0: 128% Length 128, alignment 0/ 0: 129% Length 128, alignment 5/ 6: 475% Length 128, alignment 0/ 0: 130% Length 128, alignment 0/ 0: 129% Length 32, alignment 0/ 0: 126% Length 32, alignment 0/ 0: 126% Length 32, alignment 0/ 0: 126% Length 256, alignment 0/ 0: 127% Length 256, alignment 4/ 8: 545% Length 256, alignment 0/ 0: 126% Length 256, alignment 0/ 0: 128% Length 64, alignment 0/ 0: 171% Length 64, alignment 0/ 0: 171% Length 64, alignment 0/ 0: 174% Length 512, alignment 0/ 0: 126% Length 512, alignment 3/10: 585% Length 512, alignment 0/ 0: 126% Length 512, alignment 0/ 0: 127% Length 128, alignment 0/ 0: 129% Length 128, alignment 0/ 0: 128% Length 128, alignment 0/ 0: 129% Length 1024, alignment 0/ 0: 125% Length 1024, alignment 2/12: 611% Length 1024, alignment 0/ 0: 126% Length 1024, alignment 0/ 0: 126% Length 256, alignment 0/ 0: 128% Length 256, alignment 0/ 0: 127% Length 256, alignment 0/ 0: 128% Length 2048, alignment 0/ 0: 125% Length 2048, alignment 1/14: 625% Length 2048, alignment 0/ 0: 125% Length 2048, alignment 0/ 0: 125% Length 512, alignment 0/ 0: 126% Length 512, alignment 0/ 0: 127% Length 512, alignment 0/ 0: 127% Length 4096, alignment 0/ 0: 125% Length 4096, alignment 0/16: 125% Length 4096, alignment 0/ 0: 125% Length 4096, alignment 0/ 0: 125% Length 1024, alignment 0/ 0: 126% Length 1024, alignment 0/ 0: 126% Length 1024, alignment 0/ 0: 126% Length 8192, alignment 0/ 0: 125% Length 8192, alignment 63/18: 636% Length 8192, alignment 0/ 0: 125% Length 8192, alignment 0/ 0: 125% Length 16, alignment 1/ 2: 317% Length 16, alignment 1/ 2: 317% Length 16, alignment 1/ 2: 317% Length 32, alignment 2/ 4: 395% Length 32, alignment 2/ 4: 395% Length 32, alignment 2/ 4: 398% Length 64, alignment 3/ 6: 475% Length 64, alignment 3/ 6: 475% Length 64, alignment 3/ 6: 477% Length 128, alignment 4/ 8: 479% Length 128, alignment 4/ 8: 479% Length 128, alignment 4/ 8: 479% Length 256, alignment 5/10: 543% Length 256, alignment 5/10: 539% Length 256, alignment 5/10: 543% Length 512, alignment 6/12: 585% Length 512, alignment 6/12: 585% Length 512, alignment 6/12: 585% Length 1024, alignment 7/14: 611% Length 1024, alignment 7/14: 611% Length 1024, alignment 7/14: 611% The performance measured on the bionic-benchmarks on a hikey board with a new benchmark for unaligned memcmp submitted for review at https://android-review.googlesource.com/414860 The base is with the libc from /system/lib64. The bionic libc with this patch is in /data. hikey:/data # export LD_LIBRARY_PATH=/system/lib64 hikey:/data # ./bionic-benchmarks --benchmark_filter=BM_string_memcmp Run on (8 X 2.4 MHz CPU s) Benchmark Time CPU Iterations ---------------------------------------------------------------------- BM_string_memcmp/8 30 ns 30 ns 22955680 251.07MB/s BM_string_memcmp/64 57 ns 57 ns 12349184 1076.99MB/s BM_string_memcmp/512 305 ns 305 ns 2297163 1.56496GB/s BM_string_memcmp/1024 571 ns 571 ns 1225211 1.66912GB/s BM_string_memcmp/8k 4307 ns 4306 ns 162562 1.77177GB/s BM_string_memcmp/16k 8676 ns 8675 ns 80676 1.75887GB/s BM_string_memcmp/32k 19233 ns 19230 ns 36394 1.58695GB/s BM_string_memcmp/64k 36986 ns 36984 ns 18952 1.65029GB/s BM_string_memcmp_aligned/8 199 ns 199 ns 3519166 38.3336MB/s BM_string_memcmp_aligned/64 386 ns 386 ns 1810734 158.073MB/s BM_string_memcmp_aligned/512 1735 ns 1734 ns 403981 281.525MB/s BM_string_memcmp_aligned/1024 3200 ns 3200 ns 218838 305.151MB/s BM_string_memcmp_aligned/8k 25084 ns 25080 ns 28180 311.507MB/s BM_string_memcmp_aligned/16k 51730 ns 51729 ns 13521 302.057MB/s BM_string_memcmp_aligned/32k 103228 ns 103228 ns 6782 302.727MB/s BM_string_memcmp_aligned/64k 207117 ns 207087 ns 3450 301.806MB/s BM_string_memcmp_unaligned/8 339 ns 339 ns 2070998 22.5302MB/s BM_string_memcmp_unaligned/64 1392 ns 1392 ns 502796 43.8454MB/s BM_string_memcmp_unaligned/512 9194 ns 9194 ns 76133 53.1104MB/s BM_string_memcmp_unaligned/1024 18325 ns 18323 ns 38206 53.2963MB/s BM_string_memcmp_unaligned/8k 148579 ns 148574 ns 4713 52.5831MB/s BM_string_memcmp_unaligned/16k 298169 ns 298120 ns 2344 52.4118MB/s BM_string_memcmp_unaligned/32k 598813 ns 598797 ns 1085 52.188MB/s BM_string_memcmp_unaligned/64k 1196079 ns 1196083 ns 540 52.2539MB/s hikey:/data # export LD_LIBRARY_PATH=/data hikey:/data # ./bionic-benchmarks --benchmark_filter=BM_string_memcmp Benchmark Time CPU Iterations ---------------------------------------------------------------------- BM_string_memcmp/8 27 ns 27 ns 26198166 286.069MB/s BM_string_memcmp/64 45 ns 45 ns 15553753 1.32443GB/s BM_string_memcmp/512 242 ns 242 ns 2892423 1.97049GB/s BM_string_memcmp/1024 455 ns 455 ns 1537290 2.09436GB/s BM_string_memcmp/8k 3446 ns 3446 ns 203295 2.21392GB/s BM_string_memcmp/16k 7567 ns 7567 ns 92582 2.01657GB/s BM_string_memcmp/32k 16081 ns 16081 ns 43524 1.8977GB/s BM_string_memcmp/64k 31029 ns 31028 ns 22565 1.96712GB/s BM_string_memcmp_aligned/8 184 ns 184 ns 3800912 41.3654MB/s BM_string_memcmp_aligned/64 287 ns 287 ns 2438835 212.65MB/s BM_string_memcmp_aligned/512 1370 ns 1370 ns 511014 356.498MB/s BM_string_memcmp_aligned/1024 2543 ns 2543 ns 275253 384.006MB/s BM_string_memcmp_aligned/8k 20413 ns 20411 ns 34306 382.764MB/s BM_string_memcmp_aligned/16k 42908 ns 42907 ns 16132 364.158MB/s BM_string_memcmp_aligned/32k 88902 ns 88886 ns 8087 351.574MB/s BM_string_memcmp_aligned/64k 173016 ns 173007 ns 4122 361.258MB/s BM_string_memcmp_unaligned/8 212 ns 212 ns 3304163 36.0243MB/s BM_string_memcmp_unaligned/64 361 ns 361 ns 1941597 169.279MB/s BM_string_memcmp_unaligned/512 1754 ns 1753 ns 399210 278.492MB/s BM_string_memcmp_unaligned/1024 3308 ns 3308 ns 211622 295.243MB/s BM_string_memcmp_unaligned/8k 27227 ns 27225 ns 25637 286.964MB/s BM_string_memcmp_unaligned/16k 55877 ns 55874 ns 12455 279.645MB/s BM_string_memcmp_unaligned/32k 112397 ns 112366 ns 6200 278.11MB/s BM_string_memcmp_unaligned/64k 223493 ns 223482 ns 3127 279.665MB/s Test: bionic-benchmarks --benchmark_filter='BM_string_memcmp*' Change-Id: Ia16a8cf69c68b8c0533f025f03b925c9883bb708
See the additional documentation.
The C library. Stuff like fopen(3)
and kill(2)
.
The math library. Traditionally Unix systems kept stuff like sin(3)
and cos(3)
in a separate library to save space in the days before shared libraries.
The dynamic linker interface library. This is actually just a bunch of stubs that the dynamic linker replaces with pointers to its own implementation at runtime. This is where stuff like dlopen(3)
lives.
The C++ ABI support functions. The C++ compiler doesn't know how to implement thread-safe static initialization and the like, so it just calls functions that are supplied by the system. Stuff like __cxa_guard_acquire
and __cxa_pure_virtual
live here.
The dynamic linker. When you run a dynamically-linked executable, its ELF file has a DT_INTERP
entry that says "use the following program to start me". On Android, that's either linker
or linker64
(depending on whether it's a 32-bit or 64-bit executable). It's responsible for loading the ELF executable into memory and resolving references to symbols (so that when your code tries to jump to fopen(3)
, say, it lands in the right place).
The tests/
directory contains unit tests. Roughly arranged as one file per publicly-exported header file.
The benchmarks/
directory contains benchmarks, with its own documentation.
Adding a system call usually involves:
./libc/tools/genversion-scripts.py
.As mentioned above, this is currently a two-step process:
Note that if you're actually just trying to expose device-specific headers to build your device drivers, you shouldn't modify bionic. Instead use TARGET_DEVICE_KERNEL_HEADERS
and friends described in config.mk.
This is fully automated (and these days handled by the libcore team, because they own icu, and that needs to be updated in sync with bionic):
If you make a change that is likely to have a wide effect on the tree (such as a libc header change), you should run make checkbuild
. A regular make
will not build the entire tree; just the minimum number of projects that are required for the device. Tests, additional developer tools, and various other modules will not be built. Note that make checkbuild
will not be complete either, as make tests
covers a few additional modules, but generally speaking make checkbuild
is enough.
The tests are all built from the tests/ directory.
$ mma # In $ANDROID_ROOT/bionic. $ adb root && adb remount && adb sync $ adb shell /data/nativetest/bionic-unit-tests/bionic-unit-tests32 $ adb shell \ /data/nativetest/bionic-unit-tests-static/bionic-unit-tests-static32 # Only for 64-bit targets $ adb shell /data/nativetest64/bionic-unit-tests/bionic-unit-tests64 $ adb shell \ /data/nativetest64/bionic-unit-tests-static/bionic-unit-tests-static64
Note that we use our own custom gtest runner that offers a superset of the options documented at https://github.com/google/googletest/blob/master/googletest/docs/AdvancedGuide.md#running-test-programs-advanced-options, in particular for test isolation and parallelism (both on by default).
Most of the unit tests are executed by CTS. By default, CTS runs as a non-root user, so the unit tests must also pass when not run as root. Some tests cannot do any useful work unless run as root. In this case, the test should check getuid() == 0
and do nothing otherwise (typically we log in this case to prevent accidents!). Obviously, if the test can be rewritten to not require root, that's an even better solution.
Currently, the list of bionic CTS tests is generated at build time by running a host version of the test executable and dumping the list of all tests. In order for this to continue to work, all architectures must have the same number of tests, and the host version of the executable must also have the same number of tests.
Running the gtests directly is orders of magnitude faster than using CTS, but in cases where you really have to run CTS:
$ make cts # In $ANDROID_ROOT. $ adb unroot # Because real CTS doesn't run as root. # This will sync any *test* changes, but not *code* changes: $ cts-tradefed \ run singleCommand cts --skip-preconditions -m CtsBionicTestCases
The host tests require that you have lunch
ed either an x86 or x86_64 target. Note that due to ABI limitations (specifically, the size of pthread_mutex_t), 32-bit bionic requires PIDs less than 65536. To enforce this, set /proc/sys/kernel/pid_max to 65536.
$ ./tests/run-on-host.sh 32 $ ./tests/run-on-host.sh 64 # For x86_64-bit *targets* only.
You can supply gtest flags as extra arguments to this script.
As a way to check that our tests do in fact test the correct behavior (and not just the behavior we think is correct), it is possible to run the tests against the host's glibc.
$ ./tests/run-on-host.sh glibc
For either host or target coverage, you must first:
$ export NATIVE_COVERAGE=true
bionic_coverage=true
in libc/Android.mk
and libm/Android.mk
.$ mma $ adb sync $ adb shell \ GCOV_PREFIX=/data/local/tmp/gcov \ GCOV_PREFIX_STRIP=`echo $ANDROID_BUILD_TOP | grep -o / | wc -l` \ /data/nativetest/bionic-unit-tests/bionic-unit-tests32 $ acov
acov
will pull all coverage information from the device, push it to the right directories, run lcov
, and open the coverage report in your browser.
First, build and run the host tests as usual (see above).
$ croot $ lcov -c -d $ANDROID_PRODUCT_OUT -o coverage.info $ genhtml -o covreport coverage.info # or lcov --list coverage.info
The coverage report is now available at covreport/index.html
.
Bionic's test runner will run each test in its own process by default to prevent tests failures from impacting other tests. This also has the added benefit of running them in parallel, so they are much faster.
However, this also makes it difficult to run the tests under GDB. To prevent each test from being forked, run the tests with the flag --no-isolate
.
See 32-bit ABI bugs.