Switch to a working UTF-8 mb/wc implementation.
Although glibc gets by with an 8-byte mbstate_t, OpenBSD uses 12 bytes (of
the 128 bytes it reserves!).
We can actually implement UTF-8 encoding/decoding with a 0-byte mbstate_t
which means we can make things work on LP32 too, as long as we accept the
limitation that the caller needs to present us with a complete sequence
before we'll process it.
Our behavior is fine when going from characters to bytes; we just
update the source wchar_t** to say how far through the input we got.
I'll come back and use the 4 bytes we do have to cope with byte sequences
split across multiple input buffers. The fact that we don't support
UTF-8 sequences longer than 4 bytes plus the fact that the first byte of
a UTF-8 sequence encodes the length means we shouldn't need the other
fields OpenBSD used (at the cost of some recomputation in cases where a
sequence is split across buffers).
This patch also makes the minimal changes necessary to setlocale(3) to
make us behave like glibc when an app requests UTF-8. (The difference
being that our "C" locale is the same as our "C.UTF-8" locale.)
Change-Id: Ied327a8c4643744b3611bf6bb005a9b389ba4c2f
diff --git a/libc/bionic/locale.cpp b/libc/bionic/locale.cpp
index 5ab834d..3752fa4 100644
--- a/libc/bionic/locale.cpp
+++ b/libc/bionic/locale.cpp
@@ -75,8 +75,12 @@
gLocale.int_n_sign_posn = CHAR_MAX;
}
+static bool __bionic_current_locale_is_utf8 = false;
+
static bool __is_supported_locale(const char* locale) {
- return (strcmp(locale, "") == 0 || strcmp(locale, "C") == 0 || strcmp(locale, "POSIX") == 0);
+ return (strcmp(locale, "") == 0 ||
+ strcmp(locale, "C") == 0 || strcmp(locale, "C.UTF-8") == 0 ||
+ strcmp(locale, "POSIX") == 0);
}
static locale_t __new_locale() {
@@ -115,26 +119,24 @@
return __new_locale();
}
-char* setlocale(int category, char const* locale_name) {
+char* setlocale(int category, const char* locale_name) {
// Is 'category' valid?
if (category < LC_CTYPE || category > LC_IDENTIFICATION) {
errno = EINVAL;
return NULL;
}
- // Caller just wants to query the current locale?
- if (locale_name == NULL) {
- return const_cast<char*>("C");
+ // Caller wants to set the locale rather than just query?
+ if (locale_name != NULL) {
+ if (!__is_supported_locale(locale_name)) {
+ // We don't support this locale.
+ errno = ENOENT;
+ return NULL;
+ }
+ __bionic_current_locale_is_utf8 = (strstr(locale_name, "UTF-8") != NULL);
}
- // Caller wants one of the mandatory POSIX locales?
- if (__is_supported_locale(locale_name)) {
- return const_cast<char*>("C");
- }
-
- // We don't support any other locales.
- errno = ENOENT;
- return NULL;
+ return const_cast<char*>(__bionic_current_locale_is_utf8 ? "C.UTF-8" : "C");
}
locale_t uselocale(locale_t new_locale) {