Faster TLAB allocator.

New TLAB allocator doesn't increment bytes allocated until we allocate
a new TLAB. This increases allocation performance by avoiding a CAS.

MemAllocTest:
Before GSS TLAB: 3400ms.
After GSS TLAB: 2750ms.

Bug: 9986565

Change-Id: I1673c27555330ee90d353b98498fa0e67bd57fad
diff --git a/runtime/thread.h b/runtime/thread.h
index 4312741..1b335c8 100644
--- a/runtime/thread.h
+++ b/runtime/thread.h
@@ -781,7 +781,7 @@
   void RevokeThreadLocalAllocationStack();
 
   size_t GetThreadLocalBytesAllocated() const {
-    return tlsPtr_.thread_local_pos - tlsPtr_.thread_local_start;
+    return tlsPtr_.thread_local_end - tlsPtr_.thread_local_start;
   }
 
   size_t GetThreadLocalObjectsAllocated() const {