Performance improvements by removing a DMB and inlining.

Correct the version of CAS used by Mutex::Lock to be acquire and not release.
Don't do a memory barrier in thread transitions when there is already a
barrier associated with the mutator lock.
Force inlining of the hot thread and shared lock code, heavily used by down
calls and JNI.
Force inlining of mirror routines that are used by runtime support and hot.

Performance was measured and improved using perf and maps.

Change-Id: I012580e337143236d8b6d06c1e270183ae51083c
diff --git a/src/base/mutex.h b/src/base/mutex.h
index 8576c03..b530b75 100644
--- a/src/base/mutex.h
+++ b/src/base/mutex.h
@@ -223,14 +223,14 @@
 #endif
 
   // Block until ReaderWriterMutex is shared or free then acquire a share on the access.
-  void SharedLock(Thread* self) SHARED_LOCK_FUNCTION();
+  void SharedLock(Thread* self) SHARED_LOCK_FUNCTION()  __attribute__ ((always_inline));
   void ReaderLock(Thread* self) SHARED_LOCK_FUNCTION() { SharedLock(self); }
 
   // Try to acquire share of ReaderWriterMutex.
   bool SharedTryLock(Thread* self) EXCLUSIVE_TRYLOCK_FUNCTION(true);
 
   // Release a share of the access.
-  void SharedUnlock(Thread* self) UNLOCK_FUNCTION();
+  void SharedUnlock(Thread* self) UNLOCK_FUNCTION() __attribute__ ((always_inline));
   void ReaderUnlock(Thread* self) UNLOCK_FUNCTION() { SharedUnlock(self); }
 
   // Is the current thread the exclusive holder of the ReaderWriterMutex.