Reduce the number of fences needed for monitors

Add the necessary CasWeakAcquire primitives for LockWords.

Have MonitorEnter initially read the lockword using a
memory_order_relaxed operation. In the unlikely case we need more,
compensate with an explicit fence.

In the uncontended case, install the thin lock with Acquire,
rather than SequentiallyConsistent semantics.

Have MonitorExit use a Release instead of SequentiallyConsistent
CAS in the ReadBarrier case. Add TODO for the other case.

Together, these should usually eliminate 3 fences (or acq/rel)
per critical section.

Have Install() only use Release ordering.

Add TODO for inflation spinning, which looks to me like it could be
improved appreciably.

Drive-by fix:

GetMaxSpinsBeforeThinLockInflation spelling

Test: Build for several targets, boot, m art-test-host art-test-target

Change-Id: I2cab09723252065f6365e4234ee3249c69ece888
diff --git a/runtime/runtime.h b/runtime/runtime.h
index d40c631..8fc211c 100644
--- a/runtime/runtime.h
+++ b/runtime/runtime.h
@@ -268,7 +268,7 @@
     return java_vm_.get();
   }
 
-  size_t GetMaxSpinsBeforeThinkLockInflation() const {
+  size_t GetMaxSpinsBeforeThinLockInflation() const {
     return max_spins_before_thin_lock_inflation_;
   }