Parellel mark stack processing

Enabled parallel mark stack processing by using a thread pool.

Optimized object scanning by removing dependent loads for IsClass.

Performance:
Prime: ~10% speedup of partial GC.
Nakasi: ~50% speedup of partial GC.

Change-Id: I43256a068efc47cb52d93108458ea18d4e02fccc
diff --git a/src/atomic_integer.h b/src/atomic_integer.h
index adf3e77..22cc7b4 100644
--- a/src/atomic_integer.h
+++ b/src/atomic_integer.h
@@ -71,6 +71,10 @@
   int32_t operator -- () {
     return android_atomic_dec(&value_) - 1;
   }
+
+  int CompareAndSwap(int expected_value, int new_value) {
+    return android_atomic_cas(expected_value, new_value, &value_);
+  }
  private:
   int32_t value_;
 };