Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 1 | ============================================== |
| 2 | LLVM Atomic Instructions and Concurrency Guide |
| 3 | ============================================== |
| 4 | |
| 5 | .. contents:: |
| 6 | :local: |
| 7 | |
| 8 | Introduction |
| 9 | ============ |
| 10 | |
JF Bastien | f90bc5f | 2016-04-05 00:31:25 +0000 | [diff] [blame] | 11 | LLVM supports instructions which are well-defined in the presence of threads and |
| 12 | asynchronous signals. |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 13 | |
| 14 | The atomic instructions are designed specifically to provide readable IR and |
| 15 | optimized code generation for the following: |
| 16 | |
JF Bastien | f90bc5f | 2016-04-05 00:31:25 +0000 | [diff] [blame] | 17 | * The C++11 ``<atomic>`` header. (`C++11 draft available here |
Robin Morisset | 6f6512c | 2014-10-03 01:04:20 +0000 | [diff] [blame] | 18 | <http://www.open-std.org/jtc1/sc22/wg21/>`_.) (`C11 draft available here |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 19 | <http://www.open-std.org/jtc1/sc22/wg14/>`_.) |
| 20 | |
| 21 | * Proper semantics for Java-style memory, for both ``volatile`` and regular |
| 22 | shared variables. (`Java Specification |
Benjamin Kramer | a6a0f12 | 2014-08-04 09:26:40 +0000 | [diff] [blame] | 23 | <http://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html>`_) |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 24 | |
| 25 | * gcc-compatible ``__sync_*`` builtins. (`Description |
Benjamin Kramer | a6a0f12 | 2014-08-04 09:26:40 +0000 | [diff] [blame] | 26 | <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_) |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 27 | |
| 28 | * Other scenarios with atomic semantics, including ``static`` variables with |
| 29 | non-trivial constructors in C++. |
| 30 | |
| 31 | Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile, |
| 32 | which ensures that every volatile load and store happens and is performed in the |
| 33 | stated order. A couple examples: if a SequentiallyConsistent store is |
| 34 | immediately followed by another SequentiallyConsistent store to the same |
| 35 | address, the first store can be erased. This transformation is not allowed for a |
| 36 | pair of volatile stores. On the other hand, a non-volatile non-atomic load can |
| 37 | be moved across a volatile load freely, but not an Acquire load. |
| 38 | |
| 39 | This document is intended to provide a guide to anyone either writing a frontend |
| 40 | for LLVM or working on optimization passes for LLVM with a guide for how to deal |
| 41 | with instructions with special semantics in the presence of concurrency. This |
| 42 | is not intended to be a precise guide to the semantics; the details can get |
| 43 | extremely complicated and unreadable, and are not usually necessary. |
| 44 | |
| 45 | .. _Optimization outside atomic: |
| 46 | |
| 47 | Optimization outside atomic |
| 48 | =========================== |
| 49 | |
| 50 | The basic ``'load'`` and ``'store'`` allow a variety of optimizations, but can |
| 51 | lead to undefined results in a concurrent environment; see `NotAtomic`_. This |
| 52 | section specifically goes into the one optimizer restriction which applies in |
| 53 | concurrent environments, which gets a bit more of an extended description |
| 54 | because any optimization dealing with stores needs to be aware of it. |
| 55 | |
| 56 | From the optimizer's point of view, the rule is that if there are not any |
| 57 | instructions with atomic ordering involved, concurrency does not matter, with |
| 58 | one exception: if a variable might be visible to another thread or signal |
| 59 | handler, a store cannot be inserted along a path where it might not execute |
| 60 | otherwise. Take the following example: |
| 61 | |
| 62 | .. code-block:: c |
| 63 | |
| 64 | /* C code, for readability; run through clang -O2 -S -emit-llvm to get |
| 65 | equivalent IR */ |
| 66 | int x; |
| 67 | void f(int* a) { |
| 68 | for (int i = 0; i < 100; i++) { |
| 69 | if (a[i]) |
| 70 | x += 1; |
| 71 | } |
| 72 | } |
| 73 | |
| 74 | The following is equivalent in non-concurrent situations: |
| 75 | |
| 76 | .. code-block:: c |
| 77 | |
| 78 | int x; |
| 79 | void f(int* a) { |
| 80 | int xtemp = x; |
| 81 | for (int i = 0; i < 100; i++) { |
| 82 | if (a[i]) |
| 83 | xtemp += 1; |
| 84 | } |
| 85 | x = xtemp; |
| 86 | } |
| 87 | |
| 88 | However, LLVM is not allowed to transform the former to the latter: it could |
| 89 | indirectly introduce undefined behavior if another thread can access ``x`` at |
| 90 | the same time. (This example is particularly of interest because before the |
| 91 | concurrency model was implemented, LLVM would perform this transformation.) |
| 92 | |
| 93 | Note that speculative loads are allowed; a load which is part of a race returns |
| 94 | ``undef``, but does not have undefined behavior. |
| 95 | |
| 96 | Atomic instructions |
| 97 | =================== |
| 98 | |
| 99 | For cases where simple loads and stores are not sufficient, LLVM provides |
| 100 | various atomic instructions. The exact guarantees provided depend on the |
| 101 | ordering; see `Atomic orderings`_. |
| 102 | |
| 103 | ``load atomic`` and ``store atomic`` provide the same basic functionality as |
| 104 | non-atomic loads and stores, but provide additional guarantees in situations |
| 105 | where threads and signals are involved. |
| 106 | |
| 107 | ``cmpxchg`` and ``atomicrmw`` are essentially like an atomic load followed by an |
| 108 | atomic store (where the store is conditional for ``cmpxchg``), but no other |
Tim Northover | 8f2a85e | 2014-06-13 14:24:07 +0000 | [diff] [blame] | 109 | memory operation can happen on any thread between the load and store. |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 110 | |
| 111 | A ``fence`` provides Acquire and/or Release ordering which is not part of |
| 112 | another operation; it is normally used along with Monotonic memory operations. |
| 113 | A Monotonic load followed by an Acquire fence is roughly equivalent to an |
Robin Morisset | 6f6512c | 2014-10-03 01:04:20 +0000 | [diff] [blame] | 114 | Acquire load, and a Monotonic store following a Release fence is roughly |
| 115 | equivalent to a Release store. SequentiallyConsistent fences behave as both |
| 116 | an Acquire and a Release fence, and offer some additional complicated |
| 117 | guarantees, see the C++11 standard for details. |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 118 | |
| 119 | Frontends generating atomic instructions generally need to be aware of the |
| 120 | target to some degree; atomic instructions are guaranteed to be lock-free, and |
| 121 | therefore an instruction which is wider than the target natively supports can be |
| 122 | impossible to generate. |
| 123 | |
| 124 | .. _Atomic orderings: |
| 125 | |
| 126 | Atomic orderings |
| 127 | ================ |
| 128 | |
| 129 | In order to achieve a balance between performance and necessary guarantees, |
| 130 | there are six levels of atomicity. They are listed in order of strength; each |
| 131 | level includes all the guarantees of the previous level except for |
| 132 | Acquire/Release. (See also `LangRef Ordering <LangRef.html#ordering>`_.) |
| 133 | |
| 134 | .. _NotAtomic: |
| 135 | |
| 136 | NotAtomic |
| 137 | --------- |
| 138 | |
| 139 | NotAtomic is the obvious, a load or store which is not atomic. (This isn't |
| 140 | really a level of atomicity, but is listed here for comparison.) This is |
| 141 | essentially a regular load or store. If there is a race on a given memory |
| 142 | location, loads from that location return undef. |
| 143 | |
| 144 | Relevant standard |
| 145 | This is intended to match shared variables in C/C++, and to be used in any |
| 146 | other context where memory access is necessary, and a race is impossible. (The |
| 147 | precise definition is in `LangRef Memory Model <LangRef.html#memmodel>`_.) |
| 148 | |
| 149 | Notes for frontends |
| 150 | The rule is essentially that all memory accessed with basic loads and stores |
| 151 | by multiple threads should be protected by a lock or other synchronization; |
| 152 | otherwise, you are likely to run into undefined behavior. If your frontend is |
| 153 | for a "safe" language like Java, use Unordered to load and store any shared |
| 154 | variable. Note that NotAtomic volatile loads and stores are not properly |
| 155 | atomic; do not try to use them as a substitute. (Per the C/C++ standards, |
| 156 | volatile does provide some limited guarantees around asynchronous signals, but |
| 157 | atomics are generally a better solution.) |
| 158 | |
| 159 | Notes for optimizers |
| 160 | Introducing loads to shared variables along a codepath where they would not |
| 161 | otherwise exist is allowed; introducing stores to shared variables is not. See |
| 162 | `Optimization outside atomic`_. |
| 163 | |
| 164 | Notes for code generation |
| 165 | The one interesting restriction here is that it is not allowed to write to |
| 166 | bytes outside of the bytes relevant to a store. This is mostly relevant to |
| 167 | unaligned stores: it is not allowed in general to convert an unaligned store |
| 168 | into two aligned stores of the same width as the unaligned store. Backends are |
| 169 | also expected to generate an i8 store as an i8 store, and not an instruction |
| 170 | which writes to surrounding bytes. (If you are writing a backend for an |
| 171 | architecture which cannot satisfy these restrictions and cares about |
Tanya Lattner | 377a984 | 2015-08-05 03:51:17 +0000 | [diff] [blame] | 172 | concurrency, please send an email to llvm-dev.) |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 173 | |
| 174 | Unordered |
| 175 | --------- |
| 176 | |
| 177 | Unordered is the lowest level of atomicity. It essentially guarantees that races |
| 178 | produce somewhat sane results instead of having undefined behavior. It also |
Jingyue Wu | 04b11eb | 2014-09-23 17:35:28 +0000 | [diff] [blame] | 179 | guarantees the operation to be lock-free, so it does not depend on the data |
| 180 | being part of a special atomic structure or depend on a separate per-process |
| 181 | global lock. Note that code generation will fail for unsupported atomic |
| 182 | operations; if you need such an operation, use explicit locking. |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 183 | |
| 184 | Relevant standard |
| 185 | This is intended to match the Java memory model for shared variables. |
| 186 | |
| 187 | Notes for frontends |
| 188 | This cannot be used for synchronization, but is useful for Java and other |
| 189 | "safe" languages which need to guarantee that the generated code never |
| 190 | exhibits undefined behavior. Note that this guarantee is cheap on common |
| 191 | platforms for loads of a native width, but can be expensive or unavailable for |
| 192 | wider loads, like a 64-bit store on ARM. (A frontend for Java or other "safe" |
| 193 | languages would normally split a 64-bit store on ARM into two 32-bit unordered |
| 194 | stores.) |
| 195 | |
| 196 | Notes for optimizers |
| 197 | In terms of the optimizer, this prohibits any transformation that transforms a |
| 198 | single load into multiple loads, transforms a store into multiple stores, |
| 199 | narrows a store, or stores a value which would not be stored otherwise. Some |
| 200 | examples of unsafe optimizations are narrowing an assignment into a bitfield, |
| 201 | rematerializing a load, and turning loads and stores into a memcpy |
| 202 | call. Reordering unordered operations is safe, though, and optimizers should |
| 203 | take advantage of that because unordered operations are common in languages |
| 204 | that need them. |
| 205 | |
| 206 | Notes for code generation |
| 207 | These operations are required to be atomic in the sense that if you use |
| 208 | unordered loads and unordered stores, a load cannot see a value which was |
| 209 | never stored. A normal load or store instruction is usually sufficient, but |
| 210 | note that an unordered load or store cannot be split into multiple |
| 211 | instructions (or an instruction which does multiple memory operations, like |
JF Bastien | bd4bd36 | 2013-06-18 23:07:16 +0000 | [diff] [blame] | 212 | ``LDRD`` on ARM without LPAE, or not naturally-aligned ``LDRD`` on LPAE ARM). |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 213 | |
| 214 | Monotonic |
| 215 | --------- |
| 216 | |
| 217 | Monotonic is the weakest level of atomicity that can be used in synchronization |
| 218 | primitives, although it does not provide any general synchronization. It |
| 219 | essentially guarantees that if you take all the operations affecting a specific |
| 220 | address, a consistent ordering exists. |
| 221 | |
| 222 | Relevant standard |
Robin Morisset | 6f6512c | 2014-10-03 01:04:20 +0000 | [diff] [blame] | 223 | This corresponds to the C++11/C11 ``memory_order_relaxed``; see those |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 224 | standards for the exact definition. |
| 225 | |
| 226 | Notes for frontends |
| 227 | If you are writing a frontend which uses this directly, use with caution. The |
| 228 | guarantees in terms of synchronization are very weak, so make sure these are |
| 229 | only used in a pattern which you know is correct. Generally, these would |
| 230 | either be used for atomic operations which do not protect other memory (like |
| 231 | an atomic counter), or along with a ``fence``. |
| 232 | |
| 233 | Notes for optimizers |
| 234 | In terms of the optimizer, this can be treated as a read+write on the relevant |
| 235 | memory location (and alias analysis will take advantage of that). In addition, |
| 236 | it is legal to reorder non-atomic and Unordered loads around Monotonic |
| 237 | loads. CSE/DSE and a few other optimizations are allowed, but Monotonic |
| 238 | operations are unlikely to be used in ways which would make those |
| 239 | optimizations useful. |
| 240 | |
| 241 | Notes for code generation |
| 242 | Code generation is essentially the same as that for unordered for loads and |
| 243 | stores. No fences are required. ``cmpxchg`` and ``atomicrmw`` are required |
| 244 | to appear as a single operation. |
| 245 | |
| 246 | Acquire |
| 247 | ------- |
| 248 | |
| 249 | Acquire provides a barrier of the sort necessary to acquire a lock to access |
| 250 | other memory with normal loads and stores. |
| 251 | |
| 252 | Relevant standard |
Robin Morisset | 6f6512c | 2014-10-03 01:04:20 +0000 | [diff] [blame] | 253 | This corresponds to the C++11/C11 ``memory_order_acquire``. It should also be |
| 254 | used for C++11/C11 ``memory_order_consume``. |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 255 | |
| 256 | Notes for frontends |
| 257 | If you are writing a frontend which uses this directly, use with caution. |
| 258 | Acquire only provides a semantic guarantee when paired with a Release |
| 259 | operation. |
| 260 | |
| 261 | Notes for optimizers |
| 262 | Optimizers not aware of atomics can treat this like a nothrow call. It is |
| 263 | also possible to move stores from before an Acquire load or read-modify-write |
| 264 | operation to after it, and move non-Acquire loads from before an Acquire |
| 265 | operation to after it. |
| 266 | |
| 267 | Notes for code generation |
| 268 | Architectures with weak memory ordering (essentially everything relevant today |
| 269 | except x86 and SPARC) require some sort of fence to maintain the Acquire |
| 270 | semantics. The precise fences required varies widely by architecture, but for |
| 271 | a simple implementation, most architectures provide a barrier which is strong |
| 272 | enough for everything (``dmb`` on ARM, ``sync`` on PowerPC, etc.). Putting |
| 273 | such a fence after the equivalent Monotonic operation is sufficient to |
| 274 | maintain Acquire semantics for a memory operation. |
| 275 | |
| 276 | Release |
| 277 | ------- |
| 278 | |
| 279 | Release is similar to Acquire, but with a barrier of the sort necessary to |
| 280 | release a lock. |
| 281 | |
| 282 | Relevant standard |
Robin Morisset | 6f6512c | 2014-10-03 01:04:20 +0000 | [diff] [blame] | 283 | This corresponds to the C++11/C11 ``memory_order_release``. |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 284 | |
| 285 | Notes for frontends |
| 286 | If you are writing a frontend which uses this directly, use with caution. |
| 287 | Release only provides a semantic guarantee when paired with a Acquire |
| 288 | operation. |
| 289 | |
| 290 | Notes for optimizers |
| 291 | Optimizers not aware of atomics can treat this like a nothrow call. It is |
| 292 | also possible to move loads from after a Release store or read-modify-write |
| 293 | operation to before it, and move non-Release stores from after an Release |
| 294 | operation to before it. |
| 295 | |
| 296 | Notes for code generation |
| 297 | See the section on Acquire; a fence before the relevant operation is usually |
| 298 | sufficient for Release. Note that a store-store fence is not sufficient to |
| 299 | implement Release semantics; store-store fences are generally not exposed to |
| 300 | IR because they are extremely difficult to use correctly. |
| 301 | |
| 302 | AcquireRelease |
| 303 | -------------- |
| 304 | |
| 305 | AcquireRelease (``acq_rel`` in IR) provides both an Acquire and a Release |
| 306 | barrier (for fences and operations which both read and write memory). |
| 307 | |
| 308 | Relevant standard |
Robin Morisset | 6f6512c | 2014-10-03 01:04:20 +0000 | [diff] [blame] | 309 | This corresponds to the C++11/C11 ``memory_order_acq_rel``. |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 310 | |
| 311 | Notes for frontends |
| 312 | If you are writing a frontend which uses this directly, use with caution. |
| 313 | Acquire only provides a semantic guarantee when paired with a Release |
| 314 | operation, and vice versa. |
| 315 | |
| 316 | Notes for optimizers |
Sylvestre Ledru | c8e41c5 | 2012-07-23 08:51:15 +0000 | [diff] [blame] | 317 | In general, optimizers should treat this like a nothrow call; the possible |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 318 | optimizations are usually not interesting. |
| 319 | |
| 320 | Notes for code generation |
| 321 | This operation has Acquire and Release semantics; see the sections on Acquire |
| 322 | and Release. |
| 323 | |
| 324 | SequentiallyConsistent |
| 325 | ---------------------- |
| 326 | |
| 327 | SequentiallyConsistent (``seq_cst`` in IR) provides Acquire semantics for loads |
| 328 | and Release semantics for stores. Additionally, it guarantees that a total |
| 329 | ordering exists between all SequentiallyConsistent operations. |
| 330 | |
| 331 | Relevant standard |
Robin Morisset | 6f6512c | 2014-10-03 01:04:20 +0000 | [diff] [blame] | 332 | This corresponds to the C++11/C11 ``memory_order_seq_cst``, Java volatile, and |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 333 | the gcc-compatible ``__sync_*`` builtins which do not specify otherwise. |
| 334 | |
| 335 | Notes for frontends |
| 336 | If a frontend is exposing atomic operations, these are much easier to reason |
| 337 | about for the programmer than other kinds of operations, and using them is |
| 338 | generally a practical performance tradeoff. |
| 339 | |
| 340 | Notes for optimizers |
| 341 | Optimizers not aware of atomics can treat this like a nothrow call. For |
| 342 | SequentiallyConsistent loads and stores, the same reorderings are allowed as |
| 343 | for Acquire loads and Release stores, except that SequentiallyConsistent |
| 344 | operations may not be reordered. |
| 345 | |
| 346 | Notes for code generation |
| 347 | SequentiallyConsistent loads minimally require the same barriers as Acquire |
| 348 | operations and SequentiallyConsistent stores require Release |
| 349 | barriers. Additionally, the code generator must enforce ordering between |
| 350 | SequentiallyConsistent stores followed by SequentiallyConsistent loads. This |
| 351 | is usually done by emitting either a full fence before the loads or a full |
| 352 | fence after the stores; which is preferred varies by architecture. |
| 353 | |
| 354 | Atomics and IR optimization |
| 355 | =========================== |
| 356 | |
| 357 | Predicates for optimizer writers to query: |
| 358 | |
| 359 | * ``isSimple()``: A load or store which is not volatile or atomic. This is |
| 360 | what, for example, memcpyopt would check for operations it might transform. |
| 361 | |
| 362 | * ``isUnordered()``: A load or store which is not volatile and at most |
| 363 | Unordered. This would be checked, for example, by LICM before hoisting an |
| 364 | operation. |
| 365 | |
| 366 | * ``mayReadFromMemory()``/``mayWriteToMemory()``: Existing predicate, but note |
| 367 | that they return true for any operation which is volatile or at least |
| 368 | Monotonic. |
| 369 | |
JF Bastien | b36d1a8 | 2016-04-06 21:19:33 +0000 | [diff] [blame] | 370 | * ``isStrongerThan`` / ``isAtLeastOrStrongerThan``: These are predicates on |
Robin Morisset | 6f6512c | 2014-10-03 01:04:20 +0000 | [diff] [blame] | 371 | orderings. They can be useful for passes that are aware of atomics, for |
| 372 | example to do DSE across a single atomic access, but not across a |
| 373 | release-acquire pair (see MemoryDependencyAnalysis for an example of this) |
| 374 | |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 375 | * Alias analysis: Note that AA will return ModRef for anything Acquire or |
| 376 | Release, and for the address accessed by any Monotonic operation. |
| 377 | |
| 378 | To support optimizing around atomic operations, make sure you are using the |
| 379 | right predicates; everything should work if that is done. If your pass should |
| 380 | optimize some atomic operations (Unordered operations in particular), make sure |
| 381 | it doesn't replace an atomic load or store with a non-atomic operation. |
| 382 | |
| 383 | Some examples of how optimizations interact with various kinds of atomic |
| 384 | operations: |
| 385 | |
| 386 | * ``memcpyopt``: An atomic operation cannot be optimized into part of a |
| 387 | memcpy/memset, including unordered loads/stores. It can pull operations |
| 388 | across some atomic operations. |
| 389 | |
| 390 | * LICM: Unordered loads/stores can be moved out of a loop. It just treats |
| 391 | monotonic operations like a read+write to a memory location, and anything |
| 392 | stricter than that like a nothrow call. |
| 393 | |
| 394 | * DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores can |
| 395 | be DSE'ed in some cases, but it's tricky to reason about, and not especially |
Robin Morisset | 6f6512c | 2014-10-03 01:04:20 +0000 | [diff] [blame] | 396 | important. It is possible in some case for DSE to operate across a stronger |
| 397 | atomic operation, but it is fairly tricky. DSE delegates this reasoning to |
| 398 | MemoryDependencyAnalysis (which is also used by other passes like GVN). |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 399 | |
| 400 | * Folding a load: Any atomic load from a constant global can be constant-folded, |
David Majnemer | 0c4f69f | 2016-06-15 00:19:09 +0000 | [diff] [blame] | 401 | because it cannot be observed. Similar reasoning allows sroa with |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 402 | atomic loads and stores. |
| 403 | |
| 404 | Atomics and Codegen |
| 405 | =================== |
| 406 | |
| 407 | Atomic operations are represented in the SelectionDAG with ``ATOMIC_*`` opcodes. |
| 408 | On architectures which use barrier instructions for all atomic ordering (like |
Robin Morisset | 6f6512c | 2014-10-03 01:04:20 +0000 | [diff] [blame] | 409 | ARM), appropriate fences can be emitted by the AtomicExpand Codegen pass if |
| 410 | ``setInsertFencesForAtomic()`` was used. |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 411 | |
| 412 | The MachineMemOperand for all atomic operations is currently marked as volatile; |
| 413 | this is not correct in the IR sense of volatile, but CodeGen handles anything |
| 414 | marked volatile very conservatively. This should get fixed at some point. |
| 415 | |
James Y Knight | 238d819 | 2016-04-12 20:18:48 +0000 | [diff] [blame] | 416 | One very important property of the atomic operations is that if your backend |
| 417 | supports any inline lock-free atomic operations of a given size, you should |
| 418 | support *ALL* operations of that size in a lock-free manner. |
| 419 | |
| 420 | When the target implements atomic ``cmpxchg`` or LL/SC instructions (as most do) |
| 421 | this is trivial: all the other operations can be implemented on top of those |
| 422 | primitives. However, on many older CPUs (e.g. ARMv5, SparcV8, Intel 80386) there |
| 423 | are atomic load and store instructions, but no ``cmpxchg`` or LL/SC. As it is |
| 424 | invalid to implement ``atomic load`` using the native instruction, but |
| 425 | ``cmpxchg`` using a library call to a function that uses a mutex, ``atomic |
| 426 | load`` must *also* expand to a library call on such architectures, so that it |
| 427 | can remain atomic with regards to a simultaneous ``cmpxchg``, by using the same |
| 428 | mutex. |
| 429 | |
| 430 | AtomicExpandPass can help with that: it will expand all atomic operations to the |
| 431 | proper ``__atomic_*`` libcalls for any size above the maximum set by |
| 432 | ``setMaxAtomicSizeInBitsSupported`` (which defaults to 0). |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 433 | |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 434 | On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores |
| 435 | generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent |
| 436 | fences generate an ``MFENCE``, other fences do not cause any code to be |
James Y Knight | 238d819 | 2016-04-12 20:18:48 +0000 | [diff] [blame] | 437 | generated. ``cmpxchg`` uses the ``LOCK CMPXCHG`` instruction. ``atomicrmw xchg`` |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 438 | uses ``XCHG``, ``atomicrmw add`` and ``atomicrmw sub`` use ``XADD``, and all |
| 439 | other ``atomicrmw`` operations generate a loop with ``LOCK CMPXCHG``. Depending |
| 440 | on the users of the result, some ``atomicrmw`` operations can be translated into |
| 441 | operations like ``LOCK AND``, but that does not work in general. |
| 442 | |
Tim Northover | 8f2a85e | 2014-06-13 14:24:07 +0000 | [diff] [blame] | 443 | On ARM (before v8), MIPS, and many other RISC architectures, Acquire, Release, |
| 444 | and SequentiallyConsistent semantics require barrier instructions for every such |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 445 | operation. Loads and stores generate normal instructions. ``cmpxchg`` and |
| 446 | ``atomicrmw`` can be represented using a loop with LL/SC-style instructions |
| 447 | which take some sort of exclusive lock on a cache line (``LDREX`` and ``STREX`` |
Tim Northover | 8f2a85e | 2014-06-13 14:24:07 +0000 | [diff] [blame] | 448 | on ARM, etc.). |
Robin Morisset | 6f6512c | 2014-10-03 01:04:20 +0000 | [diff] [blame] | 449 | |
| 450 | It is often easiest for backends to use AtomicExpandPass to lower some of the |
| 451 | atomic constructs. Here are some lowerings it can do: |
Dan Liew | ea01fda | 2014-10-03 12:28:48 +0000 | [diff] [blame] | 452 | |
Robin Morisset | 6f6512c | 2014-10-03 01:04:20 +0000 | [diff] [blame] | 453 | * cmpxchg -> loop with load-linked/store-conditional |
Ahmed Bougacha | 74869be | 2015-09-11 17:08:28 +0000 | [diff] [blame] | 454 | by overriding ``shouldExpandAtomicCmpXchgInIR()``, ``emitLoadLinked()``, |
Robin Morisset | 6f6512c | 2014-10-03 01:04:20 +0000 | [diff] [blame] | 455 | ``emitStoreConditional()`` |
| 456 | * large loads/stores -> ll-sc/cmpxchg |
| 457 | by overriding ``shouldExpandAtomicStoreInIR()``/``shouldExpandAtomicLoadInIR()`` |
James Y Knight | 238d819 | 2016-04-12 20:18:48 +0000 | [diff] [blame] | 458 | * strong atomic accesses -> monotonic accesses + fences by overriding |
| 459 | ``shouldInsertFencesForAtomic()``, ``emitLeadingFence()``, and |
| 460 | ``emitTrailingFence()`` |
Robin Morisset | 6f6512c | 2014-10-03 01:04:20 +0000 | [diff] [blame] | 461 | * atomic rmw -> loop with cmpxchg or load-linked/store-conditional |
| 462 | by overriding ``expandAtomicRMWInIR()`` |
James Y Knight | 238d819 | 2016-04-12 20:18:48 +0000 | [diff] [blame] | 463 | * expansion to __atomic_* libcalls for unsupported sizes. |
Alex Bradbury | 762ae1e | 2018-11-30 09:23:24 +0000 | [diff] [blame] | 464 | * part-word atomicrmw/cmpxchg -> target-specific intrinsic by overriding |
| 465 | ``shouldExpandAtomicRMWInIR``, ``emitMaskedAtomicRMWIntrinsic``, |
| 466 | ``shouldExpandAtomicCmpXchgInIR``, and ``emitMaskedAtomicCmpXchgIntrinsic``. |
Dan Liew | ea01fda | 2014-10-03 12:28:48 +0000 | [diff] [blame] | 467 | |
Alex Bradbury | 762ae1e | 2018-11-30 09:23:24 +0000 | [diff] [blame] | 468 | For an example of these look at the ARM (first five lowerings) or RISC-V (last |
| 469 | lowering) backend. |
| 470 | |
| 471 | AtomicExpandPass supports two strategies for lowering atomicrmw/cmpxchg to |
| 472 | load-linked/store-conditional (LL/SC) loops. The first expands the LL/SC loop |
| 473 | in IR, calling target lowering hooks to emit intrinsics for the LL and SC |
| 474 | operations. However, many architectures have strict requirements for LL/SC |
| 475 | loops to ensure forward progress, such as restrictions on the number and type |
| 476 | of instructions in the loop. It isn't possible to enforce these restrictions |
| 477 | when the loop is expanded in LLVM IR, and so affected targets may prefer to |
| 478 | expand to LL/SC loops at a very late stage (i.e. after register allocation). |
| 479 | AtomicExpandPass can help support lowering of part-word atomicrmw or cmpxchg |
| 480 | using this strategy by producing IR for any shifting and masking that can be |
| 481 | performed outside of the LL/SC loop. |
James Y Knight | 238d819 | 2016-04-12 20:18:48 +0000 | [diff] [blame] | 482 | |
| 483 | Libcalls: __atomic_* |
| 484 | ==================== |
| 485 | |
| 486 | There are two kinds of atomic library calls that are generated by LLVM. Please |
| 487 | note that both sets of library functions somewhat confusingly share the names of |
| 488 | builtin functions defined by clang. Despite this, the library functions are |
| 489 | not directly related to the builtins: it is *not* the case that ``__atomic_*`` |
| 490 | builtins lower to ``__atomic_*`` library calls and ``__sync_*`` builtins lower |
| 491 | to ``__sync_*`` library calls. |
| 492 | |
| 493 | The first set of library functions are named ``__atomic_*``. This set has been |
| 494 | "standardized" by GCC, and is described below. (See also `GCC's documentation |
| 495 | <https://gcc.gnu.org/wiki/Atomic/GCCMM/LIbrary>`_) |
| 496 | |
| 497 | LLVM's AtomicExpandPass will translate atomic operations on data sizes above |
| 498 | ``MaxAtomicSizeInBitsSupported`` into calls to these functions. |
| 499 | |
| 500 | There are four generic functions, which can be called with data of any size or |
| 501 | alignment:: |
| 502 | |
| 503 | void __atomic_load(size_t size, void *ptr, void *ret, int ordering) |
| 504 | void __atomic_store(size_t size, void *ptr, void *val, int ordering) |
| 505 | void __atomic_exchange(size_t size, void *ptr, void *val, void *ret, int ordering) |
| 506 | bool __atomic_compare_exchange(size_t size, void *ptr, void *expected, void *desired, int success_order, int failure_order) |
| 507 | |
| 508 | There are also size-specialized versions of the above functions, which can only |
| 509 | be used with *naturally-aligned* pointers of the appropriate size. In the |
| 510 | signatures below, "N" is one of 1, 2, 4, 8, and 16, and "iN" is the appropriate |
| 511 | integer type of that size; if no such integer type exists, the specialization |
| 512 | cannot be used:: |
| 513 | |
| 514 | iN __atomic_load_N(iN *ptr, iN val, int ordering) |
| 515 | void __atomic_store_N(iN *ptr, iN val, int ordering) |
| 516 | iN __atomic_exchange_N(iN *ptr, iN val, int ordering) |
| 517 | bool __atomic_compare_exchange_N(iN *ptr, iN *expected, iN desired, int success_order, int failure_order) |
| 518 | |
| 519 | Finally there are some read-modify-write functions, which are only available in |
| 520 | the size-specific variants (any other sizes use a ``__atomic_compare_exchange`` |
| 521 | loop):: |
| 522 | |
| 523 | iN __atomic_fetch_add_N(iN *ptr, iN val, int ordering) |
| 524 | iN __atomic_fetch_sub_N(iN *ptr, iN val, int ordering) |
| 525 | iN __atomic_fetch_and_N(iN *ptr, iN val, int ordering) |
| 526 | iN __atomic_fetch_or_N(iN *ptr, iN val, int ordering) |
| 527 | iN __atomic_fetch_xor_N(iN *ptr, iN val, int ordering) |
| 528 | iN __atomic_fetch_nand_N(iN *ptr, iN val, int ordering) |
| 529 | |
| 530 | This set of library functions have some interesting implementation requirements |
| 531 | to take note of: |
| 532 | |
| 533 | - They support all sizes and alignments -- including those which cannot be |
| 534 | implemented natively on any existing hardware. Therefore, they will certainly |
| 535 | use mutexes in for some sizes/alignments. |
| 536 | |
| 537 | - As a consequence, they cannot be shipped in a statically linked |
| 538 | compiler-support library, as they have state which must be shared amongst all |
| 539 | DSOs loaded in the program. They must be provided in a shared library used by |
| 540 | all objects. |
| 541 | |
| 542 | - The set of atomic sizes supported lock-free must be a superset of the sizes |
| 543 | any compiler can emit. That is: if a new compiler introduces support for |
| 544 | inline-lock-free atomics of size N, the ``__atomic_*`` functions must also have a |
| 545 | lock-free implementation for size N. This is a requirement so that code |
| 546 | produced by an old compiler (which will have called the ``__atomic_*`` function) |
| 547 | interoperates with code produced by the new compiler (which will use native |
| 548 | the atomic instruction). |
| 549 | |
| 550 | Note that it's possible to write an entirely target-independent implementation |
| 551 | of these library functions by using the compiler atomic builtins themselves to |
| 552 | implement the operations on naturally-aligned pointers of supported sizes, and a |
| 553 | generic mutex implementation otherwise. |
| 554 | |
| 555 | Libcalls: __sync_* |
| 556 | ================== |
| 557 | |
| 558 | Some targets or OS/target combinations can support lock-free atomics, but for |
| 559 | various reasons, it is not practical to emit the instructions inline. |
| 560 | |
| 561 | There's two typical examples of this. |
| 562 | |
| 563 | Some CPUs support multiple instruction sets which can be swiched back and forth |
| 564 | on function-call boundaries. For example, MIPS supports the MIPS16 ISA, which |
| 565 | has a smaller instruction encoding than the usual MIPS32 ISA. ARM, similarly, |
| 566 | has the Thumb ISA. In MIPS16 and earlier versions of Thumb, the atomic |
| 567 | instructions are not encodable. However, those instructions are available via a |
| 568 | function call to a function with the longer encoding. |
| 569 | |
| 570 | Additionally, a few OS/target pairs provide kernel-supported lock-free |
| 571 | atomics. ARM/Linux is an example of this: the kernel `provides |
| 572 | <https://www.kernel.org/doc/Documentation/arm/kernel_user_helpers.txt>`_ a |
| 573 | function which on older CPUs contains a "magically-restartable" atomic sequence |
| 574 | (which looks atomic so long as there's only one CPU), and contains actual atomic |
| 575 | instructions on newer multicore models. This sort of functionality can typically |
| 576 | be provided on any architecture, if all CPUs which are missing atomic |
| 577 | compare-and-swap support are uniprocessor (no SMP). This is almost always the |
| 578 | case. The only common architecture without that property is SPARC -- SPARCV8 SMP |
| 579 | systems were common, yet it doesn't support any sort of compare-and-swap |
| 580 | operation. |
| 581 | |
| 582 | In either of these cases, the Target in LLVM can claim support for atomics of an |
| 583 | appropriate size, and then implement some subset of the operations via libcalls |
| 584 | to a ``__sync_*`` function. Such functions *must* not use locks in their |
| 585 | implementation, because unlike the ``__atomic_*`` routines used by |
| 586 | AtomicExpandPass, these may be mixed-and-matched with native instructions by the |
| 587 | target lowering. |
| 588 | |
| 589 | Further, these routines do not need to be shared, as they are stateless. So, |
| 590 | there is no issue with having multiple copies included in one binary. Thus, |
| 591 | typically these routines are implemented by the statically-linked compiler |
| 592 | runtime support library. |
| 593 | |
| 594 | LLVM will emit a call to an appropriate ``__sync_*`` routine if the target |
| 595 | ISelLowering code has set the corresponding ``ATOMIC_CMPXCHG``, ``ATOMIC_SWAP``, |
| 596 | or ``ATOMIC_LOAD_*`` operation to "Expand", and if it has opted-into the |
Sylvestre Ledru | e0f2f60 | 2016-07-02 19:28:40 +0000 | [diff] [blame] | 597 | availability of those library functions via a call to ``initSyncLibcalls()``. |
James Y Knight | 238d819 | 2016-04-12 20:18:48 +0000 | [diff] [blame] | 598 | |
| 599 | The full set of functions that may be called by LLVM is (for ``N`` being 1, 2, |
| 600 | 4, 8, or 16):: |
| 601 | |
| 602 | iN __sync_val_compare_and_swap_N(iN *ptr, iN expected, iN desired) |
| 603 | iN __sync_lock_test_and_set_N(iN *ptr, iN val) |
| 604 | iN __sync_fetch_and_add_N(iN *ptr, iN val) |
| 605 | iN __sync_fetch_and_sub_N(iN *ptr, iN val) |
| 606 | iN __sync_fetch_and_and_N(iN *ptr, iN val) |
| 607 | iN __sync_fetch_and_or_N(iN *ptr, iN val) |
| 608 | iN __sync_fetch_and_xor_N(iN *ptr, iN val) |
| 609 | iN __sync_fetch_and_nand_N(iN *ptr, iN val) |
| 610 | iN __sync_fetch_and_max_N(iN *ptr, iN val) |
| 611 | iN __sync_fetch_and_umax_N(iN *ptr, iN val) |
| 612 | iN __sync_fetch_and_min_N(iN *ptr, iN val) |
| 613 | iN __sync_fetch_and_umin_N(iN *ptr, iN val) |
| 614 | |
| 615 | This list doesn't include any function for atomic load or store; all known |
| 616 | architectures support atomic loads and stores directly (possibly by emitting a |
| 617 | fence on either side of a normal load or store.) |
| 618 | |
| 619 | There's also, somewhat separately, the possibility to lower ``ATOMIC_FENCE`` to |
| 620 | ``__sync_synchronize()``. This may happen or not happen independent of all the |
| 621 | above, controlled purely by ``setOperationAction(ISD::ATOMIC_FENCE, ...)``. |