Blame - docs/Atomics.rst - platform_external_llvm

blob: 450b5b36f63abeaf2aaa3fb8864d5aa5c9a7636d [file] [log] [blame]

Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	1	==============================================
				2	LLVM Atomic Instructions and Concurrency Guide
				3	==============================================
				4
				5	.. contents::
				6	:local:
				7
				8	Introduction
				9	============
				10
JF Bastien	f90bc5f	2016-04-05 00:31:25 +0000	[diff] [blame]	11	LLVM supports instructions which are well-defined in the presence of threads and
				12	asynchronous signals.
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	13
				14	The atomic instructions are designed specifically to provide readable IR and
				15	optimized code generation for the following:
				16
JF Bastien	f90bc5f	2016-04-05 00:31:25 +0000	[diff] [blame]	17	* The C++11 ``<atomic>`` header. (`C++11 draft available here
Robin Morisset	6f6512c	2014-10-03 01:04:20 +0000	[diff] [blame]	18	<http://www.open-std.org/jtc1/sc22/wg21/>`_.) (`C11 draft available here
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	19	<http://www.open-std.org/jtc1/sc22/wg14/>`_.)
				20
				21	* Proper semantics for Java-style memory, for both ``volatile`` and regular
				22	shared variables. (`Java Specification
Benjamin Kramer	a6a0f12	2014-08-04 09:26:40 +0000	[diff] [blame]	23	<http://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html>`_)
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	24
				25	* gcc-compatible ``__sync_*`` builtins. (`Description
Benjamin Kramer	a6a0f12	2014-08-04 09:26:40 +0000	[diff] [blame]	26	<https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_)
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	27
				28	* Other scenarios with atomic semantics, including ``static`` variables with
				29	non-trivial constructors in C++.
				30
				31	Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile,
				32	which ensures that every volatile load and store happens and is performed in the
				33	stated order. A couple examples: if a SequentiallyConsistent store is
				34	immediately followed by another SequentiallyConsistent store to the same
				35	address, the first store can be erased. This transformation is not allowed for a
				36	pair of volatile stores. On the other hand, a non-volatile non-atomic load can
				37	be moved across a volatile load freely, but not an Acquire load.
				38
				39	This document is intended to provide a guide to anyone either writing a frontend
				40	for LLVM or working on optimization passes for LLVM with a guide for how to deal
				41	with instructions with special semantics in the presence of concurrency. This
				42	is not intended to be a precise guide to the semantics; the details can get
				43	extremely complicated and unreadable, and are not usually necessary.
				44
				45	.. _Optimization outside atomic:
				46
				47	Optimization outside atomic
				48	===========================
				49
				50	The basic ``'load'`` and ``'store'`` allow a variety of optimizations, but can
				51	lead to undefined results in a concurrent environment; see `NotAtomic`_. This
				52	section specifically goes into the one optimizer restriction which applies in
				53	concurrent environments, which gets a bit more of an extended description
				54	because any optimization dealing with stores needs to be aware of it.
				55
				56	From the optimizer's point of view, the rule is that if there are not any
				57	instructions with atomic ordering involved, concurrency does not matter, with
				58	one exception: if a variable might be visible to another thread or signal
				59	handler, a store cannot be inserted along a path where it might not execute
				60	otherwise. Take the following example:
				61
				62	.. code-block:: c
				63
				64	/* C code, for readability; run through clang -O2 -S -emit-llvm to get
				65	equivalent IR */
				66	int x;
				67	void f(int* a) {
				68	for (int i = 0; i < 100; i++) {
				69	if (a[i])
				70	x += 1;
				71	}
				72	}
				73
				74	The following is equivalent in non-concurrent situations:
				75
				76	.. code-block:: c
				77
				78	int x;
				79	void f(int* a) {
				80	int xtemp = x;
				81	for (int i = 0; i < 100; i++) {
				82	if (a[i])
				83	xtemp += 1;
				84	}
				85	x = xtemp;
				86	}
				87
				88	However, LLVM is not allowed to transform the former to the latter: it could
				89	indirectly introduce undefined behavior if another thread can access ``x`` at
				90	the same time. (This example is particularly of interest because before the
				91	concurrency model was implemented, LLVM would perform this transformation.)
				92
				93	Note that speculative loads are allowed; a load which is part of a race returns
				94	``undef``, but does not have undefined behavior.
				95
				96	Atomic instructions
				97	===================
				98
				99	For cases where simple loads and stores are not sufficient, LLVM provides
				100	various atomic instructions. The exact guarantees provided depend on the
				101	ordering; see `Atomic orderings`_.
				102
				103	``load atomic`` and ``store atomic`` provide the same basic functionality as
				104	non-atomic loads and stores, but provide additional guarantees in situations
				105	where threads and signals are involved.
				106
				107	``cmpxchg`` and ``atomicrmw`` are essentially like an atomic load followed by an
				108	atomic store (where the store is conditional for ``cmpxchg``), but no other
Tim Northover	8f2a85e	2014-06-13 14:24:07 +0000	[diff] [blame]	109	memory operation can happen on any thread between the load and store.
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	110
				111	A ``fence`` provides Acquire and/or Release ordering which is not part of
				112	another operation; it is normally used along with Monotonic memory operations.
				113	A Monotonic load followed by an Acquire fence is roughly equivalent to an
Robin Morisset	6f6512c	2014-10-03 01:04:20 +0000	[diff] [blame]	114	Acquire load, and a Monotonic store following a Release fence is roughly
				115	equivalent to a Release store. SequentiallyConsistent fences behave as both
				116	an Acquire and a Release fence, and offer some additional complicated
				117	guarantees, see the C++11 standard for details.
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	118
				119	Frontends generating atomic instructions generally need to be aware of the
				120	target to some degree; atomic instructions are guaranteed to be lock-free, and
				121	therefore an instruction which is wider than the target natively supports can be
				122	impossible to generate.
				123
				124	.. _Atomic orderings:
				125
				126	Atomic orderings
				127	================
				128
				129	In order to achieve a balance between performance and necessary guarantees,
				130	there are six levels of atomicity. They are listed in order of strength; each
				131	level includes all the guarantees of the previous level except for
				132	Acquire/Release. (See also `LangRef Ordering <LangRef.html#ordering>`_.)
				133
				134	.. _NotAtomic:
				135
				136	NotAtomic
				137	---------
				138
				139	NotAtomic is the obvious, a load or store which is not atomic. (This isn't
				140	really a level of atomicity, but is listed here for comparison.) This is
				141	essentially a regular load or store. If there is a race on a given memory
				142	location, loads from that location return undef.
				143
				144	Relevant standard
				145	This is intended to match shared variables in C/C++, and to be used in any
				146	other context where memory access is necessary, and a race is impossible. (The
				147	precise definition is in `LangRef Memory Model <LangRef.html#memmodel>`_.)
				148
				149	Notes for frontends
				150	The rule is essentially that all memory accessed with basic loads and stores
				151	by multiple threads should be protected by a lock or other synchronization;
				152	otherwise, you are likely to run into undefined behavior. If your frontend is
				153	for a "safe" language like Java, use Unordered to load and store any shared
				154	variable. Note that NotAtomic volatile loads and stores are not properly
				155	atomic; do not try to use them as a substitute. (Per the C/C++ standards,
				156	volatile does provide some limited guarantees around asynchronous signals, but
				157	atomics are generally a better solution.)
				158
				159	Notes for optimizers
				160	Introducing loads to shared variables along a codepath where they would not
				161	otherwise exist is allowed; introducing stores to shared variables is not. See
				162	`Optimization outside atomic`_.
				163
				164	Notes for code generation
				165	The one interesting restriction here is that it is not allowed to write to
				166	bytes outside of the bytes relevant to a store. This is mostly relevant to
				167	unaligned stores: it is not allowed in general to convert an unaligned store
				168	into two aligned stores of the same width as the unaligned store. Backends are
				169	also expected to generate an i8 store as an i8 store, and not an instruction
				170	which writes to surrounding bytes. (If you are writing a backend for an
				171	architecture which cannot satisfy these restrictions and cares about
Tanya Lattner	377a984	2015-08-05 03:51:17 +0000	[diff] [blame]	172	concurrency, please send an email to llvm-dev.)
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	173
				174	Unordered
				175	---------
				176
				177	Unordered is the lowest level of atomicity. It essentially guarantees that races
				178	produce somewhat sane results instead of having undefined behavior. It also
Jingyue Wu	04b11eb	2014-09-23 17:35:28 +0000	[diff] [blame]	179	guarantees the operation to be lock-free, so it does not depend on the data
				180	being part of a special atomic structure or depend on a separate per-process
				181	global lock. Note that code generation will fail for unsupported atomic
				182	operations; if you need such an operation, use explicit locking.
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	183
				184	Relevant standard
				185	This is intended to match the Java memory model for shared variables.
				186
				187	Notes for frontends
				188	This cannot be used for synchronization, but is useful for Java and other
				189	"safe" languages which need to guarantee that the generated code never
				190	exhibits undefined behavior. Note that this guarantee is cheap on common
				191	platforms for loads of a native width, but can be expensive or unavailable for
				192	wider loads, like a 64-bit store on ARM. (A frontend for Java or other "safe"
				193	languages would normally split a 64-bit store on ARM into two 32-bit unordered
				194	stores.)
				195
				196	Notes for optimizers
				197	In terms of the optimizer, this prohibits any transformation that transforms a
				198	single load into multiple loads, transforms a store into multiple stores,
				199	narrows a store, or stores a value which would not be stored otherwise. Some
				200	examples of unsafe optimizations are narrowing an assignment into a bitfield,
				201	rematerializing a load, and turning loads and stores into a memcpy
				202	call. Reordering unordered operations is safe, though, and optimizers should
				203	take advantage of that because unordered operations are common in languages
				204	that need them.
				205
				206	Notes for code generation
				207	These operations are required to be atomic in the sense that if you use
				208	unordered loads and unordered stores, a load cannot see a value which was
				209	never stored. A normal load or store instruction is usually sufficient, but
				210	note that an unordered load or store cannot be split into multiple
				211	instructions (or an instruction which does multiple memory operations, like
JF Bastien	bd4bd36	2013-06-18 23:07:16 +0000	[diff] [blame]	212	``LDRD`` on ARM without LPAE, or not naturally-aligned ``LDRD`` on LPAE ARM).
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	213
				214	Monotonic
				215	---------
				216
				217	Monotonic is the weakest level of atomicity that can be used in synchronization
				218	primitives, although it does not provide any general synchronization. It
				219	essentially guarantees that if you take all the operations affecting a specific
				220	address, a consistent ordering exists.
				221
				222	Relevant standard
Robin Morisset	6f6512c	2014-10-03 01:04:20 +0000	[diff] [blame]	223	This corresponds to the C++11/C11 ``memory_order_relaxed``; see those
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	224	standards for the exact definition.
				225
				226	Notes for frontends
				227	If you are writing a frontend which uses this directly, use with caution. The
				228	guarantees in terms of synchronization are very weak, so make sure these are
				229	only used in a pattern which you know is correct. Generally, these would
				230	either be used for atomic operations which do not protect other memory (like
				231	an atomic counter), or along with a ``fence``.
				232
				233	Notes for optimizers
				234	In terms of the optimizer, this can be treated as a read+write on the relevant
				235	memory location (and alias analysis will take advantage of that). In addition,
				236	it is legal to reorder non-atomic and Unordered loads around Monotonic
				237	loads. CSE/DSE and a few other optimizations are allowed, but Monotonic
				238	operations are unlikely to be used in ways which would make those
				239	optimizations useful.
				240
				241	Notes for code generation
				242	Code generation is essentially the same as that for unordered for loads and
				243	stores. No fences are required. ``cmpxchg`` and ``atomicrmw`` are required
				244	to appear as a single operation.
				245
				246	Acquire
				247	-------
				248
				249	Acquire provides a barrier of the sort necessary to acquire a lock to access
				250	other memory with normal loads and stores.
				251
				252	Relevant standard
Robin Morisset	6f6512c	2014-10-03 01:04:20 +0000	[diff] [blame]	253	This corresponds to the C++11/C11 ``memory_order_acquire``. It should also be
				254	used for C++11/C11 ``memory_order_consume``.
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	255
				256	Notes for frontends
				257	If you are writing a frontend which uses this directly, use with caution.
				258	Acquire only provides a semantic guarantee when paired with a Release
				259	operation.
				260
				261	Notes for optimizers
				262	Optimizers not aware of atomics can treat this like a nothrow call. It is
				263	also possible to move stores from before an Acquire load or read-modify-write
				264	operation to after it, and move non-Acquire loads from before an Acquire
				265	operation to after it.
				266
				267	Notes for code generation
				268	Architectures with weak memory ordering (essentially everything relevant today
				269	except x86 and SPARC) require some sort of fence to maintain the Acquire
				270	semantics. The precise fences required varies widely by architecture, but for
				271	a simple implementation, most architectures provide a barrier which is strong
				272	enough for everything (``dmb`` on ARM, ``sync`` on PowerPC, etc.). Putting
				273	such a fence after the equivalent Monotonic operation is sufficient to
				274	maintain Acquire semantics for a memory operation.
				275
				276	Release
				277	-------
				278
				279	Release is similar to Acquire, but with a barrier of the sort necessary to
				280	release a lock.
				281
				282	Relevant standard
Robin Morisset	6f6512c	2014-10-03 01:04:20 +0000	[diff] [blame]	283	This corresponds to the C++11/C11 ``memory_order_release``.
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	284
				285	Notes for frontends
				286	If you are writing a frontend which uses this directly, use with caution.
				287	Release only provides a semantic guarantee when paired with a Acquire
				288	operation.
				289
				290	Notes for optimizers
				291	Optimizers not aware of atomics can treat this like a nothrow call. It is
				292	also possible to move loads from after a Release store or read-modify-write
				293	operation to before it, and move non-Release stores from after an Release
				294	operation to before it.
				295
				296	Notes for code generation
				297	See the section on Acquire; a fence before the relevant operation is usually
				298	sufficient for Release. Note that a store-store fence is not sufficient to
				299	implement Release semantics; store-store fences are generally not exposed to
				300	IR because they are extremely difficult to use correctly.
				301
				302	AcquireRelease
				303	--------------
				304
				305	AcquireRelease (``acq_rel`` in IR) provides both an Acquire and a Release
				306	barrier (for fences and operations which both read and write memory).
				307
				308	Relevant standard
Robin Morisset	6f6512c	2014-10-03 01:04:20 +0000	[diff] [blame]	309	This corresponds to the C++11/C11 ``memory_order_acq_rel``.
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	310
				311	Notes for frontends
				312	If you are writing a frontend which uses this directly, use with caution.
				313	Acquire only provides a semantic guarantee when paired with a Release
				314	operation, and vice versa.
				315
				316	Notes for optimizers
Sylvestre Ledru	c8e41c5	2012-07-23 08:51:15 +0000	[diff] [blame]	317	In general, optimizers should treat this like a nothrow call; the possible
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	318	optimizations are usually not interesting.
				319
				320	Notes for code generation
				321	This operation has Acquire and Release semantics; see the sections on Acquire
				322	and Release.
				323
				324	SequentiallyConsistent
				325	----------------------
				326
				327	SequentiallyConsistent (``seq_cst`` in IR) provides Acquire semantics for loads
				328	and Release semantics for stores. Additionally, it guarantees that a total
				329	ordering exists between all SequentiallyConsistent operations.
				330
				331	Relevant standard
Robin Morisset	6f6512c	2014-10-03 01:04:20 +0000	[diff] [blame]	332	This corresponds to the C++11/C11 ``memory_order_seq_cst``, Java volatile, and
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	333	the gcc-compatible ``__sync_*`` builtins which do not specify otherwise.
				334
				335	Notes for frontends
				336	If a frontend is exposing atomic operations, these are much easier to reason
				337	about for the programmer than other kinds of operations, and using them is
				338	generally a practical performance tradeoff.
				339
				340	Notes for optimizers
				341	Optimizers not aware of atomics can treat this like a nothrow call. For
				342	SequentiallyConsistent loads and stores, the same reorderings are allowed as
				343	for Acquire loads and Release stores, except that SequentiallyConsistent
				344	operations may not be reordered.
				345
				346	Notes for code generation
				347	SequentiallyConsistent loads minimally require the same barriers as Acquire
				348	operations and SequentiallyConsistent stores require Release
				349	barriers. Additionally, the code generator must enforce ordering between
				350	SequentiallyConsistent stores followed by SequentiallyConsistent loads. This
				351	is usually done by emitting either a full fence before the loads or a full
				352	fence after the stores; which is preferred varies by architecture.
				353
				354	Atomics and IR optimization
				355	===========================
				356
				357	Predicates for optimizer writers to query:
				358
				359	* ``isSimple()``: A load or store which is not volatile or atomic. This is
				360	what, for example, memcpyopt would check for operations it might transform.
				361
				362	* ``isUnordered()``: A load or store which is not volatile and at most
				363	Unordered. This would be checked, for example, by LICM before hoisting an
				364	operation.
				365
				366	* ``mayReadFromMemory()``/``mayWriteToMemory()``: Existing predicate, but note
				367	that they return true for any operation which is volatile or at least
				368	Monotonic.
				369
JF Bastien	b36d1a8	2016-04-06 21:19:33 +0000	[diff] [blame]	370	* ``isStrongerThan`` / ``isAtLeastOrStrongerThan``: These are predicates on
Robin Morisset	6f6512c	2014-10-03 01:04:20 +0000	[diff] [blame]	371	orderings. They can be useful for passes that are aware of atomics, for
				372	example to do DSE across a single atomic access, but not across a
				373	release-acquire pair (see MemoryDependencyAnalysis for an example of this)
				374
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	375	* Alias analysis: Note that AA will return ModRef for anything Acquire or
				376	Release, and for the address accessed by any Monotonic operation.
				377
				378	To support optimizing around atomic operations, make sure you are using the
				379	right predicates; everything should work if that is done. If your pass should
				380	optimize some atomic operations (Unordered operations in particular), make sure
				381	it doesn't replace an atomic load or store with a non-atomic operation.
				382
				383	Some examples of how optimizations interact with various kinds of atomic
				384	operations:
				385
				386	* ``memcpyopt``: An atomic operation cannot be optimized into part of a
				387	memcpy/memset, including unordered loads/stores. It can pull operations
				388	across some atomic operations.
				389
				390	* LICM: Unordered loads/stores can be moved out of a loop. It just treats
				391	monotonic operations like a read+write to a memory location, and anything
				392	stricter than that like a nothrow call.
				393
				394	* DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores can
				395	be DSE'ed in some cases, but it's tricky to reason about, and not especially
Robin Morisset	6f6512c	2014-10-03 01:04:20 +0000	[diff] [blame]	396	important. It is possible in some case for DSE to operate across a stronger
				397	atomic operation, but it is fairly tricky. DSE delegates this reasoning to
				398	MemoryDependencyAnalysis (which is also used by other passes like GVN).
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	399
				400	* Folding a load: Any atomic load from a constant global can be constant-folded,
David Majnemer	0c4f69f	2016-06-15 00:19:09 +0000	[diff] [blame]	401	because it cannot be observed. Similar reasoning allows sroa with
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	402	atomic loads and stores.
				403
				404	Atomics and Codegen
				405	===================
				406
				407	Atomic operations are represented in the SelectionDAG with ``ATOMIC_*`` opcodes.
				408	On architectures which use barrier instructions for all atomic ordering (like
Robin Morisset	6f6512c	2014-10-03 01:04:20 +0000	[diff] [blame]	409	ARM), appropriate fences can be emitted by the AtomicExpand Codegen pass if
				410	``setInsertFencesForAtomic()`` was used.
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	411
				412	The MachineMemOperand for all atomic operations is currently marked as volatile;
				413	this is not correct in the IR sense of volatile, but CodeGen handles anything
				414	marked volatile very conservatively. This should get fixed at some point.
				415
James Y Knight	238d819	2016-04-12 20:18:48 +0000	[diff] [blame]	416	One very important property of the atomic operations is that if your backend
				417	supports any inline lock-free atomic operations of a given size, you should
				418	support ALL operations of that size in a lock-free manner.
				419
				420	When the target implements atomic ``cmpxchg`` or LL/SC instructions (as most do)
				421	this is trivial: all the other operations can be implemented on top of those
				422	primitives. However, on many older CPUs (e.g. ARMv5, SparcV8, Intel 80386) there
				423	are atomic load and store instructions, but no ``cmpxchg`` or LL/SC. As it is
				424	invalid to implement ``atomic load`` using the native instruction, but
				425	``cmpxchg`` using a library call to a function that uses a mutex, ``atomic
				426	load`` must also expand to a library call on such architectures, so that it
				427	can remain atomic with regards to a simultaneous ``cmpxchg``, by using the same
				428	mutex.
				429
				430	AtomicExpandPass can help with that: it will expand all atomic operations to the
				431	proper ``__atomic_*`` libcalls for any size above the maximum set by
				432	``setMaxAtomicSizeInBitsSupported`` (which defaults to 0).
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	433
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	434	On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores
				435	generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent
				436	fences generate an ``MFENCE``, other fences do not cause any code to be
James Y Knight	238d819	2016-04-12 20:18:48 +0000	[diff] [blame]	437	generated. ``cmpxchg`` uses the ``LOCK CMPXCHG`` instruction. ``atomicrmw xchg``
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	438	uses ``XCHG``, ``atomicrmw add`` and ``atomicrmw sub`` use ``XADD``, and all
				439	other ``atomicrmw`` operations generate a loop with ``LOCK CMPXCHG``. Depending
				440	on the users of the result, some ``atomicrmw`` operations can be translated into
				441	operations like ``LOCK AND``, but that does not work in general.
				442
Tim Northover	8f2a85e	2014-06-13 14:24:07 +0000	[diff] [blame]	443	On ARM (before v8), MIPS, and many other RISC architectures, Acquire, Release,
				444	and SequentiallyConsistent semantics require barrier instructions for every such
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	445	operation. Loads and stores generate normal instructions. ``cmpxchg`` and
				446	``atomicrmw`` can be represented using a loop with LL/SC-style instructions
				447	which take some sort of exclusive lock on a cache line (``LDREX`` and ``STREX``
Tim Northover	8f2a85e	2014-06-13 14:24:07 +0000	[diff] [blame]	448	on ARM, etc.).
Robin Morisset	6f6512c	2014-10-03 01:04:20 +0000	[diff] [blame]	449
				450	It is often easiest for backends to use AtomicExpandPass to lower some of the
				451	atomic constructs. Here are some lowerings it can do:
Dan Liew	ea01fda	2014-10-03 12:28:48 +0000	[diff] [blame]	452
Robin Morisset	6f6512c	2014-10-03 01:04:20 +0000	[diff] [blame]	453	* cmpxchg -> loop with load-linked/store-conditional
Ahmed Bougacha	74869be	2015-09-11 17:08:28 +0000	[diff] [blame]	454	by overriding ``shouldExpandAtomicCmpXchgInIR()``, ``emitLoadLinked()``,
Robin Morisset	6f6512c	2014-10-03 01:04:20 +0000	[diff] [blame]	455	``emitStoreConditional()``
				456	* large loads/stores -> ll-sc/cmpxchg
				457	by overriding ``shouldExpandAtomicStoreInIR()``/``shouldExpandAtomicLoadInIR()``
James Y Knight	238d819	2016-04-12 20:18:48 +0000	[diff] [blame]	458	* strong atomic accesses -> monotonic accesses + fences by overriding
				459	``shouldInsertFencesForAtomic()``, ``emitLeadingFence()``, and
				460	``emitTrailingFence()``
Robin Morisset	6f6512c	2014-10-03 01:04:20 +0000	[diff] [blame]	461	* atomic rmw -> loop with cmpxchg or load-linked/store-conditional
				462	by overriding ``expandAtomicRMWInIR()``
James Y Knight	238d819	2016-04-12 20:18:48 +0000	[diff] [blame]	463	* expansion to __atomic_* libcalls for unsupported sizes.
Alex Bradbury	762ae1e	2018-11-30 09:23:24 +0000	[diff] [blame]	464	* part-word atomicrmw/cmpxchg -> target-specific intrinsic by overriding
				465	``shouldExpandAtomicRMWInIR``, ``emitMaskedAtomicRMWIntrinsic``,
				466	``shouldExpandAtomicCmpXchgInIR``, and ``emitMaskedAtomicCmpXchgIntrinsic``.
Dan Liew	ea01fda	2014-10-03 12:28:48 +0000	[diff] [blame]	467
Alex Bradbury	762ae1e	2018-11-30 09:23:24 +0000	[diff] [blame]	468	For an example of these look at the ARM (first five lowerings) or RISC-V (last
				469	lowering) backend.
				470
				471	AtomicExpandPass supports two strategies for lowering atomicrmw/cmpxchg to
				472	load-linked/store-conditional (LL/SC) loops. The first expands the LL/SC loop
				473	in IR, calling target lowering hooks to emit intrinsics for the LL and SC
				474	operations. However, many architectures have strict requirements for LL/SC
				475	loops to ensure forward progress, such as restrictions on the number and type
				476	of instructions in the loop. It isn't possible to enforce these restrictions
				477	when the loop is expanded in LLVM IR, and so affected targets may prefer to
				478	expand to LL/SC loops at a very late stage (i.e. after register allocation).
				479	AtomicExpandPass can help support lowering of part-word atomicrmw or cmpxchg
				480	using this strategy by producing IR for any shifting and masking that can be
				481	performed outside of the LL/SC loop.
James Y Knight	238d819	2016-04-12 20:18:48 +0000	[diff] [blame]	482
				483	Libcalls: __atomic_*
				484	====================
				485
				486	There are two kinds of atomic library calls that are generated by LLVM. Please
				487	note that both sets of library functions somewhat confusingly share the names of
				488	builtin functions defined by clang. Despite this, the library functions are
				489	not directly related to the builtins: it is not the case that ``__atomic_*``
				490	builtins lower to ``__atomic_`` library calls and ``__sync_`` builtins lower
				491	to ``__sync_*`` library calls.
				492
				493	The first set of library functions are named ``__atomic_*``. This set has been
				494	"standardized" by GCC, and is described below. (See also `GCC's documentation
				495	<https://gcc.gnu.org/wiki/Atomic/GCCMM/LIbrary>`_)
				496
				497	LLVM's AtomicExpandPass will translate atomic operations on data sizes above
				498	``MaxAtomicSizeInBitsSupported`` into calls to these functions.
				499
				500	There are four generic functions, which can be called with data of any size or
				501	alignment::
				502
				503	void __atomic_load(size_t size, void ptr, void ret, int ordering)
				504	void __atomic_store(size_t size, void ptr, void val, int ordering)
				505	void __atomic_exchange(size_t size, void ptr, void val, void *ret, int ordering)
				506	bool __atomic_compare_exchange(size_t size, void ptr, void expected, void *desired, int success_order, int failure_order)
				507
				508	There are also size-specialized versions of the above functions, which can only
				509	be used with naturally-aligned pointers of the appropriate size. In the
				510	signatures below, "N" is one of 1, 2, 4, 8, and 16, and "iN" is the appropriate
				511	integer type of that size; if no such integer type exists, the specialization
				512	cannot be used::
				513
				514	iN __atomic_load_N(iN *ptr, iN val, int ordering)
				515	void __atomic_store_N(iN *ptr, iN val, int ordering)
				516	iN __atomic_exchange_N(iN *ptr, iN val, int ordering)
				517	bool __atomic_compare_exchange_N(iN ptr, iN expected, iN desired, int success_order, int failure_order)
				518
				519	Finally there are some read-modify-write functions, which are only available in
				520	the size-specific variants (any other sizes use a ``__atomic_compare_exchange``
				521	loop)::
				522
				523	iN __atomic_fetch_add_N(iN *ptr, iN val, int ordering)
				524	iN __atomic_fetch_sub_N(iN *ptr, iN val, int ordering)
				525	iN __atomic_fetch_and_N(iN *ptr, iN val, int ordering)
				526	iN __atomic_fetch_or_N(iN *ptr, iN val, int ordering)
				527	iN __atomic_fetch_xor_N(iN *ptr, iN val, int ordering)
				528	iN __atomic_fetch_nand_N(iN *ptr, iN val, int ordering)
				529
				530	This set of library functions have some interesting implementation requirements
				531	to take note of:
				532
				533	- They support all sizes and alignments -- including those which cannot be
				534	implemented natively on any existing hardware. Therefore, they will certainly
				535	use mutexes in for some sizes/alignments.
				536
				537	- As a consequence, they cannot be shipped in a statically linked
				538	compiler-support library, as they have state which must be shared amongst all
				539	DSOs loaded in the program. They must be provided in a shared library used by
				540	all objects.
				541
				542	- The set of atomic sizes supported lock-free must be a superset of the sizes
				543	any compiler can emit. That is: if a new compiler introduces support for
				544	inline-lock-free atomics of size N, the ``__atomic_*`` functions must also have a
				545	lock-free implementation for size N. This is a requirement so that code
				546	produced by an old compiler (which will have called the ``__atomic_*`` function)
				547	interoperates with code produced by the new compiler (which will use native
				548	the atomic instruction).
				549
				550	Note that it's possible to write an entirely target-independent implementation
				551	of these library functions by using the compiler atomic builtins themselves to
				552	implement the operations on naturally-aligned pointers of supported sizes, and a
				553	generic mutex implementation otherwise.
				554
				555	Libcalls: __sync_*
				556	==================
				557
				558	Some targets or OS/target combinations can support lock-free atomics, but for
				559	various reasons, it is not practical to emit the instructions inline.
				560
				561	There's two typical examples of this.
				562
				563	Some CPUs support multiple instruction sets which can be swiched back and forth
				564	on function-call boundaries. For example, MIPS supports the MIPS16 ISA, which
				565	has a smaller instruction encoding than the usual MIPS32 ISA. ARM, similarly,
				566	has the Thumb ISA. In MIPS16 and earlier versions of Thumb, the atomic
				567	instructions are not encodable. However, those instructions are available via a
				568	function call to a function with the longer encoding.
				569
				570	Additionally, a few OS/target pairs provide kernel-supported lock-free
				571	atomics. ARM/Linux is an example of this: the kernel `provides
				572	<https://www.kernel.org/doc/Documentation/arm/kernel_user_helpers.txt>`_ a
				573	function which on older CPUs contains a "magically-restartable" atomic sequence
				574	(which looks atomic so long as there's only one CPU), and contains actual atomic
				575	instructions on newer multicore models. This sort of functionality can typically
				576	be provided on any architecture, if all CPUs which are missing atomic
				577	compare-and-swap support are uniprocessor (no SMP). This is almost always the
				578	case. The only common architecture without that property is SPARC -- SPARCV8 SMP
				579	systems were common, yet it doesn't support any sort of compare-and-swap
				580	operation.
				581
				582	In either of these cases, the Target in LLVM can claim support for atomics of an
				583	appropriate size, and then implement some subset of the operations via libcalls
				584	to a ``__sync_`` function. Such functions must* not use locks in their
				585	implementation, because unlike the ``__atomic_*`` routines used by
				586	AtomicExpandPass, these may be mixed-and-matched with native instructions by the
				587	target lowering.
				588
				589	Further, these routines do not need to be shared, as they are stateless. So,
				590	there is no issue with having multiple copies included in one binary. Thus,
				591	typically these routines are implemented by the statically-linked compiler
				592	runtime support library.
				593
				594	LLVM will emit a call to an appropriate ``__sync_*`` routine if the target
				595	ISelLowering code has set the corresponding ``ATOMIC_CMPXCHG``, ``ATOMIC_SWAP``,
				596	or ``ATOMIC_LOAD_*`` operation to "Expand", and if it has opted-into the
Sylvestre Ledru	e0f2f60	2016-07-02 19:28:40 +0000	[diff] [blame]	597	availability of those library functions via a call to ``initSyncLibcalls()``.
James Y Knight	238d819	2016-04-12 20:18:48 +0000	[diff] [blame]	598
				599	The full set of functions that may be called by LLVM is (for ``N`` being 1, 2,
				600	4, 8, or 16)::
				601
				602	iN __sync_val_compare_and_swap_N(iN *ptr, iN expected, iN desired)
				603	iN __sync_lock_test_and_set_N(iN *ptr, iN val)
				604	iN __sync_fetch_and_add_N(iN *ptr, iN val)
				605	iN __sync_fetch_and_sub_N(iN *ptr, iN val)
				606	iN __sync_fetch_and_and_N(iN *ptr, iN val)
				607	iN __sync_fetch_and_or_N(iN *ptr, iN val)
				608	iN __sync_fetch_and_xor_N(iN *ptr, iN val)
				609	iN __sync_fetch_and_nand_N(iN *ptr, iN val)
				610	iN __sync_fetch_and_max_N(iN *ptr, iN val)
				611	iN __sync_fetch_and_umax_N(iN *ptr, iN val)
				612	iN __sync_fetch_and_min_N(iN *ptr, iN val)
				613	iN __sync_fetch_and_umin_N(iN *ptr, iN val)
				614
				615	This list doesn't include any function for atomic load or store; all known
				616	architectures support atomic loads and stores directly (possibly by emitting a
				617	fence on either side of a normal load or store.)
				618
				619	There's also, somewhat separately, the possibility to lower ``ATOMIC_FENCE`` to
				620	``__sync_synchronize()``. This may happen or not happen independent of all the
				621	above, controlled purely by ``setOperationAction(ISD::ATOMIC_FENCE, ...)``.