Blame - docs/AMDGPUUsage.rst - platform_external_llvm - Gitiles

blob: c60d60af5a82cf1204fb9854816f2762ff8c2ecc [file] [log] [blame]

Eugene Zelenko	e9a0122	2018-03-21 17:09:35 +0000	[diff] [blame]	1	=============================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	953c681	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	953c681	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	953c681	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	26	.. table:: AMDGPU Architectures
				27	:name: amdgpu-architecture-table
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	28
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	29	============ ==============================================================
				30	Architecture Description
				31	============ ==============================================================
				32	``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
				33	``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
				34	============ ==============================================================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	35
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	36	.. table:: AMDGPU Vendors
				37	:name: amdgpu-vendor-table
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	38
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	39	============ ==============================================================
				40	Vendor Description
				41	============ ==============================================================
				42	``amd`` Can be used for all AMD GPU usage.
				43	``mesa3d`` Can be used if the OS is ``mesa3d``.
				44	============ ==============================================================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	45
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	46	.. table:: AMDGPU Operating Systems
				47	:name: amdgpu-os-table
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	48
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	49	============== ============================================================
				50	OS Description
				51	============== ============================================================
				52	<empty> Defaults to the unknown OS.
				53	``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
				54	such as AMD's ROCm [AMD-ROCm]_.
				55	``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
				56	runtime.
				57	``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
				58	runtime.
				59	============== ============================================================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	60
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	61	.. table:: AMDGPU Environments
				62	:name: amdgpu-environment-table
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	63
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	64	============ ==============================================================
				65	Environment Description
				66	============ ==============================================================
Tony Tye	2b4b7fe	2018-03-23 18:45:18 +0000	[diff] [blame]	67	<empty> Default.
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	68	============ ==============================================================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	69
				70	.. _amdgpu-processors:
				71
				72	Processors
				73	----------
				74
				75	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				76	names from both the Processor and Alternative Processor can be used.
				77
				78	.. table:: AMDGPU Processors
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	79	:name: amdgpu-processor-table
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	80
Konstantin Zhuravlyov	ffb90c4	2018-11-06 20:23:53 +0000	[diff] [blame]	81	=========== =============== ============ ===== ========== ======= ======================
				82	Processor Alternative Target dGPU/ Target ROCm Example
				83	Processor Triple APU Features Support Products
Tony Tye	c865d8a	2017-12-11 15:35:27 +0000	[diff] [blame]	84	Architecture Supported
				85	[Default]
Konstantin Zhuravlyov	ffb90c4	2018-11-06 20:23:53 +0000	[diff] [blame]	86	=========== =============== ============ ===== ========== ======= ======================
Konstantin Zhuravlyov	a05cf7b	2017-10-18 17:59:20 +0000	[diff] [blame]	87	Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]_
Konstantin Zhuravlyov	ffb90c4	2018-11-06 20:23:53 +0000	[diff] [blame]	88	----------------------------------------------------------------------------------------
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	89	``r600`` ``r600`` dGPU
				90	``r630`` ``r600`` dGPU
				91	``rs880`` ``r600`` dGPU
				92	``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov	a05cf7b	2017-10-18 17:59:20 +0000	[diff] [blame]	93	Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]_
Konstantin Zhuravlyov	ffb90c4	2018-11-06 20:23:53 +0000	[diff] [blame]	94	----------------------------------------------------------------------------------------
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	95	``rv710`` ``r600`` dGPU
				96	``rv730`` ``r600`` dGPU
				97	``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov	a05cf7b	2017-10-18 17:59:20 +0000	[diff] [blame]	98	Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]_
Konstantin Zhuravlyov	ffb90c4	2018-11-06 20:23:53 +0000	[diff] [blame]	99	----------------------------------------------------------------------------------------
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	100	``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov	1629089	2018-02-16 22:33:59 +0000	[diff] [blame]	101	``cypress`` ``r600`` dGPU
				102	``juniper`` ``r600`` dGPU
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	103	``redwood`` ``r600`` dGPU
				104	``sumo`` ``r600`` dGPU
Konstantin Zhuravlyov	a05cf7b	2017-10-18 17:59:20 +0000	[diff] [blame]	105	Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]_
Konstantin Zhuravlyov	ffb90c4	2018-11-06 20:23:53 +0000	[diff] [blame]	106	----------------------------------------------------------------------------------------
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	107	``barts`` ``r600`` dGPU
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	108	``caicos`` ``r600`` dGPU
				109	``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov	1629089	2018-02-16 22:33:59 +0000	[diff] [blame]	110	``turks`` ``r600`` dGPU
Konstantin Zhuravlyov	a05cf7b	2017-10-18 17:59:20 +0000	[diff] [blame]	111	GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]_
Konstantin Zhuravlyov	ffb90c4	2018-11-06 20:23:53 +0000	[diff] [blame]	112	----------------------------------------------------------------------------------------
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	113	``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov	1629089	2018-02-16 22:33:59 +0000	[diff] [blame]	114	``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	115	- ``oland``
Konstantin Zhuravlyov	1629089	2018-02-16 22:33:59 +0000	[diff] [blame]	116	- ``pitcairn``
				117	- ``verde``
Konstantin Zhuravlyov	a05cf7b	2017-10-18 17:59:20 +0000	[diff] [blame]	118	GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]_
Konstantin Zhuravlyov	ffb90c4	2018-11-06 20:23:53 +0000	[diff] [blame]	119	----------------------------------------------------------------------------------------
				120	``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
				121	- A6 Pro-7050B
				122	- A8-7100
				123	- A8 Pro-7150B
				124	- A10-7300
				125	- A10 Pro-7350B
				126	- FX-7500
				127	- A8-7200P
				128	- A10-7400P
				129	- FX-7600P
				130	``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
				131	- FirePro W9100
				132	- FirePro S9150
				133	- FirePro S9170
				134	``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
				135	- Radeon R9 290x
				136	- Radeon R390
				137	- Radeon R390x
				138	``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
				139	- ``mullins`` - E1-2200
				140	- E1-2500
				141	- E2-3000
				142	- E2-3800
				143	- A4-5000
				144	- A4-5100
				145	- A6-5200
				146	- A4 Pro-3340B
				147	``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
				148	- Radeon HD 8770
				149	- R7 260
				150	- R7 260X
Konstantin Zhuravlyov	a05cf7b	2017-10-18 17:59:20 +0000	[diff] [blame]	151	GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]_
Konstantin Zhuravlyov	ffb90c4	2018-11-06 20:23:53 +0000	[diff] [blame]	152	----------------------------------------------------------------------------------------
				153	``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
				154	[on] - Pro A6-8500B
				155	- A8-8600P
				156	- Pro A8-8600B
				157	- FX-8800P
				158	- Pro A12-8800B
				159	\ ``amdgcn`` APU - xnack ROCm - A10-8700P
				160	[on] - Pro A10-8700B
				161	- A10-8780P
				162	\ ``amdgcn`` APU - xnack - A10-9600P
				163	[on] - A10-9630P
				164	- A12-9700P
				165	- A12-9730P
				166	- FX-9800P
				167	- FX-9830P
				168	\ ``amdgcn`` APU - xnack - E2-9010
				169	[on] - A6-9210
				170	- A9-9410
				171	``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
				172	- ``tonga`` [off] - FirePro S7100
				173	- FirePro W7100
				174	- Radeon R285
				175	- Radeon R9 380
				176	- Radeon R9 385
				177	- Mobile FirePro
				178	M7170
				179	``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
				180	[off] - Radeon R9 Fury
				181	- Radeon R9 FuryX
				182	- Radeon Pro Duo
				183	- FirePro S9300x2
				184	- Radeon Instinct MI8
				185	\ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
				186	[off] - Radeon RX 480
				187	- Radeon Instinct MI6
				188	\ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
Tony Tye	c865d8a	2017-12-11 15:35:27 +0000	[diff] [blame]	189	[off]
				190	``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
				191	[on]
Konstantin Zhuravlyov	a05cf7b	2017-10-18 17:59:20 +0000	[diff] [blame]	192	GCN GFX9 [AMD-GCN-GFX9]_
Konstantin Zhuravlyov	ffb90c4	2018-11-06 20:23:53 +0000	[diff] [blame]	193	----------------------------------------------------------------------------------------
				194	``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
				195	[off] Frontier Edition
				196	- Radeon RX Vega 56
				197	- Radeon RX Vega 64
				198	- Radeon RX Vega 64
				199	Liquid
				200	- Radeon Instinct MI25
				201	``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G
				202	[on] - Ryzen 5 2400G
				203	``gfx904`` ``amdgcn`` dGPU - xnack TBA
Matt Arsenault	ac9b3ef	2018-04-30 19:08:16 +0000	[diff] [blame]	204	[off]
Konstantin Zhuravlyov	ffb90c4	2018-11-06 20:23:53 +0000	[diff] [blame]	205	.. TODO
				206	Add product
				207	names.
Konstantin Zhuravlyov	ba9a1e1	2018-11-07 20:54:16 +0000	[diff] [blame]	208	``gfx906`` ``amdgcn`` dGPU - xnack - Radeon Instinct MI50
				209	[off] - Radeon Instinct MI60
Konstantin Zhuravlyov	7829a6d	2018-11-05 22:44:19 +0000	[diff] [blame]	210	sram-ecc
				211	[on]
Konstantin Zhuravlyov	ffb90c4	2018-11-06 20:23:53 +0000	[diff] [blame]	212	``gfx909`` ``amdgcn`` APU - xnack TBA (Raven Ridge 2)
Tim Renouf	7d5e5c2	2018-10-24 08:14:07 +0000	[diff] [blame]	213	[on]
Konstantin Zhuravlyov	ffb90c4	2018-11-06 20:23:53 +0000	[diff] [blame]	214	.. TODO
				215	Add product
				216	names.
				217	=========== =============== ============ ===== ========== ======= ======================
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	218
				219	.. _amdgpu-target-features:
				220
				221	Target Features
				222	---------------
				223
				224	Target features control how code is generated to support certain
Tony Tye	c865d8a	2017-12-11 15:35:27 +0000	[diff] [blame]	225	processor specific features. Not all target features are supported by
				226	all processors. The runtime must ensure that the features supported by
				227	the device used to execute the code match the features enabled when
				228	generating the code. A mismatch of features may result in incorrect
				229	execution, or a reduction in performance.
				230
				231	The target features supported by each processor, and the default value
				232	used if not specified explicitly, is listed in
				233	:ref:`amdgpu-processor-table`.
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	234
				235	Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
				236	target features.
				237
				238	For example:
				239
				240	``-mxnack``
Tony Tye	c865d8a	2017-12-11 15:35:27 +0000	[diff] [blame]	241	Enable the ``xnack`` feature.
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	242	``-mno-xnack``
Tony Tye	c865d8a	2017-12-11 15:35:27 +0000	[diff] [blame]	243	Disable the ``xnack`` feature.
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	244
				245	.. table:: AMDGPU Target Features
				246	:name: amdgpu-target-feature-table
				247
Konstantin Zhuravlyov	7829a6d	2018-11-05 22:44:19 +0000	[diff] [blame]	248	=============== ==================================================
				249	Target Feature Description
				250	=============== ==================================================
				251	-m[no-]xnack Enable/disable generating code that has
				252	memory clauses that are compatible with
				253	having XNACK replay enabled.
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	254
Konstantin Zhuravlyov	7829a6d	2018-11-05 22:44:19 +0000	[diff] [blame]	255	This is used for demand paging and page
				256	migration. If XNACK replay is enabled in
				257	the device, then if a page fault occurs
				258	the code may execute incorrectly if the
				259	``xnack`` feature is not enabled. Executing
				260	code that has the feature enabled on a
				261	device that does not have XNACK replay
				262	enabled will execute correctly, but may
				263	be less performant than code with the
				264	feature disabled.
				265	-m[no-]sram-ecc Enable/disable generating code that assumes SRAM
				266	ECC is enabled/disabled.
				267	=============== ==================================================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	268
				269	.. _amdgpu-address-spaces:
Tom Stellard	43f537f	2016-04-06 01:29:19 +0000	[diff] [blame]	270
				271	Address Spaces
				272	--------------
				273
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	274	The AMDGPU backend uses the following address space mappings.
Tom Stellard	43f537f	2016-04-06 01:29:19 +0000	[diff] [blame]	275
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	276	The memory space names used in the table, aside from the region memory space, is
				277	from the OpenCL standard.
Tom Stellard	43f537f	2016-04-06 01:29:19 +0000	[diff] [blame]	278
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	279	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	43f537f	2016-04-06 01:29:19 +0000	[diff] [blame]	280
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	281	.. table:: Address Space Mapping
				282	:name: amdgpu-address-space-mapping-table
				283
Yaxun Liu	2930e5c	2018-02-13 18:00:25 +0000	[diff] [blame]	284	================== =================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	285	LLVM Address Space Memory Space
Yaxun Liu	2930e5c	2018-02-13 18:00:25 +0000	[diff] [blame]	286	================== =================
				287	0 Generic (Flat)
				288	1 Global
				289	2 Region (GDS)
				290	3 Local (group/LDS)
				291	4 Constant
				292	5 Private (Scratch)
				293	6 Constant 32-bit
				294	================== =================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	295
				296	.. _amdgpu-memory-scopes:
				297
				298	Memory Scopes
				299	-------------
				300
				301	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				302	backend memory model when the target triple OS is ``amdhsa`` (see
				303	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				304
				305	The memory model supported is based on the HSA memory model [HSA]_ which is
				306	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				307	relation is transitive over the synchonizes-with relation independent of scope,
				308	and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	309	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	310
				311	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				312	inclusion and requires the memory scopes to exactly match. However, this
				313	is conservatively correct for OpenCL.
				314
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	315	.. table:: AMDHSA LLVM Sync Scopes
				316	:name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	317
				318	================ ==========================================================
				319	LLVM Sync Scope Description
				320	================ ==========================================================
				321	none The default: ``system``.
				322
				323	Synchronizes with, and participates in modification and
				324	seq_cst total orderings with, other operations (except
				325	image operations) for all address spaces (except private,
				326	or generic that accesses private) provided the other
				327	operation's sync scope is:
				328
				329	- ``system``.
				330	- ``agent`` and executed by a thread on the same agent.
				331	- ``workgroup`` and executed by a thread in the same
				332	workgroup.
				333	- ``wavefront`` and executed by a thread in the same
				334	wavefront.
				335
				336	``agent`` Synchronizes with, and participates in modification and
				337	seq_cst total orderings with, other operations (except
				338	image operations) for all address spaces (except private,
				339	or generic that accesses private) provided the other
				340	operation's sync scope is:
				341
				342	- ``system`` or ``agent`` and executed by a thread on the
				343	same agent.
				344	- ``workgroup`` and executed by a thread in the same
				345	workgroup.
				346	- ``wavefront`` and executed by a thread in the same
				347	wavefront.
				348
				349	``workgroup`` Synchronizes with, and participates in modification and
				350	seq_cst total orderings with, other operations (except
				351	image operations) for all address spaces (except private,
				352	or generic that accesses private) provided the other
				353	operation's sync scope is:
				354
				355	- ``system``, ``agent`` or ``workgroup`` and executed by a
				356	thread in the same workgroup.
				357	- ``wavefront`` and executed by a thread in the same
				358	wavefront.
				359
				360	``wavefront`` Synchronizes with, and participates in modification and
				361	seq_cst total orderings with, other operations (except
				362	image operations) for all address spaces (except private,
				363	or generic that accesses private) provided the other
				364	operation's sync scope is:
				365
				366	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				367	and executed by a thread in the same wavefront.
				368
				369	``singlethread`` Only synchronizes with, and participates in modification
				370	and seq_cst total orderings with, other operations (except
				371	image operations) running in the same thread for all
				372	address spaces (for example, in signal handlers).
				373	================ ==========================================================
				374
				375	AMDGPU Intrinsics
				376	-----------------
				377
Tony Tye	978dec7	2018-06-14 16:40:10 +0000	[diff] [blame]	378	The AMDGPU backend implements the following LLVM IR intrinsics.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	379
				380	This section is WIP.
				381
				382	.. TODO
				383	List AMDGPU intrinsics
				384
Tony Tye	978dec7	2018-06-14 16:40:10 +0000	[diff] [blame]	385	AMDGPU Attributes
				386	-----------------
				387
				388	The AMDGPU backend supports the following LLVM IR attributes.
				389
				390	.. table:: AMDGPU LLVM IR Attributes
				391	:name: amdgpu-llvm-ir-attributes-table
				392
				393	======================================= ==========================================================
				394	LLVM Attribute Description
				395	======================================= ==========================================================
				396	"amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
				397	will be specified when the kernel is dispatched. Generated
				398	by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
				399	"amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
				400	argument block size for the implicit arguments. This
				401	varies by OS and language (for OpenCL see
				402	:ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
				403	"amdgpu-max-work-group-size"="n" Specify the maximum work-group size that will be specifed
				404	when the kernel is dispatched.
				405	"amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
				406	the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
				407	"amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
				408	``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
				409	"amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
				410	execution unit. Generated by the ``amdgpu_waves_per_eu``
				411	CLANG attribute [CLANG-ATTR]_.
				412	======================================= ==========================================================
				413
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	414	Code Object
				415	===========
				416
				417	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				418	can be linked by ``lld`` to produce a standard ELF shared code object which can
				419	be loaded and executed on an AMDGPU target.
				420
				421	Header
				422	------
				423
				424	The AMDGPU backend uses the following ELF header:
				425
				426	.. table:: AMDGPU ELF Header
				427	:name: amdgpu-elf-header-table
				428
Konstantin Zhuravlyov	7624630	2017-10-03 20:54:07 +0000	[diff] [blame]	429	========================== ===============================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	430	Field Value
Konstantin Zhuravlyov	7624630	2017-10-03 20:54:07 +0000	[diff] [blame]	431	========================== ===============================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	432	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				433	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	434	``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
				435	- ``ELFOSABI_AMDGPU_HSA``
				436	- ``ELFOSABI_AMDGPU_PAL``
				437	- ``ELFOSABI_AMDGPU_MESA3D``
				438	``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
				439	- ``ELFABIVERSION_AMDGPU_PAL``
				440	- ``ELFABIVERSION_AMDGPU_MESA3D``
				441	``e_type`` - ``ET_REL``
				442	- ``ET_DYN``
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	443	``e_machine`` ``EM_AMDGPU``
				444	``e_entry`` 0
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	445	``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyov	7624630	2017-10-03 20:54:07 +0000	[diff] [blame]	446	========================== ===============================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	447
				448	..
				449
				450	.. table:: AMDGPU ELF Header Enumeration Values
				451	:name: amdgpu-elf-header-enumeration-values-table
				452
Konstantin Zhuravlyov	4b145dd	2017-10-03 21:14:14 +0000	[diff] [blame]	453	=============================== =====
				454	Name Value
				455	=============================== =====
				456	``EM_AMDGPU`` 224
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	457	``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov	4b145dd	2017-10-03 21:14:14 +0000	[diff] [blame]	458	``ELFOSABI_AMDGPU_HSA`` 64
				459	``ELFOSABI_AMDGPU_PAL`` 65
				460	``ELFOSABI_AMDGPU_MESA3D`` 66
				461	``ELFABIVERSION_AMDGPU_HSA`` 1
				462	``ELFABIVERSION_AMDGPU_PAL`` 0
				463	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				464	=============================== =====
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	465
				466	``e_ident[EI_CLASS]``
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	467	The ELF class is:
				468
				469	* ``ELFCLASS32`` for ``r600`` architecture.
				470
				471	* ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
				472	bit applications.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	473
				474	``e_ident[EI_DATA]``
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	475	All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	476
				477	``e_ident[EI_OSABI]``
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	478	One of the following AMD GPU architecture specific OS ABIs
				479	(see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyov	7624630	2017-10-03 20:54:07 +0000	[diff] [blame]	480
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	481	* ``ELFOSABI_NONE`` for unknown OS.
Konstantin Zhuravlyov	7624630	2017-10-03 20:54:07 +0000	[diff] [blame]	482
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	483	* ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	484
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	485	* ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
				486
				487	* ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov	4b145dd	2017-10-03 21:14:14 +0000	[diff] [blame]	488
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	489	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	7624630	2017-10-03 20:54:07 +0000	[diff] [blame]	490	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				491	object conforms:
				492
				493	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				494	runtime ABI.
				495
				496	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				497	runtime ABI.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	498
Konstantin Zhuravlyov	4b145dd	2017-10-03 21:14:14 +0000	[diff] [blame]	499	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	500	3D runtime ABI.
Konstantin Zhuravlyov	4b145dd	2017-10-03 21:14:14 +0000	[diff] [blame]	501
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	502	``e_type``
				503	Can be one of the following values:
				504
				505
				506	``ET_REL``
				507	The type produced by the AMD GPU backend compiler as it is relocatable code
				508	object.
				509
				510	``ET_DYN``
				511	The type produced by the linker as it is a shared code object.
				512
				513	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				514
				515	``e_machine``
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	516	The value ``EM_AMDGPU`` is used for the machine for all processors supported
				517	by the ``r600`` and ``amdgcn`` architectures (see
				518	:ref:`amdgpu-processor-table`). The specific processor is specified in the
				519	``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
				520	:ref:`amdgpu-elf-header-e_flags-table`).
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	521
				522	``e_entry``
				523	The entry point is 0 as the entry points for individual kernels must be
				524	selected in order to invoke them through AQL packets.
				525
				526	``e_flags``
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	527	The AMDGPU backend uses the following ELF header flags:
				528
				529	.. table:: AMDGPU ELF Header ``e_flags``
				530	:name: amdgpu-elf-header-e_flags-table
				531
				532	================================= ========== =============================
				533	Name Value Description
				534	================================= ========== =============================
				535	AMDGPU Processor Flag See :ref:`amdgpu-processor-table`.
				536	-------------------------------------------- -----------------------------
				537	``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
				538	mask for
				539	``EF_AMDGPU_MACH_xxx`` values
				540	defined in
				541	:ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye	c865d8a	2017-12-11 15:35:27 +0000	[diff] [blame]	542	``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
				543	target feature is
				544	enabled for all code
				545	contained in the code object.
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	546	If the processor
				547	does not support the
				548	``xnack`` target
				549	feature then must
				550	be 0.
Tony Tye	c865d8a	2017-12-11 15:35:27 +0000	[diff] [blame]	551	See
				552	:ref:`amdgpu-target-features`.
Konstantin Zhuravlyov	7829a6d	2018-11-05 22:44:19 +0000	[diff] [blame]	553	``EF_AMDGPU_SRAM_ECC`` 0x00000200 Indicates if the ``sram-ecc``
				554	target feature is
				555	enabled for all code
				556	contained in the code object.
				557	If the processor
				558	does not support the
				559	``sram-ecc`` target
				560	feature then must
				561	be 0.
				562	See
				563	:ref:`amdgpu-target-features`.
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	564	================================= ========== =============================
				565
				566	.. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
				567	:name: amdgpu-ef-amdgpu-mach-table
				568
				569	================================= ========== =============================
				570	Name Value Description (see
				571	:ref:`amdgpu-processor-table`)
				572	================================= ========== =============================
Konstantin Zhuravlyov	1629089	2018-02-16 22:33:59 +0000	[diff] [blame]	573	``EF_AMDGPU_MACH_NONE`` 0x000 not specified
				574	``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
				575	``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
				576	``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
				577	``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
				578	``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
				579	``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
				580	``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
				581	``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
				582	``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
				583	``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
				584	``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
				585	``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
				586	``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
				587	``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
				588	``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
				589	``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
				590	reserved 0x011 - Reserved for ``r600``
				591	0x01f architecture processors.
				592	``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
				593	``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
				594	``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
				595	``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
				596	``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
				597	``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
				598	``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
				599	reserved 0x027 Reserved.
				600	``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
				601	``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
				602	``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
				603	``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
				604	``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
				605	``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
Matt Arsenault	ac9b3ef	2018-04-30 19:08:16 +0000	[diff] [blame]	606	``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
				607	``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
Konstantin Zhuravlyov	1629089	2018-02-16 22:33:59 +0000	[diff] [blame]	608	reserved 0x030 Reserved.
Tim Renouf	7d5e5c2	2018-10-24 08:14:07 +0000	[diff] [blame]	609	``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909``
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	610	================================= ========== =============================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	611
				612	Sections
				613	--------
				614
				615	An AMDGPU target ELF code object has the standard ELF sections which include:
				616
				617	.. table:: AMDGPU ELF Sections
				618	:name: amdgpu-elf-sections-table
				619
				620	================== ================ =================================
				621	Name Type Attributes
				622	================== ================ =================================
				623	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				624	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				625	``.debug_``\ \* ``SHT_PROGBITS`` none
				626	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				627	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				628	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				629	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				630	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				631	``.note`` ``SHT_NOTE`` none
				632	``.rela``\ name ``SHT_RELA`` none
				633	``.rela.dyn`` ``SHT_RELA`` none
				634	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				635	``.shstrtab`` ``SHT_STRTAB`` none
				636	``.strtab`` ``SHT_STRTAB`` none
				637	``.symtab`` ``SHT_SYMTAB`` none
				638	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				639	================== ================ =================================
				640
				641	These sections have their standard meanings (see [ELF]_) and are only generated
				642	if needed.
				643
				644	``.debug``\ \*
				645	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				646	DWARF produced by the AMDGPU backend.
				647
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	648	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	649	The standard sections used by a dynamic loader.
				650
				651	``.note``
				652	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				653	backend.
				654
				655	``.rela``\ name, ``.rela.dyn``
				656	For relocatable code objects, name is the name of the section that the
				657	relocation records apply. For example, ``.rela.text`` is the section name for
				658	relocation records associated with the ``.text`` section.
				659
				660	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				661	records from each of the relocatable code object's ``.rela``\ name sections.
				662
				663	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				664	the AMDGPU backend.
				665
				666	``.text``
				667	The executable machine code for the kernels and functions they call. Generated
				668	as position independent code. See :ref:`amdgpu-code-conventions` for
				669	information on conventions used in the isa generation.
				670
				671	.. _amdgpu-note-records:
				672
				673	Note Records
				674	------------
				675
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	676	As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
				677	be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
				678	aligned. In addition, minimal zero byte padding must be generated to ensure the
				679	``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
				680	``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	681
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	682	.. _amdgpu-note-records-v2:
				683
				684	Code Object V2 Note Records (-mattr=-code-object-v3)
				685	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				686
				687	The AMDGPU backend code object uses the following ELF note record in the
				688	``.note`` section.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	689
				690	Additional note records can be present.
				691
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	692	.. table:: AMDGPU Code Object V2 ELF Note Records
				693	:name: amdgpu-elf-note-records-table-v2
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	694
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	695	===== ============================== ======================================
				696	Name Type Description
				697	===== ============================== ======================================
				698	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	699	===== ============================== ======================================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	700
				701	..
				702
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	703	.. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
				704	:name: amdgpu-elf-note-record-enumeration-values-table-v2
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	705
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	706	============================== =====
				707	Name Value
				708	============================== =====
				709	reserved 0-9
				710	``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	711	reserved 11
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	712	============================== =====
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	713
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	714	``NT_AMD_AMDGPU_HSA_METADATA``
				715	Specifies extensible metadata associated with the code objects executed on HSA
				716	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				717	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	718	:ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	719	object metadata string.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	720
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	721	.. _amdgpu-note-records-v3:
				722
				723	Code Object V3 Note Records (-mattr=+code-object-v3)
				724	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				725
				726	The AMDGPU backend code object uses the following ELF note record in the
				727	``.note`` section.
				728
				729	Additional note records can be present.
				730
				731	.. table:: AMDGPU Code Object V3 ELF Note Records
				732	:name: amdgpu-elf-note-records-table-v3
				733
				734	======== ============================== ======================================
				735	Name Type Description
				736	======== ============================== ======================================
				737	"AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_
				738	binary format.
				739	======== ============================== ======================================
				740
				741	..
				742
				743	.. table:: AMDGPU Code Object V3 ELF Note Record Enumeration Values
				744	:name: amdgpu-elf-note-record-enumeration-values-table-v3
				745
				746	============================== =====
				747	Name Value
				748	============================== =====
				749	reserved 0-31
				750	``NT_AMDGPU_METADATA`` 32
				751	============================== =====
				752
				753	``NT_AMDGPU_METADATA``
				754	Specifies extensible metadata associated with an AMDGPU code
				755	object. It is encoded as a map in the Message Pack [MsgPack]_ binary
				756	data format. See :ref:`amdgpu-amdhsa-code-object-metadata-v3` for the
				757	map keys defined for the ``amdhsa`` OS.
				758
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	759	.. _amdgpu-symbols:
				760
				761	Symbols
				762	-------
				763
				764	Symbols include the following:
				765
				766	.. table:: AMDGPU ELF Symbols
				767	:name: amdgpu-elf-symbols-table
				768
				769	===================== ============== ============= ==================
				770	Name Type Section Description
				771	===================== ============== ============= ==================
				772	link-name ``STT_OBJECT`` - ``.data`` Global variable
				773	- ``.rodata``
				774	- ``.bss``
Konstantin Zhuravlyov	299cf5f	2018-06-12 18:02:46 +0000	[diff] [blame]	775	link-name\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	776	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				777	===================== ============== ============= ==================
				778
				779	Global variable
				780	Global variables both used and defined by the compilation unit.
				781
				782	If the symbol is defined in the compilation unit then it is allocated in the
				783	appropriate section according to if it has initialized data or is readonly.
				784
				785	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				786	will resolve relocations using the definition provided by another code object
				787	or explicitly defined by the runtime.
				788
				789	All global symbols, whether defined in the compilation unit or external, are
				790	accessed by the machine code indirectly through a GOT table entry. This
				791	allows them to be preemptable. The GOT table is only supported when the target
				792	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	793
				794	.. TODO
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	795	Add description of linked shared object symbols. Seems undefined symbols
				796	are marked as STT_NOTYPE.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	797
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	798	Kernel descriptor
				799	Every HSA kernel has an associated kernel descriptor. It is the address of the
				800	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				801	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				802	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				803
				804	Kernel entry point
				805	Every HSA kernel also has a symbol for its machine code entry point.
				806
				807	.. _amdgpu-relocation-records:
				808
				809	Relocation Records
				810	------------------
				811
				812	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				813	relocatable fields are:
				814
				815	``word32``
				816	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				817	alignment. These values use the same byte order as other word values in the
				818	AMD GPU architecture.
				819
				820	``word64``
				821	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				822	alignment. These values use the same byte order as other word values in the
				823	AMD GPU architecture.
				824
				825	Following notations are used for specifying relocation calculations:
				826
				827	A
				828	Represents the addend used to compute the value of the relocatable field.
				829
				830	G
				831	Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyov	cb5868c	2017-10-19 17:12:55 +0000	[diff] [blame]	832	entry's symbol will reside during execution.
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	833
				834	GOT
				835	Represents the address of the global offset table.
				836
				837	P
				838	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				839	of the storage unit being relocated (computed using ``r_offset``).
				840
				841	S
				842	Represents the value of the symbol whose index resides in the relocation
Tony Tye	300ec0a	2017-10-16 20:44:29 +0000	[diff] [blame]	843	entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
				844
				845	B
				846	Represents the base address of a loaded executable or shared object which is
				847	the difference between the ELF address and the actual load address. Relocations
				848	using this are only valid in executable or shared objects.
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	849
				850	The following relocation types are supported:
				851
				852	.. table:: AMDGPU ELF Relocation Records
				853	:name: amdgpu-elf-relocation-records-table
				854
Tony Tye	90018a5	2018-01-30 23:59:43 +0000	[diff] [blame]	855	========================== ======= ===== ========== ==============================
				856	Relocation Type Kind Value Field Calculation
				857	========================== ======= ===== ========== ==============================
				858	``R_AMDGPU_NONE`` 0 none none
Tony Tye	f0a27cc	2018-04-13 01:01:27 +0000	[diff] [blame]	859	``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
				860	Dynamic
				861	``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
				862	Dynamic
				863	``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
Matt Arsenault	ac9b3ef	2018-04-30 19:08:16 +0000	[diff] [blame]	864	Dynamic
Tony Tye	90018a5	2018-01-30 23:59:43 +0000	[diff] [blame]	865	``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
				866	``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
Tony Tye	f0a27cc	2018-04-13 01:01:27 +0000	[diff] [blame]	867	``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
				868	Dynamic
Tony Tye	90018a5	2018-01-30 23:59:43 +0000	[diff] [blame]	869	``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
				870	``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				871	``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
				872	``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				873	``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
				874	reserved 12
				875	``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
				876	========================== ======= ===== ========== ==============================
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	877
Tony Tye	f0a27cc	2018-04-13 01:01:27 +0000	[diff] [blame]	878	``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
				879	the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
				880
				881	There is no current OS loader support for 32 bit programs and so
				882	``R_AMDGPU_ABS32`` is not used.
Matt Arsenault	ac9b3ef	2018-04-30 19:08:16 +0000	[diff] [blame]	883
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	884	.. _amdgpu-dwarf:
				885
				886	DWARF
				887	-----
				888
Scott Linder	5e4b515	2018-02-23 23:01:06 +0000	[diff] [blame]	889	Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	890	information that maps the code object executable code and data to the source
				891	language constructs. It can be used by tools such as debuggers and profilers.
				892
				893	Address Space Mapping
				894	~~~~~~~~~~~~~~~~~~~~~
				895
				896	The following address space mapping is used:
				897
				898	.. table:: AMDGPU DWARF Address Space Mapping
				899	:name: amdgpu-dwarf-address-space-mapping-table
				900
				901	=================== =================
				902	DWARF Address Space Memory Space
				903	=================== =================
				904	1 Private (Scratch)
				905	2 Local (group/LDS)
				906	omitted Global
				907	omitted Constant
				908	omitted Generic (Flat)
				909	not supported Region (GDS)
				910	=================== =================
				911
				912	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				913	used in the table.
				914
				915	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				916	DWARF address space of the value of the pointer when it is in the private or
				917	local address space. Otherwise the attribute is omitted.
				918
				919	An ``XDEREF`` operation is generated in location list expressions for variables
				920	that are allocated in the private and local address space. Otherwise no
				921	``XDREF`` is omitted.
				922
				923	Register Mapping
				924	~~~~~~~~~~~~~~~~
				925
				926	This section is WIP.
				927
				928	.. TODO
				929	Define DWARF register enumeration.
				930
				931	If want to present a wavefront state then should expose vector registers as
				932	64 wide (rather than per work-item view that LLVM uses). Either as separate
				933	registers, or a 64x4 byte single register. In either case use a new LANE op
				934	(akin to XDREF) to select the current lane usage in a location
				935	expression. This would also allow scalar register spilling to vector register
				936	lanes to be expressed (currently no debug information is being generated for
				937	spilling). If choose a wide single register approach then use LANE in
				938	conjunction with PIECE operation to select the dword part of the register for
				939	the current lane. If the separate register approach then use LANE to select
				940	the register.
				941
				942	Source Text
				943	~~~~~~~~~~~
				944
Scott Linder	5e4b515	2018-02-23 23:01:06 +0000	[diff] [blame]	945	Source text for online-compiled programs (e.g. those compiled by the OpenCL
				946	runtime) may be embedded into the DWARF v5 line table using the ``clang
				947	-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	948
Scott Linder	5e4b515	2018-02-23 23:01:06 +0000	[diff] [blame]	949	For example:
				950
				951	``-gembed-source``
				952	Enable the embedded source DWARF v5 extension.
				953	``-gno-embed-source``
				954	Disable the embedded source DWARF v5 extension.
				955
				956	.. table:: AMDGPU Debug Options
				957	:name: amdgpu-debug-options
				958
				959	==================== ==================================================
				960	Debug Flag Description
				961	==================== ==================================================
				962	-g[no-]embed-source Enable/disable embedding source text in DWARF
				963	debug sections. Useful for environments where
				964	source cannot be written to disk, such as
				965	when performing online compilation.
				966	==================== ==================================================
				967
				968	This option enables one extended content types in the DWARF v5 Line Number
				969	Program Header, which is used to encode embedded source.
				970
				971	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
				972	:name: amdgpu-dwarf-extended-content-types
				973
				974	============================ ======================
				975	Content Type Form
				976	============================ ======================
				977	``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
				978	============================ ======================
				979
				980	The source field will contain the UTF-8 encoded, null-terminated source text
				981	with ``'\n'`` line endings. When the source field is present, consumers can use
				982	the embedded source instead of attempting to discover the source on disk. When
				983	the source field is absent, consumers can access the file to get the source
				984	text.
				985
				986	The above content type appears in the ``file_name_entry_format`` field of the
				987	line table prologue, and its corresponding value appear in the ``file_names``
				988	field. The current encoding of the content type is documented in table
				989	:ref:`amdgpu-dwarf-extended-content-types-encoding`
				990
				991	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
				992	:name: amdgpu-dwarf-extended-content-types-encoding
				993
				994	============================ ====================
				995	Content Type Value
				996	============================ ====================
				997	``DW_LNCT_LLVM_source`` 0x2001
				998	============================ ====================
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	999
				1000	.. _amdgpu-code-conventions:
				1001
				1002	Code Conventions
				1003	================
				1004
				1005	This section provides code conventions used for each supported target triple OS
				1006	(see :ref:`amdgpu-target-triples`).
				1007
				1008	AMDHSA
				1009	------
				1010
				1011	This section provides code conventions used when the target triple OS is
				1012	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				1013
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	1014	.. _amdgpu-amdhsa-code-object-target-identification:
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1015
Tony Tye	d31305a	2018-03-27 21:20:46 +0000	[diff] [blame]	1016	Code Object Target Identification
				1017	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1018
				1019	The AMDHSA OS uses the following syntax to specify the code object
				1020	target as a single string:
				1021
				1022	``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
				1023
				1024	Where:
				1025
				1026	- ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
				1027	are the same as the Target Triple (see
				1028	:ref:`amdgpu-target-triples`).
				1029
				1030	- ``<Processor>`` is the same as the Processor (see
				1031	:ref:`amdgpu-processors`).
				1032
				1033	- ``<Target Features>`` is a list of the enabled Target Features
				1034	(see :ref:`amdgpu-target-features`), each prefixed by a plus, that
				1035	apply to Processor. The list must be in the same order as listed
				1036	in the table :ref:`amdgpu-target-feature-table`. Note that *Target
				1037	Features* must be included in the list if they are enabled even if
				1038	that is the default for Processor.
				1039
				1040	For example:
				1041
				1042	``"amdgcn-amd-amdhsa--gfx902+xnack"``
				1043
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	1044	.. _amdgpu-amdhsa-code-object-metadata:
				1045
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1046	Code Object Metadata
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	1047	~~~~~~~~~~~~~~~~~~~~
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1048
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	1049	The code object metadata specifies extensible metadata associated with the code
				1050	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	1051	[AMD-ROCm]_. It is specified in a note record (see :ref:`amdgpu-note-records`)
				1052	and is required when the target triple OS is ``amdhsa`` (see
				1053	:ref:`amdgpu-target-triples`). It must contain the minimum information
				1054	necessary to support the ROCM kernel queries. For example, the segment sizes
				1055	needed in a dispatch packet. In addition, a high level language runtime may
				1056	require other information to be included. For example, the AMD OpenCL runtime
				1057	records kernel argument information.
				1058
				1059	.. _amdgpu-amdhsa-code-object-metadata-v2:
				1060
				1061	Code Object V2 Metadata (-mattr=-code-object-v3)
				1062	++++++++++++++++++++++++++++++++++++++++++++++++
				1063
				1064	Code object V2 metadata is specified by the ``NT_AMD_AMDGPU_METADATA`` note
				1065	record (see :ref:`amdgpu-note-records-v2`).
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1066
Sylvestre Ledru	e9d67e4	2017-06-26 02:45:39 +0000	[diff] [blame]	1067	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1068	:doc:`YamlIO`).
				1069
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	1070	.. TODO
				1071	Is the string null terminated? It probably should not if YAML allows it to
				1072	contain null characters, otherwise it should be.
				1073
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1074	The metadata is represented as a single YAML document comprised of the mapping
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	1075	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v2` and
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1076	referenced tables.
				1077
				1078	For boolean values, the string values of ``false`` and ``true`` are used for
				1079	false and true respectively.
				1080
				1081	Additional information can be added to the mappings. To avoid conflicts, any
				1082	non-AMD key names should be prefixed by "vendor-name.".
				1083
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	1084	.. table:: AMDHSA Code Object V2 Metadata Map
				1085	:name: amdgpu-amdhsa-code-object-metadata-map-table-v2
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1086
				1087	========== ============== ========= =======================================
				1088	String Key Value Type Required? Description
				1089	========== ============== ========= =======================================
				1090	"Version" sequence of Required - The first integer is the major
				1091	2 integers version. Currently 1.
				1092	- The second integer is the minor
				1093	version. Currently 0.
				1094	"Printf" sequence of Each string is encoded information
				1095	strings about a printf function call. The
				1096	encoded information is organized as
				1097	fields separated by colon (':'):
				1098
				1099	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				1100
				1101	where:
				1102
				1103	``ID``
				1104	A 32 bit integer as a unique id for
				1105	each printf function call
				1106
				1107	``N``
				1108	A 32 bit integer equal to the number
				1109	of arguments of printf function call
				1110	minus 1
				1111
				1112	``S[i]`` (where i = 0, 1, ... , N-1)
				1113	32 bit integers for the size in bytes
				1114	of the i-th FormatString argument of
				1115	the printf function call
				1116
				1117	FormatString
				1118	The format string passed to the
				1119	printf function call.
				1120	"Kernels" sequence of Required Sequence of the mappings for each
				1121	mapping kernel in the code object. See
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	1122	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2`
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1123	for the definition of the mapping.
				1124	========== ============== ========= =======================================
				1125
				1126	..
				1127
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	1128	.. table:: AMDHSA Code Object V2 Kernel Metadata Map
				1129	:name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1130
				1131	================= ============== ========= ================================
				1132	String Key Value Type Required? Description
				1133	================= ============== ========= ================================
				1134	"Name" string Required Source name of the kernel.
				1135	"SymbolName" string Required Name of the kernel
				1136	descriptor ELF symbol.
				1137	"Language" string Source language of the kernel.
				1138	Values include:
				1139
				1140	- "OpenCL C"
				1141	- "OpenCL C++"
				1142	- "HCC"
				1143	- "OpenMP"
				1144
				1145	"LanguageVersion" sequence of - The first integer is the major
				1146	2 integers version.
				1147	- The second integer is the
				1148	minor version.
				1149	"Attrs" mapping Mapping of kernel attributes.
				1150	See
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	1151	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2`
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1152	for the mapping definition.
Konstantin Zhuravlyov	5556d84	2017-10-14 19:03:51 +0000	[diff] [blame]	1153	"Args" sequence of Sequence of mappings of the
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1154	mapping kernel arguments. See
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	1155	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2`
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1156	for the definition of the mapping.
				1157	"CodeProps" mapping Mapping of properties related to
				1158	the kernel code. See
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	1159	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2`
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1160	for the mapping definition.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1161	================= ============== ========= ================================
				1162
				1163	..
				1164
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	1165	.. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
				1166	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1167
				1168	=================== ============== ========= ==============================
				1169	String Key Value Type Required? Description
				1170	=================== ============== ========= ==============================
Tony Tye	ae5d34e	2018-01-30 23:07:10 +0000	[diff] [blame]	1171	"ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
				1172	3 integers must be >=1 and the dispatch
				1173	work-group size X, Y, Z must
				1174	correspond to the specified
				1175	values. Defaults to 0, 0, 0.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1176
				1177	Corresponds to the OpenCL
				1178	``reqd_work_group_size``
				1179	attribute.
				1180	"WorkGroupSizeHint" sequence of The dispatch work-group size
				1181	3 integers X, Y, Z is likely to be the
				1182	specified values.
				1183
				1184	Corresponds to the OpenCL
				1185	``work_group_size_hint``
				1186	attribute.
				1187	"VecTypeHint" string The name of a scalar or vector
				1188	type.
				1189
				1190	Corresponds to the OpenCL
				1191	``vec_type_hint`` attribute.
Yaxun Liu	091c043	2017-10-10 19:39:48 +0000	[diff] [blame]	1192
				1193	"RuntimeHandle" string The external symbol name
				1194	associated with a kernel.
				1195	OpenCL runtime allocates a
				1196	global buffer for the symbol
				1197	and saves the kernel's address
				1198	to it, which is used for
				1199	device side enqueueing. Only
				1200	available for device side
				1201	enqueued kernels.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1202	=================== ============== ========= ==============================
				1203
				1204	..
				1205
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	1206	.. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
				1207	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1208
				1209	================= ============== ========= ================================
				1210	String Key Value Type Required? Description
				1211	================= ============== ========= ================================
				1212	"Name" string Kernel argument name.
				1213	"TypeName" string Kernel argument type name.
				1214	"Size" integer Required Kernel argument size in bytes.
				1215	"Align" integer Required Kernel argument alignment in
				1216	bytes. Must be a power of two.
				1217	"ValueKind" string Required Kernel argument kind that
				1218	specifies how to set up the
				1219	corresponding argument.
				1220	Values include:
				1221
				1222	"ByValue"
				1223	The argument is copied
				1224	directly into the kernarg.
				1225
				1226	"GlobalBuffer"
				1227	A global address space pointer
				1228	to the buffer data is passed
				1229	in the kernarg.
				1230
				1231	"DynamicSharedPointer"
				1232	A group address space pointer
				1233	to dynamically allocated LDS
				1234	is passed in the kernarg.
				1235
				1236	"Sampler"
				1237	A global address space
				1238	pointer to a S# is passed in
				1239	the kernarg.
				1240
				1241	"Image"
				1242	A global address space
				1243	pointer to a T# is passed in
				1244	the kernarg.
				1245
				1246	"Pipe"
				1247	A global address space pointer
				1248	to an OpenCL pipe is passed in
				1249	the kernarg.
				1250
				1251	"Queue"
				1252	A global address space pointer
				1253	to an OpenCL device enqueue
				1254	queue is passed in the
				1255	kernarg.
				1256
				1257	"HiddenGlobalOffsetX"
				1258	The OpenCL grid dispatch
				1259	global offset for the X
				1260	dimension is passed in the
				1261	kernarg.
				1262
				1263	"HiddenGlobalOffsetY"
				1264	The OpenCL grid dispatch
				1265	global offset for the Y
				1266	dimension is passed in the
				1267	kernarg.
				1268
				1269	"HiddenGlobalOffsetZ"
				1270	The OpenCL grid dispatch
				1271	global offset for the Z
				1272	dimension is passed in the
				1273	kernarg.
				1274
				1275	"HiddenNone"
				1276	An argument that is not used
				1277	by the kernel. Space needs to
				1278	be left for it, but it does
				1279	not need to be set up.
				1280
				1281	"HiddenPrintfBuffer"
				1282	A global address space pointer
				1283	to the runtime printf buffer
				1284	is passed in kernarg.
				1285
				1286	"HiddenDefaultQueue"
				1287	A global address space pointer
				1288	to the OpenCL device enqueue
				1289	queue that should be used by
				1290	the kernel by default is
				1291	passed in the kernarg.
				1292
				1293	"HiddenCompletionAction"
Yaxun Liu	a52756b	2017-10-30 14:30:28 +0000	[diff] [blame]	1294	A global address space pointer
				1295	to help link enqueued kernels into
				1296	the ancestor tree for determining
				1297	when the parent kernel has finished.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1298
				1299	"ValueType" string Required Kernel argument value type. Only
				1300	present if "ValueKind" is
				1301	"ByValue". For vector data
				1302	types, the value is for the
				1303	element type. Values include:
				1304
				1305	- "Struct"
				1306	- "I8"
				1307	- "U8"
				1308	- "I16"
				1309	- "U16"
				1310	- "F16"
				1311	- "I32"
				1312	- "U32"
				1313	- "F32"
				1314	- "I64"
				1315	- "U64"
				1316	- "F64"
				1317
				1318	.. TODO
				1319	How can it be determined if a
				1320	vector type, and what size
				1321	vector?
				1322	"PointeeAlign" integer Alignment in bytes of pointee
				1323	type for pointer type kernel
				1324	argument. Must be a power
				1325	of 2. Only present if
				1326	"ValueKind" is
				1327	"DynamicSharedPointer".
				1328	"AddrSpaceQual" string Kernel argument address space
				1329	qualifier. Only present if
				1330	"ValueKind" is "GlobalBuffer" or
				1331	"DynamicSharedPointer". Values
				1332	are:
				1333
				1334	- "Private"
				1335	- "Global"
				1336	- "Constant"
				1337	- "Local"
				1338	- "Generic"
				1339	- "Region"
				1340
				1341	.. TODO
				1342	Is GlobalBuffer only Global
				1343	or Constant? Is
				1344	DynamicSharedPointer always
				1345	Local? Can HCC allow Generic?
				1346	How can Private or Region
				1347	ever happen?
				1348	"AccQual" string Kernel argument access
				1349	qualifier. Only present if
				1350	"ValueKind" is "Image" or
				1351	"Pipe". Values
				1352	are:
				1353
				1354	- "ReadOnly"
				1355	- "WriteOnly"
				1356	- "ReadWrite"
				1357
				1358	.. TODO
				1359	Does this apply to
				1360	GlobalBuffer?
Konstantin Zhuravlyov	5556d84	2017-10-14 19:03:51 +0000	[diff] [blame]	1361	"ActualAccQual" string The actual memory accesses
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1362	performed by the kernel on the
				1363	kernel argument. Only present if
				1364	"ValueKind" is "GlobalBuffer",
				1365	"Image", or "Pipe". This may be
				1366	more restrictive than indicated
				1367	by "AccQual" to reflect what the
				1368	kernel actual does. If not
				1369	present then the runtime must
				1370	assume what is implied by
				1371	"AccQual" and "IsConst". Values
				1372	are:
				1373
				1374	- "ReadOnly"
				1375	- "WriteOnly"
				1376	- "ReadWrite"
				1377
				1378	"IsConst" boolean Indicates if the kernel argument
				1379	is const qualified. Only present
				1380	if "ValueKind" is
				1381	"GlobalBuffer".
				1382
				1383	"IsRestrict" boolean Indicates if the kernel argument
				1384	is restrict qualified. Only
				1385	present if "ValueKind" is
				1386	"GlobalBuffer".
				1387
				1388	"IsVolatile" boolean Indicates if the kernel argument
				1389	is volatile qualified. Only
				1390	present if "ValueKind" is
				1391	"GlobalBuffer".
				1392
				1393	"IsPipe" boolean Indicates if the kernel argument
				1394	is pipe qualified. Only present
				1395	if "ValueKind" is "Pipe".
				1396
				1397	.. TODO
				1398	Can GlobalBuffer be pipe
				1399	qualified?
				1400	================= ============== ========= ================================
				1401
				1402	..
				1403
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	1404	.. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
				1405	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1406
				1407	============================ ============== ========= =====================
				1408	String Key Value Type Required? Description
				1409	============================ ============== ========= =====================
				1410	"KernargSegmentSize" integer Required The size in bytes of
				1411	the kernarg segment
				1412	that holds the values
				1413	of the arguments to
				1414	the kernel.
				1415	"GroupSegmentFixedSize" integer Required The amount of group
				1416	segment memory
				1417	required by a
				1418	work-group in
				1419	bytes. This does not
				1420	include any
				1421	dynamically allocated
				1422	group segment memory
				1423	that may be added
				1424	when the kernel is
				1425	dispatched.
				1426	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1427	private address space
				1428	memory required for a
				1429	work-item in
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	1430	bytes. If the kernel
				1431	uses a dynamic call
				1432	stack then additional
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1433	space must be added
				1434	to this value for the
				1435	call stack.
				1436	"KernargSegmentAlign" integer Required The maximum byte
				1437	alignment of
				1438	arguments in the
				1439	kernarg segment. Must
				1440	be a power of 2.
				1441	"WavefrontSize" integer Required Wavefront size. Must
				1442	be a power of 2.
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	1443	"NumSGPRs" integer Required Number of scalar
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1444	registers used by a
				1445	wavefront for
				1446	GFX6-GFX9. This
				1447	includes the special
				1448	SGPRs for VCC, Flat
				1449	Scratch (GFX7-GFX9)
				1450	and XNACK (for
				1451	GFX8-GFX9). It does
				1452	not include the 16
				1453	SGPR added if a trap
				1454	handler is
				1455	enabled. It is not
				1456	rounded up to the
				1457	allocation
				1458	granularity.
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	1459	"NumVGPRs" integer Required Number of vector
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1460	registers used by
				1461	each work-item for
				1462	GFX6-GFX9
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	1463	"MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1464	work-group size
				1465	supported by the
				1466	kernel in work-items.
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	1467	Must be >=1 and
Tony Tye	ae5d34e	2018-01-30 23:07:10 +0000	[diff] [blame]	1468	consistent with
				1469	ReqdWorkGroupSize if
				1470	not 0, 0, 0.
Konstantin Zhuravlyov	a9edc75	2017-11-28 17:51:08 +0000	[diff] [blame]	1471	"NumSpilledSGPRs" integer Number of stores from
				1472	a scalar register to
				1473	a register allocator
				1474	created spill
				1475	location.
				1476	"NumSpilledVGPRs" integer Number of stores from
				1477	a vector register to
				1478	a register allocator
				1479	created spill
				1480	location.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1481	============================ ============== ========= =====================
				1482
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	1483	.. _amdgpu-amdhsa-code-object-metadata-v3:
				1484
				1485	Code Object V3 Metadata (-mattr=+code-object-v3)
				1486	++++++++++++++++++++++++++++++++++++++++++++++++
				1487
				1488	Code object V3 metadata is specified by the ``NT_AMDGPU_METADATA`` note record
				1489	(see :ref:`amdgpu-note-records-v3`).
				1490
				1491	The metadata is represented as Message Pack formatted binary data (see
				1492	[MsgPack]_). The top level is a Message Pack map that includes the
				1493	keys defined in table
				1494	:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
				1495	tables.
				1496
				1497	Additional information can be added to the maps. To avoid conflicts,
				1498	any key names should be prefixed by "vendor-name." where
				1499	``vendor-name`` can be the the name of the vendor and specific vendor
				1500	tool that generates the information. The prefix is abbreviated to
				1501	simply "." when it appears within a map that has been added by the
				1502	same vendor-name.
				1503
				1504	.. table:: AMDHSA Code Object V3 Metadata Map
				1505	:name: amdgpu-amdhsa-code-object-metadata-map-table-v3
				1506
				1507	================= ============== ========= =======================================
				1508	String Key Value Type Required? Description
				1509	================= ============== ========= =======================================
				1510	"amdhsa.version" sequence of Required - The first integer is the major
				1511	2 integers version. Currently 1.
				1512	- The second integer is the minor
				1513	version. Currently 0.
				1514	"amdhsa.printf" sequence of Each string is encoded information
				1515	strings about a printf function call. The
				1516	encoded information is organized as
				1517	fields separated by colon (':'):
				1518
				1519	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				1520
				1521	where:
				1522
				1523	``ID``
				1524	A 32 bit integer as a unique id for
				1525	each printf function call
				1526
				1527	``N``
				1528	A 32 bit integer equal to the number
				1529	of arguments of printf function call
				1530	minus 1
				1531
				1532	``S[i]`` (where i = 0, 1, ... , N-1)
				1533	32 bit integers for the size in bytes
				1534	of the i-th FormatString argument of
				1535	the printf function call
				1536
				1537	FormatString
				1538	The format string passed to the
				1539	printf function call.
				1540	"amdhsa.kernels" sequence of Required Sequence of the maps for each
				1541	map kernel in the code object. See
				1542	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
				1543	for the definition of the keys included
				1544	in that map.
				1545	================= ============== ========= =======================================
				1546
				1547	..
				1548
				1549	.. table:: AMDHSA Code Object V3 Kernel Metadata Map
				1550	:name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
				1551
				1552	=================================== ============== ========= ================================
				1553	String Key Value Type Required? Description
				1554	=================================== ============== ========= ================================
				1555	".name" string Required Source name of the kernel.
				1556	".symbol" string Required Name of the kernel
				1557	descriptor ELF symbol.
				1558	".language" string Source language of the kernel.
				1559	Values include:
				1560
				1561	- "OpenCL C"
				1562	- "OpenCL C++"
				1563	- "HCC"
				1564	- "HIP"
				1565	- "OpenMP"
				1566	- "Assembler"
				1567
				1568	".language_version" sequence of - The first integer is the major
				1569	2 integers version.
				1570	- The second integer is the
				1571	minor version.
				1572	".args" sequence of Sequence of maps of the
				1573	map kernel arguments. See
				1574	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
				1575	for the definition of the keys
				1576	included in that map.
				1577	".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values
				1578	3 integers must be >=1 and the dispatch
				1579	work-group size X, Y, Z must
				1580	correspond to the specified
				1581	values. Defaults to 0, 0, 0.
				1582
				1583	Corresponds to the OpenCL
				1584	``reqd_work_group_size``
				1585	attribute.
				1586	".workgroup_size_hint" sequence of The dispatch work-group size
				1587	3 integers X, Y, Z is likely to be the
				1588	specified values.
				1589
				1590	Corresponds to the OpenCL
				1591	``work_group_size_hint``
				1592	attribute.
				1593	".vec_type_hint" string The name of a scalar or vector
				1594	type.
				1595
				1596	Corresponds to the OpenCL
				1597	``vec_type_hint`` attribute.
				1598
				1599	".device_enqueue_symbol" string The external symbol name
				1600	associated with a kernel.
				1601	OpenCL runtime allocates a
				1602	global buffer for the symbol
				1603	and saves the kernel's address
				1604	to it, which is used for
				1605	device side enqueueing. Only
				1606	available for device side
				1607	enqueued kernels.
				1608	".kernarg_segment_size" integer Required The size in bytes of
				1609	the kernarg segment
				1610	that holds the values
				1611	of the arguments to
				1612	the kernel.
				1613	".group_segment_fixed_size" integer Required The amount of group
				1614	segment memory
				1615	required by a
				1616	work-group in
				1617	bytes. This does not
				1618	include any
				1619	dynamically allocated
				1620	group segment memory
				1621	that may be added
				1622	when the kernel is
				1623	dispatched.
				1624	".private_segment_fixed_size" integer Required The amount of fixed
				1625	private address space
				1626	memory required for a
				1627	work-item in
				1628	bytes. If the kernel
				1629	uses a dynamic call
				1630	stack then additional
				1631	space must be added
				1632	to this value for the
				1633	call stack.
				1634	".kernarg_segment_align" integer Required The maximum byte
				1635	alignment of
				1636	arguments in the
				1637	kernarg segment. Must
				1638	be a power of 2.
				1639	".wavefront_size" integer Required Wavefront size. Must
				1640	be a power of 2.
				1641	".sgpr_count" integer Required Number of scalar
				1642	registers required by a
				1643	wavefront for
				1644	GFX6-GFX9. A register
				1645	is required if it is
				1646	used explicitly, or
				1647	if a higher numbered
				1648	register is used
				1649	explicitly. This
				1650	includes the special
				1651	SGPRs for VCC, Flat
				1652	Scratch (GFX7-GFX9)
				1653	and XNACK (for
				1654	GFX8-GFX9). It does
				1655	not include the 16
				1656	SGPR added if a trap
				1657	handler is
				1658	enabled. It is not
				1659	rounded up to the
				1660	allocation
				1661	granularity.
				1662	".vgpr_count" integer Required Number of vector
				1663	registers required by
				1664	each work-item for
				1665	GFX6-GFX9. A register
				1666	is required if it is
				1667	used explicitly, or
				1668	if a higher numbered
				1669	register is used
				1670	explicitly.
				1671	".max_flat_workgroup_size" integer Required Maximum flat
				1672	work-group size
				1673	supported by the
				1674	kernel in work-items.
				1675	Must be >=1 and
				1676	consistent with
				1677	ReqdWorkGroupSize if
				1678	not 0, 0, 0.
				1679	".sgpr_spill_count" integer Number of stores from
				1680	a scalar register to
				1681	a register allocator
				1682	created spill
				1683	location.
				1684	".vgpr_spill_count" integer Number of stores from
				1685	a vector register to
				1686	a register allocator
				1687	created spill
				1688	location.
				1689	=================================== ============== ========= ================================
				1690
				1691	..
				1692
				1693	.. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
				1694	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
				1695
				1696	====================== ============== ========= ================================
				1697	String Key Value Type Required? Description
				1698	====================== ============== ========= ================================
				1699	".name" string Kernel argument name.
				1700	".type_name" string Kernel argument type name.
				1701	".size" integer Required Kernel argument size in bytes.
				1702	".offset" integer Required Kernel argument offset in
				1703	bytes. The offset must be a
				1704	multiple of the alignment
				1705	required by the argument.
				1706	".value_kind" string Required Kernel argument kind that
				1707	specifies how to set up the
				1708	corresponding argument.
				1709	Values include:
				1710
				1711	"by_value"
				1712	The argument is copied
				1713	directly into the kernarg.
				1714
				1715	"global_buffer"
				1716	A global address space pointer
				1717	to the buffer data is passed
				1718	in the kernarg.
				1719
				1720	"dynamic_shared_pointer"
				1721	A group address space pointer
				1722	to dynamically allocated LDS
				1723	is passed in the kernarg.
				1724
				1725	"sampler"
				1726	A global address space
				1727	pointer to a S# is passed in
				1728	the kernarg.
				1729
				1730	"image"
				1731	A global address space
				1732	pointer to a T# is passed in
				1733	the kernarg.
				1734
				1735	"pipe"
				1736	A global address space pointer
				1737	to an OpenCL pipe is passed in
				1738	the kernarg.
				1739
				1740	"queue"
				1741	A global address space pointer
				1742	to an OpenCL device enqueue
				1743	queue is passed in the
				1744	kernarg.
				1745
				1746	"hidden_global_offset_x"
				1747	The OpenCL grid dispatch
				1748	global offset for the X
				1749	dimension is passed in the
				1750	kernarg.
				1751
				1752	"hidden_global_offset_y"
				1753	The OpenCL grid dispatch
				1754	global offset for the Y
				1755	dimension is passed in the
				1756	kernarg.
				1757
				1758	"hidden_global_offset_z"
				1759	The OpenCL grid dispatch
				1760	global offset for the Z
				1761	dimension is passed in the
				1762	kernarg.
				1763
				1764	"hidden_none"
				1765	An argument that is not used
				1766	by the kernel. Space needs to
				1767	be left for it, but it does
				1768	not need to be set up.
				1769
				1770	"hidden_printf_buffer"
				1771	A global address space pointer
				1772	to the runtime printf buffer
				1773	is passed in kernarg.
				1774
				1775	"hidden_default_queue"
				1776	A global address space pointer
				1777	to the OpenCL device enqueue
				1778	queue that should be used by
				1779	the kernel by default is
				1780	passed in the kernarg.
				1781
				1782	"hidden_completion_action"
				1783	A global address space pointer
				1784	to help link enqueued kernels into
				1785	the ancestor tree for determining
				1786	when the parent kernel has finished.
				1787
				1788	".value_type" string Required Kernel argument value type. Only
				1789	present if ".value_kind" is
				1790	"by_value". For vector data
				1791	types, the value is for the
				1792	element type. Values include:
				1793
				1794	- "struct"
				1795	- "i8"
				1796	- "u8"
				1797	- "i16"
				1798	- "u16"
				1799	- "f16"
				1800	- "i32"
				1801	- "u32"
				1802	- "f32"
				1803	- "i64"
				1804	- "u64"
				1805	- "f64"
				1806
				1807	.. TODO
				1808	How can it be determined if a
				1809	vector type, and what size
				1810	vector?
				1811	".pointee_align" integer Alignment in bytes of pointee
				1812	type for pointer type kernel
				1813	argument. Must be a power
				1814	of 2. Only present if
				1815	".value_kind" is
				1816	"dynamic_shared_pointer".
				1817	".address_space" string Kernel argument address space
				1818	qualifier. Only present if
				1819	".value_kind" is "global_buffer" or
				1820	"dynamic_shared_pointer". Values
				1821	are:
				1822
				1823	- "private"
				1824	- "global"
				1825	- "constant"
				1826	- "local"
				1827	- "generic"
				1828	- "region"
				1829
				1830	.. TODO
				1831	Is "global_buffer" only "global"
				1832	or "constant"? Is
				1833	"dynamic_shared_pointer" always
				1834	"local"? Can HCC allow "generic"?
				1835	How can "private" or "region"
				1836	ever happen?
				1837	".access" string Kernel argument access
				1838	qualifier. Only present if
				1839	".value_kind" is "image" or
				1840	"pipe". Values
				1841	are:
				1842
				1843	- "read_only"
				1844	- "write_only"
				1845	- "read_write"
				1846
				1847	.. TODO
				1848	Does this apply to
				1849	"global_buffer"?
				1850	".actual_access" string The actual memory accesses
				1851	performed by the kernel on the
				1852	kernel argument. Only present if
				1853	".value_kind" is "global_buffer",
				1854	"image", or "pipe". This may be
				1855	more restrictive than indicated
				1856	by ".access" to reflect what the
				1857	kernel actual does. If not
				1858	present then the runtime must
				1859	assume what is implied by
				1860	".access" and ".is_const" . Values
				1861	are:
				1862
				1863	- "read_only"
				1864	- "write_only"
				1865	- "read_write"
				1866
				1867	".is_const" boolean Indicates if the kernel argument
				1868	is const qualified. Only present
				1869	if ".value_kind" is
				1870	"global_buffer".
				1871
				1872	".is_restrict" boolean Indicates if the kernel argument
				1873	is restrict qualified. Only
				1874	present if ".value_kind" is
				1875	"global_buffer".
				1876
				1877	".is_volatile" boolean Indicates if the kernel argument
				1878	is volatile qualified. Only
				1879	present if ".value_kind" is
				1880	"global_buffer".
				1881
				1882	".is_pipe" boolean Indicates if the kernel argument
				1883	is pipe qualified. Only present
				1884	if ".value_kind" is "pipe".
				1885
				1886	.. TODO
				1887	Can "global_buffer" be pipe
				1888	qualified?
				1889	====================== ============== ========= ================================
				1890
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1891	..
				1892
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1893	Kernel Dispatch
				1894	~~~~~~~~~~~~~~~
				1895
				1896	The HSA architected queuing language (AQL) defines a user space memory interface
				1897	that can be used to control the dispatch of kernels, in an agent independent
				1898	way. An agent can have zero or more AQL queues created for it using the ROCm
				1899	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1900	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1901	mechanics and packet layouts.
				1902
				1903	The packet processor of a kernel agent is responsible for detecting and
				1904	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1905	packet processor is implemented by the hardware command processor (CP),
				1906	asynchronous dispatch controller (ADC) and shader processor input controller
				1907	(SPI).
				1908
				1909	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1910	mode driver to initialize and register the AQL queue with CP.
				1911
				1912	To dispatch a kernel the following actions are performed. This can occur in the
				1913	CPU host program, or from an HSA kernel executing on a GPU.
				1914
				1915	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1916	executed is obtained.
				1917	2. A pointer to the kernel descriptor (see
				1918	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1919	obtained. It must be for a kernel that is contained in a code object that that
				1920	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1921	associated.
				1922	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1923	for a memory region with the kernarg property for the kernel agent that will
				1924	execute the kernel. It must be at least 16 byte aligned.
				1925	4. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyov	cb5868c	2017-10-19 17:12:55 +0000	[diff] [blame]	1926	allocation. The layout is defined in the HSA Programmer's Language Reference
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1927	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1928	memory in the same way constant memory is accessed. (Note that the HSA
				1929	specification allows an implementation to copy the kernel argument contents to
				1930	another location that is accessed by the kernel.)
				1931	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1932	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1933	packet. The packet must be set up, and the final write must use an atomic
				1934	store release to set the packet kind to ensure the packet contents are
				1935	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1936	notify the kernel agent that the AQL queue has been updated. These rules, and
				1937	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1938	System Architecture Specification* [HSA]_.
				1939	6. A kernel dispatch packet includes information about the actual dispatch,
				1940	such as grid and work-group size, together with information from the code
				1941	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1942	the kernel symbol can be used to obtain the code object values which are
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	1943	recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	1944	7. CP executes micro-code and is responsible for detecting and setting up the
				1945	GPU to execute the wavefronts of a kernel dispatch.
				1946	8. CP ensures that when the a wavefront starts executing the kernel machine
				1947	code, the scalar general purpose registers (SGPR) and vector general purpose
				1948	registers (VGPR) are set up as required by the machine code. The required
				1949	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1950	register state is defined in
				1951	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1952	9. The prolog of the kernel machine code (see
				1953	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1954	before continuing executing the machine code that corresponds to the kernel.
				1955	10. When the kernel dispatch has completed execution, CP signals the completion
				1956	signal specified in the kernel dispatch packet if not 0.
				1957
				1958	.. _amdgpu-amdhsa-memory-spaces:
				1959
				1960	Memory Spaces
				1961	~~~~~~~~~~~~~
				1962
				1963	The memory space properties are:
				1964
				1965	.. table:: AMDHSA Memory Spaces
				1966	:name: amdgpu-amdhsa-memory-spaces-table
				1967
				1968	================= =========== ======== ======= ==================
				1969	Memory Space Name HSA Segment Hardware Address NULL Value
				1970	Name Name Size
				1971	================= =========== ======== ======= ==================
				1972	Private private scratch 32 0x00000000
				1973	Local group LDS 32 0xFFFFFFFF
				1974	Global global global 64 0x0000000000000000
				1975	Constant constant *same as 64 0x0000000000000000
				1976	global*
				1977	Generic flat flat 64 0x0000000000000000
				1978	Region N/A GDS 32 *not implemented
				1979	for AMDHSA*
				1980	================= =========== ======== ======= ==================
				1981
				1982	The global and constant memory spaces both use global virtual addresses, which
				1983	are the same virtual address space used by the CPU. However, some virtual
				1984	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1985	and some by both.
				1986
				1987	Using the constant memory space indicates that the data will not change during
				1988	the execution of the kernel. This allows scalar read instructions to be
				1989	used. The vector and scalar L1 caches are invalidated of volatile data before
				1990	each kernel dispatch execution to allow constant memory to change values between
				1991	kernel dispatches.
				1992
				1993	The local memory space uses the hardware Local Data Store (LDS) which is
				1994	automatically allocated when the hardware creates work-groups of wavefronts, and
				1995	freed when all the wavefronts of a work-group have terminated. The data store
				1996	(DS) instructions can be used to access it.
				1997
				1998	The private memory space uses the hardware scratch memory support. If the kernel
				1999	uses scratch, then the hardware allocates memory that is accessed using
				2000	wavefront lane dword (4 byte) interleaving. The mapping used from private
				2001	address to physical address is:
				2002
				2003	``wavefront-scratch-base +
				2004	(private-address * wavefront-size * 4) +
				2005	(wavefront-lane-id * 4)``
				2006
				2007	There are different ways that the wavefront scratch base address is determined
				2008	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				2009	memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2010	the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2011	instructions, or by flat instructions. If each lane of a wavefront accesses the
				2012	same private address, the interleaving results in adjacent dwords being accessed
				2013	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				2014	supported except by flat and scratch instructions in GFX9.
				2015
				2016	The generic address space uses the hardware flat address support available in
				2017	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				2018	local appertures), that are outside the range of addressible global memory, to
				2019	map from a flat address to a private or local address.
				2020
				2021	FLAT instructions can take a flat address and access global, private (scratch)
				2022	and group (LDS) memory depending in if the address is within one of the
				2023	apperture ranges. Flat access to scratch requires hardware aperture setup and
				2024	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				2025	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				2026	(see :ref:`amdgpu-amdhsa-m0`).
				2027
				2028	To convert between a segment address and a flat address the base address of the
				2029	appertures address can be used. For GFX7-GFX8 these are available in the
				2030	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				2031	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				2032	GFX9 the appature base addresses are directly available as inline constant
				2033	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				2034	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				2035	which makes it easier to convert from flat to segment or segment to flat.
				2036
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	2037	Image and Samplers
				2038	~~~~~~~~~~~~~~~~~~
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2039
				2040	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				2041	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				2042	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				2043	enumeration values for the queries that are not trivially deducible from the S#
				2044	representation.
				2045
				2046	HSA Signals
				2047	~~~~~~~~~~~
				2048
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	2049	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				2050	structure allocated in memory accessible from both the CPU and GPU. The
				2051	structure is defined by the ROCm runtime and subject to change between releases
				2052	(see [AMD-ROCm-github]_).
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2053
				2054	.. _amdgpu-amdhsa-hsa-aql-queue:
				2055
				2056	HSA AQL Queue
				2057	~~~~~~~~~~~~~
				2058
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	2059	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2060	between releases (see [AMD-ROCm-github]_). For some processors it contains
				2061	fields needed to implement certain language features such as the flat address
				2062	aperture bases. It also contains fields used by CP such as managing the
				2063	allocation of scratch memory.
				2064
				2065	.. _amdgpu-amdhsa-kernel-descriptor:
				2066
				2067	Kernel Descriptor
				2068	~~~~~~~~~~~~~~~~~
				2069
				2070	A kernel descriptor consists of the information needed by CP to initiate the
				2071	execution of a kernel, including the entry point address of the machine code
				2072	that implements the kernel.
				2073
				2074	Kernel Descriptor for GFX6-GFX9
				2075	+++++++++++++++++++++++++++++++
				2076
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	2077	CP microcode requires the Kernel descriptor to be allocated on 64 byte
				2078	alignment.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2079
				2080	.. table:: Kernel Descriptor for GFX6-GFX9
				2081	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				2082
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	2083	======= ======= =============================== ============================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2084	Bits Size Field Name Description
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	2085	======= ======= =============================== ============================
Konstantin Zhuravlyov	299cf5f	2018-06-12 18:02:46 +0000	[diff] [blame]	2086	31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2087	address space memory
				2088	required for a work-group
				2089	in bytes. This does not
				2090	include any dynamically
				2091	allocated local address
				2092	space memory that may be
				2093	added when the kernel is
				2094	dispatched.
Konstantin Zhuravlyov	299cf5f	2018-06-12 18:02:46 +0000	[diff] [blame]	2095	63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2096	private address space
				2097	memory required for a
				2098	work-item in bytes. If
				2099	is_dynamic_callstack is 1
				2100	then additional space must
				2101	be added to this value for
				2102	the call stack.
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	2103	127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov	299cf5f	2018-06-12 18:02:46 +0000	[diff] [blame]	2104	191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2105	negative) from base
				2106	address of kernel
				2107	descriptor to kernel's
				2108	entry point instruction
				2109	which must be 256 byte
				2110	aligned.
Tony Tye	ae5d34e	2018-01-30 23:07:10 +0000	[diff] [blame]	2111	383:192 24 Reserved, must be 0.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2112	bytes
Konstantin Zhuravlyov	299cf5f	2018-06-12 18:02:46 +0000	[diff] [blame]	2113	415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2114	program settings used by
				2115	CP to set up
				2116	``COMPUTE_PGM_RSRC1``
				2117	configuration
				2118	register. See
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	2119	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov	299cf5f	2018-06-12 18:02:46 +0000	[diff] [blame]	2120	447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2121	program settings used by
				2122	CP to set up
				2123	``COMPUTE_PGM_RSRC2``
				2124	configuration
				2125	register. See
				2126	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	299cf5f	2018-06-12 18:02:46 +0000	[diff] [blame]	2127	448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
				2128	_BUFFER SGPR user data registers
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2129	(see
				2130	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				2131
				2132	The total number of SGPR
				2133	user data registers
				2134	requested must not exceed
				2135	16 and match value in
				2136	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				2137	Any requests beyond 16
				2138	will be ignored.
Konstantin Zhuravlyov	299cf5f	2018-06-12 18:02:46 +0000	[diff] [blame]	2139	449 1 bit ENABLE_SGPR_DISPATCH_PTR see above
				2140	450 1 bit ENABLE_SGPR_QUEUE_PTR see above
				2141	451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR see above
				2142	452 1 bit ENABLE_SGPR_DISPATCH_ID see above
				2143	453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT see above
				2144	454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT see above
				2145	_SIZE
Konstantin Zhuravlyov	3316883	2018-06-21 18:36:04 +0000	[diff] [blame]	2146	455 1 bit Reserved, must be 0.
				2147	511:456 8 bytes Reserved, must be 0.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2148	512 Total size 64 bytes.
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	2149	======= ====================================================================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2150
				2151	..
				2152
				2153	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	2154	:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2155
Tony Tye	04631fc	2017-06-07 00:46:08 +0000	[diff] [blame]	2156	======= ======= =============================== ===========================================================================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2157	Bits Size Field Name Description
Tony Tye	04631fc	2017-06-07 00:46:08 +0000	[diff] [blame]	2158	======= ======= =============================== ===========================================================================
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	2159	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register
				2160	blocks used by each work-item;
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2161	granularity is device
				2162	specific:
				2163
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	2164	GFX6-GFX9
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	2165	- vgprs_used 0..256
				2166	- max(0, ceil(vgprs_used / 4) - 1)
				2167
				2168	Where vgprs_used is defined
				2169	as the highest VGPR number
				2170	explicitly referenced plus
				2171	one.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2172
				2173	Used by CP to set up
				2174	``COMPUTE_PGM_RSRC1.VGPRS``.
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	2175
				2176	The
				2177	:ref:`amdgpu-assembler`
				2178	calculates this
				2179	automatically for the
				2180	selected processor from
				2181	values provided to the
				2182	`.amdhsa_kernel` directive
				2183	by the
				2184	`.amdhsa_next_free_vgpr`
				2185	nested directive (see
				2186	:ref:`amdhsa-kernel-directives-table`).
				2187	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
				2188	blocks used by a wavefront;
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2189	granularity is device
				2190	specific:
				2191
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	2192	GFX6-GFX8
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	2193	- sgprs_used 0..112
				2194	- max(0, ceil(sgprs_used / 8) - 1)
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2195	GFX9
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	2196	- sgprs_used 0..112
				2197	- 2 * max(0, ceil(sgprs_used / 16) - 1)
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2198
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	2199	Where sgprs_used is
				2200	defined as the highest
				2201	SGPR number explicitly
				2202	referenced plus one, plus
				2203	a target-specific number
				2204	of additional special
				2205	SGPRs for VCC,
				2206	FLAT_SCRATCH (GFX7+) and
				2207	XNACK_MASK (GFX8+), and
				2208	any additional
				2209	target-specific
				2210	limitations. It does not
				2211	include the 16 SGPRs added
				2212	if a trap handler is
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2213	enabled.
				2214
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	2215	The target-specific
				2216	limitations and special
				2217	SGPR layout are defined in
				2218	the hardware
				2219	documentation, which can
				2220	be found in the
				2221	:ref:`amdgpu-processors`
				2222	table.
				2223
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2224	Used by CP to set up
				2225	``COMPUTE_PGM_RSRC1.SGPRS``.
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	2226
				2227	The
				2228	:ref:`amdgpu-assembler`
				2229	calculates this
				2230	automatically for the
				2231	selected processor from
				2232	values provided to the
				2233	`.amdhsa_kernel` directive
				2234	by the
				2235	`.amdhsa_next_free_sgpr`
				2236	and `.amdhsa_reserve_*`
				2237	nested directives (see
				2238	:ref:`amdhsa-kernel-directives-table`).
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2239	11:10 2 bits PRIORITY Must be 0.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2240
				2241	Start executing wavefront
				2242	at the specified priority.
				2243
				2244	CP is responsible for
				2245	filling in
				2246	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2247	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2248	with specified rounding
				2249	mode for single (32
				2250	bit) floating point
				2251	precision floating point
				2252	operations.
				2253
				2254	Floating point rounding
				2255	mode values are defined in
				2256	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				2257
				2258	Used by CP to set up
				2259	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2260	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2261	with specified rounding
				2262	denorm mode for half/double (16
				2263	and 64 bit) floating point
				2264	precision floating point
				2265	operations.
				2266
				2267	Floating point rounding
				2268	mode values are defined in
				2269	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				2270
				2271	Used by CP to set up
				2272	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2273	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2274	with specified denorm mode
				2275	for single (32
				2276	bit) floating point
				2277	precision floating point
				2278	operations.
				2279
				2280	Floating point denorm mode
				2281	values are defined in
				2282	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				2283
				2284	Used by CP to set up
				2285	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2286	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2287	with specified denorm mode
				2288	for half/double (16
				2289	and 64 bit) floating point
				2290	precision floating point
				2291	operations.
				2292
				2293	Floating point denorm mode
				2294	values are defined in
				2295	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				2296
				2297	Used by CP to set up
				2298	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2299	20 1 bit PRIV Must be 0.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2300
				2301	Start executing wavefront
				2302	in privilege trap handler
				2303	mode.
				2304
				2305	CP is responsible for
				2306	filling in
				2307	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2308	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2309	with DX10 clamp mode
				2310	enabled. Used by the vector
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	2311	ALU to force DX10 style
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2312	treatment of NaN's (when
				2313	set, clamp NaN to zero,
				2314	otherwise pass NaN
				2315	through).
				2316
				2317	Used by CP to set up
				2318	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2319	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2320
				2321	Start executing wavefront
				2322	in single step mode.
				2323
				2324	CP is responsible for
				2325	filling in
				2326	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2327	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2328	with IEEE mode
				2329	enabled. Floating point
				2330	opcodes that support
				2331	exception flag gathering
				2332	will quiet and propagate
				2333	signaling-NaN inputs per
				2334	IEEE 754-2008. Min_dx10 and
				2335	max_dx10 become IEEE
				2336	754-2008 compliant due to
				2337	signaling-NaN propagation
				2338	and quieting.
				2339
				2340	Used by CP to set up
				2341	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2342	24 1 bit BULKY Must be 0.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2343
				2344	Only one work-group allowed
				2345	to execute on a compute
				2346	unit.
				2347
				2348	CP is responsible for
				2349	filling in
				2350	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2351	25 1 bit CDBG_USER Must be 0.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2352
				2353	Flag that can be used to
				2354	control debugging code.
				2355
				2356	CP is responsible for
				2357	filling in
				2358	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	2359	26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	2360	Reserved, must be 0.
				2361	GFX9
				2362	Wavefront starts execution
				2363	with specified fp16 overflow
				2364	mode.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2365
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	2366	- If 0, fp16 overflow generates
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2367	+/-INF values.
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	2368	- If 1, fp16 overflow that is the
				2369	result of an +/-INF input value
				2370	or divide by 0 produces a +/-INF,
				2371	otherwise clamps computed
				2372	overflow to +/-MAX_FP16 as
				2373	appropriate.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2374
				2375	Used by CP to set up
				2376	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	2377	31:27 5 bits Reserved, must be 0.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2378	32 Total size 4 bytes
Tony Tye	04631fc	2017-06-07 00:46:08 +0000	[diff] [blame]	2379	======= ===================================================================================================================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2380
				2381	..
				2382
				2383	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				2384	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				2385
Tony Tye	04631fc	2017-06-07 00:46:08 +0000	[diff] [blame]	2386	======= ======= =============================== ===========================================================================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2387	Bits Size Field Name Description
Tony Tye	04631fc	2017-06-07 00:46:08 +0000	[diff] [blame]	2388	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2389	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2390	_WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2391	system register (see
				2392	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				2393
				2394	Used by CP to set up
				2395	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2396	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2397	user data registers
				2398	requested. This number must
				2399	match the number of user
				2400	data registers enabled.
				2401
				2402	Used by CP to set up
				2403	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	840f423	2018-05-29 19:09:13 +0000	[diff] [blame]	2404	6 1 bit ENABLE_TRAP_HANDLER Must be 0.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2405
Konstantin Zhuravlyov	840f423	2018-05-29 19:09:13 +0000	[diff] [blame]	2406	This bit represents
				2407	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
				2408	which is set by the CP if
				2409	the runtime has installed a
				2410	trap handler.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2411	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2412	system SGPR register for
				2413	the work-group id in the X
				2414	dimension (see
				2415	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				2416
				2417	Used by CP to set up
				2418	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2419	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2420	system SGPR register for
				2421	the work-group id in the Y
				2422	dimension (see
				2423	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				2424
				2425	Used by CP to set up
				2426	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2427	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2428	system SGPR register for
				2429	the work-group id in the Z
				2430	dimension (see
				2431	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				2432
				2433	Used by CP to set up
				2434	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2435	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2436	system SGPR register for
				2437	work-group information (see
				2438	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				2439
				2440	Used by CP to set up
				2441	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2442	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2443	VGPR system registers used
				2444	for the work-item ID.
				2445	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				2446	defines the values.
				2447
				2448	Used by CP to set up
				2449	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2450	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2451
				2452	Wavefront starts execution
				2453	with address watch
				2454	exceptions enabled which
				2455	are generated when L1 has
				2456	witnessed a thread access
				2457	an *address of
				2458	interest*.
				2459
				2460	CP is responsible for
				2461	filling in the address
				2462	watch bit in
				2463	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				2464	according to what the
				2465	runtime requests.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2466	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2467
				2468	Wavefront starts execution
				2469	with memory violation
				2470	exceptions exceptions
				2471	enabled which are generated
				2472	when a memory violation has
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2473	occurred for this wavefront from
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2474	L1 or LDS
				2475	(write-to-read-only-memory,
				2476	mis-aligned atomic, LDS
				2477	address out of range,
				2478	illegal address, etc.).
				2479
				2480	CP sets the memory
				2481	violation bit in
				2482	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				2483	according to what the
				2484	runtime requests.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2485	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2486
				2487	CP uses the rounded value
				2488	from the dispatch packet,
				2489	not this value, as the
				2490	dispatch may contain
				2491	dynamically allocated group
				2492	segment memory. CP writes
				2493	directly to
				2494	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				2495
				2496	Amount of group segment
				2497	(LDS) to allocate for each
				2498	work-group. Granularity is
				2499	device specific:
				2500
				2501	GFX6:
				2502	roundup(lds-size / (64 * 4))
				2503	GFX7-GFX9:
				2504	roundup(lds-size / (128 * 4))
				2505
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2506	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				2507	_INVALID_OPERATION with specified exceptions
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2508	enabled.
				2509
				2510	Used by CP to set up
				2511	``COMPUTE_PGM_RSRC2.EXCP_EN``
				2512	(set from bits 0..6).
				2513
				2514	IEEE 754 FP Invalid
				2515	Operation
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2516	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				2517	_SOURCE input operands is a
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2518	denormal number
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2519	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				2520	_DIVISION_BY_ZERO Zero
				2521	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				2522	_OVERFLOW
				2523	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				2524	_UNDERFLOW
				2525	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				2526	_INEXACT
				2527	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				2528	_ZERO (rcp_iflag_f32 instruction
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2529	only)
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	2530	31 1 bit Reserved, must be 0.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2531	32 Total size 4 bytes.
Tony Tye	04631fc	2017-06-07 00:46:08 +0000	[diff] [blame]	2532	======= ===================================================================================================================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2533
				2534	..
				2535
				2536	.. table:: Floating Point Rounding Mode Enumeration Values
				2537	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				2538
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2539	====================================== ===== ==============================
				2540	Enumeration Name Value Description
				2541	====================================== ===== ==============================
Konstantin Zhuravlyov	299cf5f	2018-06-12 18:02:46 +0000	[diff] [blame]	2542	FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				2543	FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				2544	FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				2545	FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2546	====================================== ===== ==============================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2547
				2548	..
				2549
				2550	.. table:: Floating Point Denorm Mode Enumeration Values
				2551	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				2552
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2553	====================================== ===== ==============================
				2554	Enumeration Name Value Description
				2555	====================================== ===== ==============================
Konstantin Zhuravlyov	299cf5f	2018-06-12 18:02:46 +0000	[diff] [blame]	2556	FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2557	Denorms
Konstantin Zhuravlyov	299cf5f	2018-06-12 18:02:46 +0000	[diff] [blame]	2558	FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				2559	FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				2560	FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2561	====================================== ===== ==============================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2562
				2563	..
				2564
				2565	.. table:: System VGPR Work-Item ID Enumeration Values
				2566	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				2567
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2568	======================================== ===== ============================
				2569	Enumeration Name Value Description
				2570	======================================== ===== ============================
Konstantin Zhuravlyov	299cf5f	2018-06-12 18:02:46 +0000	[diff] [blame]	2571	SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2572	ID.
Konstantin Zhuravlyov	299cf5f	2018-06-12 18:02:46 +0000	[diff] [blame]	2573	SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2574	dimensions ID.
Konstantin Zhuravlyov	299cf5f	2018-06-12 18:02:46 +0000	[diff] [blame]	2575	SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2576	dimensions ID.
Konstantin Zhuravlyov	299cf5f	2018-06-12 18:02:46 +0000	[diff] [blame]	2577	SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
Konstantin Zhuravlyov	755155f	2017-10-14 19:17:08 +0000	[diff] [blame]	2578	======================================== ===== ============================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2579
				2580	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				2581
				2582	Initial Kernel Execution State
				2583	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				2584
				2585	This section defines the register state that will be set up by the packet
				2586	processor prior to the start of execution of every wavefront. This is limited by
				2587	the constraints of the hardware controllers of CP/ADC/SPI.
				2588
				2589	The order of the SGPR registers is defined, but the compiler can specify which
				2590	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				2591	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2592	for enabled registers are dense starting at SGPR0: the first enabled register is
				2593	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				2594	an SGPR number.
				2595
				2596	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2597	all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2598	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				2599	initialized. These are then immediately followed by the System SGPRs that are
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2600	set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2601	dispatch.
				2602
				2603	SGPR register initial state is defined in
				2604	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				2605
				2606	.. table:: SGPR Register Set Up Order
				2607	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				2608
				2609	========== ========================== ====== ==============================
				2610	SGPR Order Name Number Description
				2611	(kernel descriptor enable of
				2612	field) SGPRs
				2613	========== ========================== ====== ==============================
				2614	First Private Segment Buffer 4 V# that can be used, together
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2615	(enable_sgpr_private with Scratch Wavefront Offset
				2616	_segment_buffer) as an offset, to access the
				2617	private memory space using a
				2618	segment address.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2619
				2620	CP uses the value provided by
				2621	the runtime.
				2622	then Dispatch Ptr 2 64 bit address of AQL dispatch
				2623	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				2624	actually executing.
				2625	then Queue Ptr 2 64 bit address of amd_queue_t
				2626	(enable_sgpr_queue_ptr) object for AQL queue on which
				2627	the dispatch packet was
				2628	queued.
				2629	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				2630	(enable_sgpr_kernarg segment. This is directly
				2631	_segment_ptr) copied from the
				2632	kernarg_address in the kernel
				2633	dispatch packet.
				2634
				2635	Having CP load it once avoids
				2636	loading it at the beginning of
				2637	every wavefront.
				2638	then Dispatch Id 2 64 bit Dispatch ID of the
				2639	(enable_sgpr_dispatch_id) dispatch packet being
				2640	executed.
				2641	then Flat Scratch Init 2 This is 2 SGPRs:
				2642	(enable_sgpr_flat_scratch
				2643	_init) GFX6
				2644	Not supported.
				2645	GFX7-GFX8
				2646	The first SGPR is a 32 bit
				2647	byte offset from
				2648	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2649	to per SPI base of memory
				2650	for scratch for the queue
				2651	executing the kernel
				2652	dispatch. CP obtains this
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	2653	from the runtime. (The
				2654	Scratch Segment Buffer base
				2655	address is
				2656	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2657	plus this offset.) The value
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2658	of Scratch Wavefront Offset must
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	2659	be added to this offset by
				2660	the kernel machine code,
				2661	right shifted by 8, and
				2662	moved to the FLAT_SCRATCH_HI
				2663	SGPR register.
				2664	FLAT_SCRATCH_HI corresponds
				2665	to SGPRn-4 on GFX7, and
				2666	SGPRn-6 on GFX8 (where SGPRn
				2667	is the highest numbered SGPR
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2668	allocated to the wavefront).
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	2669	FLAT_SCRATCH_HI is
				2670	multiplied by 256 (as it is
				2671	in units of 256 bytes) and
				2672	added to
				2673	``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2674	to calculate the per wavefront
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	2675	FLAT SCRATCH BASE in flat
				2676	memory instructions that
				2677	access the scratch
				2678	apperture.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2679
				2680	The second SGPR is 32 bit
				2681	byte size of a single
Konstantin Zhuravlyov	cb5868c	2017-10-19 17:12:55 +0000	[diff] [blame]	2682	work-item's scratch memory
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	2683	usage. CP obtains this from
				2684	the runtime, and it is
				2685	always a multiple of DWORD.
				2686	CP checks that the value in
				2687	the kernel dispatch packet
				2688	Private Segment Byte Size is
				2689	not larger, and requests the
				2690	runtime to increase the
				2691	queue's scratch size if
				2692	necessary. The kernel code
				2693	must move it to
				2694	FLAT_SCRATCH_LO which is
				2695	SGPRn-3 on GFX7 and SGPRn-5
				2696	on GFX8. FLAT_SCRATCH_LO is
				2697	used as the FLAT SCRATCH
				2698	SIZE in flat memory
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2699	instructions. Having CP load
				2700	it once avoids loading it at
				2701	the beginning of every
Tony Tye	2f2cb6b	2017-11-10 20:51:43 +0000	[diff] [blame]	2702	wavefront.
				2703	GFX9
				2704	This is the
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	2705	64 bit base address of the
				2706	per SPI scratch backing
				2707	memory managed by SPI for
				2708	the queue executing the
				2709	kernel dispatch. CP obtains
				2710	this from the runtime (and
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2711	divides it if there are
				2712	multiple Shader Arrays each
				2713	with its own SPI). The value
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2714	of Scratch Wavefront Offset must
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2715	be added by the kernel
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	2716	machine code and the result
				2717	moved to the FLAT_SCRATCH
				2718	SGPR which is SGPRn-6 and
				2719	SGPRn-5. It is used as the
				2720	FLAT SCRATCH BASE in flat
Tony Tye	2f2cb6b	2017-11-10 20:51:43 +0000	[diff] [blame]	2721	memory instructions.
				2722	then Private Segment Size 1 The 32 bit byte size of a
				2723	(enable_sgpr_private single
				2724	work-item's
				2725	scratch_segment_size) memory
				2726	allocation. This is the
				2727	value from the kernel
				2728	dispatch packet Private
				2729	Segment Byte Size rounded up
				2730	by CP to a multiple of
				2731	DWORD.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2732
				2733	Having CP load it once avoids
				2734	loading it at the beginning of
				2735	every wavefront.
				2736
				2737	This is not used for
				2738	GFX7-GFX8 since it is the same
				2739	value as the second SGPR of
				2740	Flat Scratch Init. However, it
				2741	may be needed for GFX9 which
				2742	changes the meaning of the
				2743	Flat Scratch Init value.
				2744	then Grid Work-Group Count X 1 32 bit count of the number of
				2745	(enable_sgpr_grid work-groups in the X dimension
				2746	_workgroup_count_X) for the grid being
				2747	executed. Computed from the
				2748	fields in the kernel dispatch
				2749	packet as ((grid_size.x +
				2750	workgroup_size.x - 1) /
				2751	workgroup_size.x).
				2752	then Grid Work-Group Count Y 1 32 bit count of the number of
				2753	(enable_sgpr_grid work-groups in the Y dimension
				2754	_workgroup_count_Y && for the grid being
				2755	less than 16 previous executed. Computed from the
				2756	SGPRs) fields in the kernel dispatch
				2757	packet as ((grid_size.y +
				2758	workgroup_size.y - 1) /
				2759	workgroupSize.y).
				2760
				2761	Only initialized if <16
				2762	previous SGPRs initialized.
				2763	then Grid Work-Group Count Z 1 32 bit count of the number of
				2764	(enable_sgpr_grid work-groups in the Z dimension
				2765	_workgroup_count_Z && for the grid being
				2766	less than 16 previous executed. Computed from the
				2767	SGPRs) fields in the kernel dispatch
				2768	packet as ((grid_size.z +
				2769	workgroup_size.z - 1) /
				2770	workgroupSize.z).
				2771
				2772	Only initialized if <16
				2773	previous SGPRs initialized.
				2774	then Work-Group Id X 1 32 bit work-group id in X
				2775	(enable_sgpr_workgroup_id dimension of grid for
				2776	_X) wavefront.
				2777	then Work-Group Id Y 1 32 bit work-group id in Y
				2778	(enable_sgpr_workgroup_id dimension of grid for
				2779	_Y) wavefront.
				2780	then Work-Group Id Z 1 32 bit work-group id in Z
				2781	(enable_sgpr_workgroup_id dimension of grid for
				2782	_Z) wavefront.
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2783	then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2784	(enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2785	_info) threadgroup_size_in_wavefronts[5:0]}
				2786	then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2787	(enable_sgpr_private of scratch base of queue
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2788	_segment_wavefront_offset) executing the kernel
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2789	dispatch. Must be used as an
				2790	offset with Private
				2791	segment address when using
				2792	Scratch Segment Buffer. It
				2793	must be used to set up FLAT
				2794	SCRATCH for flat addressing
				2795	(see
				2796	:ref:`amdgpu-amdhsa-flat-scratch`).
				2797	========== ========================== ====== ==============================
				2798
				2799	The order of the VGPR registers is defined, but the compiler can specify which
				2800	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2801	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2802	for enabled registers are dense starting at VGPR0: the first enabled register is
				2803	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2804	VGPR number.
				2805
				2806	VGPR register initial state is defined in
				2807	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2808
				2809	.. table:: VGPR Register Set Up Order
				2810	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2811
				2812	========== ========================== ====== ==============================
				2813	VGPR Order Name Number Description
				2814	(kernel descriptor enable of
				2815	field) VGPRs
				2816	========== ========================== ====== ==============================
				2817	First Work-Item Id X 1 32 bit work item id in X
				2818	(Always initialized) dimension of work-group for
				2819	wavefront lane.
				2820	then Work-Item Id Y 1 32 bit work item id in Y
				2821	(enable_vgpr_workitem_id dimension of work-group for
				2822	> 0) wavefront lane.
				2823	then Work-Item Id Z 1 32 bit work item id in Z
				2824	(enable_vgpr_workitem_id dimension of work-group for
				2825	> 1) wavefront lane.
				2826	========== ========================== ====== ==============================
				2827
Hiroshi Inoue	ef1bc2d	2018-04-12 05:53:20 +0000	[diff] [blame]	2828	The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2829
				2830	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2831	registers.
				2832	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2833	combination including none.
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2834	3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
				2835	its value cannot included with the flat scratch init value which is per queue.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2836	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2837	or (X, Y, Z).
				2838
				2839	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2840	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2841
				2842	The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	2843	has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2844	instructions (GFX9).
				2845
				2846	If buffer operations are used then the compiler can generate a V# with the
				2847	following properties:
				2848
				2849	* base address of 0
				2850	* no swizzle
				2851	* ATC: 1 if IOMMU present (such as APU)
				2852	* ptr64: 1
				2853	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2854	APU and NC for dGPU).
				2855
				2856	.. _amdgpu-amdhsa-kernel-prolog:
				2857
				2858	Kernel Prolog
				2859	~~~~~~~~~~~~~
				2860
				2861	.. _amdgpu-amdhsa-m0:
				2862
				2863	M0
				2864	++
				2865
				2866	GFX6-GFX8
				2867	The M0 register must be initialized with a value at least the total LDS size
				2868	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2869	available in dispatch packet. For M0, it is also possible to use maximum
				2870	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2871	GFX7-GFX8).
				2872	GFX9
				2873	The M0 register is not used for range checking LDS accesses and so does not
				2874	need to be initialized in the prolog.
				2875
				2876	.. _amdgpu-amdhsa-flat-scratch:
				2877
				2878	Flat Scratch
				2879	++++++++++++
				2880
				2881	If the kernel may use flat operations to access scratch memory, the prolog code
				2882	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2883	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2884	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2885
				2886	GFX6
				2887	Flat scratch is not supported.
				2888
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	2889	GFX7-GFX8
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2890	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2891	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2892	being managed by SPI for the queue executing the kernel dispatch. This is
				2893	the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2894	prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2895	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2896	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2897	by 8 before moving into FLAT_SCRATCH_LO.
				2898	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2899	work-items scratch memory usage. This is directly loaded from the kernel
				2900	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2901	DWORD. Having CP load it once avoids loading it at the beginning of every
				2902	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2903	SIZE.
Tony Tye	2f2cb6b	2017-11-10 20:51:43 +0000	[diff] [blame]	2904
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2905	GFX9
				2906	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2907	memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2908	prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2909	pair for use as the flat scratch base in flat memory instructions.
				2910
				2911	.. _amdgpu-amdhsa-memory-model:
				2912
				2913	Memory Model
				2914	~~~~~~~~~~~~
				2915
				2916	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2917	(see :ref:`memmodel`). The implementation is WIP.
				2918
				2919	.. TODO
				2920	Update when implementation complete.
				2921
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2922	The AMDGPU backend supports the memory synchronization scopes specified in
				2923	:ref:`amdgpu-memory-scopes`.
				2924
				2925	The code sequences used to implement the memory model are defined in table
				2926	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2927
				2928	The sequences specify the order of instructions that a single thread must
				2929	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2930	to other memory instructions executed by the same thread. This allows them to be
				2931	moved earlier or later which can allow them to be combined with other instances
				2932	of the same instruction, or hoisted/sunk out of loops to improve
				2933	performance. Only the instructions related to the memory model are given;
				2934	additional ``s_waitcnt`` instructions are required to ensure registers are
				2935	defined before being used. These may be able to be combined with the memory
				2936	model ``s_waitcnt`` instructions as described above.
				2937
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	2938	The AMDGPU backend supports the following memory models:
				2939
				2940	HSA Memory Model [HSA]_
				2941	The HSA memory model uses a single happens-before relation for all address
				2942	spaces (see :ref:`amdgpu-address-spaces`).
				2943	OpenCL Memory Model [OpenCL]_
				2944	The OpenCL memory model which has separate happens-before relations for the
				2945	global and local address spaces. Only a fence specifying both global and
				2946	local address space, and seq_cst instructions join the relationships. Since
				2947	the LLVM ``memfence`` instruction does not allow an address space to be
				2948	specified the OpenCL fence has to convervatively assume both local and
				2949	global address space was specified. However, optimizations can often be
				2950	done to eliminate the additional ``s_waitcnt`` instructions when there are
				2951	no intervening memory instructions which access the corresponding address
				2952	space. The code sequences in the table indicate what can be omitted for the
				2953	OpenCL memory. The target triple environment is used to determine if the
				2954	source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2955
				2956	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2957	operations.
				2958
				2959	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2960	termed vector memory operations.
				2961
				2962	For GFX6-GFX9:
				2963
				2964	* Each agent has multiple compute units (CU).
				2965	* Each CU has multiple SIMDs that execute wavefronts.
				2966	* The wavefronts for a single work-group are executed in the same CU but may be
				2967	executed by different SIMDs.
				2968	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2969	executing on it.
				2970	* All LDS operations of a CU are performed as wavefront wide operations in a
				2971	global order and involve no caching. Completion is reported to a wavefront in
				2972	execution order.
				2973	* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2974	CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2975	can be reordered relative to each other, which can result in reordering the
				2976	visibility of vector memory operations with respect to LDS operations of other
				2977	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e9d67e4	2017-06-26 02:45:39 +0000	[diff] [blame]	2978	ensure synchronization between LDS operations and vector memory operations
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2979	between wavefronts of a work-group, but not between operations performed by the
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2980	same wavefront.
				2981	* The vector memory operations are performed as wavefront wide operations and
				2982	completion is reported to a wavefront in execution order. The exception is
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	2983	that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2984	vector memory order if they access LDS memory, and out of LDS operation order
				2985	if they access global memory.
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	2986	* The vector memory operations access a single vector L1 cache shared by all
				2987	SIMDs a CU. Therefore, no special action is required for coherence between the
				2988	lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	2989	work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	2990	executing in different work-groups as they may be executing on different CUs.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	2991	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2992	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2993	scalar operations are used in a restricted way so do not impact the memory
				2994	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2995	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2996	the same agent.
				2997	* The L2 cache has independent channels to service disjoint ranges of virtual
				2998	addresses.
				2999	* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	3000	scalar memory operations performed by wavefronts executing in different work-groups
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3001	(which may be executing on different CUs) of an agent can be reordered
				3002	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e9d67e4	2017-06-26 02:45:39 +0000	[diff] [blame]	3003	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3004	previous vector memory operation has completed before executing a subsequent
				3005	vector memory or LDS operation and so can be used to meet the requirements of
				3006	acquire and release.
				3007	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				3008	of virtual addresses can be set up to bypass it to ensure system coherence.
				3009
Tony Tye	9000e8c	2017-11-10 01:00:54 +0000	[diff] [blame]	3010	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3011	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				3012	memory, atomic memory orderings are not meaningful and all accesses are treated
				3013	as non-atomic.
				3014
				3015	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				3016	scalar memory instructions). Since the constant address space contents do not
				3017	change during the execution of a kernel dispatch it is not legal to perform
				3018	stores, and atomic memory orderings are not meaningful and all access are
				3019	treated as non-atomic.
				3020
				3021	A memory synchronization scope wider than work-group is not meaningful for the
				3022	group (LDS) address space and is treated as work-group.
				3023
				3024	The memory model does not support the region address space which is treated as
				3025	non-atomic.
				3026
				3027	Acquire memory ordering is not meaningful on store atomic instructions and is
				3028	treated as non-atomic.
				3029
				3030	Release memory ordering is not meaningful on load atomic instructions and is
				3031	treated a non-atomic.
				3032
				3033	Acquire-release memory ordering is not meaningful on load or store atomic
				3034	instructions and is treated as acquire and release respectively.
				3035
				3036	AMDGPU backend only uses scalar memory operations to access memory that is
				3037	proven to not change during the execution of the kernel dispatch. This includes
				3038	constant address space and global address space for program scope const
				3039	variables. Therefore the kernel machine code does not have to maintain the
				3040	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				3041	and vector L1 caches are invalidated between kernel dispatches by CP since
				3042	constant address space data may change between kernel dispatch executions. See
				3043	:ref:`amdgpu-amdhsa-memory-spaces`.
				3044
Sylvestre Ledru	e9d67e4	2017-06-26 02:45:39 +0000	[diff] [blame]	3045	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3046	case the AMDGPU backend ensures the memory location used to spill is never
				3047	accessed by vector memory operations at the same time. If scalar writes are used
				3048	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				3049	return since the locations may be used for vector memory instructions by a
Tony Tye	636e223	2018-03-08 05:46:01 +0000	[diff] [blame]	3050	future wavefront that uses the same scratch area, or a function call that creates a
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3051	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				3052	as all scalar writes are write-before-read in the same thread.
				3053
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3054	Scratch backing memory (which is used for the private address space)
				3055	is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
				3056	address space is only accessed by a single thread, and is always
				3057	write-before-read, there is never a need to invalidate these entries from the L1
				3058	cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
				3059	volatile cache lines.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3060
				3061	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3062	to invalidate the L2 cache. This also causes it to be treated as
				3063	non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
				3064	(cache coherent) and so the L2 cache will coherent with the CPU and other
				3065	agents.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3066
				3067	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				3068	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				3069
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3070	============ ============ ============== ========== ===============================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3071	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				3072	Ordering Sync Scope Address
				3073	Space
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3074	============ ============ ============== ========== ===============================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3075	Non-Atomic
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3076	-----------------------------------------------------------------------------------
				3077	load none none - global - !volatile & !nontemporal
				3078	- generic
				3079	- private 1. buffer/global/flat_load
				3080	- constant
				3081	- volatile & !nontemporal
				3082
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3083	1. buffer/global/flat_load
				3084	glc=1
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3085
				3086	- nontemporal
				3087
				3088	1. buffer/global/flat_load
				3089	glc=1 slc=1
				3090
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3091	load none none - local 1. ds_load
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3092	store none none - global - !nontemporal
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3093	- generic
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3094	- private 1. buffer/global/flat_store
				3095	- constant
				3096	- nontemporal
				3097
				3098	1. buffer/global/flat_stote
				3099	glc=1 slc=1
				3100
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3101	store none none - local 1. ds_store
				3102	Unordered Atomic
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3103	-----------------------------------------------------------------------------------
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3104	load atomic unordered any any Same as non-atomic.
				3105	store atomic unordered any any Same as non-atomic.
				3106	atomicrmw unordered any any *Same as monotonic
				3107	atomic*.
				3108	Monotonic Atomic
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3109	-----------------------------------------------------------------------------------
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3110	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				3111	- wavefront - generic
				3112	- workgroup
				3113	load atomic monotonic - singlethread - local 1. ds_load
				3114	- wavefront
				3115	- workgroup
				3116	load atomic monotonic - agent - global 1. buffer/global/flat_load
				3117	- system - generic glc=1
				3118	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				3119	- wavefront - generic
				3120	- workgroup
				3121	- agent
				3122	- system
				3123	store atomic monotonic - singlethread - local 1. ds_store
				3124	- wavefront
				3125	- workgroup
				3126	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				3127	- wavefront - generic
				3128	- workgroup
				3129	- agent
				3130	- system
				3131	atomicrmw monotonic - singlethread - local 1. ds_atomic
				3132	- wavefront
				3133	- workgroup
				3134	Acquire Atomic
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3135	-----------------------------------------------------------------------------------
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3136	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				3137	- wavefront - local
				3138	- generic
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3139	load atomic acquire - workgroup - global 1. buffer/global/flat_load
				3140	load atomic acquire - workgroup - local 1. ds_load
				3141	2. s_waitcnt lgkmcnt(0)
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3142
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3143	- If OpenCL, omit.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3144	- Must happen before
				3145	any following
				3146	global/generic
				3147	load/load
				3148	atomic/store/store
				3149	atomic/atomicrmw.
				3150	- Ensures any
				3151	following global
				3152	data read is no
				3153	older than the load
				3154	atomic value being
				3155	acquired.
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3156	load atomic acquire - workgroup - generic 1. flat_load
				3157	2. s_waitcnt lgkmcnt(0)
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3158
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3159	- If OpenCL, omit.
				3160	- Must happen before
				3161	any following
				3162	global/generic
				3163	load/load
				3164	atomic/store/store
				3165	atomic/atomicrmw.
				3166	- Ensures any
				3167	following global
				3168	data read is no
				3169	older than the load
				3170	atomic value being
				3171	acquired.
				3172	load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3173	- system glc=1
				3174	2. s_waitcnt vmcnt(0)
				3175
				3176	- Must happen before
				3177	following
				3178	buffer_wbinvl1_vol.
				3179	- Ensures the load
				3180	has completed
				3181	before invalidating
				3182	the cache.
				3183
				3184	3. buffer_wbinvl1_vol
				3185
				3186	- Must happen before
				3187	any following
				3188	global/generic
				3189	load/load
				3190	atomic/atomicrmw.
				3191	- Ensures that
				3192	following
				3193	loads will not see
				3194	stale global data.
				3195
				3196	load atomic acquire - agent - generic 1. flat_load glc=1
				3197	- system 2. s_waitcnt vmcnt(0) &
				3198	lgkmcnt(0)
				3199
				3200	- If OpenCL omit
				3201	lgkmcnt(0).
				3202	- Must happen before
				3203	following
				3204	buffer_wbinvl1_vol.
				3205	- Ensures the flat_load
				3206	has completed
				3207	before invalidating
				3208	the cache.
				3209
				3210	3. buffer_wbinvl1_vol
				3211
				3212	- Must happen before
				3213	any following
				3214	global/generic
				3215	load/load
				3216	atomic/atomicrmw.
				3217	- Ensures that
				3218	following loads
				3219	will not see stale
				3220	global data.
				3221
				3222	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				3223	- wavefront - local
				3224	- generic
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3225	atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
				3226	atomicrmw acquire - workgroup - local 1. ds_atomic
				3227	2. waitcnt lgkmcnt(0)
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3228
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3229	- If OpenCL, omit.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3230	- Must happen before
				3231	any following
				3232	global/generic
				3233	load/load
				3234	atomic/store/store
				3235	atomic/atomicrmw.
				3236	- Ensures any
				3237	following global
				3238	data read is no
				3239	older than the
				3240	atomicrmw value
				3241	being acquired.
				3242
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3243	atomicrmw acquire - workgroup - generic 1. flat_atomic
				3244	2. waitcnt lgkmcnt(0)
				3245
				3246	- If OpenCL, omit.
				3247	- Must happen before
				3248	any following
				3249	global/generic
				3250	load/load
				3251	atomic/store/store
				3252	atomic/atomicrmw.
				3253	- Ensures any
				3254	following global
				3255	data read is no
				3256	older than the
				3257	atomicrmw value
				3258	being acquired.
				3259
				3260	atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3261	- system 2. s_waitcnt vmcnt(0)
				3262
				3263	- Must happen before
				3264	following
				3265	buffer_wbinvl1_vol.
				3266	- Ensures the
				3267	atomicrmw has
				3268	completed before
				3269	invalidating the
				3270	cache.
				3271
				3272	3. buffer_wbinvl1_vol
				3273
				3274	- Must happen before
				3275	any following
				3276	global/generic
				3277	load/load
				3278	atomic/atomicrmw.
				3279	- Ensures that
				3280	following loads
				3281	will not see stale
				3282	global data.
				3283
				3284	atomicrmw acquire - agent - generic 1. flat_atomic
				3285	- system 2. s_waitcnt vmcnt(0) &
				3286	lgkmcnt(0)
				3287
				3288	- If OpenCL, omit
				3289	lgkmcnt(0).
				3290	- Must happen before
				3291	following
				3292	buffer_wbinvl1_vol.
				3293	- Ensures the
				3294	atomicrmw has
				3295	completed before
				3296	invalidating the
				3297	cache.
				3298
				3299	3. buffer_wbinvl1_vol
				3300
				3301	- Must happen before
				3302	any following
				3303	global/generic
				3304	load/load
				3305	atomic/atomicrmw.
				3306	- Ensures that
				3307	following loads
				3308	will not see stale
				3309	global data.
				3310
				3311	fence acquire - singlethread none none
				3312	- wavefront
				3313	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				3314
				3315	- If OpenCL and
				3316	address space is
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3317	not generic, omit.
				3318	- However, since LLVM
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3319	currently has no
				3320	address space on
				3321	the fence need to
				3322	conservatively
				3323	always generate. If
				3324	fence had an
				3325	address space then
				3326	set to address
				3327	space of OpenCL
				3328	fence flag, or to
				3329	generic if both
				3330	local and global
				3331	flags are
				3332	specified.
				3333	- Must happen after
				3334	any preceding
				3335	local/generic load
				3336	atomic/atomicrmw
				3337	with an equal or
				3338	wider sync scope
				3339	and memory ordering
				3340	stronger than
				3341	unordered (this is
				3342	termed the
				3343	fence-paired-atomic).
				3344	- Must happen before
				3345	any following
				3346	global/generic
				3347	load/load
				3348	atomic/store/store
				3349	atomic/atomicrmw.
				3350	- Ensures any
				3351	following global
				3352	data read is no
				3353	older than the
				3354	value read by the
				3355	fence-paired-atomic.
				3356
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3357	fence acquire - agent none 1. s_waitcnt lgkmcnt(0) &
				3358	- system vmcnt(0)
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3359
				3360	- If OpenCL and
				3361	address space is
				3362	not generic, omit
				3363	lgkmcnt(0).
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3364	- However, since LLVM
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3365	currently has no
				3366	address space on
				3367	the fence need to
				3368	conservatively
				3369	always generate
				3370	(see comment for
				3371	previous fence).
Tony Tye	61c7e03	2017-06-07 00:08:35 +0000	[diff] [blame]	3372	- Could be split into
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3373	separate s_waitcnt
				3374	vmcnt(0) and
				3375	s_waitcnt
				3376	lgkmcnt(0) to allow
				3377	them to be
				3378	independently moved
				3379	according to the
				3380	following rules.
				3381	- s_waitcnt vmcnt(0)
				3382	must happen after
				3383	any preceding
				3384	global/generic load
				3385	atomic/atomicrmw
				3386	with an equal or
				3387	wider sync scope
				3388	and memory ordering
				3389	stronger than
				3390	unordered (this is
				3391	termed the
				3392	fence-paired-atomic).
				3393	- s_waitcnt lgkmcnt(0)
				3394	must happen after
				3395	any preceding
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3396	local/generic load
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3397	atomic/atomicrmw
				3398	with an equal or
				3399	wider sync scope
				3400	and memory ordering
				3401	stronger than
				3402	unordered (this is
				3403	termed the
				3404	fence-paired-atomic).
				3405	- Must happen before
				3406	the following
				3407	buffer_wbinvl1_vol.
				3408	- Ensures that the
				3409	fence-paired atomic
				3410	has completed
				3411	before invalidating
				3412	the
				3413	cache. Therefore
				3414	any following
				3415	locations read must
				3416	be no older than
				3417	the value read by
				3418	the
				3419	fence-paired-atomic.
				3420
				3421	2. buffer_wbinvl1_vol
				3422
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3423	- Must happen before any
				3424	following global/generic
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3425	load/load
				3426	atomic/store/store
				3427	atomic/atomicrmw.
				3428	- Ensures that
				3429	following loads
				3430	will not see stale
				3431	global data.
				3432
				3433	Release Atomic
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3434	-----------------------------------------------------------------------------------
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3435	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				3436	- wavefront - local
				3437	- generic
				3438	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3439
				3440	- If OpenCL, omit.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3441	- Must happen after
				3442	any preceding
				3443	local/generic
				3444	load/store/load
				3445	atomic/store
				3446	atomic/atomicrmw.
				3447	- Must happen before
				3448	the following
				3449	store.
				3450	- Ensures that all
				3451	memory operations
				3452	to local have
				3453	completed before
				3454	performing the
				3455	store that is being
				3456	released.
				3457
				3458	2. buffer/global/flat_store
				3459	store atomic release - workgroup - local 1. ds_store
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3460	store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3461
				3462	- If OpenCL, omit.
				3463	- Must happen after
				3464	any preceding
				3465	local/generic
				3466	load/store/load
				3467	atomic/store
				3468	atomic/atomicrmw.
				3469	- Must happen before
				3470	the following
				3471	store.
				3472	- Ensures that all
				3473	memory operations
				3474	to local have
				3475	completed before
				3476	performing the
				3477	store that is being
				3478	released.
				3479
				3480	2. flat_store
				3481	store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
				3482	- system - generic vmcnt(0)
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3483
				3484	- If OpenCL, omit
				3485	lgkmcnt(0).
				3486	- Could be split into
				3487	separate s_waitcnt
				3488	vmcnt(0) and
				3489	s_waitcnt
				3490	lgkmcnt(0) to allow
				3491	them to be
				3492	independently moved
				3493	according to the
				3494	following rules.
				3495	- s_waitcnt vmcnt(0)
				3496	must happen after
				3497	any preceding
				3498	global/generic
				3499	load/store/load
				3500	atomic/store
				3501	atomic/atomicrmw.
				3502	- s_waitcnt lgkmcnt(0)
				3503	must happen after
				3504	any preceding
				3505	local/generic
				3506	load/store/load
				3507	atomic/store
				3508	atomic/atomicrmw.
				3509	- Must happen before
				3510	the following
				3511	store.
				3512	- Ensures that all
				3513	memory operations
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3514	to memory have
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3515	completed before
				3516	performing the
				3517	store that is being
				3518	released.
				3519
				3520	2. buffer/global/ds/flat_store
				3521	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				3522	- wavefront - local
				3523	- generic
				3524	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3525
				3526	- If OpenCL, omit.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3527	- Must happen after
				3528	any preceding
				3529	local/generic
				3530	load/store/load
				3531	atomic/store
				3532	atomic/atomicrmw.
				3533	- Must happen before
				3534	the following
				3535	atomicrmw.
				3536	- Ensures that all
				3537	memory operations
				3538	to local have
				3539	completed before
				3540	performing the
				3541	atomicrmw that is
				3542	being released.
				3543
				3544	2. buffer/global/flat_atomic
				3545	atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3546	atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3547
				3548	- If OpenCL, omit.
				3549	- Must happen after
				3550	any preceding
				3551	local/generic
				3552	load/store/load
				3553	atomic/store
				3554	atomic/atomicrmw.
				3555	- Must happen before
				3556	the following
				3557	atomicrmw.
				3558	- Ensures that all
				3559	memory operations
				3560	to local have
				3561	completed before
				3562	performing the
				3563	atomicrmw that is
				3564	being released.
				3565
				3566	2. flat_atomic
				3567	atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
				3568	- system - generic vmcnt(0)
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3569
				3570	- If OpenCL, omit
				3571	lgkmcnt(0).
				3572	- Could be split into
				3573	separate s_waitcnt
				3574	vmcnt(0) and
				3575	s_waitcnt
				3576	lgkmcnt(0) to allow
				3577	them to be
				3578	independently moved
				3579	according to the
				3580	following rules.
				3581	- s_waitcnt vmcnt(0)
				3582	must happen after
				3583	any preceding
				3584	global/generic
				3585	load/store/load
				3586	atomic/store
				3587	atomic/atomicrmw.
				3588	- s_waitcnt lgkmcnt(0)
				3589	must happen after
				3590	any preceding
				3591	local/generic
				3592	load/store/load
				3593	atomic/store
				3594	atomic/atomicrmw.
				3595	- Must happen before
				3596	the following
				3597	atomicrmw.
				3598	- Ensures that all
				3599	memory operations
				3600	to global and local
				3601	have completed
				3602	before performing
				3603	the atomicrmw that
				3604	is being released.
				3605
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3606	2. buffer/global/ds/flat_atomic
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3607	fence release - singlethread none none
				3608	- wavefront
				3609	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				3610
				3611	- If OpenCL and
				3612	address space is
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3613	not generic, omit.
				3614	- However, since LLVM
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3615	currently has no
				3616	address space on
				3617	the fence need to
				3618	conservatively
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3619	always generate. If
				3620	fence had an
				3621	address space then
				3622	set to address
				3623	space of OpenCL
				3624	fence flag, or to
				3625	generic if both
				3626	local and global
				3627	flags are
				3628	specified.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3629	- Must happen after
				3630	any preceding
				3631	local/generic
				3632	load/load
				3633	atomic/store/store
				3634	atomic/atomicrmw.
				3635	- Must happen before
				3636	any following store
				3637	atomic/atomicrmw
				3638	with an equal or
				3639	wider sync scope
				3640	and memory ordering
				3641	stronger than
				3642	unordered (this is
				3643	termed the
				3644	fence-paired-atomic).
				3645	- Ensures that all
				3646	memory operations
				3647	to local have
				3648	completed before
				3649	performing the
				3650	following
				3651	fence-paired-atomic.
				3652
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3653	fence release - agent none 1. s_waitcnt lgkmcnt(0) &
				3654	- system vmcnt(0)
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3655
				3656	- If OpenCL and
				3657	address space is
				3658	not generic, omit
				3659	lgkmcnt(0).
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3660	- If OpenCL and
				3661	address space is
				3662	local, omit
				3663	vmcnt(0).
				3664	- However, since LLVM
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3665	currently has no
				3666	address space on
				3667	the fence need to
				3668	conservatively
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3669	always generate. If
				3670	fence had an
				3671	address space then
				3672	set to address
				3673	space of OpenCL
				3674	fence flag, or to
				3675	generic if both
				3676	local and global
				3677	flags are
				3678	specified.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3679	- Could be split into
				3680	separate s_waitcnt
				3681	vmcnt(0) and
				3682	s_waitcnt
				3683	lgkmcnt(0) to allow
				3684	them to be
				3685	independently moved
				3686	according to the
				3687	following rules.
				3688	- s_waitcnt vmcnt(0)
				3689	must happen after
				3690	any preceding
				3691	global/generic
				3692	load/store/load
				3693	atomic/store
				3694	atomic/atomicrmw.
				3695	- s_waitcnt lgkmcnt(0)
				3696	must happen after
				3697	any preceding
				3698	local/generic
				3699	load/store/load
				3700	atomic/store
				3701	atomic/atomicrmw.
				3702	- Must happen before
				3703	any following store
				3704	atomic/atomicrmw
				3705	with an equal or
				3706	wider sync scope
				3707	and memory ordering
				3708	stronger than
				3709	unordered (this is
				3710	termed the
				3711	fence-paired-atomic).
				3712	- Ensures that all
				3713	memory operations
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3714	have
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3715	completed before
				3716	performing the
				3717	following
				3718	fence-paired-atomic.
				3719
				3720	Acquire-Release Atomic
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3721	-----------------------------------------------------------------------------------
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3722	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				3723	- wavefront - local
				3724	- generic
				3725	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3726
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3727	- If OpenCL, omit.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3728	- Must happen after
				3729	any preceding
				3730	local/generic
				3731	load/store/load
				3732	atomic/store
				3733	atomic/atomicrmw.
				3734	- Must happen before
				3735	the following
				3736	atomicrmw.
				3737	- Ensures that all
				3738	memory operations
				3739	to local have
				3740	completed before
				3741	performing the
				3742	atomicrmw that is
				3743	being released.
				3744
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3745	2. buffer/global/flat_atomic
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3746	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				3747	2. s_waitcnt lgkmcnt(0)
				3748
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3749	- If OpenCL, omit.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3750	- Must happen before
				3751	any following
				3752	global/generic
				3753	load/load
				3754	atomic/store/store
				3755	atomic/atomicrmw.
				3756	- Ensures any
				3757	following global
				3758	data read is no
				3759	older than the load
				3760	atomic value being
				3761	acquired.
				3762
				3763	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3764
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3765	- If OpenCL, omit.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3766	- Must happen after
				3767	any preceding
				3768	local/generic
				3769	load/store/load
				3770	atomic/store
				3771	atomic/atomicrmw.
				3772	- Must happen before
				3773	the following
				3774	atomicrmw.
				3775	- Ensures that all
				3776	memory operations
				3777	to local have
				3778	completed before
				3779	performing the
				3780	atomicrmw that is
				3781	being released.
				3782
				3783	2. flat_atomic
				3784	3. s_waitcnt lgkmcnt(0)
				3785
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3786	- If OpenCL, omit.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3787	- Must happen before
				3788	any following
				3789	global/generic
				3790	load/load
				3791	atomic/store/store
				3792	atomic/atomicrmw.
				3793	- Ensures any
				3794	following global
				3795	data read is no
				3796	older than the load
				3797	atomic value being
				3798	acquired.
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3799
				3800	atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
				3801	- system vmcnt(0)
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3802
				3803	- If OpenCL, omit
				3804	lgkmcnt(0).
				3805	- Could be split into
				3806	separate s_waitcnt
				3807	vmcnt(0) and
				3808	s_waitcnt
				3809	lgkmcnt(0) to allow
				3810	them to be
				3811	independently moved
				3812	according to the
				3813	following rules.
				3814	- s_waitcnt vmcnt(0)
				3815	must happen after
				3816	any preceding
				3817	global/generic
				3818	load/store/load
				3819	atomic/store
				3820	atomic/atomicrmw.
				3821	- s_waitcnt lgkmcnt(0)
				3822	must happen after
				3823	any preceding
				3824	local/generic
				3825	load/store/load
				3826	atomic/store
				3827	atomic/atomicrmw.
				3828	- Must happen before
				3829	the following
				3830	atomicrmw.
				3831	- Ensures that all
				3832	memory operations
				3833	to global have
				3834	completed before
				3835	performing the
				3836	atomicrmw that is
				3837	being released.
				3838
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3839	2. buffer/global/flat_atomic
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3840	3. s_waitcnt vmcnt(0)
				3841
				3842	- Must happen before
				3843	following
				3844	buffer_wbinvl1_vol.
				3845	- Ensures the
				3846	atomicrmw has
				3847	completed before
				3848	invalidating the
				3849	cache.
				3850
				3851	4. buffer_wbinvl1_vol
				3852
				3853	- Must happen before
				3854	any following
				3855	global/generic
				3856	load/load
				3857	atomic/atomicrmw.
				3858	- Ensures that
				3859	following loads
				3860	will not see stale
				3861	global data.
				3862
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3863	atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
				3864	- system vmcnt(0)
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3865
				3866	- If OpenCL, omit
				3867	lgkmcnt(0).
				3868	- Could be split into
				3869	separate s_waitcnt
				3870	vmcnt(0) and
				3871	s_waitcnt
				3872	lgkmcnt(0) to allow
				3873	them to be
				3874	independently moved
				3875	according to the
				3876	following rules.
				3877	- s_waitcnt vmcnt(0)
				3878	must happen after
				3879	any preceding
				3880	global/generic
				3881	load/store/load
				3882	atomic/store
				3883	atomic/atomicrmw.
				3884	- s_waitcnt lgkmcnt(0)
				3885	must happen after
				3886	any preceding
				3887	local/generic
				3888	load/store/load
				3889	atomic/store
				3890	atomic/atomicrmw.
				3891	- Must happen before
				3892	the following
				3893	atomicrmw.
				3894	- Ensures that all
				3895	memory operations
				3896	to global have
				3897	completed before
				3898	performing the
				3899	atomicrmw that is
				3900	being released.
				3901
				3902	2. flat_atomic
				3903	3. s_waitcnt vmcnt(0) &
				3904	lgkmcnt(0)
				3905
				3906	- If OpenCL, omit
				3907	lgkmcnt(0).
				3908	- Must happen before
				3909	following
				3910	buffer_wbinvl1_vol.
				3911	- Ensures the
				3912	atomicrmw has
				3913	completed before
				3914	invalidating the
				3915	cache.
				3916
				3917	4. buffer_wbinvl1_vol
				3918
				3919	- Must happen before
				3920	any following
				3921	global/generic
				3922	load/load
				3923	atomic/atomicrmw.
				3924	- Ensures that
				3925	following loads
				3926	will not see stale
				3927	global data.
				3928
				3929	fence acq_rel - singlethread none none
				3930	- wavefront
				3931	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3932
				3933	- If OpenCL and
				3934	address space is
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3935	not generic, omit.
				3936	- However,
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3937	since LLVM
				3938	currently has no
				3939	address space on
				3940	the fence need to
				3941	conservatively
				3942	always generate
				3943	(see comment for
				3944	previous fence).
				3945	- Must happen after
				3946	any preceding
				3947	local/generic
				3948	load/load
				3949	atomic/store/store
				3950	atomic/atomicrmw.
				3951	- Must happen before
				3952	any following
				3953	global/generic
				3954	load/load
				3955	atomic/store/store
				3956	atomic/atomicrmw.
				3957	- Ensures that all
				3958	memory operations
				3959	to local have
				3960	completed before
				3961	performing any
				3962	following global
				3963	memory operations.
				3964	- Ensures that the
				3965	preceding
				3966	local/generic load
				3967	atomic/atomicrmw
				3968	with an equal or
				3969	wider sync scope
				3970	and memory ordering
				3971	stronger than
				3972	unordered (this is
				3973	termed the
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3974	acquire-fence-paired-atomic
				3975	) has completed
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3976	before following
				3977	global memory
				3978	operations. This
				3979	satisfies the
				3980	requirements of
				3981	acquire.
				3982	- Ensures that all
				3983	previous memory
				3984	operations have
				3985	completed before a
				3986	following
				3987	local/generic store
				3988	atomic/atomicrmw
				3989	with an equal or
				3990	wider sync scope
				3991	and memory ordering
				3992	stronger than
				3993	unordered (this is
				3994	termed the
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	3995	release-fence-paired-atomic
				3996	). This satisfies the
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	3997	requirements of
				3998	release.
				3999
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	4000	fence acq_rel - agent none 1. s_waitcnt lgkmcnt(0) &
				4001	- system vmcnt(0)
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4002
				4003	- If OpenCL and
				4004	address space is
				4005	not generic, omit
				4006	lgkmcnt(0).
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	4007	- However, since LLVM
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4008	currently has no
				4009	address space on
				4010	the fence need to
				4011	conservatively
				4012	always generate
				4013	(see comment for
				4014	previous fence).
				4015	- Could be split into
				4016	separate s_waitcnt
				4017	vmcnt(0) and
				4018	s_waitcnt
				4019	lgkmcnt(0) to allow
				4020	them to be
				4021	independently moved
				4022	according to the
				4023	following rules.
				4024	- s_waitcnt vmcnt(0)
				4025	must happen after
				4026	any preceding
				4027	global/generic
				4028	load/store/load
				4029	atomic/store
				4030	atomic/atomicrmw.
				4031	- s_waitcnt lgkmcnt(0)
				4032	must happen after
				4033	any preceding
				4034	local/generic
				4035	load/store/load
				4036	atomic/store
				4037	atomic/atomicrmw.
				4038	- Must happen before
				4039	the following
				4040	buffer_wbinvl1_vol.
				4041	- Ensures that the
				4042	preceding
				4043	global/local/generic
				4044	load
				4045	atomic/atomicrmw
				4046	with an equal or
				4047	wider sync scope
				4048	and memory ordering
				4049	stronger than
				4050	unordered (this is
				4051	termed the
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	4052	acquire-fence-paired-atomic
				4053	) has completed
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4054	before invalidating
				4055	the cache. This
				4056	satisfies the
				4057	requirements of
				4058	acquire.
				4059	- Ensures that all
				4060	previous memory
				4061	operations have
				4062	completed before a
				4063	following
				4064	global/local/generic
				4065	store
				4066	atomic/atomicrmw
				4067	with an equal or
				4068	wider sync scope
				4069	and memory ordering
				4070	stronger than
				4071	unordered (this is
				4072	termed the
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	4073	release-fence-paired-atomic
				4074	). This satisfies the
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4075	requirements of
				4076	release.
				4077
				4078	2. buffer_wbinvl1_vol
				4079
				4080	- Must happen before
				4081	any following
				4082	global/generic
				4083	load/load
				4084	atomic/store/store
				4085	atomic/atomicrmw.
				4086	- Ensures that
				4087	following loads
				4088	will not see stale
				4089	global data. This
				4090	satisfies the
				4091	requirements of
				4092	acquire.
				4093
				4094	Sequential Consistent Atomic
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	4095	-----------------------------------------------------------------------------------
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4096	load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	4097	- wavefront - local load atomic acquire,
				4098	- generic except must generated
				4099	all instructions even
				4100	for OpenCL.*
				4101	load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
				4102	- generic
				4103	- Must
				4104	happen after
				4105	preceding
				4106	global/generic load
				4107	atomic/store
				4108	atomic/atomicrmw
				4109	with memory
				4110	ordering of seq_cst
				4111	and with equal or
				4112	wider sync scope.
				4113	(Note that seq_cst
				4114	fences have their
				4115	own s_waitcnt
				4116	lgkmcnt(0) and so do
				4117	not need to be
				4118	considered.)
				4119	- Ensures any
				4120	preceding
				4121	sequential
				4122	consistent local
				4123	memory instructions
				4124	have completed
				4125	before executing
				4126	this sequentially
				4127	consistent
				4128	instruction. This
				4129	prevents reordering
				4130	a seq_cst store
				4131	followed by a
				4132	seq_cst load. (Note
				4133	that seq_cst is
				4134	stronger than
				4135	acquire/release as
				4136	the reordering of
				4137	load acquire
				4138	followed by a store
				4139	release is
				4140	prevented by the
				4141	waitcnt of
				4142	the release, but
				4143	there is nothing
				4144	preventing a store
				4145	release followed by
				4146	load acquire from
				4147	competing out of
				4148	order.)
				4149
				4150	2. *Following
				4151	instructions same as
				4152	corresponding load
				4153	atomic acquire,
				4154	except must generated
				4155	all instructions even
				4156	for OpenCL.*
				4157	load atomic seq_cst - workgroup - local *Same as corresponding
				4158	load atomic acquire,
				4159	except must generated
				4160	all instructions even
				4161	for OpenCL.*
				4162	load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
				4163	- system - generic vmcnt(0)
				4164
				4165	- Could be split into
				4166	separate s_waitcnt
				4167	vmcnt(0)
				4168	and s_waitcnt
				4169	lgkmcnt(0) to allow
				4170	them to be
				4171	independently moved
				4172	according to the
				4173	following rules.
				4174	- waitcnt lgkmcnt(0)
				4175	must happen after
				4176	preceding
				4177	global/generic load
				4178	atomic/store
				4179	atomic/atomicrmw
				4180	with memory
				4181	ordering of seq_cst
				4182	and with equal or
				4183	wider sync scope.
				4184	(Note that seq_cst
				4185	fences have their
				4186	own s_waitcnt
				4187	lgkmcnt(0) and so do
				4188	not need to be
				4189	considered.)
				4190	- waitcnt vmcnt(0)
				4191	must happen after
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4192	preceding
				4193	global/generic load
				4194	atomic/store
				4195	atomic/atomicrmw
				4196	with memory
				4197	ordering of seq_cst
				4198	and with equal or
				4199	wider sync scope.
				4200	(Note that seq_cst
				4201	fences have their
				4202	own s_waitcnt
				4203	vmcnt(0) and so do
				4204	not need to be
				4205	considered.)
				4206	- Ensures any
				4207	preceding
				4208	sequential
				4209	consistent global
				4210	memory instructions
				4211	have completed
				4212	before executing
				4213	this sequentially
				4214	consistent
				4215	instruction. This
				4216	prevents reordering
				4217	a seq_cst store
				4218	followed by a
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	4219	seq_cst load. (Note
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4220	that seq_cst is
				4221	stronger than
				4222	acquire/release as
				4223	the reordering of
				4224	load acquire
				4225	followed by a store
				4226	release is
				4227	prevented by the
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	4228	waitcnt of
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4229	the release, but
				4230	there is nothing
				4231	preventing a store
				4232	release followed by
				4233	load acquire from
				4234	competing out of
				4235	order.)
				4236
				4237	2. *Following
				4238	instructions same as
				4239	corresponding load
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	4240	atomic acquire,
				4241	except must generated
				4242	all instructions even
				4243	for OpenCL.*
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4244	store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	4245	- wavefront - local store atomic release,
				4246	- workgroup - generic except must generated
				4247	all instructions even
				4248	for OpenCL.*
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4249	store atomic seq_cst - agent - global *Same as corresponding
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	4250	- system - generic store atomic release,
				4251	except must generated
				4252	all instructions even
				4253	for OpenCL.*
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4254	atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	4255	- wavefront - local atomicrmw acq_rel,
				4256	- workgroup - generic except must generated
				4257	all instructions even
				4258	for OpenCL.*
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4259	atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	4260	- system - generic atomicrmw acq_rel,
				4261	except must generated
				4262	all instructions even
				4263	for OpenCL.*
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4264	fence seq_cst - singlethread none *Same as corresponding
Tony Tye	0a09220	2017-10-18 22:16:55 +0000	[diff] [blame]	4265	- wavefront fence acq_rel,
				4266	- workgroup except must generated
				4267	- agent all instructions even
				4268	- system for OpenCL.*
				4269	============ ============ ============== ========== ===============================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4270
				4271	The memory order also adds the single thread optimization constrains defined in
				4272	table
				4273	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				4274
				4275	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				4276	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				4277
				4278	============ ==============================================================
				4279	LLVM Memory Optimization Constraints
				4280	Ordering
				4281	============ ==============================================================
				4282	unordered none
				4283	monotonic none
				4284	acquire - If a load atomic/atomicrmw then no following load/load
				4285	atomic/store/ store atomic/atomicrmw/fence instruction can
				4286	be moved before the acquire.
				4287	- If a fence then same as load atomic, plus no preceding
				4288	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e9d67e4	2017-06-26 02:45:39 +0000	[diff] [blame]	4289	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4290	atomic/store/ store atomic/atomicrmw/fence instruction can
				4291	be moved after the release.
				4292	- If a fence then same as store atomic, plus no following
				4293	associated fence-paired-atomic can be moved before the
				4294	fence.
				4295	acq_rel Same constraints as both acquire and release.
				4296	seq_cst - If a load atomic then same constraints as acquire, plus no
				4297	preceding sequentially consistent load atomic/store
				4298	atomic/atomicrmw/fence instruction can be moved after the
				4299	seq_cst.
				4300	- If a store atomic then the same constraints as release, plus
				4301	no following sequentially consistent load atomic/store
				4302	atomic/atomicrmw/fence instruction can be moved before the
				4303	seq_cst.
				4304	- If an atomicrmw/fence then same constraints as acq_rel.
				4305	============ ==============================================================
Konstantin Zhuravlyov	2cee5cc	2017-03-08 23:55:44 +0000	[diff] [blame]	4306
Wei Ding	8d22e6f	2017-02-21 18:48:01 +0000	[diff] [blame]	4307	Trap Handler ABI
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4308	~~~~~~~~~~~~~~~~
Wei Ding	8d22e6f	2017-02-21 18:48:01 +0000	[diff] [blame]	4309
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4310	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				4311	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				4312	the ``s_trap`` instruction with the following usage:
Wei Ding	8d22e6f	2017-02-21 18:48:01 +0000	[diff] [blame]	4313
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4314	.. table:: AMDGPU Trap Handler for AMDHSA OS
				4315	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	8d22e6f	2017-02-21 18:48:01 +0000	[diff] [blame]	4316
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4317	=================== =============== =============== =======================
				4318	Usage Code Sequence Trap Handler Description
				4319	Inputs
				4320	=================== =============== =============== =======================
				4321	reserved ``s_trap 0x00`` Reserved by hardware.
				4322	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				4323	``queue_ptr`` ``debugtrap``
				4324	``VGPR0``: intrinsic (not
				4325	``arg`` implemented).
				4326	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				4327	``queue_ptr`` terminated and its
				4328	associated queue put
				4329	into the error state.
Tony Tye	b994738	2018-05-16 16:19:34 +0000	[diff] [blame]	4330	``llvm.debugtrap`` ``s_trap 0x03`` - If debugger not
				4331	installed then
				4332	behaves as a
				4333	no-operation. The
				4334	trap handler is
				4335	entered and
				4336	immediately returns
				4337	to continue
				4338	execution of the
				4339	wavefront.
				4340	- If the debugger is
				4341	installed, causes
				4342	the debug trap to be
				4343	reported by the
				4344	debugger and the
				4345	wavefront is put in
				4346	the halt state until
				4347	resumed by the
				4348	debugger.
				4349	reserved ``s_trap 0x04`` Reserved.
				4350	reserved ``s_trap 0x05`` Reserved.
				4351	reserved ``s_trap 0x06`` Reserved.
				4352	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4353	breakpoints.
Tony Tye	b994738	2018-05-16 16:19:34 +0000	[diff] [blame]	4354	reserved ``s_trap 0x08`` Reserved.
				4355	reserved ``s_trap 0xfe`` Reserved.
				4356	reserved ``s_trap 0xff`` Reserved.
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4357	=================== =============== =============== =======================
Wei Ding	8d22e6f	2017-02-21 18:48:01 +0000	[diff] [blame]	4358
Tim Corringham	c5e0b13	2018-04-04 13:02:09 +0000	[diff] [blame]	4359	AMDPAL
				4360	------
				4361
				4362	This section provides code conventions used when the target triple OS is
				4363	``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
				4364	from the application/runtime to each invocation of a hardware shader. These
				4365	parameters include both generic, application-controlled parameters called
				4366	user data as well as system-generated parameters that are a product of the
				4367	draw or dispatch execution.
				4368
				4369	User Data
				4370	~~~~~~~~~
				4371
				4372	Each hardware stage has a set of 32-bit user data registers which can be
				4373	written from a command buffer and then loaded into SGPRs when waves are launched
				4374	via a subsequent dispatch or draw operation. This is the way most arguments are
				4375	passed from the application/runtime to a hardware shader.
				4376
				4377	Compute User Data
				4378	~~~~~~~~~~~~~~~~~
				4379
				4380	Compute shader user data mappings are simpler than graphics shaders, and have a
				4381	fixed mapping.
				4382
				4383	Note that there are always 10 available user data entries in registers -
				4384	entries beyond that limit must be fetched from memory (via the spill table
				4385	pointer) by the shader.
				4386
				4387	.. table:: PAL Compute Shader User Data Registers
				4388	:name: pal-compute-user-data-registers
				4389
				4390	============= ================================
				4391	User Register Description
				4392	============= ================================
				4393	0 Global Internal Table (32-bit pointer)
				4394	1 Per-Shader Internal Table (32-bit pointer)
				4395	2 - 11 Application-Controlled User Data (10 32-bit values)
				4396	12 Spill Table (32-bit pointer)
				4397	13 - 14 Thread Group Count (64-bit pointer)
				4398	15 GDS Range
				4399	============= ================================
				4400
				4401	Graphics User Data
				4402	~~~~~~~~~~~~~~~~~~
				4403
				4404	Graphics pipelines support a much more flexible user data mapping:
				4405
				4406	.. table:: PAL Graphics Shader User Data Registers
				4407	:name: pal-graphics-user-data-registers
				4408
				4409	============= ================================
				4410	User Register Description
				4411	============= ================================
				4412	0 Global Internal Table (32-bit pointer)
				4413	+ Per-Shader Internal Table (32-bit pointer)
				4414	+ 1-15 Application Controlled User Data
				4415	(1-15 Contiguous 32-bit Values in Registers)
				4416	+ Spill Table (32-bit pointer)
				4417	+ Draw Index (First Stage Only)
				4418	+ Vertex Offset (First Stage Only)
				4419	+ Instance Offset (First Stage Only)
				4420	============= ================================
				4421
				4422	The placement of the global internal table remains fixed in the first *user
				4423	data SGPR register*. Otherwise all parameters are optional, and can be mapped
				4424	to any desired user data SGPR register, with the following regstrictions:
				4425
				4426	* Draw Index, Vertex Offset, and Instance Offset can only be used by the first
				4427	activehardware stage in a graphics pipeline (i.e. where the API vertex
				4428	shader runs).
				4429
				4430	* Application-controlled user data must be mapped into a contiguous range of
				4431	user data registers.
				4432
				4433	* The application-controlled user data range supports compaction remapping, so
				4434	only entries that are actually consumed by the shader must be assigned to
				4435	corresponding registers. Note that in order to support an efficient runtime
				4436	implementation, the remapping must pack registers in the same order as
				4437	entries, with unused entries removed.
				4438
				4439	.. _pal_global_internal_table:
				4440
				4441	Global Internal Table
				4442	~~~~~~~~~~~~~~~~~~~~~
				4443
				4444	The global internal table is a table of shader resource descriptors (SRDs) that
				4445	define how certain engine-wide, runtime-managed resources should be accessed
				4446	from a shader. The majority of these resources have HW-defined formats, and it
				4447	is up to the compiler to write/read data as required by the target hardware.
				4448
				4449	The following table illustrates the required format:
				4450
				4451	.. table:: PAL Global Internal Table
				4452	:name: pal-git-table
				4453
				4454	============= ================================
				4455	Offset Description
				4456	============= ================================
				4457	0-3 Graphics Scratch SRD
				4458	4-7 Compute Scratch SRD
				4459	8-11 ES/GS Ring Output SRD
				4460	12-15 ES/GS Ring Input SRD
				4461	16-19 GS/VS Ring Output #0
				4462	20-23 GS/VS Ring Output #1
				4463	24-27 GS/VS Ring Output #2
				4464	28-31 GS/VS Ring Output #3
				4465	32-35 GS/VS Ring Input SRD
				4466	36-39 Tessellation Factor Buffer SRD
				4467	40-43 Off-Chip LDS Buffer SRD
				4468	44-47 Off-Chip Param Cache Buffer SRD
				4469	48-51 Sample Position Buffer SRD
				4470	52 vaRange::ShadowDescriptorTable High Bits
				4471	============= ================================
				4472
				4473	The pointer to the global internal table passed to the shader as user data
				4474	is a 32-bit pointer. The top 32 bits should be assumed to be the same as
				4475	the top 32 bits of the pipeline, so the shader may use the program
				4476	counter's top 32 bits.
				4477
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	4478	Unspecified OS
				4479	--------------
				4480
				4481	This section provides code conventions used when the target triple OS is
				4482	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4483
				4484	Trap Handler ABI
				4485	~~~~~~~~~~~~~~~~
				4486
				4487	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				4488	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				4489	instructions are handled as follows:
				4490
				4491	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				4492	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				4493
				4494	=============== =============== ===========================================
				4495	Usage Code Sequence Description
				4496	=============== =============== ===========================================
				4497	llvm.trap s_endpgm Causes wavefront to be terminated.
				4498	llvm.debugtrap none Compiler warning given that there is no
				4499	trap handler installed.
				4500	=============== =============== ===========================================
				4501
				4502	Source Languages
				4503	================
				4504
				4505	.. _amdgpu-opencl:
				4506
				4507	OpenCL
				4508	------
				4509
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4510	When the language is OpenCL the following differences occur:
				4511
				4512	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
Tony Tye	2b4b7fe	2018-03-23 18:45:18 +0000	[diff] [blame]	4513	2. The AMDGPU backend appends additional arguments to the kernel's explicit
				4514	arguments for the AMDHSA OS (see
				4515	:ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
Tony Tye	1a0450f	2017-08-15 20:47:41 +0000	[diff] [blame]	4516	3. Additional metadata is generated
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4517	(see :ref:`amdgpu-amdhsa-code-object-metadata`).
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4518
Tony Tye	2b4b7fe	2018-03-23 18:45:18 +0000	[diff] [blame]	4519	.. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
				4520	:name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
				4521
				4522	======== ==== ========= ===========================================
				4523	Position Byte Byte Description
				4524	Size Alignment
				4525	======== ==== ========= ===========================================
Tony Tye	9272c8a	2018-03-23 18:58:47 +0000	[diff] [blame]	4526	1 8 8 OpenCL Global Offset X
				4527	2 8 8 OpenCL Global Offset Y
				4528	3 8 8 OpenCL Global Offset Z
				4529	4 8 8 OpenCL address of printf buffer
				4530	5 8 8 OpenCL address of virtual queue used by
				4531	enqueue_kernel.
				4532	6 8 8 OpenCL address of AqlWrap struct used by
				4533	enqueue_kernel.
Tony Tye	2b4b7fe	2018-03-23 18:45:18 +0000	[diff] [blame]	4534	======== ==== ========= ===========================================
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4535
				4536	.. _amdgpu-hcc:
				4537
				4538	HCC
				4539	---
				4540
Tony Tye	2b4b7fe	2018-03-23 18:45:18 +0000	[diff] [blame]	4541	When the language is HCC the following differences occur:
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4542
				4543	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				4544
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4545	.. _amdgpu-assembler:
				4546
Tom Stellard	953c681	2015-06-13 03:28:10 +0000	[diff] [blame]	4547	Assembler
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4548	---------
Tom Stellard	953c681	2015-06-13 03:28:10 +0000	[diff] [blame]	4549
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4550	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	2f2cb6b	2017-11-10 20:51:43 +0000	[diff] [blame]	4551	It supports AMDGCN GFX6-GFX9.
Tom Stellard	953c681	2015-06-13 03:28:10 +0000	[diff] [blame]	4552
Dmitry Preobrazhensky	1eaf2d7	2018-03-12 15:55:08 +0000	[diff] [blame]	4553	This section describes general syntax for instructions and operands.
				4554
				4555	Instructions
				4556	~~~~~~~~~~~~
				4557
				4558	.. toctree::
				4559	:hidden:
				4560
Dmitry Preobrazhensky	51120d7	2018-12-17 17:38:11 +0000	[diff] [blame]	4561	AMDGPU/AMDGPUAsmGFX7
				4562	AMDGPU/AMDGPUAsmGFX8
				4563	AMDGPU/AMDGPUAsmGFX9
				4564	AMDGPUModifierSyntax
Dmitry Preobrazhensky	1eaf2d7	2018-03-12 15:55:08 +0000	[diff] [blame]	4565	AMDGPUOperandSyntax
Dmitry Preobrazhensky	51120d7	2018-12-17 17:38:11 +0000	[diff] [blame]	4566	AMDGPUInstructionSyntax
				4567	AMDGPUInstructionNotation
Dmitry Preobrazhensky	1eaf2d7	2018-03-12 15:55:08 +0000	[diff] [blame]	4568
Dmitry Preobrazhensky	51120d7	2018-12-17 17:38:11 +0000	[diff] [blame]	4569	An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
Dmitry Preobrazhensky	1eaf2d7	2018-03-12 15:55:08 +0000	[diff] [blame]	4570
Dmitry Preobrazhensky	51120d7	2018-12-17 17:38:11 +0000	[diff] [blame]	4571	``<``\ opcode\ ``> <``\ operand0\ ``>, <``\ operand1\ ``>,... <``\ modifier0\ ``> <``\ modifier1\ ``>...``
Dmitry Preobrazhensky	1eaf2d7	2018-03-12 15:55:08 +0000	[diff] [blame]	4572
Dmitry Preobrazhensky	51120d7	2018-12-17 17:38:11 +0000	[diff] [blame]	4573	:doc:`Operands<AMDGPUOperandSyntax>` are normally comma-separated while
				4574	:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
Dmitry Preobrazhensky	1eaf2d7	2018-03-12 15:55:08 +0000	[diff] [blame]	4575
Dmitry Preobrazhensky	51120d7	2018-12-17 17:38:11 +0000	[diff] [blame]	4576	The order of operands and modifiers is fixed.
				4577	Most modifiers are optional and may be omitted.
Dmitry Preobrazhensky	1eaf2d7	2018-03-12 15:55:08 +0000	[diff] [blame]	4578
Dmitry Preobrazhensky	51120d7	2018-12-17 17:38:11 +0000	[diff] [blame]	4579	See detailed instruction syntax description for :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`,
				4580	:doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`.
Dmitry Preobrazhensky	1eaf2d7	2018-03-12 15:55:08 +0000	[diff] [blame]	4581
				4582	Note that features under development are not included in this description.
				4583
				4584	For more information about instructions, their semantics and supported combinations of
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4585	operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov	a05cf7b	2017-10-18 17:59:20 +0000	[diff] [blame]	4586	[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard	953c681	2015-06-13 03:28:10 +0000	[diff] [blame]	4587
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4588	Operands
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4589	~~~~~~~~
Tom Stellard	953c681	2015-06-13 03:28:10 +0000	[diff] [blame]	4590
Dmitry Preobrazhensky	51120d7	2018-12-17 17:38:11 +0000	[diff] [blame]	4591	Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
Tom Stellard	953c681	2015-06-13 03:28:10 +0000	[diff] [blame]	4592
Dmitry Preobrazhensky	1eaf2d7	2018-03-12 15:55:08 +0000	[diff] [blame]	4593	Modifiers
				4594	~~~~~~~~~
Tom Stellard	953c681	2015-06-13 03:28:10 +0000	[diff] [blame]	4595
Dmitry Preobrazhensky	51120d7	2018-12-17 17:38:11 +0000	[diff] [blame]	4596	Detailed description of modifiers may be found :doc:`here<AMDGPUModifierSyntax>`.
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4597
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4598	Instruction Examples
				4599	~~~~~~~~~~~~~~~~~~~~
				4600
				4601	DS
Dmitry Preobrazhensky	1eaf2d7	2018-03-12 15:55:08 +0000	[diff] [blame]	4602	++
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4603
				4604	.. code-block:: nasm
				4605
				4606	ds_add_u32 v2, v4 offset:16
				4607	ds_write_src2_b64 v2 offset0:4 offset1:8
				4608	ds_cmpst_f32 v2, v4, v6
				4609	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				4610
				4611
				4612	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				4613
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4614	FLAT
				4615	++++
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4616
				4617	.. code-block:: nasm
				4618
				4619	flat_load_dword v1, v[3:4]
				4620	flat_store_dwordx3 v[3:4], v[5:7]
				4621	flat_atomic_swap v1, v[3:4], v5 glc
				4622	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				4623	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				4624
				4625	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				4626
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4627	MUBUF
				4628	+++++
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4629
				4630	.. code-block:: nasm
				4631
				4632	buffer_load_dword v1, off, s[4:7], s1
				4633	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				4634	buffer_store_format_xy v[1:2], off, s[4:7], s1
				4635	buffer_wbinvl1
				4636	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				4637
				4638	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				4639
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4640	SMRD/SMEM
				4641	+++++++++
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4642
				4643	.. code-block:: nasm
				4644
				4645	s_load_dword s1, s[2:3], 0xfc
				4646	s_load_dwordx8 s[8:15], s[2:3], s4
				4647	s_load_dwordx16 s[88:103], s[2:3], s4
				4648	s_dcache_inv_vol
				4649	s_memtime s[4:5]
				4650
				4651	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				4652
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4653	SOP1
				4654	++++
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4655
				4656	.. code-block:: nasm
				4657
				4658	s_mov_b32 s1, s2
				4659	s_mov_b64 s[0:1], 0x80000000
				4660	s_cmov_b32 s1, 200
				4661	s_wqm_b64 s[2:3], s[4:5]
				4662	s_bcnt0_i32_b64 s1, s[2:3]
				4663	s_swappc_b64 s[2:3], s[4:5]
				4664	s_cbranch_join s[4:5]
				4665
				4666	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				4667
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4668	SOP2
				4669	++++
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4670
				4671	.. code-block:: nasm
				4672
				4673	s_add_u32 s1, s2, s3
				4674	s_and_b64 s[2:3], s[4:5], s[6:7]
				4675	s_cselect_b32 s1, s2, s3
				4676	s_andn2_b32 s2, s4, s6
				4677	s_lshr_b64 s[2:3], s[4:5], s6
				4678	s_ashr_i32 s2, s4, s6
				4679	s_bfm_b64 s[2:3], s4, s6
				4680	s_bfe_i64 s[2:3], s[4:5], s6
				4681	s_cbranch_g_fork s[4:5], s[6:7]
				4682
				4683	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				4684
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4685	SOPC
				4686	++++
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4687
				4688	.. code-block:: nasm
				4689
				4690	s_cmp_eq_i32 s1, s2
				4691	s_bitcmp1_b32 s1, s2
				4692	s_bitcmp0_b64 s[2:3], s4
				4693	s_setvskip s3, s5
				4694
				4695	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				4696
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4697	SOPP
				4698	++++
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4699
				4700	.. code-block:: nasm
				4701
				4702	s_barrier
				4703	s_nop 2
				4704	s_endpgm
				4705	s_waitcnt 0 ; Wait for all counters to be 0
				4706	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				4707	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				4708	s_sethalt 9
				4709	s_sleep 10
				4710	s_sendmsg 0x1
				4711	s_sendmsg sendmsg(MSG_INTERRUPT)
				4712	s_trap 1
				4713
				4714	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				4715
				4716	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	1d6becb	2017-01-14 11:37:01 +0000	[diff] [blame]	4717	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	953c681	2015-06-13 03:28:10 +0000	[diff] [blame]	4718	range or acceptable values.
				4719
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4720	VALU
				4721	++++
Tom Stellard	953c681	2015-06-13 03:28:10 +0000	[diff] [blame]	4722
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4723	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				4724	the assembler will automatically use optimal encoding based on its operands.
				4725	To force specific encoding, one can add a suffix to the opcode of the instruction:
				4726
				4727	* _e32 for 32-bit VOP1/VOP2/VOPC
				4728	* _e64 for 64-bit VOP3
				4729	* _dpp for VOP_DPP
				4730	* _sdwa for VOP_SDWA
				4731
				4732	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	953c681	2015-06-13 03:28:10 +0000	[diff] [blame]	4733
				4734	.. code-block:: nasm
				4735
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4736	v_mov_b32 v1, v2
				4737	v_mov_b32_e32 v1, v2
				4738	v_nop
				4739	v_cvt_f64_i32_e32 v[1:2], v2
				4740	v_floor_f32_e32 v1, v2
				4741	v_bfrev_b32_e32 v1, v2
				4742	v_add_f32_e32 v1, v2, v3
				4743	v_mul_i32_i24_e64 v1, v2, 3
				4744	v_mul_i32_i24_e32 v1, -3, v3
				4745	v_mul_i32_i24_e32 v1, -100, v3
				4746	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				4747	v_max_f16_e32 v1, v2, v3
Tom Stellard	953c681	2015-06-13 03:28:10 +0000	[diff] [blame]	4748
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4749	VOP_DPP examples:
Tom Stellard	953c681	2015-06-13 03:28:10 +0000	[diff] [blame]	4750
				4751	.. code-block:: nasm
				4752
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4753	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				4754	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4755	v_mov_b32 v0, v0 wave_shl:1
				4756	v_mov_b32 v0, v0 row_mirror
				4757	v_mov_b32 v0, v0 row_bcast:31
				4758	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4759	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4760	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	ac1a45e	2015-06-26 21:15:07 +0000	[diff] [blame]	4761
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4762	VOP_SDWA examples:
				4763
				4764	.. code-block:: nasm
				4765
				4766	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				4767	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				4768	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				4769	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				4770	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				4771
				4772	For full list of supported instructions, refer to "Vector ALU instructions".
				4773
Konstantin Zhuravlyov	d6794f0	2018-06-22 19:23:18 +0000	[diff] [blame]	4774	.. TODO
				4775	Remove once we switch to code object v3 by default.
				4776
				4777	HSA Code Object Directives
				4778	~~~~~~~~~~~~~~~~~~~~~~~~~~
				4779
				4780	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				4781	one can specify them with assembler directives.
				4782
				4783	.hsa_code_object_version major, minor
				4784	+++++++++++++++++++++++++++++++++++++
				4785
				4786	major and minor are integers that specify the version of the HSA code
				4787	object that will be generated by the assembler.
				4788
				4789	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
				4790	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				4791
				4792
				4793	major, minor, and stepping are all integers that describe the instruction
				4794	set architecture (ISA) version of the assembly program.
				4795
				4796	vendor and arch are quoted strings. vendor should always be equal to
				4797	"AMD" and arch should always be equal to "AMDGPU".
				4798
				4799	By default, the assembler will derive the ISA version, vendor, and arch
				4800	from the value of the -mcpu option that is passed to the assembler.
				4801
				4802	.amdgpu_hsa_kernel (name)
				4803	+++++++++++++++++++++++++
				4804
				4805	This directives specifies that the symbol with given name is a kernel entry point
				4806	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
				4807
				4808	.amd_kernel_code_t
				4809	++++++++++++++++++
				4810
				4811	This directive marks the beginning of a list of key / value pairs that are used
				4812	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				4813	The list must be terminated by the .end_amd_kernel_code_t directive. For
				4814	any amd_kernel_code_t values that are unspecified a default value will be
				4815	used. The default value for all keys is 0, with the following exceptions:
				4816
				4817	- kernel_code_version_major defaults to 1.
				4818	- machine_kind defaults to 1.
				4819	- machine_version_major, machine_version_minor, and
				4820	machine_version_stepping are derived from the value of the -mcpu option
				4821	that is passed to the assembler.
				4822	- kernel_code_entry_byte_offset defaults to 256.
				4823	- wavefront_size defaults to 6.
				4824	- kernarg_segment_alignment, group_segment_alignment, and
				4825	private_segment_alignment default to 4. Note that alignments are specified
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	4826	as a power of 2, so a value of n means an alignment of 2^ n.
Konstantin Zhuravlyov	d6794f0	2018-06-22 19:23:18 +0000	[diff] [blame]	4827
				4828	The .amd_kernel_code_t directive must be placed immediately after the
				4829	function label and before any instructions.
				4830
				4831	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				4832	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
				4833
				4834	Here is an example of a minimal amd_kernel_code_t specification:
				4835
				4836	.. code-block:: none
				4837
				4838	.hsa_code_object_version 1,0
				4839	.hsa_code_object_isa
				4840
				4841	.hsatext
				4842	.globl hello_world
				4843	.p2align 8
				4844	.amdgpu_hsa_kernel hello_world
				4845
				4846	hello_world:
				4847
				4848	.amd_kernel_code_t
				4849	enable_sgpr_kernarg_segment_ptr = 1
				4850	is_ptr64 = 1
				4851	compute_pgm_rsrc1_vgprs = 0
				4852	compute_pgm_rsrc1_sgprs = 0
				4853	compute_pgm_rsrc2_user_sgpr = 2
				4854	kernarg_segment_byte_size = 8
				4855	wavefront_sgpr_count = 2
				4856	workitem_vgpr_count = 3
				4857	.end_amd_kernel_code_t
				4858
				4859	s_load_dwordx2 s[0:1], s[0:1] 0x0
				4860	v_mov_b32 v0, 3.14159
				4861	s_waitcnt lgkmcnt(0)
				4862	v_mov_b32 v1, s0
				4863	v_mov_b32 v2, s1
				4864	flat_store_dword v[1:2], v0
				4865	s_endpgm
				4866	.Lfunc_end0:
				4867	.size hello_world, .Lfunc_end0-hello_world
				4868
				4869	Predefined Symbols (-mattr=+code-object-v3)
				4870	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4871
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4872	The AMDGPU assembler defines and updates some symbols automatically. These
				4873	symbols do not affect code generation.
Tom Stellard	ac1a45e	2015-06-26 21:15:07 +0000	[diff] [blame]	4874
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4875	.amdgcn.gfx_generation_number
				4876	+++++++++++++++++++++++++++++
Tom Stellard	ac1a45e	2015-06-26 21:15:07 +0000	[diff] [blame]	4877
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4878	Set to the GFX generation number of the target being assembled for. For
				4879	example, when assembling for a "GFX9" target this will be set to the integer
				4880	value "9". The possible GFX generation numbers are presented in
				4881	:ref:`amdgpu-processors`.
Tom Stellard	ac1a45e	2015-06-26 21:15:07 +0000	[diff] [blame]	4882
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4883	.amdgcn.next_free_vgpr
				4884	++++++++++++++++++++++
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	4885
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4886	Set to zero before assembly begins. At each instruction, if the current value
				4887	of this symbol is less than or equal to the maximum VGPR number explicitly
				4888	referenced within that instruction then the symbol value is updated to equal
				4889	that VGPR number plus one.
Tom Stellard	ac1a45e	2015-06-26 21:15:07 +0000	[diff] [blame]	4890
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4891	May be used to set the `.amdhsa_next_free_vpgr` directive in
				4892	:ref:`amdhsa-kernel-directives-table`.
Tom Stellard	ac1a45e	2015-06-26 21:15:07 +0000	[diff] [blame]	4893
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4894	May be set at any time, e.g. manually set to zero at the start of each kernel.
Tom Stellard	ac1a45e	2015-06-26 21:15:07 +0000	[diff] [blame]	4895
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4896	.amdgcn.next_free_sgpr
				4897	++++++++++++++++++++++
Tom Stellard	ac1a45e	2015-06-26 21:15:07 +0000	[diff] [blame]	4898
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4899	Set to zero before assembly begins. At each instruction, if the current value
				4900	of this symbol is less than or equal the maximum SGPR number explicitly
				4901	referenced within that instruction then the symbol value is updated to equal
				4902	that SGPR number plus one.
Nikolay Haustov	2a3c739	2016-09-20 09:04:51 +0000	[diff] [blame]	4903
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4904	May be used to set the `.amdhsa_next_free_spgr` directive in
				4905	:ref:`amdhsa-kernel-directives-table`.
Tom Stellard	4a88808	2015-06-26 21:58:31 +0000	[diff] [blame]	4906
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4907	May be set at any time, e.g. manually set to zero at the start of each kernel.
Tom Stellard	4a88808	2015-06-26 21:58:31 +0000	[diff] [blame]	4908
Konstantin Zhuravlyov	d6794f0	2018-06-22 19:23:18 +0000	[diff] [blame]	4909	Code Object Directives (-mattr=+code-object-v3)
				4910	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tom Stellard	4a88808	2015-06-26 21:58:31 +0000	[diff] [blame]	4911
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4912	Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
				4913	architecture processors, and are not OS-specific. Directives which begin with
				4914	``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
				4915	``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
				4916	:ref:`amdgpu-processors`.
Tom Stellard	4a88808	2015-06-26 21:58:31 +0000	[diff] [blame]	4917
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4918	.amdgcn_target <target>
				4919	+++++++++++++++++++++++
Tom Stellard	4a88808	2015-06-26 21:58:31 +0000	[diff] [blame]	4920
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4921	Optional directive which declares the target supported by the containing
				4922	assembler source file. Valid values are described in
				4923	:ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler
				4924	to validate command-line options such as ``-triple``, ``-mcpu``, and those
				4925	which specify target features.
Tom Stellard	4a88808	2015-06-26 21:58:31 +0000	[diff] [blame]	4926
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4927	.amdhsa_kernel <name>
				4928	+++++++++++++++++++++
Tom Stellard	4a88808	2015-06-26 21:58:31 +0000	[diff] [blame]	4929
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4930	Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
				4931	``<name>.kd``, in the current location of the current section. Only valid when
				4932	the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
				4933	instruction to execute, and does not need to be previously defined.
Tom Stellard	4a88808	2015-06-26 21:58:31 +0000	[diff] [blame]	4934
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4935	Marks the beginning of a list of directives used to generate the bytes of a
				4936	kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				4937	Directives which may appear in this list are described in
				4938	:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
				4939	be valid for the target being assembled for, and cannot be repeated. Directives
				4940	support the range of values specified by the field they reference in
				4941	:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
				4942	assumed to have its default value, unless it is marked as "Required", in which
				4943	case it is an error to omit the directive. This list of directives is
				4944	terminated by an ``.end_amdhsa_kernel`` directive.
Tom Stellard	4a88808	2015-06-26 21:58:31 +0000	[diff] [blame]	4945
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4946	.. table:: AMDHSA Kernel Assembler Directives
				4947	:name: amdhsa-kernel-directives-table
Tom Stellard	4a88808	2015-06-26 21:58:31 +0000	[diff] [blame]	4948
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	4949	======================================================== ================ ============ ===================
				4950	Directive Default Supported On Description
				4951	======================================================== ================ ============ ===================
				4952	``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX9 Controls GROUP_SEGMENT_FIXED_SIZE in
				4953	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4954	``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX9 Controls PRIVATE_SEGMENT_FIXED_SIZE in
				4955	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4956	``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
				4957	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4958	``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_DISPATCH_PTR in
				4959	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4960	``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_QUEUE_PTR in
				4961	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4962	``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
				4963	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4964	``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX9 Controls ENABLE_SGPR_DISPATCH_ID in
				4965	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4966	``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX9 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
				4967	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4968	``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
				4969	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4970	``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in
				4971	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4972	``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_X in
				4973	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4974	``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_Y in
				4975	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4976	``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_Z in
				4977	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4978	``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_INFO in
				4979	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4980	``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX9 Controls ENABLE_VGPR_WORKITEM_ID in
				4981	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4982	Possible values are defined in
				4983	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
				4984	``.amdhsa_next_free_vgpr`` Required GFX6-GFX9 Maximum VGPR number explicitly referenced, plus one.
				4985	Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
				4986	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4987	``.amdhsa_next_free_sgpr`` Required GFX6-GFX9 Maximum SGPR number explicitly referenced, plus one.
				4988	Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
				4989	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4990	``.amdhsa_reserve_vcc`` 1 GFX6-GFX9 Whether the kernel may use the special VCC SGPR.
				4991	Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
				4992	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4993	``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX9 Whether the kernel may use flat instructions to access
				4994	scratch memory. Used to calculate
				4995	GRANULATED_WAVEFRONT_SGPR_COUNT in
				4996	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4997	``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX9 Whether the kernel may trigger XNACK replay.
				4998	Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
				4999	Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				5000	(+xnack)
				5001	``.amdhsa_float_round_mode_32`` 0 GFX6-GFX9 Controls FLOAT_ROUND_MODE_32 in
				5002	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				5003	Possible values are defined in
				5004	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				5005	``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX9 Controls FLOAT_ROUND_MODE_16_64 in
				5006	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				5007	Possible values are defined in
				5008	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				5009	``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX9 Controls FLOAT_DENORM_MODE_32 in
				5010	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				5011	Possible values are defined in
				5012	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				5013	``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX9 Controls FLOAT_DENORM_MODE_16_64 in
				5014	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				5015	Possible values are defined in
				5016	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				5017	``.amdhsa_dx10_clamp`` 1 GFX6-GFX9 Controls ENABLE_DX10_CLAMP in
				5018	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				5019	``.amdhsa_ieee_mode`` 1 GFX6-GFX9 Controls ENABLE_IEEE_MODE in
				5020	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				5021	``.amdhsa_fp16_overflow`` 0 GFX9 Controls FP16_OVFL in
				5022	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				5023	``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
				5024	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				5025	``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
				5026	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				5027	``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
				5028	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				5029	``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
				5030	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				5031	``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
				5032	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				5033	``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
				5034	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				5035	``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
				5036	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				5037	======================================================== ================ ============ ===================
Tom Stellard	4a88808	2015-06-26 21:58:31 +0000	[diff] [blame]	5038
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	5039	.amdgpu_metadata
				5040	++++++++++++++++
				5041
				5042	Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
				5043	note record (see :ref:`amdgpu-elf-note-records-table-v3`).
				5044
				5045	The contents must be in the [YAML]_ markup format, with the same structure and
				5046	semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`.
				5047
				5048	This directive is terminated by an ``.end_amdgpu_metadata`` directive.
				5049
Konstantin Zhuravlyov	d6794f0	2018-06-22 19:23:18 +0000	[diff] [blame]	5050	Example HSA Source Code (-mattr=+code-object-v3)
				5051	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tom Stellard	4a88808	2015-06-26 21:58:31 +0000	[diff] [blame]	5052
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	5053	Here is an example of a minimal assembly source file, defining one HSA kernel:
				5054
Chandler Carruth	c0cb45a	2018-08-06 01:19:43 +0000	[diff] [blame]	5055	.. code-block:: none
Scott Linder	43cbf8d	2018-06-21 19:38:56 +0000	[diff] [blame]	5056
				5057	.amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
				5058
				5059	.text
				5060	.globl hello_world
				5061	.p2align 8
				5062	.type hello_world,@function
				5063	hello_world:
				5064	s_load_dwordx2 s[0:1], s[0:1] 0x0
				5065	v_mov_b32 v0, 3.14159
				5066	s_waitcnt lgkmcnt(0)
				5067	v_mov_b32 v1, s0
				5068	v_mov_b32 v2, s1
				5069	flat_store_dword v[1:2], v0
				5070	s_endpgm
				5071	.Lfunc_end0:
				5072	.size hello_world, .Lfunc_end0-hello_world
				5073
				5074	.rodata
				5075	.p2align 6
				5076	.amdhsa_kernel hello_world
				5077	.amdhsa_user_sgpr_kernarg_segment_ptr 1
				5078	.amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
				5079	.amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
				5080	.end_amdhsa_kernel
				5081
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	5082	.amdgpu_metadata
				5083	---
				5084	amdhsa.version:
				5085	- 1
				5086	- 0
				5087	amdhsa.kernels:
				5088	- .name: hello_world
				5089	.symbol: hello_world.kd
				5090	.kernarg_segment_size: 48
				5091	.group_segment_fixed_size: 0
				5092	.private_segment_fixed_size: 0
				5093	.kernarg_segment_align: 4
				5094	.wavefront_size: 64
				5095	.sgpr_count: 2
				5096	.vgpr_count: 3
				5097	.max_flat_workgroup_size: 256
				5098	...
				5099	.end_amdgpu_metadata
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	5100
				5101	Additional Documentation
				5102	========================
				5103
Konstantin Zhuravlyov	a05cf7b	2017-10-18 17:59:20 +0000	[diff] [blame]	5104	.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				5105	.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				5106	.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				5107	.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				5108	.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				5109	.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				5110	.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				5111	.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	5112	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				5113	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				5114	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				5115	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				5116	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyov	cb5868c	2017-10-19 17:12:55 +0000	[diff] [blame]	5117	.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Scott Linder	dff71ea	2018-11-15 20:46:55 +0000	[diff] [blame]	5118	.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
Tony Tye	6986553	2017-06-06 20:31:59 +0000	[diff] [blame]	5119	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				5120	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
Tony Tye	978dec7	2018-06-14 16:40:10 +0000	[diff] [blame]	5121	.. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__