blob: c60d60af5a82cf1204fb9854816f2762ff8c2ecc [file] [log] [blame]
Eugene Zelenkoe9a01222018-03-21 17:09:35 +00001=============================
Tony Tye69865532017-06-06 20:31:59 +00002User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard953c6812015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tye69865532017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard953c6812015-06-13 03:28:10 +000014
Tony Tye69865532017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard953c6812015-06-13 03:28:10 +000017
Tony Tye69865532017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
Tony Tye9000e8c2017-11-10 01:00:54 +000026 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
Tony Tye69865532017-06-06 20:31:59 +000028
Tony Tye9000e8c2017-11-10 01:00:54 +000029 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
Tony Tye69865532017-06-06 20:31:59 +000035
Tony Tye9000e8c2017-11-10 01:00:54 +000036 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
Tony Tye69865532017-06-06 20:31:59 +000038
Tony Tye9000e8c2017-11-10 01:00:54 +000039 ============ ==============================================================
40 Vendor Description
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
Tony Tye69865532017-06-06 20:31:59 +000045
Tony Tye9000e8c2017-11-10 01:00:54 +000046 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
Tony Tye69865532017-06-06 20:31:59 +000048
Tony Tye9000e8c2017-11-10 01:00:54 +000049 ============== ============================================================
50 OS Description
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
56 runtime.
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
58 runtime.
59 ============== ============================================================
Tony Tye69865532017-06-06 20:31:59 +000060
Tony Tye9000e8c2017-11-10 01:00:54 +000061 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
Tony Tye69865532017-06-06 20:31:59 +000063
Tony Tye9000e8c2017-11-10 01:00:54 +000064 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
Tony Tye2b4b7fe2018-03-23 18:45:18 +000067 *<empty>* Default.
Tony Tye9000e8c2017-11-10 01:00:54 +000068 ============ ==============================================================
Tony Tye69865532017-06-06 20:31:59 +000069
70.. _amdgpu-processors:
71
72Processors
73----------
74
75Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
76names from both the *Processor* and *Alternative Processor* can be used.
77
78 .. table:: AMDGPU Processors
Tony Tye9000e8c2017-11-10 01:00:54 +000079 :name: amdgpu-processor-table
Tony Tye69865532017-06-06 20:31:59 +000080
Konstantin Zhuravlyovffb90c42018-11-06 20:23:53 +000081 =========== =============== ============ ===== ========== ======= ======================
82 Processor Alternative Target dGPU/ Target ROCm Example
83 Processor Triple APU Features Support Products
Tony Tyec865d8a2017-12-11 15:35:27 +000084 Architecture Supported
85 [Default]
Konstantin Zhuravlyovffb90c42018-11-06 20:23:53 +000086 =========== =============== ============ ===== ========== ======= ======================
Konstantin Zhuravlyova05cf7b2017-10-18 17:59:20 +000087 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
Konstantin Zhuravlyovffb90c42018-11-06 20:23:53 +000088 ----------------------------------------------------------------------------------------
Tony Tye9000e8c2017-11-10 01:00:54 +000089 ``r600`` ``r600`` dGPU
90 ``r630`` ``r600`` dGPU
91 ``rs880`` ``r600`` dGPU
92 ``rv670`` ``r600`` dGPU
Konstantin Zhuravlyova05cf7b2017-10-18 17:59:20 +000093 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
Konstantin Zhuravlyovffb90c42018-11-06 20:23:53 +000094 ----------------------------------------------------------------------------------------
Tony Tye9000e8c2017-11-10 01:00:54 +000095 ``rv710`` ``r600`` dGPU
96 ``rv730`` ``r600`` dGPU
97 ``rv770`` ``r600`` dGPU
Konstantin Zhuravlyova05cf7b2017-10-18 17:59:20 +000098 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
Konstantin Zhuravlyovffb90c42018-11-06 20:23:53 +000099 ----------------------------------------------------------------------------------------
Tony Tye9000e8c2017-11-10 01:00:54 +0000100 ``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov16290892018-02-16 22:33:59 +0000101 ``cypress`` ``r600`` dGPU
102 ``juniper`` ``r600`` dGPU
Tony Tye9000e8c2017-11-10 01:00:54 +0000103 ``redwood`` ``r600`` dGPU
104 ``sumo`` ``r600`` dGPU
Konstantin Zhuravlyova05cf7b2017-10-18 17:59:20 +0000105 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
Konstantin Zhuravlyovffb90c42018-11-06 20:23:53 +0000106 ----------------------------------------------------------------------------------------
Tony Tye9000e8c2017-11-10 01:00:54 +0000107 ``barts`` ``r600`` dGPU
Tony Tye9000e8c2017-11-10 01:00:54 +0000108 ``caicos`` ``r600`` dGPU
109 ``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov16290892018-02-16 22:33:59 +0000110 ``turks`` ``r600`` dGPU
Konstantin Zhuravlyova05cf7b2017-10-18 17:59:20 +0000111 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
Konstantin Zhuravlyovffb90c42018-11-06 20:23:53 +0000112 ----------------------------------------------------------------------------------------
Tony Tye9000e8c2017-11-10 01:00:54 +0000113 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov16290892018-02-16 22:33:59 +0000114 ``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye9000e8c2017-11-10 01:00:54 +0000115 - ``oland``
Konstantin Zhuravlyov16290892018-02-16 22:33:59 +0000116 - ``pitcairn``
117 - ``verde``
Konstantin Zhuravlyova05cf7b2017-10-18 17:59:20 +0000118 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
Konstantin Zhuravlyovffb90c42018-11-06 20:23:53 +0000119 ----------------------------------------------------------------------------------------
120 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
121 - A6 Pro-7050B
122 - A8-7100
123 - A8 Pro-7150B
124 - A10-7300
125 - A10 Pro-7350B
126 - FX-7500
127 - A8-7200P
128 - A10-7400P
129 - FX-7600P
130 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
131 - FirePro W9100
132 - FirePro S9150
133 - FirePro S9170
134 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
135 - Radeon R9 290x
136 - Radeon R390
137 - Radeon R390x
138 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
139 - ``mullins`` - E1-2200
140 - E1-2500
141 - E2-3000
142 - E2-3800
143 - A4-5000
144 - A4-5100
145 - A6-5200
146 - A4 Pro-3340B
147 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
148 - Radeon HD 8770
149 - R7 260
150 - R7 260X
Konstantin Zhuravlyova05cf7b2017-10-18 17:59:20 +0000151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
Konstantin Zhuravlyovffb90c42018-11-06 20:23:53 +0000152 ----------------------------------------------------------------------------------------
153 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
154 [on] - Pro A6-8500B
155 - A8-8600P
156 - Pro A8-8600B
157 - FX-8800P
158 - Pro A12-8800B
159 \ ``amdgcn`` APU - xnack ROCm - A10-8700P
160 [on] - Pro A10-8700B
161 - A10-8780P
162 \ ``amdgcn`` APU - xnack - A10-9600P
163 [on] - A10-9630P
164 - A12-9700P
165 - A12-9730P
166 - FX-9800P
167 - FX-9830P
168 \ ``amdgcn`` APU - xnack - E2-9010
169 [on] - A6-9210
170 - A9-9410
171 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
172 - ``tonga`` [off] - FirePro S7100
173 - FirePro W7100
174 - Radeon R285
175 - Radeon R9 380
176 - Radeon R9 385
177 - Mobile FirePro
178 M7170
179 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
180 [off] - Radeon R9 Fury
181 - Radeon R9 FuryX
182 - Radeon Pro Duo
183 - FirePro S9300x2
184 - Radeon Instinct MI8
185 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
186 [off] - Radeon RX 480
187 - Radeon Instinct MI6
188 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
Tony Tyec865d8a2017-12-11 15:35:27 +0000189 [off]
190 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
191 [on]
Konstantin Zhuravlyova05cf7b2017-10-18 17:59:20 +0000192 **GCN GFX9** [AMD-GCN-GFX9]_
Konstantin Zhuravlyovffb90c42018-11-06 20:23:53 +0000193 ----------------------------------------------------------------------------------------
194 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
195 [off] Frontier Edition
196 - Radeon RX Vega 56
197 - Radeon RX Vega 64
198 - Radeon RX Vega 64
199 Liquid
200 - Radeon Instinct MI25
201 ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G
202 [on] - Ryzen 5 2400G
203 ``gfx904`` ``amdgcn`` dGPU - xnack *TBA*
Matt Arsenaultac9b3ef2018-04-30 19:08:16 +0000204 [off]
Konstantin Zhuravlyovffb90c42018-11-06 20:23:53 +0000205 .. TODO
206 Add product
207 names.
Konstantin Zhuravlyovba9a1e12018-11-07 20:54:16 +0000208 ``gfx906`` ``amdgcn`` dGPU - xnack - Radeon Instinct MI50
209 [off] - Radeon Instinct MI60
Konstantin Zhuravlyov7829a6d2018-11-05 22:44:19 +0000210 sram-ecc
211 [on]
Konstantin Zhuravlyovffb90c42018-11-06 20:23:53 +0000212 ``gfx909`` ``amdgcn`` APU - xnack *TBA* (Raven Ridge 2)
Tim Renouf7d5e5c22018-10-24 08:14:07 +0000213 [on]
Konstantin Zhuravlyovffb90c42018-11-06 20:23:53 +0000214 .. TODO
215 Add product
216 names.
217 =========== =============== ============ ===== ========== ======= ======================
Tony Tye9000e8c2017-11-10 01:00:54 +0000218
219.. _amdgpu-target-features:
220
221Target Features
222---------------
223
224Target features control how code is generated to support certain
Tony Tyec865d8a2017-12-11 15:35:27 +0000225processor specific features. Not all target features are supported by
226all processors. The runtime must ensure that the features supported by
227the device used to execute the code match the features enabled when
228generating the code. A mismatch of features may result in incorrect
229execution, or a reduction in performance.
230
231The target features supported by each processor, and the default value
232used if not specified explicitly, is listed in
233:ref:`amdgpu-processor-table`.
Tony Tye9000e8c2017-11-10 01:00:54 +0000234
235Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
236target features.
237
238For example:
239
240``-mxnack``
Tony Tyec865d8a2017-12-11 15:35:27 +0000241 Enable the ``xnack`` feature.
Tony Tye9000e8c2017-11-10 01:00:54 +0000242``-mno-xnack``
Tony Tyec865d8a2017-12-11 15:35:27 +0000243 Disable the ``xnack`` feature.
Tony Tye9000e8c2017-11-10 01:00:54 +0000244
245 .. table:: AMDGPU Target Features
246 :name: amdgpu-target-feature-table
247
Konstantin Zhuravlyov7829a6d2018-11-05 22:44:19 +0000248 =============== ==================================================
249 Target Feature Description
250 =============== ==================================================
251 -m[no-]xnack Enable/disable generating code that has
252 memory clauses that are compatible with
253 having XNACK replay enabled.
Tony Tye9000e8c2017-11-10 01:00:54 +0000254
Konstantin Zhuravlyov7829a6d2018-11-05 22:44:19 +0000255 This is used for demand paging and page
256 migration. If XNACK replay is enabled in
257 the device, then if a page fault occurs
258 the code may execute incorrectly if the
259 ``xnack`` feature is not enabled. Executing
260 code that has the feature enabled on a
261 device that does not have XNACK replay
262 enabled will execute correctly, but may
263 be less performant than code with the
264 feature disabled.
265 -m[no-]sram-ecc Enable/disable generating code that assumes SRAM
266 ECC is enabled/disabled.
267 =============== ==================================================
Tony Tye69865532017-06-06 20:31:59 +0000268
269.. _amdgpu-address-spaces:
Tom Stellard43f537f2016-04-06 01:29:19 +0000270
271Address Spaces
272--------------
273
Tony Tye69865532017-06-06 20:31:59 +0000274The AMDGPU backend uses the following address space mappings.
Tom Stellard43f537f2016-04-06 01:29:19 +0000275
Tony Tye69865532017-06-06 20:31:59 +0000276The memory space names used in the table, aside from the region memory space, is
277from the OpenCL standard.
Tom Stellard43f537f2016-04-06 01:29:19 +0000278
Tony Tye69865532017-06-06 20:31:59 +0000279LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard43f537f2016-04-06 01:29:19 +0000280
Tony Tye69865532017-06-06 20:31:59 +0000281 .. table:: Address Space Mapping
282 :name: amdgpu-address-space-mapping-table
283
Yaxun Liu2930e5c2018-02-13 18:00:25 +0000284 ================== =================
Tony Tye69865532017-06-06 20:31:59 +0000285 LLVM Address Space Memory Space
Yaxun Liu2930e5c2018-02-13 18:00:25 +0000286 ================== =================
287 0 Generic (Flat)
288 1 Global
289 2 Region (GDS)
290 3 Local (group/LDS)
291 4 Constant
292 5 Private (Scratch)
293 6 Constant 32-bit
294 ================== =================
Tony Tye69865532017-06-06 20:31:59 +0000295
296.. _amdgpu-memory-scopes:
297
298Memory Scopes
299-------------
300
301This section provides LLVM memory synchronization scopes supported by the AMDGPU
302backend memory model when the target triple OS is ``amdhsa`` (see
303:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
304
305The memory model supported is based on the HSA memory model [HSA]_ which is
306based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
307relation is transitive over the synchonizes-with relation independent of scope,
308and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye9000e8c2017-11-10 01:00:54 +0000309table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tye69865532017-06-06 20:31:59 +0000310
311This is different to the OpenCL [OpenCL]_ memory model which does not have scope
312inclusion and requires the memory scopes to exactly match. However, this
313is conservatively correct for OpenCL.
314
Tony Tye9000e8c2017-11-10 01:00:54 +0000315 .. table:: AMDHSA LLVM Sync Scopes
316 :name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tye69865532017-06-06 20:31:59 +0000317
318 ================ ==========================================================
319 LLVM Sync Scope Description
320 ================ ==========================================================
321 *none* The default: ``system``.
322
323 Synchronizes with, and participates in modification and
324 seq_cst total orderings with, other operations (except
325 image operations) for all address spaces (except private,
326 or generic that accesses private) provided the other
327 operation's sync scope is:
328
329 - ``system``.
330 - ``agent`` and executed by a thread on the same agent.
331 - ``workgroup`` and executed by a thread in the same
332 workgroup.
333 - ``wavefront`` and executed by a thread in the same
334 wavefront.
335
336 ``agent`` Synchronizes with, and participates in modification and
337 seq_cst total orderings with, other operations (except
338 image operations) for all address spaces (except private,
339 or generic that accesses private) provided the other
340 operation's sync scope is:
341
342 - ``system`` or ``agent`` and executed by a thread on the
343 same agent.
344 - ``workgroup`` and executed by a thread in the same
345 workgroup.
346 - ``wavefront`` and executed by a thread in the same
347 wavefront.
348
349 ``workgroup`` Synchronizes with, and participates in modification and
350 seq_cst total orderings with, other operations (except
351 image operations) for all address spaces (except private,
352 or generic that accesses private) provided the other
353 operation's sync scope is:
354
355 - ``system``, ``agent`` or ``workgroup`` and executed by a
356 thread in the same workgroup.
357 - ``wavefront`` and executed by a thread in the same
358 wavefront.
359
360 ``wavefront`` Synchronizes with, and participates in modification and
361 seq_cst total orderings with, other operations (except
362 image operations) for all address spaces (except private,
363 or generic that accesses private) provided the other
364 operation's sync scope is:
365
366 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
367 and executed by a thread in the same wavefront.
368
369 ``singlethread`` Only synchronizes with, and participates in modification
370 and seq_cst total orderings with, other operations (except
371 image operations) running in the same thread for all
372 address spaces (for example, in signal handlers).
373 ================ ==========================================================
374
375AMDGPU Intrinsics
376-----------------
377
Tony Tye978dec72018-06-14 16:40:10 +0000378The AMDGPU backend implements the following LLVM IR intrinsics.
Tony Tye69865532017-06-06 20:31:59 +0000379
380*This section is WIP.*
381
382.. TODO
383 List AMDGPU intrinsics
384
Tony Tye978dec72018-06-14 16:40:10 +0000385AMDGPU Attributes
386-----------------
387
388The AMDGPU backend supports the following LLVM IR attributes.
389
390 .. table:: AMDGPU LLVM IR Attributes
391 :name: amdgpu-llvm-ir-attributes-table
392
393 ======================================= ==========================================================
394 LLVM Attribute Description
395 ======================================= ==========================================================
396 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
397 will be specified when the kernel is dispatched. Generated
398 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
399 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
400 argument block size for the implicit arguments. This
401 varies by OS and language (for OpenCL see
402 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
403 "amdgpu-max-work-group-size"="n" Specify the maximum work-group size that will be specifed
404 when the kernel is dispatched.
405 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
406 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
407 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
408 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
409 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
410 execution unit. Generated by the ``amdgpu_waves_per_eu``
411 CLANG attribute [CLANG-ATTR]_.
412 ======================================= ==========================================================
413
Tony Tye69865532017-06-06 20:31:59 +0000414Code Object
415===========
416
417The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
418can be linked by ``lld`` to produce a standard ELF shared code object which can
419be loaded and executed on an AMDGPU target.
420
421Header
422------
423
424The AMDGPU backend uses the following ELF header:
425
426 .. table:: AMDGPU ELF Header
427 :name: amdgpu-elf-header-table
428
Konstantin Zhuravlyov76246302017-10-03 20:54:07 +0000429 ========================== ===============================
Tony Tye69865532017-06-06 20:31:59 +0000430 Field Value
Konstantin Zhuravlyov76246302017-10-03 20:54:07 +0000431 ========================== ===============================
Tony Tye69865532017-06-06 20:31:59 +0000432 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
433 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye9000e8c2017-11-10 01:00:54 +0000434 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
435 - ``ELFOSABI_AMDGPU_HSA``
436 - ``ELFOSABI_AMDGPU_PAL``
437 - ``ELFOSABI_AMDGPU_MESA3D``
438 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
439 - ``ELFABIVERSION_AMDGPU_PAL``
440 - ``ELFABIVERSION_AMDGPU_MESA3D``
441 ``e_type`` - ``ET_REL``
442 - ``ET_DYN``
Tony Tye69865532017-06-06 20:31:59 +0000443 ``e_machine`` ``EM_AMDGPU``
444 ``e_entry`` 0
Tony Tye9000e8c2017-11-10 01:00:54 +0000445 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyov76246302017-10-03 20:54:07 +0000446 ========================== ===============================
Tony Tye69865532017-06-06 20:31:59 +0000447
448..
449
450 .. table:: AMDGPU ELF Header Enumeration Values
451 :name: amdgpu-elf-header-enumeration-values-table
452
Konstantin Zhuravlyov4b145dd2017-10-03 21:14:14 +0000453 =============================== =====
454 Name Value
455 =============================== =====
456 ``EM_AMDGPU`` 224
Tony Tye9000e8c2017-11-10 01:00:54 +0000457 ``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov4b145dd2017-10-03 21:14:14 +0000458 ``ELFOSABI_AMDGPU_HSA`` 64
459 ``ELFOSABI_AMDGPU_PAL`` 65
460 ``ELFOSABI_AMDGPU_MESA3D`` 66
461 ``ELFABIVERSION_AMDGPU_HSA`` 1
462 ``ELFABIVERSION_AMDGPU_PAL`` 0
463 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
464 =============================== =====
Tony Tye69865532017-06-06 20:31:59 +0000465
466``e_ident[EI_CLASS]``
Tony Tye9000e8c2017-11-10 01:00:54 +0000467 The ELF class is:
468
469 * ``ELFCLASS32`` for ``r600`` architecture.
470
471 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
472 bit applications.
Tony Tye69865532017-06-06 20:31:59 +0000473
474``e_ident[EI_DATA]``
Tony Tye9000e8c2017-11-10 01:00:54 +0000475 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tye69865532017-06-06 20:31:59 +0000476
477``e_ident[EI_OSABI]``
Tony Tye9000e8c2017-11-10 01:00:54 +0000478 One of the following AMD GPU architecture specific OS ABIs
479 (see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyov76246302017-10-03 20:54:07 +0000480
Tony Tye9000e8c2017-11-10 01:00:54 +0000481 * ``ELFOSABI_NONE`` for *unknown* OS.
Konstantin Zhuravlyov76246302017-10-03 20:54:07 +0000482
Tony Tye9000e8c2017-11-10 01:00:54 +0000483 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tye69865532017-06-06 20:31:59 +0000484
Tony Tye9000e8c2017-11-10 01:00:54 +0000485 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
486
487 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov4b145dd2017-10-03 21:14:14 +0000488
Tony Tye69865532017-06-06 20:31:59 +0000489``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov76246302017-10-03 20:54:07 +0000490 The ABI version of the AMD GPU architecture specific OS ABI to which the code
491 object conforms:
492
493 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
494 runtime ABI.
495
496 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
497 runtime ABI.
Tony Tye69865532017-06-06 20:31:59 +0000498
Konstantin Zhuravlyov4b145dd2017-10-03 21:14:14 +0000499 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye9000e8c2017-11-10 01:00:54 +0000500 3D runtime ABI.
Konstantin Zhuravlyov4b145dd2017-10-03 21:14:14 +0000501
Tony Tye69865532017-06-06 20:31:59 +0000502``e_type``
503 Can be one of the following values:
504
505
506 ``ET_REL``
507 The type produced by the AMD GPU backend compiler as it is relocatable code
508 object.
509
510 ``ET_DYN``
511 The type produced by the linker as it is a shared code object.
512
513 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
514
515``e_machine``
Tony Tye9000e8c2017-11-10 01:00:54 +0000516 The value ``EM_AMDGPU`` is used for the machine for all processors supported
517 by the ``r600`` and ``amdgcn`` architectures (see
518 :ref:`amdgpu-processor-table`). The specific processor is specified in the
519 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
520 :ref:`amdgpu-elf-header-e_flags-table`).
Tony Tye69865532017-06-06 20:31:59 +0000521
522``e_entry``
523 The entry point is 0 as the entry points for individual kernels must be
524 selected in order to invoke them through AQL packets.
525
526``e_flags``
Tony Tye9000e8c2017-11-10 01:00:54 +0000527 The AMDGPU backend uses the following ELF header flags:
528
529 .. table:: AMDGPU ELF Header ``e_flags``
530 :name: amdgpu-elf-header-e_flags-table
531
532 ================================= ========== =============================
533 Name Value Description
534 ================================= ========== =============================
535 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
536 -------------------------------------------- -----------------------------
537 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
538 mask for
539 ``EF_AMDGPU_MACH_xxx`` values
540 defined in
541 :ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tyec865d8a2017-12-11 15:35:27 +0000542 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
543 target feature is
544 enabled for all code
545 contained in the code object.
Tony Tye636e2232018-03-08 05:46:01 +0000546 If the processor
547 does not support the
548 ``xnack`` target
549 feature then must
550 be 0.
Tony Tyec865d8a2017-12-11 15:35:27 +0000551 See
552 :ref:`amdgpu-target-features`.
Konstantin Zhuravlyov7829a6d2018-11-05 22:44:19 +0000553 ``EF_AMDGPU_SRAM_ECC`` 0x00000200 Indicates if the ``sram-ecc``
554 target feature is
555 enabled for all code
556 contained in the code object.
557 If the processor
558 does not support the
559 ``sram-ecc`` target
560 feature then must
561 be 0.
562 See
563 :ref:`amdgpu-target-features`.
Tony Tye9000e8c2017-11-10 01:00:54 +0000564 ================================= ========== =============================
565
566 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
567 :name: amdgpu-ef-amdgpu-mach-table
568
569 ================================= ========== =============================
570 Name Value Description (see
571 :ref:`amdgpu-processor-table`)
572 ================================= ========== =============================
Konstantin Zhuravlyov16290892018-02-16 22:33:59 +0000573 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
574 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
575 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
576 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
577 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
578 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
579 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
580 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
581 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
582 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
583 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
584 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
585 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
586 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
587 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
588 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
589 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
590 *reserved* 0x011 - Reserved for ``r600``
591 0x01f architecture processors.
592 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
593 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
594 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
595 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
596 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
597 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
598 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
599 *reserved* 0x027 Reserved.
600 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
601 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
602 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
603 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
604 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
605 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
Matt Arsenaultac9b3ef2018-04-30 19:08:16 +0000606 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
607 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
Konstantin Zhuravlyov16290892018-02-16 22:33:59 +0000608 *reserved* 0x030 Reserved.
Tim Renouf7d5e5c22018-10-24 08:14:07 +0000609 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909``
Tony Tye9000e8c2017-11-10 01:00:54 +0000610 ================================= ========== =============================
Tony Tye69865532017-06-06 20:31:59 +0000611
612Sections
613--------
614
615An AMDGPU target ELF code object has the standard ELF sections which include:
616
617 .. table:: AMDGPU ELF Sections
618 :name: amdgpu-elf-sections-table
619
620 ================== ================ =================================
621 Name Type Attributes
622 ================== ================ =================================
623 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
624 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
625 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
626 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
627 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
628 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
629 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
630 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
631 ``.note`` ``SHT_NOTE`` *none*
632 ``.rela``\ *name* ``SHT_RELA`` *none*
633 ``.rela.dyn`` ``SHT_RELA`` *none*
634 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
635 ``.shstrtab`` ``SHT_STRTAB`` *none*
636 ``.strtab`` ``SHT_STRTAB`` *none*
637 ``.symtab`` ``SHT_SYMTAB`` *none*
638 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
639 ================== ================ =================================
640
641These sections have their standard meanings (see [ELF]_) and are only generated
642if needed.
643
644``.debug``\ *\**
645 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
646 DWARF produced by the AMDGPU backend.
647
Tony Tye1a0450f2017-08-15 20:47:41 +0000648``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye69865532017-06-06 20:31:59 +0000649 The standard sections used by a dynamic loader.
650
651``.note``
652 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
653 backend.
654
655``.rela``\ *name*, ``.rela.dyn``
656 For relocatable code objects, *name* is the name of the section that the
657 relocation records apply. For example, ``.rela.text`` is the section name for
658 relocation records associated with the ``.text`` section.
659
660 For linked shared code objects, ``.rela.dyn`` contains all the relocation
661 records from each of the relocatable code object's ``.rela``\ *name* sections.
662
663 See :ref:`amdgpu-relocation-records` for the relocation records supported by
664 the AMDGPU backend.
665
666``.text``
667 The executable machine code for the kernels and functions they call. Generated
668 as position independent code. See :ref:`amdgpu-code-conventions` for
669 information on conventions used in the isa generation.
670
671.. _amdgpu-note-records:
672
673Note Records
674------------
675
Tony Tye9000e8c2017-11-10 01:00:54 +0000676As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
677be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
678aligned. In addition, minimal zero byte padding must be generated to ensure the
679``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
680``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tye69865532017-06-06 20:31:59 +0000681
Scott Linderdff71ea2018-11-15 20:46:55 +0000682.. _amdgpu-note-records-v2:
683
684Code Object V2 Note Records (-mattr=-code-object-v3)
685~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
686
687The AMDGPU backend code object uses the following ELF note record in the
688``.note`` section.
Tony Tye69865532017-06-06 20:31:59 +0000689
690Additional note records can be present.
691
Scott Linderdff71ea2018-11-15 20:46:55 +0000692 .. table:: AMDGPU Code Object V2 ELF Note Records
693 :name: amdgpu-elf-note-records-table-v2
Tony Tye69865532017-06-06 20:31:59 +0000694
Tony Tye1a0450f2017-08-15 20:47:41 +0000695 ===== ============================== ======================================
696 Name Type Description
697 ===== ============================== ======================================
698 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye1a0450f2017-08-15 20:47:41 +0000699 ===== ============================== ======================================
Tony Tye69865532017-06-06 20:31:59 +0000700
701..
702
Scott Linderdff71ea2018-11-15 20:46:55 +0000703 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
704 :name: amdgpu-elf-note-record-enumeration-values-table-v2
Tony Tye69865532017-06-06 20:31:59 +0000705
Tony Tye1a0450f2017-08-15 20:47:41 +0000706 ============================== =====
707 Name Value
708 ============================== =====
709 *reserved* 0-9
710 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye9000e8c2017-11-10 01:00:54 +0000711 *reserved* 11
Tony Tye1a0450f2017-08-15 20:47:41 +0000712 ============================== =====
Tony Tye69865532017-06-06 20:31:59 +0000713
Tony Tye1a0450f2017-08-15 20:47:41 +0000714``NT_AMD_AMDGPU_HSA_METADATA``
715 Specifies extensible metadata associated with the code objects executed on HSA
716 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
717 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
Scott Linderdff71ea2018-11-15 20:46:55 +0000718 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code
Tony Tye1a0450f2017-08-15 20:47:41 +0000719 object metadata string.
Tony Tye69865532017-06-06 20:31:59 +0000720
Scott Linderdff71ea2018-11-15 20:46:55 +0000721.. _amdgpu-note-records-v3:
722
723Code Object V3 Note Records (-mattr=+code-object-v3)
724~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
725
726The AMDGPU backend code object uses the following ELF note record in the
727``.note`` section.
728
729Additional note records can be present.
730
731 .. table:: AMDGPU Code Object V3 ELF Note Records
732 :name: amdgpu-elf-note-records-table-v3
733
734 ======== ============================== ======================================
735 Name Type Description
736 ======== ============================== ======================================
737 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_
738 binary format.
739 ======== ============================== ======================================
740
741..
742
743 .. table:: AMDGPU Code Object V3 ELF Note Record Enumeration Values
744 :name: amdgpu-elf-note-record-enumeration-values-table-v3
745
746 ============================== =====
747 Name Value
748 ============================== =====
749 *reserved* 0-31
750 ``NT_AMDGPU_METADATA`` 32
751 ============================== =====
752
753``NT_AMDGPU_METADATA``
754 Specifies extensible metadata associated with an AMDGPU code
755 object. It is encoded as a map in the Message Pack [MsgPack]_ binary
756 data format. See :ref:`amdgpu-amdhsa-code-object-metadata-v3` for the
757 map keys defined for the ``amdhsa`` OS.
758
Tony Tye1a0450f2017-08-15 20:47:41 +0000759.. _amdgpu-symbols:
760
761Symbols
762-------
763
764Symbols include the following:
765
766 .. table:: AMDGPU ELF Symbols
767 :name: amdgpu-elf-symbols-table
768
769 ===================== ============== ============= ==================
770 Name Type Section Description
771 ===================== ============== ============= ==================
772 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
773 - ``.rodata``
774 - ``.bss``
Konstantin Zhuravlyov299cf5f2018-06-12 18:02:46 +0000775 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
Tony Tye1a0450f2017-08-15 20:47:41 +0000776 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
777 ===================== ============== ============= ==================
778
779Global variable
780 Global variables both used and defined by the compilation unit.
781
782 If the symbol is defined in the compilation unit then it is allocated in the
783 appropriate section according to if it has initialized data or is readonly.
784
785 If the symbol is external then its section is ``STN_UNDEF`` and the loader
786 will resolve relocations using the definition provided by another code object
787 or explicitly defined by the runtime.
788
789 All global symbols, whether defined in the compilation unit or external, are
790 accessed by the machine code indirectly through a GOT table entry. This
791 allows them to be preemptable. The GOT table is only supported when the target
792 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye69865532017-06-06 20:31:59 +0000793
794 .. TODO
Tony Tye1a0450f2017-08-15 20:47:41 +0000795 Add description of linked shared object symbols. Seems undefined symbols
796 are marked as STT_NOTYPE.
Tony Tye69865532017-06-06 20:31:59 +0000797
Tony Tye1a0450f2017-08-15 20:47:41 +0000798Kernel descriptor
799 Every HSA kernel has an associated kernel descriptor. It is the address of the
800 kernel descriptor that is used in the AQL dispatch packet used to invoke the
801 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
802 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
803
804Kernel entry point
805 Every HSA kernel also has a symbol for its machine code entry point.
806
807.. _amdgpu-relocation-records:
808
809Relocation Records
810------------------
811
812AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
813relocatable fields are:
814
815``word32``
816 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
817 alignment. These values use the same byte order as other word values in the
818 AMD GPU architecture.
819
820``word64``
821 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
822 alignment. These values use the same byte order as other word values in the
823 AMD GPU architecture.
824
825Following notations are used for specifying relocation calculations:
826
827**A**
828 Represents the addend used to compute the value of the relocatable field.
829
830**G**
831 Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyovcb5868c2017-10-19 17:12:55 +0000832 entry's symbol will reside during execution.
Tony Tye1a0450f2017-08-15 20:47:41 +0000833
834**GOT**
835 Represents the address of the global offset table.
836
837**P**
838 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
839 of the storage unit being relocated (computed using ``r_offset``).
840
841**S**
842 Represents the value of the symbol whose index resides in the relocation
Tony Tye300ec0a2017-10-16 20:44:29 +0000843 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
844
845**B**
846 Represents the base address of a loaded executable or shared object which is
847 the difference between the ELF address and the actual load address. Relocations
848 using this are only valid in executable or shared objects.
Tony Tye1a0450f2017-08-15 20:47:41 +0000849
850The following relocation types are supported:
851
852 .. table:: AMDGPU ELF Relocation Records
853 :name: amdgpu-elf-relocation-records-table
854
Tony Tye90018a52018-01-30 23:59:43 +0000855 ========================== ======= ===== ========== ==============================
856 Relocation Type Kind Value Field Calculation
857 ========================== ======= ===== ========== ==============================
858 ``R_AMDGPU_NONE`` 0 *none* *none*
Tony Tyef0a27cc2018-04-13 01:01:27 +0000859 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
860 Dynamic
861 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
862 Dynamic
863 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
Matt Arsenaultac9b3ef2018-04-30 19:08:16 +0000864 Dynamic
Tony Tye90018a52018-01-30 23:59:43 +0000865 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
866 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
Tony Tyef0a27cc2018-04-13 01:01:27 +0000867 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
868 Dynamic
Tony Tye90018a52018-01-30 23:59:43 +0000869 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
870 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
871 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
872 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
873 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
874 *reserved* 12
875 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
876 ========================== ======= ===== ========== ==============================
Tony Tye1a0450f2017-08-15 20:47:41 +0000877
Tony Tyef0a27cc2018-04-13 01:01:27 +0000878``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
879the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
880
881There is no current OS loader support for 32 bit programs and so
882``R_AMDGPU_ABS32`` is not used.
Matt Arsenaultac9b3ef2018-04-30 19:08:16 +0000883
Tony Tye1a0450f2017-08-15 20:47:41 +0000884.. _amdgpu-dwarf:
885
886DWARF
887-----
888
Scott Linder5e4b5152018-02-23 23:01:06 +0000889Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye1a0450f2017-08-15 20:47:41 +0000890information that maps the code object executable code and data to the source
891language constructs. It can be used by tools such as debuggers and profilers.
892
893Address Space Mapping
894~~~~~~~~~~~~~~~~~~~~~
895
896The following address space mapping is used:
897
898 .. table:: AMDGPU DWARF Address Space Mapping
899 :name: amdgpu-dwarf-address-space-mapping-table
900
901 =================== =================
902 DWARF Address Space Memory Space
903 =================== =================
904 1 Private (Scratch)
905 2 Local (group/LDS)
906 *omitted* Global
907 *omitted* Constant
908 *omitted* Generic (Flat)
909 *not supported* Region (GDS)
910 =================== =================
911
912See :ref:`amdgpu-address-spaces` for information on the memory space terminology
913used in the table.
914
915An ``address_class`` attribute is generated on pointer type DIEs to specify the
916DWARF address space of the value of the pointer when it is in the *private* or
917*local* address space. Otherwise the attribute is omitted.
918
919An ``XDEREF`` operation is generated in location list expressions for variables
920that are allocated in the *private* and *local* address space. Otherwise no
921``XDREF`` is omitted.
922
923Register Mapping
924~~~~~~~~~~~~~~~~
925
926*This section is WIP.*
927
928.. TODO
929 Define DWARF register enumeration.
930
931 If want to present a wavefront state then should expose vector registers as
932 64 wide (rather than per work-item view that LLVM uses). Either as separate
933 registers, or a 64x4 byte single register. In either case use a new LANE op
934 (akin to XDREF) to select the current lane usage in a location
935 expression. This would also allow scalar register spilling to vector register
936 lanes to be expressed (currently no debug information is being generated for
937 spilling). If choose a wide single register approach then use LANE in
938 conjunction with PIECE operation to select the dword part of the register for
939 the current lane. If the separate register approach then use LANE to select
940 the register.
941
942Source Text
943~~~~~~~~~~~
944
Scott Linder5e4b5152018-02-23 23:01:06 +0000945Source text for online-compiled programs (e.g. those compiled by the OpenCL
946runtime) may be embedded into the DWARF v5 line table using the ``clang
947-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye1a0450f2017-08-15 20:47:41 +0000948
Scott Linder5e4b5152018-02-23 23:01:06 +0000949For example:
950
951``-gembed-source``
952 Enable the embedded source DWARF v5 extension.
953``-gno-embed-source``
954 Disable the embedded source DWARF v5 extension.
955
956 .. table:: AMDGPU Debug Options
957 :name: amdgpu-debug-options
958
959 ==================== ==================================================
960 Debug Flag Description
961 ==================== ==================================================
962 -g[no-]embed-source Enable/disable embedding source text in DWARF
963 debug sections. Useful for environments where
964 source cannot be written to disk, such as
965 when performing online compilation.
966 ==================== ==================================================
967
968This option enables one extended content types in the DWARF v5 Line Number
969Program Header, which is used to encode embedded source.
970
971 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
972 :name: amdgpu-dwarf-extended-content-types
973
974 ============================ ======================
975 Content Type Form
976 ============================ ======================
977 ``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
978 ============================ ======================
979
980The source field will contain the UTF-8 encoded, null-terminated source text
981with ``'\n'`` line endings. When the source field is present, consumers can use
982the embedded source instead of attempting to discover the source on disk. When
983the source field is absent, consumers can access the file to get the source
984text.
985
986The above content type appears in the ``file_name_entry_format`` field of the
987line table prologue, and its corresponding value appear in the ``file_names``
988field. The current encoding of the content type is documented in table
989:ref:`amdgpu-dwarf-extended-content-types-encoding`
990
991 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
992 :name: amdgpu-dwarf-extended-content-types-encoding
993
994 ============================ ====================
995 Content Type Value
996 ============================ ====================
997 ``DW_LNCT_LLVM_source`` 0x2001
998 ============================ ====================
Tony Tye1a0450f2017-08-15 20:47:41 +0000999
1000.. _amdgpu-code-conventions:
1001
1002Code Conventions
1003================
1004
1005This section provides code conventions used for each supported target triple OS
1006(see :ref:`amdgpu-target-triples`).
1007
1008AMDHSA
1009------
1010
1011This section provides code conventions used when the target triple OS is
1012``amdhsa`` (see :ref:`amdgpu-target-triples`).
1013
Scott Linder43cbf8d2018-06-21 19:38:56 +00001014.. _amdgpu-amdhsa-code-object-target-identification:
Tony Tye69865532017-06-06 20:31:59 +00001015
Tony Tyed31305a2018-03-27 21:20:46 +00001016Code Object Target Identification
1017~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1018
1019The AMDHSA OS uses the following syntax to specify the code object
1020target as a single string:
1021
1022 ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
1023
1024Where:
1025
1026 - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
1027 are the same as the *Target Triple* (see
1028 :ref:`amdgpu-target-triples`).
1029
1030 - ``<Processor>`` is the same as the *Processor* (see
1031 :ref:`amdgpu-processors`).
1032
1033 - ``<Target Features>`` is a list of the enabled *Target Features*
1034 (see :ref:`amdgpu-target-features`), each prefixed by a plus, that
1035 apply to *Processor*. The list must be in the same order as listed
1036 in the table :ref:`amdgpu-target-feature-table`. Note that *Target
1037 Features* must be included in the list if they are enabled even if
1038 that is the default for *Processor*.
1039
1040For example:
1041
1042 ``"amdgcn-amd-amdhsa--gfx902+xnack"``
1043
Scott Linder43cbf8d2018-06-21 19:38:56 +00001044.. _amdgpu-amdhsa-code-object-metadata:
1045
Tony Tye69865532017-06-06 20:31:59 +00001046Code Object Metadata
Tony Tye1a0450f2017-08-15 20:47:41 +00001047~~~~~~~~~~~~~~~~~~~~
Tony Tye69865532017-06-06 20:31:59 +00001048
Tony Tye1a0450f2017-08-15 20:47:41 +00001049The code object metadata specifies extensible metadata associated with the code
1050objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
Scott Linderdff71ea2018-11-15 20:46:55 +00001051[AMD-ROCm]_. It is specified in a note record (see :ref:`amdgpu-note-records`)
1052and is required when the target triple OS is ``amdhsa`` (see
1053:ref:`amdgpu-target-triples`). It must contain the minimum information
1054necessary to support the ROCM kernel queries. For example, the segment sizes
1055needed in a dispatch packet. In addition, a high level language runtime may
1056require other information to be included. For example, the AMD OpenCL runtime
1057records kernel argument information.
1058
1059.. _amdgpu-amdhsa-code-object-metadata-v2:
1060
1061Code Object V2 Metadata (-mattr=-code-object-v3)
1062++++++++++++++++++++++++++++++++++++++++++++++++
1063
1064Code object V2 metadata is specified by the ``NT_AMD_AMDGPU_METADATA`` note
1065record (see :ref:`amdgpu-note-records-v2`).
Tony Tye69865532017-06-06 20:31:59 +00001066
Sylvestre Ledrue9d67e42017-06-26 02:45:39 +00001067The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye69865532017-06-06 20:31:59 +00001068:doc:`YamlIO`).
1069
Tony Tye1a0450f2017-08-15 20:47:41 +00001070.. TODO
1071 Is the string null terminated? It probably should not if YAML allows it to
1072 contain null characters, otherwise it should be.
1073
Tony Tye69865532017-06-06 20:31:59 +00001074The metadata is represented as a single YAML document comprised of the mapping
Scott Linderdff71ea2018-11-15 20:46:55 +00001075defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v2` and
Tony Tye69865532017-06-06 20:31:59 +00001076referenced tables.
1077
1078For boolean values, the string values of ``false`` and ``true`` are used for
1079false and true respectively.
1080
1081Additional information can be added to the mappings. To avoid conflicts, any
1082non-AMD key names should be prefixed by "*vendor-name*.".
1083
Scott Linderdff71ea2018-11-15 20:46:55 +00001084 .. table:: AMDHSA Code Object V2 Metadata Map
1085 :name: amdgpu-amdhsa-code-object-metadata-map-table-v2
Tony Tye69865532017-06-06 20:31:59 +00001086
1087 ========== ============== ========= =======================================
1088 String Key Value Type Required? Description
1089 ========== ============== ========= =======================================
1090 "Version" sequence of Required - The first integer is the major
1091 2 integers version. Currently 1.
1092 - The second integer is the minor
1093 version. Currently 0.
1094 "Printf" sequence of Each string is encoded information
1095 strings about a printf function call. The
1096 encoded information is organized as
1097 fields separated by colon (':'):
1098
1099 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
1100
1101 where:
1102
1103 ``ID``
1104 A 32 bit integer as a unique id for
1105 each printf function call
1106
1107 ``N``
1108 A 32 bit integer equal to the number
1109 of arguments of printf function call
1110 minus 1
1111
1112 ``S[i]`` (where i = 0, 1, ... , N-1)
1113 32 bit integers for the size in bytes
1114 of the i-th FormatString argument of
1115 the printf function call
1116
1117 FormatString
1118 The format string passed to the
1119 printf function call.
1120 "Kernels" sequence of Required Sequence of the mappings for each
1121 mapping kernel in the code object. See
Scott Linderdff71ea2018-11-15 20:46:55 +00001122 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2`
Tony Tye69865532017-06-06 20:31:59 +00001123 for the definition of the mapping.
1124 ========== ============== ========= =======================================
1125
1126..
1127
Scott Linderdff71ea2018-11-15 20:46:55 +00001128 .. table:: AMDHSA Code Object V2 Kernel Metadata Map
1129 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2
Tony Tye69865532017-06-06 20:31:59 +00001130
1131 ================= ============== ========= ================================
1132 String Key Value Type Required? Description
1133 ================= ============== ========= ================================
1134 "Name" string Required Source name of the kernel.
1135 "SymbolName" string Required Name of the kernel
1136 descriptor ELF symbol.
1137 "Language" string Source language of the kernel.
1138 Values include:
1139
1140 - "OpenCL C"
1141 - "OpenCL C++"
1142 - "HCC"
1143 - "OpenMP"
1144
1145 "LanguageVersion" sequence of - The first integer is the major
1146 2 integers version.
1147 - The second integer is the
1148 minor version.
1149 "Attrs" mapping Mapping of kernel attributes.
1150 See
Scott Linderdff71ea2018-11-15 20:46:55 +00001151 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2`
Tony Tye69865532017-06-06 20:31:59 +00001152 for the mapping definition.
Konstantin Zhuravlyov5556d842017-10-14 19:03:51 +00001153 "Args" sequence of Sequence of mappings of the
Tony Tye69865532017-06-06 20:31:59 +00001154 mapping kernel arguments. See
Scott Linderdff71ea2018-11-15 20:46:55 +00001155 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2`
Tony Tye69865532017-06-06 20:31:59 +00001156 for the definition of the mapping.
1157 "CodeProps" mapping Mapping of properties related to
1158 the kernel code. See
Scott Linderdff71ea2018-11-15 20:46:55 +00001159 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2`
Tony Tye69865532017-06-06 20:31:59 +00001160 for the mapping definition.
Tony Tye69865532017-06-06 20:31:59 +00001161 ================= ============== ========= ================================
1162
1163..
1164
Scott Linderdff71ea2018-11-15 20:46:55 +00001165 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
1166 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2
Tony Tye69865532017-06-06 20:31:59 +00001167
1168 =================== ============== ========= ==============================
1169 String Key Value Type Required? Description
1170 =================== ============== ========= ==============================
Tony Tyeae5d34e2018-01-30 23:07:10 +00001171 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
1172 3 integers must be >=1 and the dispatch
1173 work-group size X, Y, Z must
1174 correspond to the specified
1175 values. Defaults to 0, 0, 0.
Tony Tye69865532017-06-06 20:31:59 +00001176
1177 Corresponds to the OpenCL
1178 ``reqd_work_group_size``
1179 attribute.
1180 "WorkGroupSizeHint" sequence of The dispatch work-group size
1181 3 integers X, Y, Z is likely to be the
1182 specified values.
1183
1184 Corresponds to the OpenCL
1185 ``work_group_size_hint``
1186 attribute.
1187 "VecTypeHint" string The name of a scalar or vector
1188 type.
1189
1190 Corresponds to the OpenCL
1191 ``vec_type_hint`` attribute.
Yaxun Liu091c0432017-10-10 19:39:48 +00001192
1193 "RuntimeHandle" string The external symbol name
1194 associated with a kernel.
1195 OpenCL runtime allocates a
1196 global buffer for the symbol
1197 and saves the kernel's address
1198 to it, which is used for
1199 device side enqueueing. Only
1200 available for device side
1201 enqueued kernels.
Tony Tye69865532017-06-06 20:31:59 +00001202 =================== ============== ========= ==============================
1203
1204..
1205
Scott Linderdff71ea2018-11-15 20:46:55 +00001206 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
1207 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2
Tony Tye69865532017-06-06 20:31:59 +00001208
1209 ================= ============== ========= ================================
1210 String Key Value Type Required? Description
1211 ================= ============== ========= ================================
1212 "Name" string Kernel argument name.
1213 "TypeName" string Kernel argument type name.
1214 "Size" integer Required Kernel argument size in bytes.
1215 "Align" integer Required Kernel argument alignment in
1216 bytes. Must be a power of two.
1217 "ValueKind" string Required Kernel argument kind that
1218 specifies how to set up the
1219 corresponding argument.
1220 Values include:
1221
1222 "ByValue"
1223 The argument is copied
1224 directly into the kernarg.
1225
1226 "GlobalBuffer"
1227 A global address space pointer
1228 to the buffer data is passed
1229 in the kernarg.
1230
1231 "DynamicSharedPointer"
1232 A group address space pointer
1233 to dynamically allocated LDS
1234 is passed in the kernarg.
1235
1236 "Sampler"
1237 A global address space
1238 pointer to a S# is passed in
1239 the kernarg.
1240
1241 "Image"
1242 A global address space
1243 pointer to a T# is passed in
1244 the kernarg.
1245
1246 "Pipe"
1247 A global address space pointer
1248 to an OpenCL pipe is passed in
1249 the kernarg.
1250
1251 "Queue"
1252 A global address space pointer
1253 to an OpenCL device enqueue
1254 queue is passed in the
1255 kernarg.
1256
1257 "HiddenGlobalOffsetX"
1258 The OpenCL grid dispatch
1259 global offset for the X
1260 dimension is passed in the
1261 kernarg.
1262
1263 "HiddenGlobalOffsetY"
1264 The OpenCL grid dispatch
1265 global offset for the Y
1266 dimension is passed in the
1267 kernarg.
1268
1269 "HiddenGlobalOffsetZ"
1270 The OpenCL grid dispatch
1271 global offset for the Z
1272 dimension is passed in the
1273 kernarg.
1274
1275 "HiddenNone"
1276 An argument that is not used
1277 by the kernel. Space needs to
1278 be left for it, but it does
1279 not need to be set up.
1280
1281 "HiddenPrintfBuffer"
1282 A global address space pointer
1283 to the runtime printf buffer
1284 is passed in kernarg.
1285
1286 "HiddenDefaultQueue"
1287 A global address space pointer
1288 to the OpenCL device enqueue
1289 queue that should be used by
1290 the kernel by default is
1291 passed in the kernarg.
1292
1293 "HiddenCompletionAction"
Yaxun Liua52756b2017-10-30 14:30:28 +00001294 A global address space pointer
1295 to help link enqueued kernels into
1296 the ancestor tree for determining
1297 when the parent kernel has finished.
Tony Tye69865532017-06-06 20:31:59 +00001298
1299 "ValueType" string Required Kernel argument value type. Only
1300 present if "ValueKind" is
1301 "ByValue". For vector data
1302 types, the value is for the
1303 element type. Values include:
1304
1305 - "Struct"
1306 - "I8"
1307 - "U8"
1308 - "I16"
1309 - "U16"
1310 - "F16"
1311 - "I32"
1312 - "U32"
1313 - "F32"
1314 - "I64"
1315 - "U64"
1316 - "F64"
1317
1318 .. TODO
1319 How can it be determined if a
1320 vector type, and what size
1321 vector?
1322 "PointeeAlign" integer Alignment in bytes of pointee
1323 type for pointer type kernel
1324 argument. Must be a power
1325 of 2. Only present if
1326 "ValueKind" is
1327 "DynamicSharedPointer".
1328 "AddrSpaceQual" string Kernel argument address space
1329 qualifier. Only present if
1330 "ValueKind" is "GlobalBuffer" or
1331 "DynamicSharedPointer". Values
1332 are:
1333
1334 - "Private"
1335 - "Global"
1336 - "Constant"
1337 - "Local"
1338 - "Generic"
1339 - "Region"
1340
1341 .. TODO
1342 Is GlobalBuffer only Global
1343 or Constant? Is
1344 DynamicSharedPointer always
1345 Local? Can HCC allow Generic?
1346 How can Private or Region
1347 ever happen?
1348 "AccQual" string Kernel argument access
1349 qualifier. Only present if
1350 "ValueKind" is "Image" or
1351 "Pipe". Values
1352 are:
1353
1354 - "ReadOnly"
1355 - "WriteOnly"
1356 - "ReadWrite"
1357
1358 .. TODO
1359 Does this apply to
1360 GlobalBuffer?
Konstantin Zhuravlyov5556d842017-10-14 19:03:51 +00001361 "ActualAccQual" string The actual memory accesses
Tony Tye69865532017-06-06 20:31:59 +00001362 performed by the kernel on the
1363 kernel argument. Only present if
1364 "ValueKind" is "GlobalBuffer",
1365 "Image", or "Pipe". This may be
1366 more restrictive than indicated
1367 by "AccQual" to reflect what the
1368 kernel actual does. If not
1369 present then the runtime must
1370 assume what is implied by
1371 "AccQual" and "IsConst". Values
1372 are:
1373
1374 - "ReadOnly"
1375 - "WriteOnly"
1376 - "ReadWrite"
1377
1378 "IsConst" boolean Indicates if the kernel argument
1379 is const qualified. Only present
1380 if "ValueKind" is
1381 "GlobalBuffer".
1382
1383 "IsRestrict" boolean Indicates if the kernel argument
1384 is restrict qualified. Only
1385 present if "ValueKind" is
1386 "GlobalBuffer".
1387
1388 "IsVolatile" boolean Indicates if the kernel argument
1389 is volatile qualified. Only
1390 present if "ValueKind" is
1391 "GlobalBuffer".
1392
1393 "IsPipe" boolean Indicates if the kernel argument
1394 is pipe qualified. Only present
1395 if "ValueKind" is "Pipe".
1396
1397 .. TODO
1398 Can GlobalBuffer be pipe
1399 qualified?
1400 ================= ============== ========= ================================
1401
1402..
1403
Scott Linderdff71ea2018-11-15 20:46:55 +00001404 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
1405 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2
Tony Tye69865532017-06-06 20:31:59 +00001406
1407 ============================ ============== ========= =====================
1408 String Key Value Type Required? Description
1409 ============================ ============== ========= =====================
1410 "KernargSegmentSize" integer Required The size in bytes of
1411 the kernarg segment
1412 that holds the values
1413 of the arguments to
1414 the kernel.
1415 "GroupSegmentFixedSize" integer Required The amount of group
1416 segment memory
1417 required by a
1418 work-group in
1419 bytes. This does not
1420 include any
1421 dynamically allocated
1422 group segment memory
1423 that may be added
1424 when the kernel is
1425 dispatched.
1426 "PrivateSegmentFixedSize" integer Required The amount of fixed
1427 private address space
1428 memory required for a
1429 work-item in
Tony Tye9000e8c2017-11-10 01:00:54 +00001430 bytes. If the kernel
1431 uses a dynamic call
1432 stack then additional
Tony Tye69865532017-06-06 20:31:59 +00001433 space must be added
1434 to this value for the
1435 call stack.
1436 "KernargSegmentAlign" integer Required The maximum byte
1437 alignment of
1438 arguments in the
1439 kernarg segment. Must
1440 be a power of 2.
1441 "WavefrontSize" integer Required Wavefront size. Must
1442 be a power of 2.
Tony Tye9000e8c2017-11-10 01:00:54 +00001443 "NumSGPRs" integer Required Number of scalar
Tony Tye69865532017-06-06 20:31:59 +00001444 registers used by a
1445 wavefront for
1446 GFX6-GFX9. This
1447 includes the special
1448 SGPRs for VCC, Flat
1449 Scratch (GFX7-GFX9)
1450 and XNACK (for
1451 GFX8-GFX9). It does
1452 not include the 16
1453 SGPR added if a trap
1454 handler is
1455 enabled. It is not
1456 rounded up to the
1457 allocation
1458 granularity.
Tony Tye9000e8c2017-11-10 01:00:54 +00001459 "NumVGPRs" integer Required Number of vector
Tony Tye69865532017-06-06 20:31:59 +00001460 registers used by
1461 each work-item for
1462 GFX6-GFX9
Tony Tye9000e8c2017-11-10 01:00:54 +00001463 "MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tye69865532017-06-06 20:31:59 +00001464 work-group size
1465 supported by the
1466 kernel in work-items.
Tony Tye9000e8c2017-11-10 01:00:54 +00001467 Must be >=1 and
Tony Tyeae5d34e2018-01-30 23:07:10 +00001468 consistent with
1469 ReqdWorkGroupSize if
1470 not 0, 0, 0.
Konstantin Zhuravlyova9edc752017-11-28 17:51:08 +00001471 "NumSpilledSGPRs" integer Number of stores from
1472 a scalar register to
1473 a register allocator
1474 created spill
1475 location.
1476 "NumSpilledVGPRs" integer Number of stores from
1477 a vector register to
1478 a register allocator
1479 created spill
1480 location.
Tony Tye69865532017-06-06 20:31:59 +00001481 ============================ ============== ========= =====================
1482
Scott Linderdff71ea2018-11-15 20:46:55 +00001483.. _amdgpu-amdhsa-code-object-metadata-v3:
1484
1485Code Object V3 Metadata (-mattr=+code-object-v3)
1486++++++++++++++++++++++++++++++++++++++++++++++++
1487
1488Code object V3 metadata is specified by the ``NT_AMDGPU_METADATA`` note record
1489(see :ref:`amdgpu-note-records-v3`).
1490
1491The metadata is represented as Message Pack formatted binary data (see
1492[MsgPack]_). The top level is a Message Pack map that includes the
1493keys defined in table
1494:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
1495tables.
1496
1497Additional information can be added to the maps. To avoid conflicts,
1498any key names should be prefixed by "*vendor-name*." where
1499``vendor-name`` can be the the name of the vendor and specific vendor
1500tool that generates the information. The prefix is abbreviated to
1501simply "." when it appears within a map that has been added by the
1502same *vendor-name*.
1503
1504 .. table:: AMDHSA Code Object V3 Metadata Map
1505 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
1506
1507 ================= ============== ========= =======================================
1508 String Key Value Type Required? Description
1509 ================= ============== ========= =======================================
1510 "amdhsa.version" sequence of Required - The first integer is the major
1511 2 integers version. Currently 1.
1512 - The second integer is the minor
1513 version. Currently 0.
1514 "amdhsa.printf" sequence of Each string is encoded information
1515 strings about a printf function call. The
1516 encoded information is organized as
1517 fields separated by colon (':'):
1518
1519 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
1520
1521 where:
1522
1523 ``ID``
1524 A 32 bit integer as a unique id for
1525 each printf function call
1526
1527 ``N``
1528 A 32 bit integer equal to the number
1529 of arguments of printf function call
1530 minus 1
1531
1532 ``S[i]`` (where i = 0, 1, ... , N-1)
1533 32 bit integers for the size in bytes
1534 of the i-th FormatString argument of
1535 the printf function call
1536
1537 FormatString
1538 The format string passed to the
1539 printf function call.
1540 "amdhsa.kernels" sequence of Required Sequence of the maps for each
1541 map kernel in the code object. See
1542 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
1543 for the definition of the keys included
1544 in that map.
1545 ================= ============== ========= =======================================
1546
1547..
1548
1549 .. table:: AMDHSA Code Object V3 Kernel Metadata Map
1550 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
1551
1552 =================================== ============== ========= ================================
1553 String Key Value Type Required? Description
1554 =================================== ============== ========= ================================
1555 ".name" string Required Source name of the kernel.
1556 ".symbol" string Required Name of the kernel
1557 descriptor ELF symbol.
1558 ".language" string Source language of the kernel.
1559 Values include:
1560
1561 - "OpenCL C"
1562 - "OpenCL C++"
1563 - "HCC"
1564 - "HIP"
1565 - "OpenMP"
1566 - "Assembler"
1567
1568 ".language_version" sequence of - The first integer is the major
1569 2 integers version.
1570 - The second integer is the
1571 minor version.
1572 ".args" sequence of Sequence of maps of the
1573 map kernel arguments. See
1574 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
1575 for the definition of the keys
1576 included in that map.
1577 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values
1578 3 integers must be >=1 and the dispatch
1579 work-group size X, Y, Z must
1580 correspond to the specified
1581 values. Defaults to 0, 0, 0.
1582
1583 Corresponds to the OpenCL
1584 ``reqd_work_group_size``
1585 attribute.
1586 ".workgroup_size_hint" sequence of The dispatch work-group size
1587 3 integers X, Y, Z is likely to be the
1588 specified values.
1589
1590 Corresponds to the OpenCL
1591 ``work_group_size_hint``
1592 attribute.
1593 ".vec_type_hint" string The name of a scalar or vector
1594 type.
1595
1596 Corresponds to the OpenCL
1597 ``vec_type_hint`` attribute.
1598
1599 ".device_enqueue_symbol" string The external symbol name
1600 associated with a kernel.
1601 OpenCL runtime allocates a
1602 global buffer for the symbol
1603 and saves the kernel's address
1604 to it, which is used for
1605 device side enqueueing. Only
1606 available for device side
1607 enqueued kernels.
1608 ".kernarg_segment_size" integer Required The size in bytes of
1609 the kernarg segment
1610 that holds the values
1611 of the arguments to
1612 the kernel.
1613 ".group_segment_fixed_size" integer Required The amount of group
1614 segment memory
1615 required by a
1616 work-group in
1617 bytes. This does not
1618 include any
1619 dynamically allocated
1620 group segment memory
1621 that may be added
1622 when the kernel is
1623 dispatched.
1624 ".private_segment_fixed_size" integer Required The amount of fixed
1625 private address space
1626 memory required for a
1627 work-item in
1628 bytes. If the kernel
1629 uses a dynamic call
1630 stack then additional
1631 space must be added
1632 to this value for the
1633 call stack.
1634 ".kernarg_segment_align" integer Required The maximum byte
1635 alignment of
1636 arguments in the
1637 kernarg segment. Must
1638 be a power of 2.
1639 ".wavefront_size" integer Required Wavefront size. Must
1640 be a power of 2.
1641 ".sgpr_count" integer Required Number of scalar
1642 registers required by a
1643 wavefront for
1644 GFX6-GFX9. A register
1645 is required if it is
1646 used explicitly, or
1647 if a higher numbered
1648 register is used
1649 explicitly. This
1650 includes the special
1651 SGPRs for VCC, Flat
1652 Scratch (GFX7-GFX9)
1653 and XNACK (for
1654 GFX8-GFX9). It does
1655 not include the 16
1656 SGPR added if a trap
1657 handler is
1658 enabled. It is not
1659 rounded up to the
1660 allocation
1661 granularity.
1662 ".vgpr_count" integer Required Number of vector
1663 registers required by
1664 each work-item for
1665 GFX6-GFX9. A register
1666 is required if it is
1667 used explicitly, or
1668 if a higher numbered
1669 register is used
1670 explicitly.
1671 ".max_flat_workgroup_size" integer Required Maximum flat
1672 work-group size
1673 supported by the
1674 kernel in work-items.
1675 Must be >=1 and
1676 consistent with
1677 ReqdWorkGroupSize if
1678 not 0, 0, 0.
1679 ".sgpr_spill_count" integer Number of stores from
1680 a scalar register to
1681 a register allocator
1682 created spill
1683 location.
1684 ".vgpr_spill_count" integer Number of stores from
1685 a vector register to
1686 a register allocator
1687 created spill
1688 location.
1689 =================================== ============== ========= ================================
1690
1691..
1692
1693 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
1694 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
1695
1696 ====================== ============== ========= ================================
1697 String Key Value Type Required? Description
1698 ====================== ============== ========= ================================
1699 ".name" string Kernel argument name.
1700 ".type_name" string Kernel argument type name.
1701 ".size" integer Required Kernel argument size in bytes.
1702 ".offset" integer Required Kernel argument offset in
1703 bytes. The offset must be a
1704 multiple of the alignment
1705 required by the argument.
1706 ".value_kind" string Required Kernel argument kind that
1707 specifies how to set up the
1708 corresponding argument.
1709 Values include:
1710
1711 "by_value"
1712 The argument is copied
1713 directly into the kernarg.
1714
1715 "global_buffer"
1716 A global address space pointer
1717 to the buffer data is passed
1718 in the kernarg.
1719
1720 "dynamic_shared_pointer"
1721 A group address space pointer
1722 to dynamically allocated LDS
1723 is passed in the kernarg.
1724
1725 "sampler"
1726 A global address space
1727 pointer to a S# is passed in
1728 the kernarg.
1729
1730 "image"
1731 A global address space
1732 pointer to a T# is passed in
1733 the kernarg.
1734
1735 "pipe"
1736 A global address space pointer
1737 to an OpenCL pipe is passed in
1738 the kernarg.
1739
1740 "queue"
1741 A global address space pointer
1742 to an OpenCL device enqueue
1743 queue is passed in the
1744 kernarg.
1745
1746 "hidden_global_offset_x"
1747 The OpenCL grid dispatch
1748 global offset for the X
1749 dimension is passed in the
1750 kernarg.
1751
1752 "hidden_global_offset_y"
1753 The OpenCL grid dispatch
1754 global offset for the Y
1755 dimension is passed in the
1756 kernarg.
1757
1758 "hidden_global_offset_z"
1759 The OpenCL grid dispatch
1760 global offset for the Z
1761 dimension is passed in the
1762 kernarg.
1763
1764 "hidden_none"
1765 An argument that is not used
1766 by the kernel. Space needs to
1767 be left for it, but it does
1768 not need to be set up.
1769
1770 "hidden_printf_buffer"
1771 A global address space pointer
1772 to the runtime printf buffer
1773 is passed in kernarg.
1774
1775 "hidden_default_queue"
1776 A global address space pointer
1777 to the OpenCL device enqueue
1778 queue that should be used by
1779 the kernel by default is
1780 passed in the kernarg.
1781
1782 "hidden_completion_action"
1783 A global address space pointer
1784 to help link enqueued kernels into
1785 the ancestor tree for determining
1786 when the parent kernel has finished.
1787
1788 ".value_type" string Required Kernel argument value type. Only
1789 present if ".value_kind" is
1790 "by_value". For vector data
1791 types, the value is for the
1792 element type. Values include:
1793
1794 - "struct"
1795 - "i8"
1796 - "u8"
1797 - "i16"
1798 - "u16"
1799 - "f16"
1800 - "i32"
1801 - "u32"
1802 - "f32"
1803 - "i64"
1804 - "u64"
1805 - "f64"
1806
1807 .. TODO
1808 How can it be determined if a
1809 vector type, and what size
1810 vector?
1811 ".pointee_align" integer Alignment in bytes of pointee
1812 type for pointer type kernel
1813 argument. Must be a power
1814 of 2. Only present if
1815 ".value_kind" is
1816 "dynamic_shared_pointer".
1817 ".address_space" string Kernel argument address space
1818 qualifier. Only present if
1819 ".value_kind" is "global_buffer" or
1820 "dynamic_shared_pointer". Values
1821 are:
1822
1823 - "private"
1824 - "global"
1825 - "constant"
1826 - "local"
1827 - "generic"
1828 - "region"
1829
1830 .. TODO
1831 Is "global_buffer" only "global"
1832 or "constant"? Is
1833 "dynamic_shared_pointer" always
1834 "local"? Can HCC allow "generic"?
1835 How can "private" or "region"
1836 ever happen?
1837 ".access" string Kernel argument access
1838 qualifier. Only present if
1839 ".value_kind" is "image" or
1840 "pipe". Values
1841 are:
1842
1843 - "read_only"
1844 - "write_only"
1845 - "read_write"
1846
1847 .. TODO
1848 Does this apply to
1849 "global_buffer"?
1850 ".actual_access" string The actual memory accesses
1851 performed by the kernel on the
1852 kernel argument. Only present if
1853 ".value_kind" is "global_buffer",
1854 "image", or "pipe". This may be
1855 more restrictive than indicated
1856 by ".access" to reflect what the
1857 kernel actual does. If not
1858 present then the runtime must
1859 assume what is implied by
1860 ".access" and ".is_const" . Values
1861 are:
1862
1863 - "read_only"
1864 - "write_only"
1865 - "read_write"
1866
1867 ".is_const" boolean Indicates if the kernel argument
1868 is const qualified. Only present
1869 if ".value_kind" is
1870 "global_buffer".
1871
1872 ".is_restrict" boolean Indicates if the kernel argument
1873 is restrict qualified. Only
1874 present if ".value_kind" is
1875 "global_buffer".
1876
1877 ".is_volatile" boolean Indicates if the kernel argument
1878 is volatile qualified. Only
1879 present if ".value_kind" is
1880 "global_buffer".
1881
1882 ".is_pipe" boolean Indicates if the kernel argument
1883 is pipe qualified. Only present
1884 if ".value_kind" is "pipe".
1885
1886 .. TODO
1887 Can "global_buffer" be pipe
1888 qualified?
1889 ====================== ============== ========= ================================
1890
Tony Tye69865532017-06-06 20:31:59 +00001891..
1892
Tony Tye69865532017-06-06 20:31:59 +00001893Kernel Dispatch
1894~~~~~~~~~~~~~~~
1895
1896The HSA architected queuing language (AQL) defines a user space memory interface
1897that can be used to control the dispatch of kernels, in an agent independent
1898way. An agent can have zero or more AQL queues created for it using the ROCm
1899runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1900*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1901mechanics and packet layouts.
1902
1903The packet processor of a kernel agent is responsible for detecting and
1904dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1905packet processor is implemented by the hardware command processor (CP),
1906asynchronous dispatch controller (ADC) and shader processor input controller
1907(SPI).
1908
1909The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1910mode driver to initialize and register the AQL queue with CP.
1911
1912To dispatch a kernel the following actions are performed. This can occur in the
1913CPU host program, or from an HSA kernel executing on a GPU.
1914
19151. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1916 executed is obtained.
19172. A pointer to the kernel descriptor (see
1918 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1919 obtained. It must be for a kernel that is contained in a code object that that
1920 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1921 associated.
19223. Space is allocated for the kernel arguments using the ROCm runtime allocator
1923 for a memory region with the kernarg property for the kernel agent that will
1924 execute the kernel. It must be at least 16 byte aligned.
19254. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyovcb5868c2017-10-19 17:12:55 +00001926 allocation. The layout is defined in the *HSA Programmer's Language Reference*
Tony Tye69865532017-06-06 20:31:59 +00001927 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1928 memory in the same way constant memory is accessed. (Note that the HSA
1929 specification allows an implementation to copy the kernel argument contents to
1930 another location that is accessed by the kernel.)
19315. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1932 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1933 packet. The packet must be set up, and the final write must use an atomic
1934 store release to set the packet kind to ensure the packet contents are
1935 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1936 notify the kernel agent that the AQL queue has been updated. These rules, and
1937 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1938 System Architecture Specification* [HSA]_.
19396. A kernel dispatch packet includes information about the actual dispatch,
1940 such as grid and work-group size, together with information from the code
1941 object about the kernel, such as segment sizes. The ROCm runtime queries on
1942 the kernel symbol can be used to obtain the code object values which are
Scott Linder43cbf8d2018-06-21 19:38:56 +00001943 recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
Tony Tye69865532017-06-06 20:31:59 +000019447. CP executes micro-code and is responsible for detecting and setting up the
1945 GPU to execute the wavefronts of a kernel dispatch.
19468. CP ensures that when the a wavefront starts executing the kernel machine
1947 code, the scalar general purpose registers (SGPR) and vector general purpose
1948 registers (VGPR) are set up as required by the machine code. The required
1949 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1950 register state is defined in
1951 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
19529. The prolog of the kernel machine code (see
1953 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1954 before continuing executing the machine code that corresponds to the kernel.
195510. When the kernel dispatch has completed execution, CP signals the completion
1956 signal specified in the kernel dispatch packet if not 0.
1957
1958.. _amdgpu-amdhsa-memory-spaces:
1959
1960Memory Spaces
1961~~~~~~~~~~~~~
1962
1963The memory space properties are:
1964
1965 .. table:: AMDHSA Memory Spaces
1966 :name: amdgpu-amdhsa-memory-spaces-table
1967
1968 ================= =========== ======== ======= ==================
1969 Memory Space Name HSA Segment Hardware Address NULL Value
1970 Name Name Size
1971 ================= =========== ======== ======= ==================
1972 Private private scratch 32 0x00000000
1973 Local group LDS 32 0xFFFFFFFF
1974 Global global global 64 0x0000000000000000
1975 Constant constant *same as 64 0x0000000000000000
1976 global*
1977 Generic flat flat 64 0x0000000000000000
1978 Region N/A GDS 32 *not implemented
1979 for AMDHSA*
1980 ================= =========== ======== ======= ==================
1981
1982The global and constant memory spaces both use global virtual addresses, which
1983are the same virtual address space used by the CPU. However, some virtual
1984addresses may only be accessible to the CPU, some only accessible by the GPU,
1985and some by both.
1986
1987Using the constant memory space indicates that the data will not change during
1988the execution of the kernel. This allows scalar read instructions to be
1989used. The vector and scalar L1 caches are invalidated of volatile data before
1990each kernel dispatch execution to allow constant memory to change values between
1991kernel dispatches.
1992
1993The local memory space uses the hardware Local Data Store (LDS) which is
1994automatically allocated when the hardware creates work-groups of wavefronts, and
1995freed when all the wavefronts of a work-group have terminated. The data store
1996(DS) instructions can be used to access it.
1997
1998The private memory space uses the hardware scratch memory support. If the kernel
1999uses scratch, then the hardware allocates memory that is accessed using
2000wavefront lane dword (4 byte) interleaving. The mapping used from private
2001address to physical address is:
2002
2003 ``wavefront-scratch-base +
2004 (private-address * wavefront-size * 4) +
2005 (wavefront-lane-id * 4)``
2006
2007There are different ways that the wavefront scratch base address is determined
2008by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
2009memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye636e2232018-03-08 05:46:01 +00002010the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tye69865532017-06-06 20:31:59 +00002011instructions, or by flat instructions. If each lane of a wavefront accesses the
2012same private address, the interleaving results in adjacent dwords being accessed
2013and hence requires fewer cache lines to be fetched. Multi-dword access is not
2014supported except by flat and scratch instructions in GFX9.
2015
2016The generic address space uses the hardware flat address support available in
2017GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
2018local appertures), that are outside the range of addressible global memory, to
2019map from a flat address to a private or local address.
2020
2021FLAT instructions can take a flat address and access global, private (scratch)
2022and group (LDS) memory depending in if the address is within one of the
2023apperture ranges. Flat access to scratch requires hardware aperture setup and
2024setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
2025access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
2026(see :ref:`amdgpu-amdhsa-m0`).
2027
2028To convert between a segment address and a flat address the base address of the
2029appertures address can be used. For GFX7-GFX8 these are available in the
2030:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
2031Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
2032GFX9 the appature base addresses are directly available as inline constant
2033registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
2034address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
2035which makes it easier to convert from flat to segment or segment to flat.
2036
Tony Tye1a0450f2017-08-15 20:47:41 +00002037Image and Samplers
2038~~~~~~~~~~~~~~~~~~
Tony Tye69865532017-06-06 20:31:59 +00002039
2040Image and sample handles created by the ROCm runtime are 64 bit addresses of a
2041hardware 32 byte V# and 48 byte S# object respectively. In order to support the
2042HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
2043enumeration values for the queries that are not trivially deducible from the S#
2044representation.
2045
2046HSA Signals
2047~~~~~~~~~~~
2048
Tony Tye1a0450f2017-08-15 20:47:41 +00002049HSA signal handles created by the ROCm runtime are 64 bit addresses of a
2050structure allocated in memory accessible from both the CPU and GPU. The
2051structure is defined by the ROCm runtime and subject to change between releases
2052(see [AMD-ROCm-github]_).
Tony Tye69865532017-06-06 20:31:59 +00002053
2054.. _amdgpu-amdhsa-hsa-aql-queue:
2055
2056HSA AQL Queue
2057~~~~~~~~~~~~~
2058
Tony Tye1a0450f2017-08-15 20:47:41 +00002059The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye69865532017-06-06 20:31:59 +00002060between releases (see [AMD-ROCm-github]_). For some processors it contains
2061fields needed to implement certain language features such as the flat address
2062aperture bases. It also contains fields used by CP such as managing the
2063allocation of scratch memory.
2064
2065.. _amdgpu-amdhsa-kernel-descriptor:
2066
2067Kernel Descriptor
2068~~~~~~~~~~~~~~~~~
2069
2070A kernel descriptor consists of the information needed by CP to initiate the
2071execution of a kernel, including the entry point address of the machine code
2072that implements the kernel.
2073
2074Kernel Descriptor for GFX6-GFX9
2075+++++++++++++++++++++++++++++++
2076
Scott Linder43cbf8d2018-06-21 19:38:56 +00002077CP microcode requires the Kernel descriptor to be allocated on 64 byte
2078alignment.
Tony Tye69865532017-06-06 20:31:59 +00002079
2080 .. table:: Kernel Descriptor for GFX6-GFX9
2081 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
2082
Tony Tye0a092202017-10-18 22:16:55 +00002083 ======= ======= =============================== ============================
Tony Tye69865532017-06-06 20:31:59 +00002084 Bits Size Field Name Description
Tony Tye0a092202017-10-18 22:16:55 +00002085 ======= ======= =============================== ============================
Konstantin Zhuravlyov299cf5f2018-06-12 18:02:46 +00002086 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
Tony Tye69865532017-06-06 20:31:59 +00002087 address space memory
2088 required for a work-group
2089 in bytes. This does not
2090 include any dynamically
2091 allocated local address
2092 space memory that may be
2093 added when the kernel is
2094 dispatched.
Konstantin Zhuravlyov299cf5f2018-06-12 18:02:46 +00002095 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
Tony Tye69865532017-06-06 20:31:59 +00002096 private address space
2097 memory required for a
2098 work-item in bytes. If
2099 is_dynamic_callstack is 1
2100 then additional space must
2101 be added to this value for
2102 the call stack.
Tony Tye9000e8c2017-11-10 01:00:54 +00002103 127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov299cf5f2018-06-12 18:02:46 +00002104 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
Tony Tye69865532017-06-06 20:31:59 +00002105 negative) from base
2106 address of kernel
2107 descriptor to kernel's
2108 entry point instruction
2109 which must be 256 byte
2110 aligned.
Tony Tyeae5d34e2018-01-30 23:07:10 +00002111 383:192 24 Reserved, must be 0.
Tony Tye69865532017-06-06 20:31:59 +00002112 bytes
Konstantin Zhuravlyov299cf5f2018-06-12 18:02:46 +00002113 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
Tony Tye69865532017-06-06 20:31:59 +00002114 program settings used by
2115 CP to set up
2116 ``COMPUTE_PGM_RSRC1``
2117 configuration
2118 register. See
Tony Tye0a092202017-10-18 22:16:55 +00002119 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov299cf5f2018-06-12 18:02:46 +00002120 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
Tony Tye69865532017-06-06 20:31:59 +00002121 program settings used by
2122 CP to set up
2123 ``COMPUTE_PGM_RSRC2``
2124 configuration
2125 register. See
2126 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov299cf5f2018-06-12 18:02:46 +00002127 448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
2128 _BUFFER SGPR user data registers
Tony Tye69865532017-06-06 20:31:59 +00002129 (see
2130 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2131
2132 The total number of SGPR
2133 user data registers
2134 requested must not exceed
2135 16 and match value in
2136 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
2137 Any requests beyond 16
2138 will be ignored.
Konstantin Zhuravlyov299cf5f2018-06-12 18:02:46 +00002139 449 1 bit ENABLE_SGPR_DISPATCH_PTR *see above*
2140 450 1 bit ENABLE_SGPR_QUEUE_PTR *see above*
2141 451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above*
2142 452 1 bit ENABLE_SGPR_DISPATCH_ID *see above*
2143 453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT *see above*
2144 454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT *see above*
2145 _SIZE
Konstantin Zhuravlyov33168832018-06-21 18:36:04 +00002146 455 1 bit Reserved, must be 0.
2147 511:456 8 bytes Reserved, must be 0.
Tony Tye69865532017-06-06 20:31:59 +00002148 512 **Total size 64 bytes.**
Tony Tye0a092202017-10-18 22:16:55 +00002149 ======= ====================================================================
Tony Tye69865532017-06-06 20:31:59 +00002150
2151..
2152
2153 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye0a092202017-10-18 22:16:55 +00002154 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tye69865532017-06-06 20:31:59 +00002155
Tony Tye04631fc2017-06-07 00:46:08 +00002156 ======= ======= =============================== ===========================================================================
Tony Tye69865532017-06-06 20:31:59 +00002157 Bits Size Field Name Description
Tony Tye04631fc2017-06-07 00:46:08 +00002158 ======= ======= =============================== ===========================================================================
Scott Linder43cbf8d2018-06-21 19:38:56 +00002159 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register
2160 blocks used by each work-item;
Tony Tye69865532017-06-06 20:31:59 +00002161 granularity is device
2162 specific:
2163
Tony Tye9000e8c2017-11-10 01:00:54 +00002164 GFX6-GFX9
Scott Linder43cbf8d2018-06-21 19:38:56 +00002165 - vgprs_used 0..256
2166 - max(0, ceil(vgprs_used / 4) - 1)
2167
2168 Where vgprs_used is defined
2169 as the highest VGPR number
2170 explicitly referenced plus
2171 one.
Tony Tye69865532017-06-06 20:31:59 +00002172
2173 Used by CP to set up
2174 ``COMPUTE_PGM_RSRC1.VGPRS``.
Scott Linder43cbf8d2018-06-21 19:38:56 +00002175
2176 The
2177 :ref:`amdgpu-assembler`
2178 calculates this
2179 automatically for the
2180 selected processor from
2181 values provided to the
2182 `.amdhsa_kernel` directive
2183 by the
2184 `.amdhsa_next_free_vgpr`
2185 nested directive (see
2186 :ref:`amdhsa-kernel-directives-table`).
2187 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
2188 blocks used by a wavefront;
Tony Tye69865532017-06-06 20:31:59 +00002189 granularity is device
2190 specific:
2191
Tony Tye9000e8c2017-11-10 01:00:54 +00002192 GFX6-GFX8
Scott Linder43cbf8d2018-06-21 19:38:56 +00002193 - sgprs_used 0..112
2194 - max(0, ceil(sgprs_used / 8) - 1)
Tony Tye69865532017-06-06 20:31:59 +00002195 GFX9
Scott Linder43cbf8d2018-06-21 19:38:56 +00002196 - sgprs_used 0..112
2197 - 2 * max(0, ceil(sgprs_used / 16) - 1)
Tony Tye69865532017-06-06 20:31:59 +00002198
Scott Linder43cbf8d2018-06-21 19:38:56 +00002199 Where sgprs_used is
2200 defined as the highest
2201 SGPR number explicitly
2202 referenced plus one, plus
2203 a target-specific number
2204 of additional special
2205 SGPRs for VCC,
2206 FLAT_SCRATCH (GFX7+) and
2207 XNACK_MASK (GFX8+), and
2208 any additional
2209 target-specific
2210 limitations. It does not
2211 include the 16 SGPRs added
2212 if a trap handler is
Tony Tye69865532017-06-06 20:31:59 +00002213 enabled.
2214
Scott Linder43cbf8d2018-06-21 19:38:56 +00002215 The target-specific
2216 limitations and special
2217 SGPR layout are defined in
2218 the hardware
2219 documentation, which can
2220 be found in the
2221 :ref:`amdgpu-processors`
2222 table.
2223
Tony Tye69865532017-06-06 20:31:59 +00002224 Used by CP to set up
2225 ``COMPUTE_PGM_RSRC1.SGPRS``.
Scott Linder43cbf8d2018-06-21 19:38:56 +00002226
2227 The
2228 :ref:`amdgpu-assembler`
2229 calculates this
2230 automatically for the
2231 selected processor from
2232 values provided to the
2233 `.amdhsa_kernel` directive
2234 by the
2235 `.amdhsa_next_free_sgpr`
2236 and `.amdhsa_reserve_*`
2237 nested directives (see
2238 :ref:`amdhsa-kernel-directives-table`).
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002239 11:10 2 bits PRIORITY Must be 0.
Tony Tye69865532017-06-06 20:31:59 +00002240
2241 Start executing wavefront
2242 at the specified priority.
2243
2244 CP is responsible for
2245 filling in
2246 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002247 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye69865532017-06-06 20:31:59 +00002248 with specified rounding
2249 mode for single (32
2250 bit) floating point
2251 precision floating point
2252 operations.
2253
2254 Floating point rounding
2255 mode values are defined in
2256 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
2257
2258 Used by CP to set up
2259 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002260 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye69865532017-06-06 20:31:59 +00002261 with specified rounding
2262 denorm mode for half/double (16
2263 and 64 bit) floating point
2264 precision floating point
2265 operations.
2266
2267 Floating point rounding
2268 mode values are defined in
2269 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
2270
2271 Used by CP to set up
2272 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002273 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye69865532017-06-06 20:31:59 +00002274 with specified denorm mode
2275 for single (32
2276 bit) floating point
2277 precision floating point
2278 operations.
2279
2280 Floating point denorm mode
2281 values are defined in
2282 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
2283
2284 Used by CP to set up
2285 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002286 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye69865532017-06-06 20:31:59 +00002287 with specified denorm mode
2288 for half/double (16
2289 and 64 bit) floating point
2290 precision floating point
2291 operations.
2292
2293 Floating point denorm mode
2294 values are defined in
2295 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
2296
2297 Used by CP to set up
2298 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002299 20 1 bit PRIV Must be 0.
Tony Tye69865532017-06-06 20:31:59 +00002300
2301 Start executing wavefront
2302 in privilege trap handler
2303 mode.
2304
2305 CP is responsible for
2306 filling in
2307 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002308 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye69865532017-06-06 20:31:59 +00002309 with DX10 clamp mode
2310 enabled. Used by the vector
Tony Tye0a092202017-10-18 22:16:55 +00002311 ALU to force DX10 style
Tony Tye69865532017-06-06 20:31:59 +00002312 treatment of NaN's (when
2313 set, clamp NaN to zero,
2314 otherwise pass NaN
2315 through).
2316
2317 Used by CP to set up
2318 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002319 22 1 bit DEBUG_MODE Must be 0.
Tony Tye69865532017-06-06 20:31:59 +00002320
2321 Start executing wavefront
2322 in single step mode.
2323
2324 CP is responsible for
2325 filling in
2326 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002327 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye69865532017-06-06 20:31:59 +00002328 with IEEE mode
2329 enabled. Floating point
2330 opcodes that support
2331 exception flag gathering
2332 will quiet and propagate
2333 signaling-NaN inputs per
2334 IEEE 754-2008. Min_dx10 and
2335 max_dx10 become IEEE
2336 754-2008 compliant due to
2337 signaling-NaN propagation
2338 and quieting.
2339
2340 Used by CP to set up
2341 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002342 24 1 bit BULKY Must be 0.
Tony Tye69865532017-06-06 20:31:59 +00002343
2344 Only one work-group allowed
2345 to execute on a compute
2346 unit.
2347
2348 CP is responsible for
2349 filling in
2350 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002351 25 1 bit CDBG_USER Must be 0.
Tony Tye69865532017-06-06 20:31:59 +00002352
2353 Flag that can be used to
2354 control debugging code.
2355
2356 CP is responsible for
2357 filling in
2358 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye9000e8c2017-11-10 01:00:54 +00002359 26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye0a092202017-10-18 22:16:55 +00002360 Reserved, must be 0.
2361 GFX9
2362 Wavefront starts execution
2363 with specified fp16 overflow
2364 mode.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002365
Tony Tye0a092202017-10-18 22:16:55 +00002366 - If 0, fp16 overflow generates
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002367 +/-INF values.
Tony Tye0a092202017-10-18 22:16:55 +00002368 - If 1, fp16 overflow that is the
2369 result of an +/-INF input value
2370 or divide by 0 produces a +/-INF,
2371 otherwise clamps computed
2372 overflow to +/-MAX_FP16 as
2373 appropriate.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002374
2375 Used by CP to set up
2376 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye0a092202017-10-18 22:16:55 +00002377 31:27 5 bits Reserved, must be 0.
Tony Tye69865532017-06-06 20:31:59 +00002378 32 **Total size 4 bytes**
Tony Tye04631fc2017-06-07 00:46:08 +00002379 ======= ===================================================================================================================
Tony Tye69865532017-06-06 20:31:59 +00002380
2381..
2382
2383 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
2384 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
2385
Tony Tye04631fc2017-06-07 00:46:08 +00002386 ======= ======= =============================== ===========================================================================
Tony Tye69865532017-06-06 20:31:59 +00002387 Bits Size Field Name Description
Tony Tye04631fc2017-06-07 00:46:08 +00002388 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002389 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye636e2232018-03-08 05:46:01 +00002390 _WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tye69865532017-06-06 20:31:59 +00002391 system register (see
2392 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2393
2394 Used by CP to set up
2395 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002396 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye69865532017-06-06 20:31:59 +00002397 user data registers
2398 requested. This number must
2399 match the number of user
2400 data registers enabled.
2401
2402 Used by CP to set up
2403 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov840f4232018-05-29 19:09:13 +00002404 6 1 bit ENABLE_TRAP_HANDLER Must be 0.
Tony Tye69865532017-06-06 20:31:59 +00002405
Konstantin Zhuravlyov840f4232018-05-29 19:09:13 +00002406 This bit represents
2407 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
2408 which is set by the CP if
2409 the runtime has installed a
2410 trap handler.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002411 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye69865532017-06-06 20:31:59 +00002412 system SGPR register for
2413 the work-group id in the X
2414 dimension (see
2415 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2416
2417 Used by CP to set up
2418 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002419 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye69865532017-06-06 20:31:59 +00002420 system SGPR register for
2421 the work-group id in the Y
2422 dimension (see
2423 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2424
2425 Used by CP to set up
2426 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002427 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye69865532017-06-06 20:31:59 +00002428 system SGPR register for
2429 the work-group id in the Z
2430 dimension (see
2431 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2432
2433 Used by CP to set up
2434 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002435 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye69865532017-06-06 20:31:59 +00002436 system SGPR register for
2437 work-group information (see
2438 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2439
2440 Used by CP to set up
2441 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002442 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye69865532017-06-06 20:31:59 +00002443 VGPR system registers used
2444 for the work-item ID.
2445 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
2446 defines the values.
2447
2448 Used by CP to set up
2449 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002450 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye69865532017-06-06 20:31:59 +00002451
2452 Wavefront starts execution
2453 with address watch
2454 exceptions enabled which
2455 are generated when L1 has
2456 witnessed a thread access
2457 an *address of
2458 interest*.
2459
2460 CP is responsible for
2461 filling in the address
2462 watch bit in
2463 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
2464 according to what the
2465 runtime requests.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002466 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye69865532017-06-06 20:31:59 +00002467
2468 Wavefront starts execution
2469 with memory violation
2470 exceptions exceptions
2471 enabled which are generated
2472 when a memory violation has
Tony Tye636e2232018-03-08 05:46:01 +00002473 occurred for this wavefront from
Tony Tye69865532017-06-06 20:31:59 +00002474 L1 or LDS
2475 (write-to-read-only-memory,
2476 mis-aligned atomic, LDS
2477 address out of range,
2478 illegal address, etc.).
2479
2480 CP sets the memory
2481 violation bit in
2482 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
2483 according to what the
2484 runtime requests.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002485 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye69865532017-06-06 20:31:59 +00002486
2487 CP uses the rounded value
2488 from the dispatch packet,
2489 not this value, as the
2490 dispatch may contain
2491 dynamically allocated group
2492 segment memory. CP writes
2493 directly to
2494 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
2495
2496 Amount of group segment
2497 (LDS) to allocate for each
2498 work-group. Granularity is
2499 device specific:
2500
2501 GFX6:
2502 roundup(lds-size / (64 * 4))
2503 GFX7-GFX9:
2504 roundup(lds-size / (128 * 4))
2505
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002506 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
2507 _INVALID_OPERATION with specified exceptions
Tony Tye69865532017-06-06 20:31:59 +00002508 enabled.
2509
2510 Used by CP to set up
2511 ``COMPUTE_PGM_RSRC2.EXCP_EN``
2512 (set from bits 0..6).
2513
2514 IEEE 754 FP Invalid
2515 Operation
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002516 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
2517 _SOURCE input operands is a
Tony Tye69865532017-06-06 20:31:59 +00002518 denormal number
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002519 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
2520 _DIVISION_BY_ZERO Zero
2521 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
2522 _OVERFLOW
2523 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
2524 _UNDERFLOW
2525 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
2526 _INEXACT
2527 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
2528 _ZERO (rcp_iflag_f32 instruction
Tony Tye69865532017-06-06 20:31:59 +00002529 only)
Tony Tye0a092202017-10-18 22:16:55 +00002530 31 1 bit Reserved, must be 0.
Tony Tye69865532017-06-06 20:31:59 +00002531 32 **Total size 4 bytes.**
Tony Tye04631fc2017-06-07 00:46:08 +00002532 ======= ===================================================================================================================
Tony Tye69865532017-06-06 20:31:59 +00002533
2534..
2535
2536 .. table:: Floating Point Rounding Mode Enumeration Values
2537 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
2538
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002539 ====================================== ===== ==============================
2540 Enumeration Name Value Description
2541 ====================================== ===== ==============================
Konstantin Zhuravlyov299cf5f2018-06-12 18:02:46 +00002542 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
2543 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
2544 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
2545 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002546 ====================================== ===== ==============================
Tony Tye69865532017-06-06 20:31:59 +00002547
2548..
2549
2550 .. table:: Floating Point Denorm Mode Enumeration Values
2551 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
2552
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002553 ====================================== ===== ==============================
2554 Enumeration Name Value Description
2555 ====================================== ===== ==============================
Konstantin Zhuravlyov299cf5f2018-06-12 18:02:46 +00002556 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002557 Denorms
Konstantin Zhuravlyov299cf5f2018-06-12 18:02:46 +00002558 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
2559 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
2560 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002561 ====================================== ===== ==============================
Tony Tye69865532017-06-06 20:31:59 +00002562
2563..
2564
2565 .. table:: System VGPR Work-Item ID Enumeration Values
2566 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
2567
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002568 ======================================== ===== ============================
2569 Enumeration Name Value Description
2570 ======================================== ===== ============================
Konstantin Zhuravlyov299cf5f2018-06-12 18:02:46 +00002571 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002572 ID.
Konstantin Zhuravlyov299cf5f2018-06-12 18:02:46 +00002573 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002574 dimensions ID.
Konstantin Zhuravlyov299cf5f2018-06-12 18:02:46 +00002575 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002576 dimensions ID.
Konstantin Zhuravlyov299cf5f2018-06-12 18:02:46 +00002577 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
Konstantin Zhuravlyov755155f2017-10-14 19:17:08 +00002578 ======================================== ===== ============================
Tony Tye69865532017-06-06 20:31:59 +00002579
2580.. _amdgpu-amdhsa-initial-kernel-execution-state:
2581
2582Initial Kernel Execution State
2583~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2584
2585This section defines the register state that will be set up by the packet
2586processor prior to the start of execution of every wavefront. This is limited by
2587the constraints of the hardware controllers of CP/ADC/SPI.
2588
2589The order of the SGPR registers is defined, but the compiler can specify which
2590ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2591fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2592for enabled registers are dense starting at SGPR0: the first enabled register is
2593SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2594an SGPR number.
2595
2596The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye636e2232018-03-08 05:46:01 +00002597all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tye69865532017-06-06 20:31:59 +00002598the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2599initialized. These are then immediately followed by the System SGPRs that are
Tony Tye636e2232018-03-08 05:46:01 +00002600set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tye69865532017-06-06 20:31:59 +00002601dispatch.
2602
2603SGPR register initial state is defined in
2604:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2605
2606 .. table:: SGPR Register Set Up Order
2607 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2608
2609 ========== ========================== ====== ==============================
2610 SGPR Order Name Number Description
2611 (kernel descriptor enable of
2612 field) SGPRs
2613 ========== ========================== ====== ==============================
2614 First Private Segment Buffer 4 V# that can be used, together
Tony Tye636e2232018-03-08 05:46:01 +00002615 (enable_sgpr_private with Scratch Wavefront Offset
2616 _segment_buffer) as an offset, to access the
2617 private memory space using a
2618 segment address.
Tony Tye69865532017-06-06 20:31:59 +00002619
2620 CP uses the value provided by
2621 the runtime.
2622 then Dispatch Ptr 2 64 bit address of AQL dispatch
2623 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2624 actually executing.
2625 then Queue Ptr 2 64 bit address of amd_queue_t
2626 (enable_sgpr_queue_ptr) object for AQL queue on which
2627 the dispatch packet was
2628 queued.
2629 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2630 (enable_sgpr_kernarg segment. This is directly
2631 _segment_ptr) copied from the
2632 kernarg_address in the kernel
2633 dispatch packet.
2634
2635 Having CP load it once avoids
2636 loading it at the beginning of
2637 every wavefront.
2638 then Dispatch Id 2 64 bit Dispatch ID of the
2639 (enable_sgpr_dispatch_id) dispatch packet being
2640 executed.
2641 then Flat Scratch Init 2 This is 2 SGPRs:
2642 (enable_sgpr_flat_scratch
2643 _init) GFX6
2644 Not supported.
2645 GFX7-GFX8
2646 The first SGPR is a 32 bit
2647 byte offset from
2648 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2649 to per SPI base of memory
2650 for scratch for the queue
2651 executing the kernel
2652 dispatch. CP obtains this
Tony Tye1a0450f2017-08-15 20:47:41 +00002653 from the runtime. (The
2654 Scratch Segment Buffer base
2655 address is
2656 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2657 plus this offset.) The value
Tony Tye636e2232018-03-08 05:46:01 +00002658 of Scratch Wavefront Offset must
Tony Tye1a0450f2017-08-15 20:47:41 +00002659 be added to this offset by
2660 the kernel machine code,
2661 right shifted by 8, and
2662 moved to the FLAT_SCRATCH_HI
2663 SGPR register.
2664 FLAT_SCRATCH_HI corresponds
2665 to SGPRn-4 on GFX7, and
2666 SGPRn-6 on GFX8 (where SGPRn
2667 is the highest numbered SGPR
Tony Tye636e2232018-03-08 05:46:01 +00002668 allocated to the wavefront).
Tony Tye1a0450f2017-08-15 20:47:41 +00002669 FLAT_SCRATCH_HI is
2670 multiplied by 256 (as it is
2671 in units of 256 bytes) and
2672 added to
2673 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye636e2232018-03-08 05:46:01 +00002674 to calculate the per wavefront
Tony Tye1a0450f2017-08-15 20:47:41 +00002675 FLAT SCRATCH BASE in flat
2676 memory instructions that
2677 access the scratch
2678 apperture.
Tony Tye69865532017-06-06 20:31:59 +00002679
2680 The second SGPR is 32 bit
2681 byte size of a single
Konstantin Zhuravlyovcb5868c2017-10-19 17:12:55 +00002682 work-item's scratch memory
Tony Tye1a0450f2017-08-15 20:47:41 +00002683 usage. CP obtains this from
2684 the runtime, and it is
2685 always a multiple of DWORD.
2686 CP checks that the value in
2687 the kernel dispatch packet
2688 Private Segment Byte Size is
2689 not larger, and requests the
2690 runtime to increase the
2691 queue's scratch size if
2692 necessary. The kernel code
2693 must move it to
2694 FLAT_SCRATCH_LO which is
2695 SGPRn-3 on GFX7 and SGPRn-5
2696 on GFX8. FLAT_SCRATCH_LO is
2697 used as the FLAT SCRATCH
2698 SIZE in flat memory
Tony Tye69865532017-06-06 20:31:59 +00002699 instructions. Having CP load
2700 it once avoids loading it at
2701 the beginning of every
Tony Tye2f2cb6b2017-11-10 20:51:43 +00002702 wavefront.
2703 GFX9
2704 This is the
Tony Tye1a0450f2017-08-15 20:47:41 +00002705 64 bit base address of the
2706 per SPI scratch backing
2707 memory managed by SPI for
2708 the queue executing the
2709 kernel dispatch. CP obtains
2710 this from the runtime (and
Tony Tye69865532017-06-06 20:31:59 +00002711 divides it if there are
2712 multiple Shader Arrays each
2713 with its own SPI). The value
Tony Tye636e2232018-03-08 05:46:01 +00002714 of Scratch Wavefront Offset must
Tony Tye69865532017-06-06 20:31:59 +00002715 be added by the kernel
Tony Tye1a0450f2017-08-15 20:47:41 +00002716 machine code and the result
2717 moved to the FLAT_SCRATCH
2718 SGPR which is SGPRn-6 and
2719 SGPRn-5. It is used as the
2720 FLAT SCRATCH BASE in flat
Tony Tye2f2cb6b2017-11-10 20:51:43 +00002721 memory instructions.
2722 then Private Segment Size 1 The 32 bit byte size of a
2723 (enable_sgpr_private single
2724 work-item's
2725 scratch_segment_size) memory
2726 allocation. This is the
2727 value from the kernel
2728 dispatch packet Private
2729 Segment Byte Size rounded up
2730 by CP to a multiple of
2731 DWORD.
Tony Tye69865532017-06-06 20:31:59 +00002732
2733 Having CP load it once avoids
2734 loading it at the beginning of
2735 every wavefront.
2736
2737 This is not used for
2738 GFX7-GFX8 since it is the same
2739 value as the second SGPR of
2740 Flat Scratch Init. However, it
2741 may be needed for GFX9 which
2742 changes the meaning of the
2743 Flat Scratch Init value.
2744 then Grid Work-Group Count X 1 32 bit count of the number of
2745 (enable_sgpr_grid work-groups in the X dimension
2746 _workgroup_count_X) for the grid being
2747 executed. Computed from the
2748 fields in the kernel dispatch
2749 packet as ((grid_size.x +
2750 workgroup_size.x - 1) /
2751 workgroup_size.x).
2752 then Grid Work-Group Count Y 1 32 bit count of the number of
2753 (enable_sgpr_grid work-groups in the Y dimension
2754 _workgroup_count_Y && for the grid being
2755 less than 16 previous executed. Computed from the
2756 SGPRs) fields in the kernel dispatch
2757 packet as ((grid_size.y +
2758 workgroup_size.y - 1) /
2759 workgroupSize.y).
2760
2761 Only initialized if <16
2762 previous SGPRs initialized.
2763 then Grid Work-Group Count Z 1 32 bit count of the number of
2764 (enable_sgpr_grid work-groups in the Z dimension
2765 _workgroup_count_Z && for the grid being
2766 less than 16 previous executed. Computed from the
2767 SGPRs) fields in the kernel dispatch
2768 packet as ((grid_size.z +
2769 workgroup_size.z - 1) /
2770 workgroupSize.z).
2771
2772 Only initialized if <16
2773 previous SGPRs initialized.
2774 then Work-Group Id X 1 32 bit work-group id in X
2775 (enable_sgpr_workgroup_id dimension of grid for
2776 _X) wavefront.
2777 then Work-Group Id Y 1 32 bit work-group id in Y
2778 (enable_sgpr_workgroup_id dimension of grid for
2779 _Y) wavefront.
2780 then Work-Group Id Z 1 32 bit work-group id in Z
2781 (enable_sgpr_workgroup_id dimension of grid for
2782 _Z) wavefront.
Tony Tye636e2232018-03-08 05:46:01 +00002783 then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tye69865532017-06-06 20:31:59 +00002784 (enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye636e2232018-03-08 05:46:01 +00002785 _info) threadgroup_size_in_wavefronts[5:0]}
2786 then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tye69865532017-06-06 20:31:59 +00002787 (enable_sgpr_private of scratch base of queue
Tony Tye636e2232018-03-08 05:46:01 +00002788 _segment_wavefront_offset) executing the kernel
Tony Tye69865532017-06-06 20:31:59 +00002789 dispatch. Must be used as an
2790 offset with Private
2791 segment address when using
2792 Scratch Segment Buffer. It
2793 must be used to set up FLAT
2794 SCRATCH for flat addressing
2795 (see
2796 :ref:`amdgpu-amdhsa-flat-scratch`).
2797 ========== ========================== ====== ==============================
2798
2799The order of the VGPR registers is defined, but the compiler can specify which
2800ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2801fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2802for enabled registers are dense starting at VGPR0: the first enabled register is
2803VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2804VGPR number.
2805
2806VGPR register initial state is defined in
2807:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2808
2809 .. table:: VGPR Register Set Up Order
2810 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2811
2812 ========== ========================== ====== ==============================
2813 VGPR Order Name Number Description
2814 (kernel descriptor enable of
2815 field) VGPRs
2816 ========== ========================== ====== ==============================
2817 First Work-Item Id X 1 32 bit work item id in X
2818 (Always initialized) dimension of work-group for
2819 wavefront lane.
2820 then Work-Item Id Y 1 32 bit work item id in Y
2821 (enable_vgpr_workitem_id dimension of work-group for
2822 > 0) wavefront lane.
2823 then Work-Item Id Z 1 32 bit work item id in Z
2824 (enable_vgpr_workitem_id dimension of work-group for
2825 > 1) wavefront lane.
2826 ========== ========================== ====== ==============================
2827
Hiroshi Inoueef1bc2d2018-04-12 05:53:20 +00002828The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
Tony Tye69865532017-06-06 20:31:59 +00002829
28301. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2831 registers.
28322. Work-group Id registers X, Y, Z are set by ADC which supports any
2833 combination including none.
Tony Tye636e2232018-03-08 05:46:01 +000028343. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
2835 its value cannot included with the flat scratch init value which is per queue.
Tony Tye69865532017-06-06 20:31:59 +000028364. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2837 or (X, Y, Z).
2838
2839Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2840value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2841
2842The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye9000e8c2017-11-10 01:00:54 +00002843has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tye69865532017-06-06 20:31:59 +00002844instructions (GFX9).
2845
2846If buffer operations are used then the compiler can generate a V# with the
2847following properties:
2848
2849* base address of 0
2850* no swizzle
2851* ATC: 1 if IOMMU present (such as APU)
2852* ptr64: 1
2853* MTYPE set to support memory coherence that matches the runtime (such as CC for
2854 APU and NC for dGPU).
2855
2856.. _amdgpu-amdhsa-kernel-prolog:
2857
2858Kernel Prolog
2859~~~~~~~~~~~~~
2860
2861.. _amdgpu-amdhsa-m0:
2862
2863M0
2864++
2865
2866GFX6-GFX8
2867 The M0 register must be initialized with a value at least the total LDS size
2868 if the kernel may access LDS via DS or flat operations. Total LDS size is
2869 available in dispatch packet. For M0, it is also possible to use maximum
2870 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2871 GFX7-GFX8).
2872GFX9
2873 The M0 register is not used for range checking LDS accesses and so does not
2874 need to be initialized in the prolog.
2875
2876.. _amdgpu-amdhsa-flat-scratch:
2877
2878Flat Scratch
2879++++++++++++
2880
2881If the kernel may use flat operations to access scratch memory, the prolog code
2882must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye636e2232018-03-08 05:46:01 +00002883are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tye69865532017-06-06 20:31:59 +00002884Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2885
2886GFX6
2887 Flat scratch is not supported.
2888
Tony Tye9000e8c2017-11-10 01:00:54 +00002889GFX7-GFX8
Tony Tye69865532017-06-06 20:31:59 +00002890 1. The low word of Flat Scratch Init is 32 bit byte offset from
2891 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2892 being managed by SPI for the queue executing the kernel dispatch. This is
2893 the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye636e2232018-03-08 05:46:01 +00002894 prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tye69865532017-06-06 20:31:59 +00002895 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2896 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2897 by 8 before moving into FLAT_SCRATCH_LO.
2898 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2899 work-items scratch memory usage. This is directly loaded from the kernel
2900 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2901 DWORD. Having CP load it once avoids loading it at the beginning of every
2902 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2903 SIZE.
Tony Tye2f2cb6b2017-11-10 20:51:43 +00002904
Tony Tye69865532017-06-06 20:31:59 +00002905GFX9
2906 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2907 memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye636e2232018-03-08 05:46:01 +00002908 prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tye69865532017-06-06 20:31:59 +00002909 pair for use as the flat scratch base in flat memory instructions.
2910
2911.. _amdgpu-amdhsa-memory-model:
2912
2913Memory Model
2914~~~~~~~~~~~~
2915
2916This section describes the mapping of LLVM memory model onto AMDGPU machine code
2917(see :ref:`memmodel`). *The implementation is WIP.*
2918
2919.. TODO
2920 Update when implementation complete.
2921
Tony Tye69865532017-06-06 20:31:59 +00002922The AMDGPU backend supports the memory synchronization scopes specified in
2923:ref:`amdgpu-memory-scopes`.
2924
2925The code sequences used to implement the memory model are defined in table
2926:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2927
2928The sequences specify the order of instructions that a single thread must
2929execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2930to other memory instructions executed by the same thread. This allows them to be
2931moved earlier or later which can allow them to be combined with other instances
2932of the same instruction, or hoisted/sunk out of loops to improve
2933performance. Only the instructions related to the memory model are given;
2934additional ``s_waitcnt`` instructions are required to ensure registers are
2935defined before being used. These may be able to be combined with the memory
2936model ``s_waitcnt`` instructions as described above.
2937
Tony Tye0a092202017-10-18 22:16:55 +00002938The AMDGPU backend supports the following memory models:
2939
2940 HSA Memory Model [HSA]_
2941 The HSA memory model uses a single happens-before relation for all address
2942 spaces (see :ref:`amdgpu-address-spaces`).
2943 OpenCL Memory Model [OpenCL]_
2944 The OpenCL memory model which has separate happens-before relations for the
2945 global and local address spaces. Only a fence specifying both global and
2946 local address space, and seq_cst instructions join the relationships. Since
2947 the LLVM ``memfence`` instruction does not allow an address space to be
2948 specified the OpenCL fence has to convervatively assume both local and
2949 global address space was specified. However, optimizations can often be
2950 done to eliminate the additional ``s_waitcnt`` instructions when there are
2951 no intervening memory instructions which access the corresponding address
2952 space. The code sequences in the table indicate what can be omitted for the
2953 OpenCL memory. The target triple environment is used to determine if the
2954 source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tye69865532017-06-06 20:31:59 +00002955
2956``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2957operations.
2958
2959``buffer/global/flat_load/store/atomic`` instructions to global memory are
2960termed vector memory operations.
2961
2962For GFX6-GFX9:
2963
2964* Each agent has multiple compute units (CU).
2965* Each CU has multiple SIMDs that execute wavefronts.
2966* The wavefronts for a single work-group are executed in the same CU but may be
2967 executed by different SIMDs.
2968* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2969 executing on it.
2970* All LDS operations of a CU are performed as wavefront wide operations in a
2971 global order and involve no caching. Completion is reported to a wavefront in
2972 execution order.
2973* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye636e2232018-03-08 05:46:01 +00002974 CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tye69865532017-06-06 20:31:59 +00002975 can be reordered relative to each other, which can result in reordering the
2976 visibility of vector memory operations with respect to LDS operations of other
2977 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue9d67e42017-06-26 02:45:39 +00002978 ensure synchronization between LDS operations and vector memory operations
Tony Tye636e2232018-03-08 05:46:01 +00002979 between wavefronts of a work-group, but not between operations performed by the
Tony Tye69865532017-06-06 20:31:59 +00002980 same wavefront.
2981* The vector memory operations are performed as wavefront wide operations and
2982 completion is reported to a wavefront in execution order. The exception is
Tony Tye9000e8c2017-11-10 01:00:54 +00002983 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tye69865532017-06-06 20:31:59 +00002984 vector memory order if they access LDS memory, and out of LDS operation order
2985 if they access global memory.
Tony Tye0a092202017-10-18 22:16:55 +00002986* The vector memory operations access a single vector L1 cache shared by all
2987 SIMDs a CU. Therefore, no special action is required for coherence between the
2988 lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye636e2232018-03-08 05:46:01 +00002989 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye0a092202017-10-18 22:16:55 +00002990 executing in different work-groups as they may be executing on different CUs.
Tony Tye69865532017-06-06 20:31:59 +00002991* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2992 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2993 scalar operations are used in a restricted way so do not impact the memory
2994 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2995* The vector and scalar memory operations use an L2 cache shared by all CUs on
2996 the same agent.
2997* The L2 cache has independent channels to service disjoint ranges of virtual
2998 addresses.
2999* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye636e2232018-03-08 05:46:01 +00003000 scalar memory operations performed by wavefronts executing in different work-groups
Tony Tye69865532017-06-06 20:31:59 +00003001 (which may be executing on different CUs) of an agent can be reordered
3002 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue9d67e42017-06-26 02:45:39 +00003003 synchronization between vector memory operations of different CUs. It ensures a
Tony Tye69865532017-06-06 20:31:59 +00003004 previous vector memory operation has completed before executing a subsequent
3005 vector memory or LDS operation and so can be used to meet the requirements of
3006 acquire and release.
3007* The L2 cache can be kept coherent with other agents on some targets, or ranges
3008 of virtual addresses can be set up to bypass it to ensure system coherence.
3009
Tony Tye9000e8c2017-11-10 01:00:54 +00003010Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tye69865532017-06-06 20:31:59 +00003011or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
3012memory, atomic memory orderings are not meaningful and all accesses are treated
3013as non-atomic.
3014
3015Constant address space uses ``buffer/global_load`` instructions (or equivalent
3016scalar memory instructions). Since the constant address space contents do not
3017change during the execution of a kernel dispatch it is not legal to perform
3018stores, and atomic memory orderings are not meaningful and all access are
3019treated as non-atomic.
3020
3021A memory synchronization scope wider than work-group is not meaningful for the
3022group (LDS) address space and is treated as work-group.
3023
3024The memory model does not support the region address space which is treated as
3025non-atomic.
3026
3027Acquire memory ordering is not meaningful on store atomic instructions and is
3028treated as non-atomic.
3029
3030Release memory ordering is not meaningful on load atomic instructions and is
3031treated a non-atomic.
3032
3033Acquire-release memory ordering is not meaningful on load or store atomic
3034instructions and is treated as acquire and release respectively.
3035
3036AMDGPU backend only uses scalar memory operations to access memory that is
3037proven to not change during the execution of the kernel dispatch. This includes
3038constant address space and global address space for program scope const
3039variables. Therefore the kernel machine code does not have to maintain the
3040scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
3041and vector L1 caches are invalidated between kernel dispatches by CP since
3042constant address space data may change between kernel dispatch executions. See
3043:ref:`amdgpu-amdhsa-memory-spaces`.
3044
Sylvestre Ledrue9d67e42017-06-26 02:45:39 +00003045The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye69865532017-06-06 20:31:59 +00003046case the AMDGPU backend ensures the memory location used to spill is never
3047accessed by vector memory operations at the same time. If scalar writes are used
3048then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
3049return since the locations may be used for vector memory instructions by a
Tony Tye636e2232018-03-08 05:46:01 +00003050future wavefront that uses the same scratch area, or a function call that creates a
Tony Tye69865532017-06-06 20:31:59 +00003051frame at the same address, respectively. There is no need for a ``s_dcache_inv``
3052as all scalar writes are write-before-read in the same thread.
3053
Tony Tye0a092202017-10-18 22:16:55 +00003054Scratch backing memory (which is used for the private address space)
3055is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
3056address space is only accessed by a single thread, and is always
3057write-before-read, there is never a need to invalidate these entries from the L1
3058cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
3059volatile cache lines.
Tony Tye69865532017-06-06 20:31:59 +00003060
3061On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye0a092202017-10-18 22:16:55 +00003062to invalidate the L2 cache. This also causes it to be treated as
3063non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
3064(cache coherent) and so the L2 cache will coherent with the CPU and other
3065agents.
Tony Tye69865532017-06-06 20:31:59 +00003066
3067 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
3068 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
3069
Tony Tye0a092202017-10-18 22:16:55 +00003070 ============ ============ ============== ========== ===============================
Tony Tye69865532017-06-06 20:31:59 +00003071 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
3072 Ordering Sync Scope Address
3073 Space
Tony Tye0a092202017-10-18 22:16:55 +00003074 ============ ============ ============== ========== ===============================
Tony Tye69865532017-06-06 20:31:59 +00003075 **Non-Atomic**
Tony Tye0a092202017-10-18 22:16:55 +00003076 -----------------------------------------------------------------------------------
3077 load *none* *none* - global - !volatile & !nontemporal
3078 - generic
3079 - private 1. buffer/global/flat_load
3080 - constant
3081 - volatile & !nontemporal
3082
Tony Tye69865532017-06-06 20:31:59 +00003083 1. buffer/global/flat_load
3084 glc=1
Tony Tye0a092202017-10-18 22:16:55 +00003085
3086 - nontemporal
3087
3088 1. buffer/global/flat_load
3089 glc=1 slc=1
3090
Tony Tye69865532017-06-06 20:31:59 +00003091 load *none* *none* - local 1. ds_load
Tony Tye0a092202017-10-18 22:16:55 +00003092 store *none* *none* - global - !nontemporal
Tony Tye69865532017-06-06 20:31:59 +00003093 - generic
Tony Tye0a092202017-10-18 22:16:55 +00003094 - private 1. buffer/global/flat_store
3095 - constant
3096 - nontemporal
3097
3098 1. buffer/global/flat_stote
3099 glc=1 slc=1
3100
Tony Tye69865532017-06-06 20:31:59 +00003101 store *none* *none* - local 1. ds_store
3102 **Unordered Atomic**
Tony Tye0a092202017-10-18 22:16:55 +00003103 -----------------------------------------------------------------------------------
Tony Tye69865532017-06-06 20:31:59 +00003104 load atomic unordered *any* *any* *Same as non-atomic*.
3105 store atomic unordered *any* *any* *Same as non-atomic*.
3106 atomicrmw unordered *any* *any* *Same as monotonic
3107 atomic*.
3108 **Monotonic Atomic**
Tony Tye0a092202017-10-18 22:16:55 +00003109 -----------------------------------------------------------------------------------
Tony Tye69865532017-06-06 20:31:59 +00003110 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
3111 - wavefront - generic
3112 - workgroup
3113 load atomic monotonic - singlethread - local 1. ds_load
3114 - wavefront
3115 - workgroup
3116 load atomic monotonic - agent - global 1. buffer/global/flat_load
3117 - system - generic glc=1
3118 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
3119 - wavefront - generic
3120 - workgroup
3121 - agent
3122 - system
3123 store atomic monotonic - singlethread - local 1. ds_store
3124 - wavefront
3125 - workgroup
3126 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
3127 - wavefront - generic
3128 - workgroup
3129 - agent
3130 - system
3131 atomicrmw monotonic - singlethread - local 1. ds_atomic
3132 - wavefront
3133 - workgroup
3134 **Acquire Atomic**
Tony Tye0a092202017-10-18 22:16:55 +00003135 -----------------------------------------------------------------------------------
Tony Tye69865532017-06-06 20:31:59 +00003136 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
3137 - wavefront - local
3138 - generic
Tony Tye0a092202017-10-18 22:16:55 +00003139 load atomic acquire - workgroup - global 1. buffer/global/flat_load
3140 load atomic acquire - workgroup - local 1. ds_load
3141 2. s_waitcnt lgkmcnt(0)
Tony Tye69865532017-06-06 20:31:59 +00003142
Tony Tye0a092202017-10-18 22:16:55 +00003143 - If OpenCL, omit.
Tony Tye69865532017-06-06 20:31:59 +00003144 - Must happen before
3145 any following
3146 global/generic
3147 load/load
3148 atomic/store/store
3149 atomic/atomicrmw.
3150 - Ensures any
3151 following global
3152 data read is no
3153 older than the load
3154 atomic value being
3155 acquired.
Tony Tye0a092202017-10-18 22:16:55 +00003156 load atomic acquire - workgroup - generic 1. flat_load
3157 2. s_waitcnt lgkmcnt(0)
Tony Tye69865532017-06-06 20:31:59 +00003158
Tony Tye0a092202017-10-18 22:16:55 +00003159 - If OpenCL, omit.
3160 - Must happen before
3161 any following
3162 global/generic
3163 load/load
3164 atomic/store/store
3165 atomic/atomicrmw.
3166 - Ensures any
3167 following global
3168 data read is no
3169 older than the load
3170 atomic value being
3171 acquired.
3172 load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tye69865532017-06-06 20:31:59 +00003173 - system glc=1
3174 2. s_waitcnt vmcnt(0)
3175
3176 - Must happen before
3177 following
3178 buffer_wbinvl1_vol.
3179 - Ensures the load
3180 has completed
3181 before invalidating
3182 the cache.
3183
3184 3. buffer_wbinvl1_vol
3185
3186 - Must happen before
3187 any following
3188 global/generic
3189 load/load
3190 atomic/atomicrmw.
3191 - Ensures that
3192 following
3193 loads will not see
3194 stale global data.
3195
3196 load atomic acquire - agent - generic 1. flat_load glc=1
3197 - system 2. s_waitcnt vmcnt(0) &
3198 lgkmcnt(0)
3199
3200 - If OpenCL omit
3201 lgkmcnt(0).
3202 - Must happen before
3203 following
3204 buffer_wbinvl1_vol.
3205 - Ensures the flat_load
3206 has completed
3207 before invalidating
3208 the cache.
3209
3210 3. buffer_wbinvl1_vol
3211
3212 - Must happen before
3213 any following
3214 global/generic
3215 load/load
3216 atomic/atomicrmw.
3217 - Ensures that
3218 following loads
3219 will not see stale
3220 global data.
3221
3222 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
3223 - wavefront - local
3224 - generic
Tony Tye0a092202017-10-18 22:16:55 +00003225 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
3226 atomicrmw acquire - workgroup - local 1. ds_atomic
3227 2. waitcnt lgkmcnt(0)
Tony Tye69865532017-06-06 20:31:59 +00003228
Tony Tye0a092202017-10-18 22:16:55 +00003229 - If OpenCL, omit.
Tony Tye69865532017-06-06 20:31:59 +00003230 - Must happen before
3231 any following
3232 global/generic
3233 load/load
3234 atomic/store/store
3235 atomic/atomicrmw.
3236 - Ensures any
3237 following global
3238 data read is no
3239 older than the
3240 atomicrmw value
3241 being acquired.
3242
Tony Tye0a092202017-10-18 22:16:55 +00003243 atomicrmw acquire - workgroup - generic 1. flat_atomic
3244 2. waitcnt lgkmcnt(0)
3245
3246 - If OpenCL, omit.
3247 - Must happen before
3248 any following
3249 global/generic
3250 load/load
3251 atomic/store/store
3252 atomic/atomicrmw.
3253 - Ensures any
3254 following global
3255 data read is no
3256 older than the
3257 atomicrmw value
3258 being acquired.
3259
3260 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tye69865532017-06-06 20:31:59 +00003261 - system 2. s_waitcnt vmcnt(0)
3262
3263 - Must happen before
3264 following
3265 buffer_wbinvl1_vol.
3266 - Ensures the
3267 atomicrmw has
3268 completed before
3269 invalidating the
3270 cache.
3271
3272 3. buffer_wbinvl1_vol
3273
3274 - Must happen before
3275 any following
3276 global/generic
3277 load/load
3278 atomic/atomicrmw.
3279 - Ensures that
3280 following loads
3281 will not see stale
3282 global data.
3283
3284 atomicrmw acquire - agent - generic 1. flat_atomic
3285 - system 2. s_waitcnt vmcnt(0) &
3286 lgkmcnt(0)
3287
3288 - If OpenCL, omit
3289 lgkmcnt(0).
3290 - Must happen before
3291 following
3292 buffer_wbinvl1_vol.
3293 - Ensures the
3294 atomicrmw has
3295 completed before
3296 invalidating the
3297 cache.
3298
3299 3. buffer_wbinvl1_vol
3300
3301 - Must happen before
3302 any following
3303 global/generic
3304 load/load
3305 atomic/atomicrmw.
3306 - Ensures that
3307 following loads
3308 will not see stale
3309 global data.
3310
3311 fence acquire - singlethread *none* *none*
3312 - wavefront
3313 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3314
3315 - If OpenCL and
3316 address space is
Tony Tye0a092202017-10-18 22:16:55 +00003317 not generic, omit.
3318 - However, since LLVM
Tony Tye69865532017-06-06 20:31:59 +00003319 currently has no
3320 address space on
3321 the fence need to
3322 conservatively
3323 always generate. If
3324 fence had an
3325 address space then
3326 set to address
3327 space of OpenCL
3328 fence flag, or to
3329 generic if both
3330 local and global
3331 flags are
3332 specified.
3333 - Must happen after
3334 any preceding
3335 local/generic load
3336 atomic/atomicrmw
3337 with an equal or
3338 wider sync scope
3339 and memory ordering
3340 stronger than
3341 unordered (this is
3342 termed the
3343 fence-paired-atomic).
3344 - Must happen before
3345 any following
3346 global/generic
3347 load/load
3348 atomic/store/store
3349 atomic/atomicrmw.
3350 - Ensures any
3351 following global
3352 data read is no
3353 older than the
3354 value read by the
3355 fence-paired-atomic.
3356
Tony Tye0a092202017-10-18 22:16:55 +00003357 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
3358 - system vmcnt(0)
Tony Tye69865532017-06-06 20:31:59 +00003359
3360 - If OpenCL and
3361 address space is
3362 not generic, omit
3363 lgkmcnt(0).
Tony Tye0a092202017-10-18 22:16:55 +00003364 - However, since LLVM
Tony Tye69865532017-06-06 20:31:59 +00003365 currently has no
3366 address space on
3367 the fence need to
3368 conservatively
3369 always generate
3370 (see comment for
3371 previous fence).
Tony Tye61c7e032017-06-07 00:08:35 +00003372 - Could be split into
Tony Tye69865532017-06-06 20:31:59 +00003373 separate s_waitcnt
3374 vmcnt(0) and
3375 s_waitcnt
3376 lgkmcnt(0) to allow
3377 them to be
3378 independently moved
3379 according to the
3380 following rules.
3381 - s_waitcnt vmcnt(0)
3382 must happen after
3383 any preceding
3384 global/generic load
3385 atomic/atomicrmw
3386 with an equal or
3387 wider sync scope
3388 and memory ordering
3389 stronger than
3390 unordered (this is
3391 termed the
3392 fence-paired-atomic).
3393 - s_waitcnt lgkmcnt(0)
3394 must happen after
3395 any preceding
Tony Tye0a092202017-10-18 22:16:55 +00003396 local/generic load
Tony Tye69865532017-06-06 20:31:59 +00003397 atomic/atomicrmw
3398 with an equal or
3399 wider sync scope
3400 and memory ordering
3401 stronger than
3402 unordered (this is
3403 termed the
3404 fence-paired-atomic).
3405 - Must happen before
3406 the following
3407 buffer_wbinvl1_vol.
3408 - Ensures that the
3409 fence-paired atomic
3410 has completed
3411 before invalidating
3412 the
3413 cache. Therefore
3414 any following
3415 locations read must
3416 be no older than
3417 the value read by
3418 the
3419 fence-paired-atomic.
3420
3421 2. buffer_wbinvl1_vol
3422
Tony Tye0a092202017-10-18 22:16:55 +00003423 - Must happen before any
3424 following global/generic
Tony Tye69865532017-06-06 20:31:59 +00003425 load/load
3426 atomic/store/store
3427 atomic/atomicrmw.
3428 - Ensures that
3429 following loads
3430 will not see stale
3431 global data.
3432
3433 **Release Atomic**
Tony Tye0a092202017-10-18 22:16:55 +00003434 -----------------------------------------------------------------------------------
Tony Tye69865532017-06-06 20:31:59 +00003435 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
3436 - wavefront - local
3437 - generic
3438 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye0a092202017-10-18 22:16:55 +00003439
3440 - If OpenCL, omit.
Tony Tye69865532017-06-06 20:31:59 +00003441 - Must happen after
3442 any preceding
3443 local/generic
3444 load/store/load
3445 atomic/store
3446 atomic/atomicrmw.
3447 - Must happen before
3448 the following
3449 store.
3450 - Ensures that all
3451 memory operations
3452 to local have
3453 completed before
3454 performing the
3455 store that is being
3456 released.
3457
3458 2. buffer/global/flat_store
3459 store atomic release - workgroup - local 1. ds_store
Tony Tye0a092202017-10-18 22:16:55 +00003460 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3461
3462 - If OpenCL, omit.
3463 - Must happen after
3464 any preceding
3465 local/generic
3466 load/store/load
3467 atomic/store
3468 atomic/atomicrmw.
3469 - Must happen before
3470 the following
3471 store.
3472 - Ensures that all
3473 memory operations
3474 to local have
3475 completed before
3476 performing the
3477 store that is being
3478 released.
3479
3480 2. flat_store
3481 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
3482 - system - generic vmcnt(0)
Tony Tye69865532017-06-06 20:31:59 +00003483
3484 - If OpenCL, omit
3485 lgkmcnt(0).
3486 - Could be split into
3487 separate s_waitcnt
3488 vmcnt(0) and
3489 s_waitcnt
3490 lgkmcnt(0) to allow
3491 them to be
3492 independently moved
3493 according to the
3494 following rules.
3495 - s_waitcnt vmcnt(0)
3496 must happen after
3497 any preceding
3498 global/generic
3499 load/store/load
3500 atomic/store
3501 atomic/atomicrmw.
3502 - s_waitcnt lgkmcnt(0)
3503 must happen after
3504 any preceding
3505 local/generic
3506 load/store/load
3507 atomic/store
3508 atomic/atomicrmw.
3509 - Must happen before
3510 the following
3511 store.
3512 - Ensures that all
3513 memory operations
Tony Tye0a092202017-10-18 22:16:55 +00003514 to memory have
Tony Tye69865532017-06-06 20:31:59 +00003515 completed before
3516 performing the
3517 store that is being
3518 released.
3519
3520 2. buffer/global/ds/flat_store
3521 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
3522 - wavefront - local
3523 - generic
3524 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye0a092202017-10-18 22:16:55 +00003525
3526 - If OpenCL, omit.
Tony Tye69865532017-06-06 20:31:59 +00003527 - Must happen after
3528 any preceding
3529 local/generic
3530 load/store/load
3531 atomic/store
3532 atomic/atomicrmw.
3533 - Must happen before
3534 the following
3535 atomicrmw.
3536 - Ensures that all
3537 memory operations
3538 to local have
3539 completed before
3540 performing the
3541 atomicrmw that is
3542 being released.
3543
3544 2. buffer/global/flat_atomic
3545 atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye0a092202017-10-18 22:16:55 +00003546 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3547
3548 - If OpenCL, omit.
3549 - Must happen after
3550 any preceding
3551 local/generic
3552 load/store/load
3553 atomic/store
3554 atomic/atomicrmw.
3555 - Must happen before
3556 the following
3557 atomicrmw.
3558 - Ensures that all
3559 memory operations
3560 to local have
3561 completed before
3562 performing the
3563 atomicrmw that is
3564 being released.
3565
3566 2. flat_atomic
3567 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
3568 - system - generic vmcnt(0)
Tony Tye69865532017-06-06 20:31:59 +00003569
3570 - If OpenCL, omit
3571 lgkmcnt(0).
3572 - Could be split into
3573 separate s_waitcnt
3574 vmcnt(0) and
3575 s_waitcnt
3576 lgkmcnt(0) to allow
3577 them to be
3578 independently moved
3579 according to the
3580 following rules.
3581 - s_waitcnt vmcnt(0)
3582 must happen after
3583 any preceding
3584 global/generic
3585 load/store/load
3586 atomic/store
3587 atomic/atomicrmw.
3588 - s_waitcnt lgkmcnt(0)
3589 must happen after
3590 any preceding
3591 local/generic
3592 load/store/load
3593 atomic/store
3594 atomic/atomicrmw.
3595 - Must happen before
3596 the following
3597 atomicrmw.
3598 - Ensures that all
3599 memory operations
3600 to global and local
3601 have completed
3602 before performing
3603 the atomicrmw that
3604 is being released.
3605
Tony Tye0a092202017-10-18 22:16:55 +00003606 2. buffer/global/ds/flat_atomic
Tony Tye69865532017-06-06 20:31:59 +00003607 fence release - singlethread *none* *none*
3608 - wavefront
3609 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3610
3611 - If OpenCL and
3612 address space is
Tony Tye0a092202017-10-18 22:16:55 +00003613 not generic, omit.
3614 - However, since LLVM
Tony Tye69865532017-06-06 20:31:59 +00003615 currently has no
3616 address space on
3617 the fence need to
3618 conservatively
Tony Tye0a092202017-10-18 22:16:55 +00003619 always generate. If
3620 fence had an
3621 address space then
3622 set to address
3623 space of OpenCL
3624 fence flag, or to
3625 generic if both
3626 local and global
3627 flags are
3628 specified.
Tony Tye69865532017-06-06 20:31:59 +00003629 - Must happen after
3630 any preceding
3631 local/generic
3632 load/load
3633 atomic/store/store
3634 atomic/atomicrmw.
3635 - Must happen before
3636 any following store
3637 atomic/atomicrmw
3638 with an equal or
3639 wider sync scope
3640 and memory ordering
3641 stronger than
3642 unordered (this is
3643 termed the
3644 fence-paired-atomic).
3645 - Ensures that all
3646 memory operations
3647 to local have
3648 completed before
3649 performing the
3650 following
3651 fence-paired-atomic.
3652
Tony Tye0a092202017-10-18 22:16:55 +00003653 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3654 - system vmcnt(0)
Tony Tye69865532017-06-06 20:31:59 +00003655
3656 - If OpenCL and
3657 address space is
3658 not generic, omit
3659 lgkmcnt(0).
Tony Tye0a092202017-10-18 22:16:55 +00003660 - If OpenCL and
3661 address space is
3662 local, omit
3663 vmcnt(0).
3664 - However, since LLVM
Tony Tye69865532017-06-06 20:31:59 +00003665 currently has no
3666 address space on
3667 the fence need to
3668 conservatively
Tony Tye0a092202017-10-18 22:16:55 +00003669 always generate. If
3670 fence had an
3671 address space then
3672 set to address
3673 space of OpenCL
3674 fence flag, or to
3675 generic if both
3676 local and global
3677 flags are
3678 specified.
Tony Tye69865532017-06-06 20:31:59 +00003679 - Could be split into
3680 separate s_waitcnt
3681 vmcnt(0) and
3682 s_waitcnt
3683 lgkmcnt(0) to allow
3684 them to be
3685 independently moved
3686 according to the
3687 following rules.
3688 - s_waitcnt vmcnt(0)
3689 must happen after
3690 any preceding
3691 global/generic
3692 load/store/load
3693 atomic/store
3694 atomic/atomicrmw.
3695 - s_waitcnt lgkmcnt(0)
3696 must happen after
3697 any preceding
3698 local/generic
3699 load/store/load
3700 atomic/store
3701 atomic/atomicrmw.
3702 - Must happen before
3703 any following store
3704 atomic/atomicrmw
3705 with an equal or
3706 wider sync scope
3707 and memory ordering
3708 stronger than
3709 unordered (this is
3710 termed the
3711 fence-paired-atomic).
3712 - Ensures that all
3713 memory operations
Tony Tye0a092202017-10-18 22:16:55 +00003714 have
Tony Tye69865532017-06-06 20:31:59 +00003715 completed before
3716 performing the
3717 following
3718 fence-paired-atomic.
3719
3720 **Acquire-Release Atomic**
Tony Tye0a092202017-10-18 22:16:55 +00003721 -----------------------------------------------------------------------------------
Tony Tye69865532017-06-06 20:31:59 +00003722 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3723 - wavefront - local
3724 - generic
3725 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3726
Tony Tye0a092202017-10-18 22:16:55 +00003727 - If OpenCL, omit.
Tony Tye69865532017-06-06 20:31:59 +00003728 - Must happen after
3729 any preceding
3730 local/generic
3731 load/store/load
3732 atomic/store
3733 atomic/atomicrmw.
3734 - Must happen before
3735 the following
3736 atomicrmw.
3737 - Ensures that all
3738 memory operations
3739 to local have
3740 completed before
3741 performing the
3742 atomicrmw that is
3743 being released.
3744
Tony Tye0a092202017-10-18 22:16:55 +00003745 2. buffer/global/flat_atomic
Tony Tye69865532017-06-06 20:31:59 +00003746 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3747 2. s_waitcnt lgkmcnt(0)
3748
Tony Tye0a092202017-10-18 22:16:55 +00003749 - If OpenCL, omit.
Tony Tye69865532017-06-06 20:31:59 +00003750 - Must happen before
3751 any following
3752 global/generic
3753 load/load
3754 atomic/store/store
3755 atomic/atomicrmw.
3756 - Ensures any
3757 following global
3758 data read is no
3759 older than the load
3760 atomic value being
3761 acquired.
3762
3763 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3764
Tony Tye0a092202017-10-18 22:16:55 +00003765 - If OpenCL, omit.
Tony Tye69865532017-06-06 20:31:59 +00003766 - Must happen after
3767 any preceding
3768 local/generic
3769 load/store/load
3770 atomic/store
3771 atomic/atomicrmw.
3772 - Must happen before
3773 the following
3774 atomicrmw.
3775 - Ensures that all
3776 memory operations
3777 to local have
3778 completed before
3779 performing the
3780 atomicrmw that is
3781 being released.
3782
3783 2. flat_atomic
3784 3. s_waitcnt lgkmcnt(0)
3785
Tony Tye0a092202017-10-18 22:16:55 +00003786 - If OpenCL, omit.
Tony Tye69865532017-06-06 20:31:59 +00003787 - Must happen before
3788 any following
3789 global/generic
3790 load/load
3791 atomic/store/store
3792 atomic/atomicrmw.
3793 - Ensures any
3794 following global
3795 data read is no
3796 older than the load
3797 atomic value being
3798 acquired.
Tony Tye0a092202017-10-18 22:16:55 +00003799
3800 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3801 - system vmcnt(0)
Tony Tye69865532017-06-06 20:31:59 +00003802
3803 - If OpenCL, omit
3804 lgkmcnt(0).
3805 - Could be split into
3806 separate s_waitcnt
3807 vmcnt(0) and
3808 s_waitcnt
3809 lgkmcnt(0) to allow
3810 them to be
3811 independently moved
3812 according to the
3813 following rules.
3814 - s_waitcnt vmcnt(0)
3815 must happen after
3816 any preceding
3817 global/generic
3818 load/store/load
3819 atomic/store
3820 atomic/atomicrmw.
3821 - s_waitcnt lgkmcnt(0)
3822 must happen after
3823 any preceding
3824 local/generic
3825 load/store/load
3826 atomic/store
3827 atomic/atomicrmw.
3828 - Must happen before
3829 the following
3830 atomicrmw.
3831 - Ensures that all
3832 memory operations
3833 to global have
3834 completed before
3835 performing the
3836 atomicrmw that is
3837 being released.
3838
Tony Tye0a092202017-10-18 22:16:55 +00003839 2. buffer/global/flat_atomic
Tony Tye69865532017-06-06 20:31:59 +00003840 3. s_waitcnt vmcnt(0)
3841
3842 - Must happen before
3843 following
3844 buffer_wbinvl1_vol.
3845 - Ensures the
3846 atomicrmw has
3847 completed before
3848 invalidating the
3849 cache.
3850
3851 4. buffer_wbinvl1_vol
3852
3853 - Must happen before
3854 any following
3855 global/generic
3856 load/load
3857 atomic/atomicrmw.
3858 - Ensures that
3859 following loads
3860 will not see stale
3861 global data.
3862
Tony Tye0a092202017-10-18 22:16:55 +00003863 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3864 - system vmcnt(0)
Tony Tye69865532017-06-06 20:31:59 +00003865
3866 - If OpenCL, omit
3867 lgkmcnt(0).
3868 - Could be split into
3869 separate s_waitcnt
3870 vmcnt(0) and
3871 s_waitcnt
3872 lgkmcnt(0) to allow
3873 them to be
3874 independently moved
3875 according to the
3876 following rules.
3877 - s_waitcnt vmcnt(0)
3878 must happen after
3879 any preceding
3880 global/generic
3881 load/store/load
3882 atomic/store
3883 atomic/atomicrmw.
3884 - s_waitcnt lgkmcnt(0)
3885 must happen after
3886 any preceding
3887 local/generic
3888 load/store/load
3889 atomic/store
3890 atomic/atomicrmw.
3891 - Must happen before
3892 the following
3893 atomicrmw.
3894 - Ensures that all
3895 memory operations
3896 to global have
3897 completed before
3898 performing the
3899 atomicrmw that is
3900 being released.
3901
3902 2. flat_atomic
3903 3. s_waitcnt vmcnt(0) &
3904 lgkmcnt(0)
3905
3906 - If OpenCL, omit
3907 lgkmcnt(0).
3908 - Must happen before
3909 following
3910 buffer_wbinvl1_vol.
3911 - Ensures the
3912 atomicrmw has
3913 completed before
3914 invalidating the
3915 cache.
3916
3917 4. buffer_wbinvl1_vol
3918
3919 - Must happen before
3920 any following
3921 global/generic
3922 load/load
3923 atomic/atomicrmw.
3924 - Ensures that
3925 following loads
3926 will not see stale
3927 global data.
3928
3929 fence acq_rel - singlethread *none* *none*
3930 - wavefront
3931 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3932
3933 - If OpenCL and
3934 address space is
Tony Tye0a092202017-10-18 22:16:55 +00003935 not generic, omit.
3936 - However,
Tony Tye69865532017-06-06 20:31:59 +00003937 since LLVM
3938 currently has no
3939 address space on
3940 the fence need to
3941 conservatively
3942 always generate
3943 (see comment for
3944 previous fence).
3945 - Must happen after
3946 any preceding
3947 local/generic
3948 load/load
3949 atomic/store/store
3950 atomic/atomicrmw.
3951 - Must happen before
3952 any following
3953 global/generic
3954 load/load
3955 atomic/store/store
3956 atomic/atomicrmw.
3957 - Ensures that all
3958 memory operations
3959 to local have
3960 completed before
3961 performing any
3962 following global
3963 memory operations.
3964 - Ensures that the
3965 preceding
3966 local/generic load
3967 atomic/atomicrmw
3968 with an equal or
3969 wider sync scope
3970 and memory ordering
3971 stronger than
3972 unordered (this is
3973 termed the
Tony Tye0a092202017-10-18 22:16:55 +00003974 acquire-fence-paired-atomic
3975 ) has completed
Tony Tye69865532017-06-06 20:31:59 +00003976 before following
3977 global memory
3978 operations. This
3979 satisfies the
3980 requirements of
3981 acquire.
3982 - Ensures that all
3983 previous memory
3984 operations have
3985 completed before a
3986 following
3987 local/generic store
3988 atomic/atomicrmw
3989 with an equal or
3990 wider sync scope
3991 and memory ordering
3992 stronger than
3993 unordered (this is
3994 termed the
Tony Tye0a092202017-10-18 22:16:55 +00003995 release-fence-paired-atomic
3996 ). This satisfies the
Tony Tye69865532017-06-06 20:31:59 +00003997 requirements of
3998 release.
3999
Tony Tye0a092202017-10-18 22:16:55 +00004000 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
4001 - system vmcnt(0)
Tony Tye69865532017-06-06 20:31:59 +00004002
4003 - If OpenCL and
4004 address space is
4005 not generic, omit
4006 lgkmcnt(0).
Tony Tye0a092202017-10-18 22:16:55 +00004007 - However, since LLVM
Tony Tye69865532017-06-06 20:31:59 +00004008 currently has no
4009 address space on
4010 the fence need to
4011 conservatively
4012 always generate
4013 (see comment for
4014 previous fence).
4015 - Could be split into
4016 separate s_waitcnt
4017 vmcnt(0) and
4018 s_waitcnt
4019 lgkmcnt(0) to allow
4020 them to be
4021 independently moved
4022 according to the
4023 following rules.
4024 - s_waitcnt vmcnt(0)
4025 must happen after
4026 any preceding
4027 global/generic
4028 load/store/load
4029 atomic/store
4030 atomic/atomicrmw.
4031 - s_waitcnt lgkmcnt(0)
4032 must happen after
4033 any preceding
4034 local/generic
4035 load/store/load
4036 atomic/store
4037 atomic/atomicrmw.
4038 - Must happen before
4039 the following
4040 buffer_wbinvl1_vol.
4041 - Ensures that the
4042 preceding
4043 global/local/generic
4044 load
4045 atomic/atomicrmw
4046 with an equal or
4047 wider sync scope
4048 and memory ordering
4049 stronger than
4050 unordered (this is
4051 termed the
Tony Tye0a092202017-10-18 22:16:55 +00004052 acquire-fence-paired-atomic
4053 ) has completed
Tony Tye69865532017-06-06 20:31:59 +00004054 before invalidating
4055 the cache. This
4056 satisfies the
4057 requirements of
4058 acquire.
4059 - Ensures that all
4060 previous memory
4061 operations have
4062 completed before a
4063 following
4064 global/local/generic
4065 store
4066 atomic/atomicrmw
4067 with an equal or
4068 wider sync scope
4069 and memory ordering
4070 stronger than
4071 unordered (this is
4072 termed the
Tony Tye0a092202017-10-18 22:16:55 +00004073 release-fence-paired-atomic
4074 ). This satisfies the
Tony Tye69865532017-06-06 20:31:59 +00004075 requirements of
4076 release.
4077
4078 2. buffer_wbinvl1_vol
4079
4080 - Must happen before
4081 any following
4082 global/generic
4083 load/load
4084 atomic/store/store
4085 atomic/atomicrmw.
4086 - Ensures that
4087 following loads
4088 will not see stale
4089 global data. This
4090 satisfies the
4091 requirements of
4092 acquire.
4093
4094 **Sequential Consistent Atomic**
Tony Tye0a092202017-10-18 22:16:55 +00004095 -----------------------------------------------------------------------------------
Tony Tye69865532017-06-06 20:31:59 +00004096 load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye0a092202017-10-18 22:16:55 +00004097 - wavefront - local load atomic acquire,
4098 - generic except must generated
4099 all instructions even
4100 for OpenCL.*
4101 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
4102 - generic
4103 - Must
4104 happen after
4105 preceding
4106 global/generic load
4107 atomic/store
4108 atomic/atomicrmw
4109 with memory
4110 ordering of seq_cst
4111 and with equal or
4112 wider sync scope.
4113 (Note that seq_cst
4114 fences have their
4115 own s_waitcnt
4116 lgkmcnt(0) and so do
4117 not need to be
4118 considered.)
4119 - Ensures any
4120 preceding
4121 sequential
4122 consistent local
4123 memory instructions
4124 have completed
4125 before executing
4126 this sequentially
4127 consistent
4128 instruction. This
4129 prevents reordering
4130 a seq_cst store
4131 followed by a
4132 seq_cst load. (Note
4133 that seq_cst is
4134 stronger than
4135 acquire/release as
4136 the reordering of
4137 load acquire
4138 followed by a store
4139 release is
4140 prevented by the
4141 waitcnt of
4142 the release, but
4143 there is nothing
4144 preventing a store
4145 release followed by
4146 load acquire from
4147 competing out of
4148 order.)
4149
4150 2. *Following
4151 instructions same as
4152 corresponding load
4153 atomic acquire,
4154 except must generated
4155 all instructions even
4156 for OpenCL.*
4157 load atomic seq_cst - workgroup - local *Same as corresponding
4158 load atomic acquire,
4159 except must generated
4160 all instructions even
4161 for OpenCL.*
4162 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
4163 - system - generic vmcnt(0)
4164
4165 - Could be split into
4166 separate s_waitcnt
4167 vmcnt(0)
4168 and s_waitcnt
4169 lgkmcnt(0) to allow
4170 them to be
4171 independently moved
4172 according to the
4173 following rules.
4174 - waitcnt lgkmcnt(0)
4175 must happen after
4176 preceding
4177 global/generic load
4178 atomic/store
4179 atomic/atomicrmw
4180 with memory
4181 ordering of seq_cst
4182 and with equal or
4183 wider sync scope.
4184 (Note that seq_cst
4185 fences have their
4186 own s_waitcnt
4187 lgkmcnt(0) and so do
4188 not need to be
4189 considered.)
4190 - waitcnt vmcnt(0)
4191 must happen after
Tony Tye69865532017-06-06 20:31:59 +00004192 preceding
4193 global/generic load
4194 atomic/store
4195 atomic/atomicrmw
4196 with memory
4197 ordering of seq_cst
4198 and with equal or
4199 wider sync scope.
4200 (Note that seq_cst
4201 fences have their
4202 own s_waitcnt
4203 vmcnt(0) and so do
4204 not need to be
4205 considered.)
4206 - Ensures any
4207 preceding
4208 sequential
4209 consistent global
4210 memory instructions
4211 have completed
4212 before executing
4213 this sequentially
4214 consistent
4215 instruction. This
4216 prevents reordering
4217 a seq_cst store
4218 followed by a
Tony Tye0a092202017-10-18 22:16:55 +00004219 seq_cst load. (Note
Tony Tye69865532017-06-06 20:31:59 +00004220 that seq_cst is
4221 stronger than
4222 acquire/release as
4223 the reordering of
4224 load acquire
4225 followed by a store
4226 release is
4227 prevented by the
Tony Tye0a092202017-10-18 22:16:55 +00004228 waitcnt of
Tony Tye69865532017-06-06 20:31:59 +00004229 the release, but
4230 there is nothing
4231 preventing a store
4232 release followed by
4233 load acquire from
4234 competing out of
4235 order.)
4236
4237 2. *Following
4238 instructions same as
4239 corresponding load
Tony Tye0a092202017-10-18 22:16:55 +00004240 atomic acquire,
4241 except must generated
4242 all instructions even
4243 for OpenCL.*
Tony Tye69865532017-06-06 20:31:59 +00004244 store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye0a092202017-10-18 22:16:55 +00004245 - wavefront - local store atomic release,
4246 - workgroup - generic except must generated
4247 all instructions even
4248 for OpenCL.*
Tony Tye69865532017-06-06 20:31:59 +00004249 store atomic seq_cst - agent - global *Same as corresponding
Tony Tye0a092202017-10-18 22:16:55 +00004250 - system - generic store atomic release,
4251 except must generated
4252 all instructions even
4253 for OpenCL.*
Tony Tye69865532017-06-06 20:31:59 +00004254 atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye0a092202017-10-18 22:16:55 +00004255 - wavefront - local atomicrmw acq_rel,
4256 - workgroup - generic except must generated
4257 all instructions even
4258 for OpenCL.*
Tony Tye69865532017-06-06 20:31:59 +00004259 atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye0a092202017-10-18 22:16:55 +00004260 - system - generic atomicrmw acq_rel,
4261 except must generated
4262 all instructions even
4263 for OpenCL.*
Tony Tye69865532017-06-06 20:31:59 +00004264 fence seq_cst - singlethread *none* *Same as corresponding
Tony Tye0a092202017-10-18 22:16:55 +00004265 - wavefront fence acq_rel,
4266 - workgroup except must generated
4267 - agent all instructions even
4268 - system for OpenCL.*
4269 ============ ============ ============== ========== ===============================
Tony Tye69865532017-06-06 20:31:59 +00004270
4271The memory order also adds the single thread optimization constrains defined in
4272table
4273:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
4274
4275 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
4276 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
4277
4278 ============ ==============================================================
4279 LLVM Memory Optimization Constraints
4280 Ordering
4281 ============ ==============================================================
4282 unordered *none*
4283 monotonic *none*
4284 acquire - If a load atomic/atomicrmw then no following load/load
4285 atomic/store/ store atomic/atomicrmw/fence instruction can
4286 be moved before the acquire.
4287 - If a fence then same as load atomic, plus no preceding
4288 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue9d67e42017-06-26 02:45:39 +00004289 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye69865532017-06-06 20:31:59 +00004290 atomic/store/ store atomic/atomicrmw/fence instruction can
4291 be moved after the release.
4292 - If a fence then same as store atomic, plus no following
4293 associated fence-paired-atomic can be moved before the
4294 fence.
4295 acq_rel Same constraints as both acquire and release.
4296 seq_cst - If a load atomic then same constraints as acquire, plus no
4297 preceding sequentially consistent load atomic/store
4298 atomic/atomicrmw/fence instruction can be moved after the
4299 seq_cst.
4300 - If a store atomic then the same constraints as release, plus
4301 no following sequentially consistent load atomic/store
4302 atomic/atomicrmw/fence instruction can be moved before the
4303 seq_cst.
4304 - If an atomicrmw/fence then same constraints as acq_rel.
4305 ============ ==============================================================
Konstantin Zhuravlyov2cee5cc2017-03-08 23:55:44 +00004306
Wei Ding8d22e6f2017-02-21 18:48:01 +00004307Trap Handler ABI
Tony Tye69865532017-06-06 20:31:59 +00004308~~~~~~~~~~~~~~~~
Wei Ding8d22e6f2017-02-21 18:48:01 +00004309
Tony Tye69865532017-06-06 20:31:59 +00004310For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
4311(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
4312the ``s_trap`` instruction with the following usage:
Wei Ding8d22e6f2017-02-21 18:48:01 +00004313
Tony Tye69865532017-06-06 20:31:59 +00004314 .. table:: AMDGPU Trap Handler for AMDHSA OS
4315 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding8d22e6f2017-02-21 18:48:01 +00004316
Tony Tye69865532017-06-06 20:31:59 +00004317 =================== =============== =============== =======================
4318 Usage Code Sequence Trap Handler Description
4319 Inputs
4320 =================== =============== =============== =======================
4321 reserved ``s_trap 0x00`` Reserved by hardware.
4322 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
4323 ``queue_ptr`` ``debugtrap``
4324 ``VGPR0``: intrinsic (not
4325 ``arg`` implemented).
4326 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
4327 ``queue_ptr`` terminated and its
4328 associated queue put
4329 into the error state.
Tony Tyeb9947382018-05-16 16:19:34 +00004330 ``llvm.debugtrap`` ``s_trap 0x03`` - If debugger not
4331 installed then
4332 behaves as a
4333 no-operation. The
4334 trap handler is
4335 entered and
4336 immediately returns
4337 to continue
4338 execution of the
4339 wavefront.
4340 - If the debugger is
4341 installed, causes
4342 the debug trap to be
4343 reported by the
4344 debugger and the
4345 wavefront is put in
4346 the halt state until
4347 resumed by the
4348 debugger.
4349 reserved ``s_trap 0x04`` Reserved.
4350 reserved ``s_trap 0x05`` Reserved.
4351 reserved ``s_trap 0x06`` Reserved.
4352 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
Tony Tye69865532017-06-06 20:31:59 +00004353 breakpoints.
Tony Tyeb9947382018-05-16 16:19:34 +00004354 reserved ``s_trap 0x08`` Reserved.
4355 reserved ``s_trap 0xfe`` Reserved.
4356 reserved ``s_trap 0xff`` Reserved.
Tony Tye69865532017-06-06 20:31:59 +00004357 =================== =============== =============== =======================
Wei Ding8d22e6f2017-02-21 18:48:01 +00004358
Tim Corringhamc5e0b132018-04-04 13:02:09 +00004359AMDPAL
4360------
4361
4362This section provides code conventions used when the target triple OS is
4363``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
4364from the application/runtime to each invocation of a hardware shader. These
4365parameters include both generic, application-controlled parameters called
4366*user data* as well as system-generated parameters that are a product of the
4367draw or dispatch execution.
4368
4369User Data
4370~~~~~~~~~
4371
4372Each hardware stage has a set of 32-bit *user data registers* which can be
4373written from a command buffer and then loaded into SGPRs when waves are launched
4374via a subsequent dispatch or draw operation. This is the way most arguments are
4375passed from the application/runtime to a hardware shader.
4376
4377Compute User Data
4378~~~~~~~~~~~~~~~~~
4379
4380Compute shader user data mappings are simpler than graphics shaders, and have a
4381fixed mapping.
4382
4383Note that there are always 10 available *user data entries* in registers -
4384entries beyond that limit must be fetched from memory (via the spill table
4385pointer) by the shader.
4386
4387 .. table:: PAL Compute Shader User Data Registers
4388 :name: pal-compute-user-data-registers
4389
4390 ============= ================================
4391 User Register Description
4392 ============= ================================
4393 0 Global Internal Table (32-bit pointer)
4394 1 Per-Shader Internal Table (32-bit pointer)
4395 2 - 11 Application-Controlled User Data (10 32-bit values)
4396 12 Spill Table (32-bit pointer)
4397 13 - 14 Thread Group Count (64-bit pointer)
4398 15 GDS Range
4399 ============= ================================
4400
4401Graphics User Data
4402~~~~~~~~~~~~~~~~~~
4403
4404Graphics pipelines support a much more flexible user data mapping:
4405
4406 .. table:: PAL Graphics Shader User Data Registers
4407 :name: pal-graphics-user-data-registers
4408
4409 ============= ================================
4410 User Register Description
4411 ============= ================================
4412 0 Global Internal Table (32-bit pointer)
4413 + Per-Shader Internal Table (32-bit pointer)
4414 + 1-15 Application Controlled User Data
4415 (1-15 Contiguous 32-bit Values in Registers)
4416 + Spill Table (32-bit pointer)
4417 + Draw Index (First Stage Only)
4418 + Vertex Offset (First Stage Only)
4419 + Instance Offset (First Stage Only)
4420 ============= ================================
4421
4422 The placement of the global internal table remains fixed in the first *user
4423 data SGPR register*. Otherwise all parameters are optional, and can be mapped
4424 to any desired *user data SGPR register*, with the following regstrictions:
4425
4426 * Draw Index, Vertex Offset, and Instance Offset can only be used by the first
4427 activehardware stage in a graphics pipeline (i.e. where the API vertex
4428 shader runs).
4429
4430 * Application-controlled user data must be mapped into a contiguous range of
4431 user data registers.
4432
4433 * The application-controlled user data range supports compaction remapping, so
4434 only *entries* that are actually consumed by the shader must be assigned to
4435 corresponding *registers*. Note that in order to support an efficient runtime
4436 implementation, the remapping must pack *registers* in the same order as
4437 *entries*, with unused *entries* removed.
4438
4439.. _pal_global_internal_table:
4440
4441Global Internal Table
4442~~~~~~~~~~~~~~~~~~~~~
4443
4444The global internal table is a table of *shader resource descriptors* (SRDs) that
4445define how certain engine-wide, runtime-managed resources should be accessed
4446from a shader. The majority of these resources have HW-defined formats, and it
4447is up to the compiler to write/read data as required by the target hardware.
4448
4449The following table illustrates the required format:
4450
4451 .. table:: PAL Global Internal Table
4452 :name: pal-git-table
4453
4454 ============= ================================
4455 Offset Description
4456 ============= ================================
4457 0-3 Graphics Scratch SRD
4458 4-7 Compute Scratch SRD
4459 8-11 ES/GS Ring Output SRD
4460 12-15 ES/GS Ring Input SRD
4461 16-19 GS/VS Ring Output #0
4462 20-23 GS/VS Ring Output #1
4463 24-27 GS/VS Ring Output #2
4464 28-31 GS/VS Ring Output #3
4465 32-35 GS/VS Ring Input SRD
4466 36-39 Tessellation Factor Buffer SRD
4467 40-43 Off-Chip LDS Buffer SRD
4468 44-47 Off-Chip Param Cache Buffer SRD
4469 48-51 Sample Position Buffer SRD
4470 52 vaRange::ShadowDescriptorTable High Bits
4471 ============= ================================
4472
4473 The pointer to the global internal table passed to the shader as user data
4474 is a 32-bit pointer. The top 32 bits should be assumed to be the same as
4475 the top 32 bits of the pipeline, so the shader may use the program
4476 counter's top 32 bits.
4477
Tony Tye1a0450f2017-08-15 20:47:41 +00004478Unspecified OS
4479--------------
4480
4481This section provides code conventions used when the target triple OS is
4482empty (see :ref:`amdgpu-target-triples`).
Tony Tye69865532017-06-06 20:31:59 +00004483
4484Trap Handler ABI
4485~~~~~~~~~~~~~~~~
4486
4487For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
4488not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
4489instructions are handled as follows:
4490
4491 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
4492 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
4493
4494 =============== =============== ===========================================
4495 Usage Code Sequence Description
4496 =============== =============== ===========================================
4497 llvm.trap s_endpgm Causes wavefront to be terminated.
4498 llvm.debugtrap *none* Compiler warning given that there is no
4499 trap handler installed.
4500 =============== =============== ===========================================
4501
4502Source Languages
4503================
4504
4505.. _amdgpu-opencl:
4506
4507OpenCL
4508------
4509
Tony Tye69865532017-06-06 20:31:59 +00004510When the language is OpenCL the following differences occur:
4511
45121. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
Tony Tye2b4b7fe2018-03-23 18:45:18 +000045132. The AMDGPU backend appends additional arguments to the kernel's explicit
4514 arguments for the AMDHSA OS (see
4515 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
Tony Tye1a0450f2017-08-15 20:47:41 +000045163. Additional metadata is generated
Scott Linder43cbf8d2018-06-21 19:38:56 +00004517 (see :ref:`amdgpu-amdhsa-code-object-metadata`).
Tony Tye69865532017-06-06 20:31:59 +00004518
Tony Tye2b4b7fe2018-03-23 18:45:18 +00004519 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
4520 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
4521
4522 ======== ==== ========= ===========================================
4523 Position Byte Byte Description
4524 Size Alignment
4525 ======== ==== ========= ===========================================
Tony Tye9272c8a2018-03-23 18:58:47 +00004526 1 8 8 OpenCL Global Offset X
4527 2 8 8 OpenCL Global Offset Y
4528 3 8 8 OpenCL Global Offset Z
4529 4 8 8 OpenCL address of printf buffer
4530 5 8 8 OpenCL address of virtual queue used by
4531 enqueue_kernel.
4532 6 8 8 OpenCL address of AqlWrap struct used by
4533 enqueue_kernel.
Tony Tye2b4b7fe2018-03-23 18:45:18 +00004534 ======== ==== ========= ===========================================
Tony Tye69865532017-06-06 20:31:59 +00004535
4536.. _amdgpu-hcc:
4537
4538HCC
4539---
4540
Tony Tye2b4b7fe2018-03-23 18:45:18 +00004541When the language is HCC the following differences occur:
Tony Tye69865532017-06-06 20:31:59 +00004542
45431. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
4544
Scott Linder43cbf8d2018-06-21 19:38:56 +00004545.. _amdgpu-assembler:
4546
Tom Stellard953c6812015-06-13 03:28:10 +00004547Assembler
Tony Tye69865532017-06-06 20:31:59 +00004548---------
Tom Stellard953c6812015-06-13 03:28:10 +00004549
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004550AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye2f2cb6b2017-11-10 20:51:43 +00004551It supports AMDGCN GFX6-GFX9.
Tom Stellard953c6812015-06-13 03:28:10 +00004552
Dmitry Preobrazhensky1eaf2d72018-03-12 15:55:08 +00004553This section describes general syntax for instructions and operands.
4554
4555Instructions
4556~~~~~~~~~~~~
4557
4558.. toctree::
4559 :hidden:
4560
Dmitry Preobrazhensky51120d72018-12-17 17:38:11 +00004561 AMDGPU/AMDGPUAsmGFX7
4562 AMDGPU/AMDGPUAsmGFX8
4563 AMDGPU/AMDGPUAsmGFX9
4564 AMDGPUModifierSyntax
Dmitry Preobrazhensky1eaf2d72018-03-12 15:55:08 +00004565 AMDGPUOperandSyntax
Dmitry Preobrazhensky51120d72018-12-17 17:38:11 +00004566 AMDGPUInstructionSyntax
4567 AMDGPUInstructionNotation
Dmitry Preobrazhensky1eaf2d72018-03-12 15:55:08 +00004568
Dmitry Preobrazhensky51120d72018-12-17 17:38:11 +00004569An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
Dmitry Preobrazhensky1eaf2d72018-03-12 15:55:08 +00004570
Dmitry Preobrazhensky51120d72018-12-17 17:38:11 +00004571 ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
Dmitry Preobrazhensky1eaf2d72018-03-12 15:55:08 +00004572
Dmitry Preobrazhensky51120d72018-12-17 17:38:11 +00004573:doc:`Operands<AMDGPUOperandSyntax>` are normally comma-separated while
4574:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
Dmitry Preobrazhensky1eaf2d72018-03-12 15:55:08 +00004575
Dmitry Preobrazhensky51120d72018-12-17 17:38:11 +00004576The order of *operands* and *modifiers* is fixed.
4577Most *modifiers* are optional and may be omitted.
Dmitry Preobrazhensky1eaf2d72018-03-12 15:55:08 +00004578
Dmitry Preobrazhensky51120d72018-12-17 17:38:11 +00004579See detailed instruction syntax description for :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`,
4580:doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`.
Dmitry Preobrazhensky1eaf2d72018-03-12 15:55:08 +00004581
4582Note that features under development are not included in this description.
4583
4584For more information about instructions, their semantics and supported combinations of
Tony Tye69865532017-06-06 20:31:59 +00004585operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyova05cf7b2017-10-18 17:59:20 +00004586[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard953c6812015-06-13 03:28:10 +00004587
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004588Operands
Tony Tye69865532017-06-06 20:31:59 +00004589~~~~~~~~
Tom Stellard953c6812015-06-13 03:28:10 +00004590
Dmitry Preobrazhensky51120d72018-12-17 17:38:11 +00004591Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
Tom Stellard953c6812015-06-13 03:28:10 +00004592
Dmitry Preobrazhensky1eaf2d72018-03-12 15:55:08 +00004593Modifiers
4594~~~~~~~~~
Tom Stellard953c6812015-06-13 03:28:10 +00004595
Dmitry Preobrazhensky51120d72018-12-17 17:38:11 +00004596Detailed description of modifiers may be found :doc:`here<AMDGPUModifierSyntax>`.
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004597
Tony Tye69865532017-06-06 20:31:59 +00004598Instruction Examples
4599~~~~~~~~~~~~~~~~~~~~
4600
4601DS
Dmitry Preobrazhensky1eaf2d72018-03-12 15:55:08 +00004602++
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004603
4604.. code-block:: nasm
4605
4606 ds_add_u32 v2, v4 offset:16
4607 ds_write_src2_b64 v2 offset0:4 offset1:8
4608 ds_cmpst_f32 v2, v4, v6
4609 ds_min_rtn_f64 v[8:9], v2, v[4:5]
4610
4611
4612For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
4613
Tony Tye69865532017-06-06 20:31:59 +00004614FLAT
4615++++
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004616
4617.. code-block:: nasm
4618
4619 flat_load_dword v1, v[3:4]
4620 flat_store_dwordx3 v[3:4], v[5:7]
4621 flat_atomic_swap v1, v[3:4], v5 glc
4622 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
4623 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
4624
4625For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
4626
Tony Tye69865532017-06-06 20:31:59 +00004627MUBUF
4628+++++
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004629
4630.. code-block:: nasm
4631
4632 buffer_load_dword v1, off, s[4:7], s1
4633 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
4634 buffer_store_format_xy v[1:2], off, s[4:7], s1
4635 buffer_wbinvl1
4636 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
4637
4638For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
4639
Tony Tye69865532017-06-06 20:31:59 +00004640SMRD/SMEM
4641+++++++++
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004642
4643.. code-block:: nasm
4644
4645 s_load_dword s1, s[2:3], 0xfc
4646 s_load_dwordx8 s[8:15], s[2:3], s4
4647 s_load_dwordx16 s[88:103], s[2:3], s4
4648 s_dcache_inv_vol
4649 s_memtime s[4:5]
4650
4651For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
4652
Tony Tye69865532017-06-06 20:31:59 +00004653SOP1
4654++++
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004655
4656.. code-block:: nasm
4657
4658 s_mov_b32 s1, s2
4659 s_mov_b64 s[0:1], 0x80000000
4660 s_cmov_b32 s1, 200
4661 s_wqm_b64 s[2:3], s[4:5]
4662 s_bcnt0_i32_b64 s1, s[2:3]
4663 s_swappc_b64 s[2:3], s[4:5]
4664 s_cbranch_join s[4:5]
4665
4666For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
4667
Tony Tye69865532017-06-06 20:31:59 +00004668SOP2
4669++++
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004670
4671.. code-block:: nasm
4672
4673 s_add_u32 s1, s2, s3
4674 s_and_b64 s[2:3], s[4:5], s[6:7]
4675 s_cselect_b32 s1, s2, s3
4676 s_andn2_b32 s2, s4, s6
4677 s_lshr_b64 s[2:3], s[4:5], s6
4678 s_ashr_i32 s2, s4, s6
4679 s_bfm_b64 s[2:3], s4, s6
4680 s_bfe_i64 s[2:3], s[4:5], s6
4681 s_cbranch_g_fork s[4:5], s[6:7]
4682
4683For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
4684
Tony Tye69865532017-06-06 20:31:59 +00004685SOPC
4686++++
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004687
4688.. code-block:: nasm
4689
4690 s_cmp_eq_i32 s1, s2
4691 s_bitcmp1_b32 s1, s2
4692 s_bitcmp0_b64 s[2:3], s4
4693 s_setvskip s3, s5
4694
4695For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
4696
Tony Tye69865532017-06-06 20:31:59 +00004697SOPP
4698++++
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004699
4700.. code-block:: nasm
4701
4702 s_barrier
4703 s_nop 2
4704 s_endpgm
4705 s_waitcnt 0 ; Wait for all counters to be 0
4706 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
4707 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
4708 s_sethalt 9
4709 s_sleep 10
4710 s_sendmsg 0x1
4711 s_sendmsg sendmsg(MSG_INTERRUPT)
4712 s_trap 1
4713
4714For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
4715
4716Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru1d6becb2017-01-14 11:37:01 +00004717of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard953c6812015-06-13 03:28:10 +00004718range or acceptable values.
4719
Tony Tye69865532017-06-06 20:31:59 +00004720VALU
4721++++
Tom Stellard953c6812015-06-13 03:28:10 +00004722
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004723For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
4724the assembler will automatically use optimal encoding based on its operands.
4725To force specific encoding, one can add a suffix to the opcode of the instruction:
4726
4727* _e32 for 32-bit VOP1/VOP2/VOPC
4728* _e64 for 64-bit VOP3
4729* _dpp for VOP_DPP
4730* _sdwa for VOP_SDWA
4731
4732VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard953c6812015-06-13 03:28:10 +00004733
4734.. code-block:: nasm
4735
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004736 v_mov_b32 v1, v2
4737 v_mov_b32_e32 v1, v2
4738 v_nop
4739 v_cvt_f64_i32_e32 v[1:2], v2
4740 v_floor_f32_e32 v1, v2
4741 v_bfrev_b32_e32 v1, v2
4742 v_add_f32_e32 v1, v2, v3
4743 v_mul_i32_i24_e64 v1, v2, 3
4744 v_mul_i32_i24_e32 v1, -3, v3
4745 v_mul_i32_i24_e32 v1, -100, v3
4746 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4747 v_max_f16_e32 v1, v2, v3
Tom Stellard953c6812015-06-13 03:28:10 +00004748
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004749VOP_DPP examples:
Tom Stellard953c6812015-06-13 03:28:10 +00004750
4751.. code-block:: nasm
4752
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004753 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4754 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4755 v_mov_b32 v0, v0 wave_shl:1
4756 v_mov_b32 v0, v0 row_mirror
4757 v_mov_b32 v0, v0 row_bcast:31
4758 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4759 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4760 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellardac1a45e2015-06-26 21:15:07 +00004761
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004762VOP_SDWA examples:
4763
4764.. code-block:: nasm
4765
4766 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4767 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4768 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4769 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4770 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4771
4772For full list of supported instructions, refer to "Vector ALU instructions".
4773
Konstantin Zhuravlyovd6794f02018-06-22 19:23:18 +00004774.. TODO
4775 Remove once we switch to code object v3 by default.
4776
4777HSA Code Object Directives
4778~~~~~~~~~~~~~~~~~~~~~~~~~~
4779
4780AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4781one can specify them with assembler directives.
4782
4783.hsa_code_object_version major, minor
4784+++++++++++++++++++++++++++++++++++++
4785
4786*major* and *minor* are integers that specify the version of the HSA code
4787object that will be generated by the assembler.
4788
4789.hsa_code_object_isa [major, minor, stepping, vendor, arch]
4790+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4791
4792
4793*major*, *minor*, and *stepping* are all integers that describe the instruction
4794set architecture (ISA) version of the assembly program.
4795
4796*vendor* and *arch* are quoted strings. *vendor* should always be equal to
4797"AMD" and *arch* should always be equal to "AMDGPU".
4798
4799By default, the assembler will derive the ISA version, *vendor*, and *arch*
4800from the value of the -mcpu option that is passed to the assembler.
4801
4802.amdgpu_hsa_kernel (name)
4803+++++++++++++++++++++++++
4804
4805This directives specifies that the symbol with given name is a kernel entry point
4806(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
4807
4808.amd_kernel_code_t
4809++++++++++++++++++
4810
4811This directive marks the beginning of a list of key / value pairs that are used
4812to specify the amd_kernel_code_t object that will be emitted by the assembler.
4813The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4814any amd_kernel_code_t values that are unspecified a default value will be
4815used. The default value for all keys is 0, with the following exceptions:
4816
4817- *kernel_code_version_major* defaults to 1.
4818- *machine_kind* defaults to 1.
4819- *machine_version_major*, *machine_version_minor*, and
4820 *machine_version_stepping* are derived from the value of the -mcpu option
4821 that is passed to the assembler.
4822- *kernel_code_entry_byte_offset* defaults to 256.
4823- *wavefront_size* defaults to 6.
4824- *kernarg_segment_alignment*, *group_segment_alignment*, and
4825 *private_segment_alignment* default to 4. Note that alignments are specified
Scott Linderdff71ea2018-11-15 20:46:55 +00004826 as a power of 2, so a value of **n** means an alignment of 2^ **n**.
Konstantin Zhuravlyovd6794f02018-06-22 19:23:18 +00004827
4828The *.amd_kernel_code_t* directive must be placed immediately after the
4829function label and before any instructions.
4830
4831For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4832comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
4833
4834Here is an example of a minimal amd_kernel_code_t specification:
4835
4836.. code-block:: none
4837
4838 .hsa_code_object_version 1,0
4839 .hsa_code_object_isa
4840
4841 .hsatext
4842 .globl hello_world
4843 .p2align 8
4844 .amdgpu_hsa_kernel hello_world
4845
4846 hello_world:
4847
4848 .amd_kernel_code_t
4849 enable_sgpr_kernarg_segment_ptr = 1
4850 is_ptr64 = 1
4851 compute_pgm_rsrc1_vgprs = 0
4852 compute_pgm_rsrc1_sgprs = 0
4853 compute_pgm_rsrc2_user_sgpr = 2
4854 kernarg_segment_byte_size = 8
4855 wavefront_sgpr_count = 2
4856 workitem_vgpr_count = 3
4857 .end_amd_kernel_code_t
4858
4859 s_load_dwordx2 s[0:1], s[0:1] 0x0
4860 v_mov_b32 v0, 3.14159
4861 s_waitcnt lgkmcnt(0)
4862 v_mov_b32 v1, s0
4863 v_mov_b32 v2, s1
4864 flat_store_dword v[1:2], v0
4865 s_endpgm
4866 .Lfunc_end0:
4867 .size hello_world, .Lfunc_end0-hello_world
4868
4869Predefined Symbols (-mattr=+code-object-v3)
4870~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004871
Scott Linder43cbf8d2018-06-21 19:38:56 +00004872The AMDGPU assembler defines and updates some symbols automatically. These
4873symbols do not affect code generation.
Tom Stellardac1a45e2015-06-26 21:15:07 +00004874
Scott Linder43cbf8d2018-06-21 19:38:56 +00004875.amdgcn.gfx_generation_number
4876+++++++++++++++++++++++++++++
Tom Stellardac1a45e2015-06-26 21:15:07 +00004877
Scott Linder43cbf8d2018-06-21 19:38:56 +00004878Set to the GFX generation number of the target being assembled for. For
4879example, when assembling for a "GFX9" target this will be set to the integer
4880value "9". The possible GFX generation numbers are presented in
4881:ref:`amdgpu-processors`.
Tom Stellardac1a45e2015-06-26 21:15:07 +00004882
Scott Linder43cbf8d2018-06-21 19:38:56 +00004883.amdgcn.next_free_vgpr
4884++++++++++++++++++++++
Tony Tye69865532017-06-06 20:31:59 +00004885
Scott Linder43cbf8d2018-06-21 19:38:56 +00004886Set to zero before assembly begins. At each instruction, if the current value
4887of this symbol is less than or equal to the maximum VGPR number explicitly
4888referenced within that instruction then the symbol value is updated to equal
4889that VGPR number plus one.
Tom Stellardac1a45e2015-06-26 21:15:07 +00004890
Scott Linder43cbf8d2018-06-21 19:38:56 +00004891May be used to set the `.amdhsa_next_free_vpgr` directive in
4892:ref:`amdhsa-kernel-directives-table`.
Tom Stellardac1a45e2015-06-26 21:15:07 +00004893
Scott Linder43cbf8d2018-06-21 19:38:56 +00004894May be set at any time, e.g. manually set to zero at the start of each kernel.
Tom Stellardac1a45e2015-06-26 21:15:07 +00004895
Scott Linder43cbf8d2018-06-21 19:38:56 +00004896.amdgcn.next_free_sgpr
4897++++++++++++++++++++++
Tom Stellardac1a45e2015-06-26 21:15:07 +00004898
Scott Linder43cbf8d2018-06-21 19:38:56 +00004899Set to zero before assembly begins. At each instruction, if the current value
4900of this symbol is less than or equal the maximum SGPR number explicitly
4901referenced within that instruction then the symbol value is updated to equal
4902that SGPR number plus one.
Nikolay Haustov2a3c7392016-09-20 09:04:51 +00004903
Scott Linder43cbf8d2018-06-21 19:38:56 +00004904May be used to set the `.amdhsa_next_free_spgr` directive in
4905:ref:`amdhsa-kernel-directives-table`.
Tom Stellard4a888082015-06-26 21:58:31 +00004906
Scott Linder43cbf8d2018-06-21 19:38:56 +00004907May be set at any time, e.g. manually set to zero at the start of each kernel.
Tom Stellard4a888082015-06-26 21:58:31 +00004908
Konstantin Zhuravlyovd6794f02018-06-22 19:23:18 +00004909Code Object Directives (-mattr=+code-object-v3)
4910~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tom Stellard4a888082015-06-26 21:58:31 +00004911
Scott Linder43cbf8d2018-06-21 19:38:56 +00004912Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
4913architecture processors, and are not OS-specific. Directives which begin with
4914``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
4915``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
4916:ref:`amdgpu-processors`.
Tom Stellard4a888082015-06-26 21:58:31 +00004917
Scott Linder43cbf8d2018-06-21 19:38:56 +00004918.amdgcn_target <target>
4919+++++++++++++++++++++++
Tom Stellard4a888082015-06-26 21:58:31 +00004920
Scott Linder43cbf8d2018-06-21 19:38:56 +00004921Optional directive which declares the target supported by the containing
4922assembler source file. Valid values are described in
4923:ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler
4924to validate command-line options such as ``-triple``, ``-mcpu``, and those
4925which specify target features.
Tom Stellard4a888082015-06-26 21:58:31 +00004926
Scott Linder43cbf8d2018-06-21 19:38:56 +00004927.amdhsa_kernel <name>
4928+++++++++++++++++++++
Tom Stellard4a888082015-06-26 21:58:31 +00004929
Scott Linder43cbf8d2018-06-21 19:38:56 +00004930Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
4931``<name>.kd``, in the current location of the current section. Only valid when
4932the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
4933instruction to execute, and does not need to be previously defined.
Tom Stellard4a888082015-06-26 21:58:31 +00004934
Scott Linder43cbf8d2018-06-21 19:38:56 +00004935Marks the beginning of a list of directives used to generate the bytes of a
4936kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
4937Directives which may appear in this list are described in
4938:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
4939be valid for the target being assembled for, and cannot be repeated. Directives
4940support the range of values specified by the field they reference in
4941:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
4942assumed to have its default value, unless it is marked as "Required", in which
4943case it is an error to omit the directive. This list of directives is
4944terminated by an ``.end_amdhsa_kernel`` directive.
Tom Stellard4a888082015-06-26 21:58:31 +00004945
Scott Linder43cbf8d2018-06-21 19:38:56 +00004946 .. table:: AMDHSA Kernel Assembler Directives
4947 :name: amdhsa-kernel-directives-table
Tom Stellard4a888082015-06-26 21:58:31 +00004948
Scott Linder43cbf8d2018-06-21 19:38:56 +00004949 ======================================================== ================ ============ ===================
4950 Directive Default Supported On Description
4951 ======================================================== ================ ============ ===================
4952 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX9 Controls GROUP_SEGMENT_FIXED_SIZE in
4953 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4954 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX9 Controls PRIVATE_SEGMENT_FIXED_SIZE in
4955 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4956 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
4957 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4958 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_DISPATCH_PTR in
4959 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4960 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_QUEUE_PTR in
4961 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4962 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
4963 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4964 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX9 Controls ENABLE_SGPR_DISPATCH_ID in
4965 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4966 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX9 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
4967 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4968 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
4969 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4970 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in
4971 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4972 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_X in
4973 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4974 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_Y in
4975 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4976 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_Z in
4977 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4978 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_INFO in
4979 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4980 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX9 Controls ENABLE_VGPR_WORKITEM_ID in
4981 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4982 Possible values are defined in
4983 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
4984 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX9 Maximum VGPR number explicitly referenced, plus one.
4985 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
4986 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4987 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX9 Maximum SGPR number explicitly referenced, plus one.
4988 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
4989 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4990 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX9 Whether the kernel may use the special VCC SGPR.
4991 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
4992 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4993 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX9 Whether the kernel may use flat instructions to access
4994 scratch memory. Used to calculate
4995 GRANULATED_WAVEFRONT_SGPR_COUNT in
4996 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4997 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX9 Whether the kernel may trigger XNACK replay.
4998 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
4999 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
5000 (+xnack)
5001 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX9 Controls FLOAT_ROUND_MODE_32 in
5002 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
5003 Possible values are defined in
5004 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
5005 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX9 Controls FLOAT_ROUND_MODE_16_64 in
5006 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
5007 Possible values are defined in
5008 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
5009 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX9 Controls FLOAT_DENORM_MODE_32 in
5010 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
5011 Possible values are defined in
5012 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
5013 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX9 Controls FLOAT_DENORM_MODE_16_64 in
5014 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
5015 Possible values are defined in
5016 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
5017 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX9 Controls ENABLE_DX10_CLAMP in
5018 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
5019 ``.amdhsa_ieee_mode`` 1 GFX6-GFX9 Controls ENABLE_IEEE_MODE in
5020 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
5021 ``.amdhsa_fp16_overflow`` 0 GFX9 Controls FP16_OVFL in
5022 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
5023 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
5024 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
5025 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
5026 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
5027 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
5028 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
5029 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
5030 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
5031 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
5032 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
5033 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
5034 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
5035 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
5036 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
5037 ======================================================== ================ ============ ===================
Tom Stellard4a888082015-06-26 21:58:31 +00005038
Scott Linderdff71ea2018-11-15 20:46:55 +00005039.amdgpu_metadata
5040++++++++++++++++
5041
5042Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
5043note record (see :ref:`amdgpu-elf-note-records-table-v3`).
5044
5045The contents must be in the [YAML]_ markup format, with the same structure and
5046semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`.
5047
5048This directive is terminated by an ``.end_amdgpu_metadata`` directive.
5049
Konstantin Zhuravlyovd6794f02018-06-22 19:23:18 +00005050Example HSA Source Code (-mattr=+code-object-v3)
5051~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tom Stellard4a888082015-06-26 21:58:31 +00005052
Scott Linder43cbf8d2018-06-21 19:38:56 +00005053Here is an example of a minimal assembly source file, defining one HSA kernel:
5054
Chandler Carruthc0cb45a2018-08-06 01:19:43 +00005055.. code-block:: none
Scott Linder43cbf8d2018-06-21 19:38:56 +00005056
5057 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
5058
5059 .text
5060 .globl hello_world
5061 .p2align 8
5062 .type hello_world,@function
5063 hello_world:
5064 s_load_dwordx2 s[0:1], s[0:1] 0x0
5065 v_mov_b32 v0, 3.14159
5066 s_waitcnt lgkmcnt(0)
5067 v_mov_b32 v1, s0
5068 v_mov_b32 v2, s1
5069 flat_store_dword v[1:2], v0
5070 s_endpgm
5071 .Lfunc_end0:
5072 .size hello_world, .Lfunc_end0-hello_world
5073
5074 .rodata
5075 .p2align 6
5076 .amdhsa_kernel hello_world
5077 .amdhsa_user_sgpr_kernarg_segment_ptr 1
5078 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
5079 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
5080 .end_amdhsa_kernel
5081
Scott Linderdff71ea2018-11-15 20:46:55 +00005082 .amdgpu_metadata
5083 ---
5084 amdhsa.version:
5085 - 1
5086 - 0
5087 amdhsa.kernels:
5088 - .name: hello_world
5089 .symbol: hello_world.kd
5090 .kernarg_segment_size: 48
5091 .group_segment_fixed_size: 0
5092 .private_segment_fixed_size: 0
5093 .kernarg_segment_align: 4
5094 .wavefront_size: 64
5095 .sgpr_count: 2
5096 .vgpr_count: 3
5097 .max_flat_workgroup_size: 256
5098 ...
5099 .end_amdgpu_metadata
Tony Tye69865532017-06-06 20:31:59 +00005100
5101Additional Documentation
5102========================
5103
Konstantin Zhuravlyova05cf7b2017-10-18 17:59:20 +00005104.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
5105.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
5106.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
5107.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
5108.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
5109.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
5110.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
5111.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye69865532017-06-06 20:31:59 +00005112.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
5113.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
5114.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
5115.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
5116.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyovcb5868c2017-10-19 17:12:55 +00005117.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Scott Linderdff71ea2018-11-15 20:46:55 +00005118.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
Tony Tye69865532017-06-06 20:31:59 +00005119.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
5120.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
Tony Tye978dec72018-06-14 16:40:10 +00005121.. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__