blob: 38222afbc63adca653b50e1eaa45656a55568267 [file] [log] [blame]
Justin Holewinski531ebc82013-03-30 16:41:14 +00001=============================
2User Guide for NVPTX Back-end
3=============================
4
5.. contents::
6 :local:
7 :depth: 3
8
9
10Introduction
11============
12
13To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
14along with a defined set of conventions used to represent GPU programming
15concepts. This document provides an overview of the general usage of the back-
16end, including a description of the conventions used and the set of accepted
17LLVM IR.
18
19.. note::
20
21 This document assumes a basic familiarity with CUDA and the PTX
22 assembly language. Information about the CUDA Driver API and the PTX assembly
23 language can be found in the `CUDA documentation
24 <http://docs.nvidia.com/cuda/index.html>`_.
25
26
27
28Conventions
29===========
30
31Marking Functions as Kernels
32----------------------------
33
34In PTX, there are two types of functions: *device functions*, which are only
35callable by device code, and *kernel functions*, which are callable by host
36code. By default, the back-end will emit device functions. Metadata is used to
37declare a function as a kernel function. This metadata is attached to the
38``nvvm.annotations`` named metadata object, and has the following format:
39
Renato Golin88ea57f2016-07-20 12:16:38 +000040.. code-block:: text
Justin Holewinski531ebc82013-03-30 16:41:14 +000041
Jingyue Wu811f0942016-05-04 17:34:57 +000042 !0 = !{<function-ref>, metadata !"kernel", i32 1}
Justin Holewinski531ebc82013-03-30 16:41:14 +000043
44The first parameter is a reference to the kernel function. The following
45example shows a kernel function calling a device function in LLVM IR. The
46function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
47
48.. code-block:: llvm
49
50 define float @my_fmad(float %x, float %y, float %z) {
51 %mul = fmul float %x, %y
52 %add = fadd float %mul, %z
53 ret float %add
54 }
55
56 define void @my_kernel(float* %ptr) {
Jingyue Wu811f0942016-05-04 17:34:57 +000057 %val = load float, float* %ptr
Justin Holewinski531ebc82013-03-30 16:41:14 +000058 %ret = call float @my_fmad(float %val, float %val, float %val)
59 store float %ret, float* %ptr
60 ret void
61 }
62
63 !nvvm.annotations = !{!1}
Jingyue Wu811f0942016-05-04 17:34:57 +000064 !1 = !{void (float*)* @my_kernel, !"kernel", i32 1}
Justin Holewinski531ebc82013-03-30 16:41:14 +000065
66When compiled, the PTX kernel functions are callable by host-side code.
67
68
Justin Holewinski22452852013-11-15 13:02:10 +000069.. _address_spaces:
70
Justin Holewinski531ebc82013-03-30 16:41:14 +000071Address Spaces
72--------------
73
74The NVPTX back-end uses the following address space mapping:
75
76 ============= ======================
77 Address Space Memory Space
78 ============= ======================
79 0 Generic
80 1 Global
81 2 Internal Use
82 3 Shared
83 4 Constant
84 5 Local
85 ============= ======================
86
87Every global variable and pointer type is assigned to one of these address
88spaces, with 0 being the default address space. Intrinsics are provided which
89can be used to convert pointers between the generic and non-generic address
90spaces.
91
92As an example, the following IR will define an array ``@g`` that resides in
93global device memory.
94
95.. code-block:: llvm
96
97 @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
98
99LLVM IR functions can read and write to this array, and host-side code can
100copy data to it by name with the CUDA Driver API.
101
102Note that since address space 0 is the generic space, it is illegal to have
103global variables in address space 0. Address space 0 is the default address
104space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
105variables.
106
107
Justin Holewinski22452852013-11-15 13:02:10 +0000108Triples
109-------
110
111The NVPTX target uses the module triple to select between 32/64-bit code
112generation and the driver-compiler interface to use. The triple architecture
113can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The
114operating system should be one of ``cuda`` or ``nvcl``, which determines the
115interface used by the generated code to communicate with the driver. Most
116users will want to use ``cuda`` as the operating system, which makes the
117generated PTX compatible with the CUDA Driver API.
118
119Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda``
120
121Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda``
122
123
124
125.. _nvptx_intrinsics:
126
Justin Holewinski531ebc82013-03-30 16:41:14 +0000127NVPTX Intrinsics
128================
129
130Address Space Conversion
131------------------------
132
133'``llvm.nvvm.ptr.*.to.gen``' Intrinsics
134^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
135
136Syntax:
137"""""""
138
139These are overloaded intrinsics. You can use these on any pointer types.
140
141.. code-block:: llvm
142
143 declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
144 declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
145 declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
146 declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
147
148Overview:
149"""""""""
150
151The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
152address space to a generic address space pointer.
153
154Semantics:
155""""""""""
156
157These intrinsics modify the pointer value to be a valid generic address space
158pointer.
159
160
161'``llvm.nvvm.ptr.gen.to.*``' Intrinsics
162^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
163
164Syntax:
165"""""""
166
167These are overloaded intrinsics. You can use these on any pointer types.
168
169.. code-block:: llvm
170
Jingyue Wubb06a522015-05-29 22:18:03 +0000171 declare i8 addrspace(1)* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8*)
172 declare i8 addrspace(3)* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8*)
173 declare i8 addrspace(4)* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8*)
174 declare i8 addrspace(5)* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8*)
Justin Holewinski531ebc82013-03-30 16:41:14 +0000175
176Overview:
177"""""""""
178
179The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
180address space to a pointer in the target address space. Note that these
181intrinsics are only useful if the address space of the target address space of
182the pointer is known. It is not legal to use address space conversion
183intrinsics to convert a pointer from one non-generic address space to another
184non-generic address space.
185
186Semantics:
187""""""""""
188
189These intrinsics modify the pointer value to be a valid pointer in the target
190non-generic address space.
191
192
193Reading PTX Special Registers
194-----------------------------
195
196'``llvm.nvvm.read.ptx.sreg.*``'
197^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
198
199Syntax:
200"""""""
201
202.. code-block:: llvm
203
204 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
205 declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
206 declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
207 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
208 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
209 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
210 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
211 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
212 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
213 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
214 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
215 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
216 declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
217
218Overview:
219"""""""""
220
221The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
222special registers, in particular the kernel launch bounds. These registers
223map in the following way to CUDA builtins:
224
225 ============ =====================================
226 CUDA Builtin PTX Special Register Intrinsic
227 ============ =====================================
228 ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
229 ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
230 ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
231 ``gridDim`` ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
232 ============ =====================================
233
234
235Barriers
236--------
237
238'``llvm.nvvm.barrier0``'
239^^^^^^^^^^^^^^^^^^^^^^^^^^^
240
241Syntax:
242"""""""
243
244.. code-block:: llvm
245
246 declare void @llvm.nvvm.barrier0()
247
248Overview:
249"""""""""
250
251The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
252instruction, equivalent to the ``__syncthreads()`` call in CUDA.
253
254
255Other Intrinsics
256----------------
257
258For the full set of NVPTX intrinsics, please see the
259``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
260
261
Justin Holewinski22452852013-11-15 13:02:10 +0000262.. _libdevice:
263
264Linking with Libdevice
265======================
266
267The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that
268implements many common mathematical functions. This library can be used as a
269high-performance math library for any compilers using the LLVM NVPTX target.
270The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and
271there is a separate version for each compute architecture.
272
273For a list of all math functions implemented in libdevice, see
274`libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_.
275
Alp Tokerbaf8c082013-12-20 00:33:39 +0000276To accommodate various math-related compiler flags that can affect code
Justin Holewinski22452852013-11-15 13:02:10 +0000277generation of libdevice code, the library code depends on a special LLVM IR
278pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This
279pass looks for calls to the ``@__nvvm_reflect`` function and replaces them
280with constants based on the defined reflection parameters. Such conditional
281code often follows a pattern:
282
283.. code-block:: c++
284
285 float my_function(float a) {
286 if (__nvvm_reflect("FASTMATH"))
287 return my_function_fast(a);
288 else
289 return my_function_precise(a);
290 }
291
Justin Lebar352f7fd2017-01-15 16:54:35 +0000292The default value for all unspecified reflection parameters is zero.
Justin Holewinski22452852013-11-15 13:02:10 +0000293
294The ``NVVMReflect`` pass should be executed early in the optimization
295pipeline, immediately after the link stage. The ``internalize`` pass is also
296recommended to remove unused math functions from the resulting PTX. For an
297input IR module ``module.bc``, the following compilation flow is recommended:
298
2991. Save list of external functions in ``module.bc``
3002. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc``
3013. Internalize all functions not in list from (1)
3024. Eliminate all unused internal functions
3035. Run ``NVVMReflect`` pass
3046. Run standard optimization pipeline
305
306.. note::
307
308 ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the
309 libdevice functions. It is possible to link two IR modules that have been
310 linked against libdevice using different reflection variables.
311
312Since the ``NVVMReflect`` pass replaces conditionals with constants, it will
313often leave behind dead code of the form:
314
315.. code-block:: llvm
316
317 entry:
318 ..
319 br i1 true, label %foo, label %bar
320 foo:
321 ..
322 bar:
323 ; Dead code
324 ..
325
326Therefore, it is recommended that ``NVVMReflect`` is executed early in the
327optimization pipeline before dead-code elimination.
328
Justin Lebar352f7fd2017-01-15 16:54:35 +0000329The NVPTX TargetMachine knows how to schedule ``NVVMReflect`` at the beginning
330of your pass manager; just use the following code when setting up your pass
331manager:
332
333.. code-block:: c++
Justin Lebar289cd402017-01-16 18:39:15 +0000334
Justin Lebar352f7fd2017-01-15 16:54:35 +0000335 std::unique_ptr<TargetMachine> TM = ...;
336 PassManagerBuilder PMBuilder(...);
Justin Lebar16270ea2017-01-27 19:44:24 +0000337 if (TM)
338 TM->adjustPassManager(PMBuilder);
Justin Holewinski22452852013-11-15 13:02:10 +0000339
340Reflection Parameters
341---------------------
342
343The libdevice library currently uses the following reflection parameters to
344control code generation:
345
346==================== ======================================================
347Flag Description
348==================== ======================================================
349``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero
350==================== ======================================================
351
Justin Lebar352f7fd2017-01-15 16:54:35 +0000352The value of this flag is determined by the "nvvm-reflect-ftz" module flag.
353The following sets the ftz flag to 1.
Justin Holewinski22452852013-11-15 13:02:10 +0000354
Justin Lebar352f7fd2017-01-15 16:54:35 +0000355.. code-block:: llvm
Aaron Ballmane38a9b52017-01-17 21:48:31 +0000356
Justin Lebar352f7fd2017-01-15 16:54:35 +0000357 !llvm.module.flag = !{!0}
358 !0 = !{i32 4, !"nvvm-reflect-ftz", i32 1}
Justin Holewinski22452852013-11-15 13:02:10 +0000359
Justin Lebar352f7fd2017-01-15 16:54:35 +0000360(``i32 4`` indicates that the value set here overrides the value in another
361module we link with. See the `LangRef <LangRef.html#module-flags-metadata>`
362for details.)
Justin Holewinski22452852013-11-15 13:02:10 +0000363
Justin Holewinski531ebc82013-03-30 16:41:14 +0000364Executing PTX
365=============
366
367The most common way to execute PTX assembly on a GPU device is to use the CUDA
368Driver API. This API is a low-level interface to the GPU driver and allows for
369JIT compilation of PTX code to native GPU machine code.
370
371Initializing the Driver API:
372
373.. code-block:: c++
374
375 CUdevice device;
376 CUcontext context;
377
378 // Initialize the driver API
379 cuInit(0);
380 // Get a handle to the first compute device
381 cuDeviceGet(&device, 0);
382 // Create a compute device context
383 cuCtxCreate(&context, 0, device);
384
385JIT compiling a PTX string to a device binary:
386
387.. code-block:: c++
388
389 CUmodule module;
Sylvestre Ledru3c5ec722016-02-14 20:16:22 +0000390 CUfunction function;
Justin Holewinski531ebc82013-03-30 16:41:14 +0000391
392 // JIT compile a null-terminated PTX string
393 cuModuleLoadData(&module, (void*)PTXString);
394
395 // Get a handle to the "myfunction" kernel function
396 cuModuleGetFunction(&function, module, "myfunction");
397
398For full examples of executing PTX assembly, please see the `CUDA Samples
399<https://developer.nvidia.com/cuda-downloads>`_ distribution.
Justin Holewinski22452852013-11-15 13:02:10 +0000400
401
402Common Issues
403=============
404
405ptxas complains of undefined function: __nvvm_reflect
406-----------------------------------------------------
407
408When linking with libdevice, the ``NVVMReflect`` pass must be used. See
409:ref:`libdevice` for more information.
410
411
412Tutorial: A Simple Compute Kernel
413=================================
414
415To start, let us take a look at a simple compute kernel written directly in
416LLVM IR. The kernel implements vector addition, where each thread computes one
417element of the output vector C from the input vectors A and B. To make this
418easier, we also assume that only a single CTA (thread block) will be launched,
419and that it will be one dimensional.
420
421
422The Kernel
423----------
424
425.. code-block:: llvm
426
427 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
428 target triple = "nvptx64-nvidia-cuda"
429
430 ; Intrinsic to read X component of thread ID
431 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
432
433 define void @kernel(float addrspace(1)* %A,
434 float addrspace(1)* %B,
435 float addrspace(1)* %C) {
436 entry:
437 ; What is my ID?
438 %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
439
440 ; Compute pointers into A, B, and C
Jingyue Wu811f0942016-05-04 17:34:57 +0000441 %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id
442 %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id
443 %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id
Justin Holewinski22452852013-11-15 13:02:10 +0000444
445 ; Read A, B
Jingyue Wu811f0942016-05-04 17:34:57 +0000446 %valA = load float, float addrspace(1)* %ptrA, align 4
447 %valB = load float, float addrspace(1)* %ptrB, align 4
Justin Holewinski22452852013-11-15 13:02:10 +0000448
449 ; Compute C = A + B
450 %valC = fadd float %valA, %valB
451
452 ; Store back to C
453 store float %valC, float addrspace(1)* %ptrC, align 4
454
455 ret void
456 }
457
458 !nvvm.annotations = !{!0}
Jingyue Wu811f0942016-05-04 17:34:57 +0000459 !0 = !{void (float addrspace(1)*,
460 float addrspace(1)*,
461 float addrspace(1)*)* @kernel, !"kernel", i32 1}
Justin Holewinski22452852013-11-15 13:02:10 +0000462
463
464We can use the LLVM ``llc`` tool to directly run the NVPTX code generator:
465
466.. code-block:: text
467
468 # llc -mcpu=sm_20 kernel.ll -o kernel.ptx
469
470
471.. note::
472
473 If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32``
Justin Holewinski24083352013-11-15 16:08:49 +0000474 in the module data layout string and use ``nvptx-nvidia-cuda`` as the
Justin Holewinski22452852013-11-15 13:02:10 +0000475 target triple.
476
477
478The output we get from ``llc`` (as of LLVM 3.4):
479
480.. code-block:: text
481
482 //
483 // Generated by LLVM NVPTX Back-End
484 //
485
486 .version 3.1
487 .target sm_20
488 .address_size 64
489
490 // .globl kernel
491 // @kernel
492 .visible .entry kernel(
493 .param .u64 kernel_param_0,
494 .param .u64 kernel_param_1,
495 .param .u64 kernel_param_2
496 )
497 {
498 .reg .f32 %f<4>;
499 .reg .s32 %r<2>;
500 .reg .s64 %rl<8>;
501
Francis Visoiu Mistrihca0df552017-12-04 17:18:51 +0000502 // %bb.0: // %entry
Justin Holewinski22452852013-11-15 13:02:10 +0000503 ld.param.u64 %rl1, [kernel_param_0];
504 mov.u32 %r1, %tid.x;
505 mul.wide.s32 %rl2, %r1, 4;
506 add.s64 %rl3, %rl1, %rl2;
507 ld.param.u64 %rl4, [kernel_param_1];
508 add.s64 %rl5, %rl4, %rl2;
509 ld.param.u64 %rl6, [kernel_param_2];
510 add.s64 %rl7, %rl6, %rl2;
511 ld.global.f32 %f1, [%rl3];
512 ld.global.f32 %f2, [%rl5];
513 add.f32 %f3, %f1, %f2;
514 st.global.f32 [%rl7], %f3;
515 ret;
516 }
517
518
519Dissecting the Kernel
520---------------------
521
522Now let us dissect the LLVM IR that makes up this kernel.
523
524Data Layout
525^^^^^^^^^^^
526
527The data layout string determines the size in bits of common data types, their
528ABI alignment, and their storage size. For NVPTX, you should use one of the
529following:
530
53132-bit PTX:
532
533.. code-block:: llvm
534
535 target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
536
53764-bit PTX:
538
539.. code-block:: llvm
540
541 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
542
543
544Target Intrinsics
545^^^^^^^^^^^^^^^^^
546
547In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to
548read the X component of the current thread's ID, which corresponds to a read
549of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of
550intrinsics. A short list is shown below; please see
551``include/llvm/IR/IntrinsicsNVVM.td`` for the full list.
552
553
554================================================ ====================
555Intrinsic CUDA Equivalent
556================================================ ====================
557``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}`` threadIdx.{x,y,z}
558``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}`` blockIdx.{x,y,z}
559``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}`` blockDim.{x,y,z}
560``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}`` gridDim.{x,y,z}
Justin Bogner45d8d9a2016-07-06 20:02:45 +0000561``void @llvm.nvvm.barrier0()`` __syncthreads()
Justin Holewinski22452852013-11-15 13:02:10 +0000562================================================ ====================
563
564
565Address Spaces
566^^^^^^^^^^^^^^
567
568You may have noticed that all of the pointer types in the LLVM IR example had
569an explicit address space specifier. What is address space 1? NVIDIA GPU
570devices (generally) have four types of memory:
571
572- Global: Large, off-chip memory
573- Shared: Small, on-chip memory shared among all threads in a CTA
574- Local: Per-thread, private memory
575- Constant: Read-only memory shared across all threads
576
577These different types of memory are represented in LLVM IR as address spaces.
578There is also a fifth address space used by the NVPTX code generator that
579corresponds to the "generic" address space. This address space can represent
580addresses in any other address space (with a few exceptions). This allows
581users to write IR functions that can load/store memory using the same
582instructions. Intrinsics are provided to convert pointers between the generic
583and non-generic address spaces.
584
585See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information.
586
587
588Kernel Metadata
589^^^^^^^^^^^^^^^
590
591In PTX, a function can be either a `kernel` function (callable from the host
592program), or a `device` function (callable only from GPU code). You can think
593of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR
594function as a `kernel` function, we make use of special LLVM metadata. The
595NVPTX back-end will look for a named metadata node called
596``nvvm.annotations``. This named metadata must contain a list of metadata that
597describe the IR. For our purposes, we need to declare a metadata node that
598assigns the "kernel" attribute to the LLVM IR function that should be emitted
599as a PTX `kernel` function. These metadata nodes take the form:
600
601.. code-block:: text
602
Jingyue Wu811f0942016-05-04 17:34:57 +0000603 !{<function ref>, metadata !"kernel", i32 1}
Justin Holewinski22452852013-11-15 13:02:10 +0000604
605For the previous example, we have:
606
607.. code-block:: llvm
608
609 !nvvm.annotations = !{!0}
Jingyue Wu811f0942016-05-04 17:34:57 +0000610 !0 = !{void (float addrspace(1)*,
611 float addrspace(1)*,
612 float addrspace(1)*)* @kernel, !"kernel", i32 1}
Justin Holewinski22452852013-11-15 13:02:10 +0000613
614Here, we have a single metadata declaration in ``nvvm.annotations``. This
615metadata annotates our ``@kernel`` function with the ``kernel`` attribute.
616
617
618Running the Kernel
619------------------
620
621Generating PTX from LLVM IR is all well and good, but how do we execute it on
622a real GPU device? The CUDA Driver API provides a convenient mechanism for
623loading and JIT compiling PTX to a native GPU device, and launching a kernel.
624The API is similar to OpenCL. A simple example showing how to load and
625execute our vector addition code is shown below. Note that for brevity this
626code does not perform much error checking!
627
628.. note::
629
630 You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline
631 compile PTX to machine code (SASS) for a specific GPU architecture. Such
632 binaries can be loaded by the CUDA Driver API in the same way as PTX. This
633 can be useful for reducing startup time by precompiling the PTX kernels.
634
635
636.. code-block:: c++
637
638 #include <iostream>
639 #include <fstream>
640 #include <cassert>
641 #include "cuda.h"
642
643
644 void checkCudaErrors(CUresult err) {
645 assert(err == CUDA_SUCCESS);
646 }
647
648 /// main - Program entry point
649 int main(int argc, char **argv) {
650 CUdevice device;
651 CUmodule cudaModule;
652 CUcontext context;
653 CUfunction function;
654 CUlinkState linker;
655 int devCount;
656
657 // CUDA initialization
658 checkCudaErrors(cuInit(0));
659 checkCudaErrors(cuDeviceGetCount(&devCount));
660 checkCudaErrors(cuDeviceGet(&device, 0));
661
662 char name[128];
663 checkCudaErrors(cuDeviceGetName(name, 128, device));
664 std::cout << "Using CUDA Device [0]: " << name << "\n";
665
666 int devMajor, devMinor;
667 checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
668 std::cout << "Device Compute Capability: "
669 << devMajor << "." << devMinor << "\n";
670 if (devMajor < 2) {
671 std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
672 return 1;
673 }
674
675 std::ifstream t("kernel.ptx");
676 if (!t.is_open()) {
677 std::cerr << "kernel.ptx not found\n";
678 return 1;
679 }
680 std::string str((std::istreambuf_iterator<char>(t)),
681 std::istreambuf_iterator<char>());
682
683 // Create driver context
684 checkCudaErrors(cuCtxCreate(&context, 0, device));
685
686 // Create module for object
687 checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0));
688
689 // Get kernel function
690 checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel"));
691
692 // Device data
693 CUdeviceptr devBufferA;
694 CUdeviceptr devBufferB;
695 CUdeviceptr devBufferC;
696
697 checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16));
698 checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16));
699 checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16));
700
701 float* hostA = new float[16];
702 float* hostB = new float[16];
703 float* hostC = new float[16];
704
705 // Populate input
706 for (unsigned i = 0; i != 16; ++i) {
707 hostA[i] = (float)i;
708 hostB[i] = (float)(2*i);
709 hostC[i] = 0.0f;
710 }
711
712 checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16));
713 checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16));
714
715
716 unsigned blockSizeX = 16;
717 unsigned blockSizeY = 1;
718 unsigned blockSizeZ = 1;
719 unsigned gridSizeX = 1;
720 unsigned gridSizeY = 1;
721 unsigned gridSizeZ = 1;
722
723 // Kernel parameters
724 void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC };
725
726 std::cout << "Launching kernel\n";
727
728 // Kernel launch
729 checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ,
730 blockSizeX, blockSizeY, blockSizeZ,
731 0, NULL, KernelParams, NULL));
732
733 // Retrieve device data
734 checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16));
735
736
737 std::cout << "Results:\n";
738 for (unsigned i = 0; i != 16; ++i) {
739 std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n";
740 }
741
742
743 // Clean up after ourselves
744 delete [] hostA;
745 delete [] hostB;
746 delete [] hostC;
747
748 // Clean-up
749 checkCudaErrors(cuMemFree(devBufferA));
750 checkCudaErrors(cuMemFree(devBufferB));
751 checkCudaErrors(cuMemFree(devBufferC));
752 checkCudaErrors(cuModuleUnload(cudaModule));
753 checkCudaErrors(cuCtxDestroy(context));
754
755 return 0;
756 }
757
758
759You will need to link with the CUDA driver and specify the path to cuda.h.
760
761.. code-block:: text
762
763 # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda
764
765We don't need to specify a path to ``libcuda.so`` since this is installed in a
766system location by the driver, not the CUDA toolkit.
767
768If everything goes as planned, you should see the following output when
769running the compiled program:
770
771.. code-block:: text
772
773 Using CUDA Device [0]: GeForce GTX 680
774 Device Compute Capability: 3.0
775 Launching kernel
776 Results:
777 0 + 0 = 0
778 1 + 2 = 3
779 2 + 4 = 6
780 3 + 6 = 9
781 4 + 8 = 12
782 5 + 10 = 15
783 6 + 12 = 18
784 7 + 14 = 21
785 8 + 16 = 24
786 9 + 18 = 27
787 10 + 20 = 30
788 11 + 22 = 33
789 12 + 24 = 36
790 13 + 26 = 39
791 14 + 28 = 42
792 15 + 30 = 45
793
794.. note::
795
796 You will likely see a different device identifier based on your hardware
797
798
799Tutorial: Linking with Libdevice
800================================
801
802In this tutorial, we show a simple example of linking LLVM IR with the
803libdevice library. We will use the same kernel as the previous tutorial,
804except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``.
805Libdevice provides an ``__nv_powf`` function that we will use.
806
807.. code-block:: llvm
808
809 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
810 target triple = "nvptx64-nvidia-cuda"
811
812 ; Intrinsic to read X component of thread ID
813 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
814 ; libdevice function
815 declare float @__nv_powf(float, float)
816
817 define void @kernel(float addrspace(1)* %A,
818 float addrspace(1)* %B,
819 float addrspace(1)* %C) {
820 entry:
821 ; What is my ID?
822 %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
823
824 ; Compute pointers into A, B, and C
Jingyue Wu811f0942016-05-04 17:34:57 +0000825 %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id
826 %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id
827 %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id
Justin Holewinski22452852013-11-15 13:02:10 +0000828
829 ; Read A, B
Jingyue Wu811f0942016-05-04 17:34:57 +0000830 %valA = load float, float addrspace(1)* %ptrA, align 4
831 %valB = load float, float addrspace(1)* %ptrB, align 4
Justin Holewinski22452852013-11-15 13:02:10 +0000832
833 ; Compute C = pow(A, B)
Eli Bendersky352168c2013-12-17 14:14:15 +0000834 %valC = call float @__nv_powf(float %valA, float %valB)
Justin Holewinski22452852013-11-15 13:02:10 +0000835
836 ; Store back to C
837 store float %valC, float addrspace(1)* %ptrC, align 4
838
839 ret void
840 }
841
842 !nvvm.annotations = !{!0}
Jingyue Wu811f0942016-05-04 17:34:57 +0000843 !0 = !{void (float addrspace(1)*,
844 float addrspace(1)*,
845 float addrspace(1)*)* @kernel, !"kernel", i32 1}
Justin Holewinski22452852013-11-15 13:02:10 +0000846
847
848To compile this kernel, we perform the following steps:
849
8501. Link with libdevice
8512. Internalize all but the public kernel function
8523. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0
8534. Optimize the linked module
8545. Codegen the module
855
856
857These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc``
858tools. In a complete compiler, these steps can also be performed entirely
859programmatically by setting up an appropriate pass configuration (see
860:ref:`libdevice`).
861
862.. code-block:: text
863
864 # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc
865 # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc
866 # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx
867
868.. note::
869
870 The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any
871 undefined variables will default to zero. It is shown here for evaluation
872 purposes.
873
874
875This gives us the following PTX (excerpt):
876
877.. code-block:: text
878
879 //
880 // Generated by LLVM NVPTX Back-End
881 //
882
883 .version 3.1
884 .target sm_20
885 .address_size 64
886
887 // .globl kernel
888 // @kernel
889 .visible .entry kernel(
890 .param .u64 kernel_param_0,
891 .param .u64 kernel_param_1,
892 .param .u64 kernel_param_2
893 )
894 {
895 .reg .pred %p<30>;
896 .reg .f32 %f<111>;
897 .reg .s32 %r<21>;
898 .reg .s64 %rl<8>;
899
Francis Visoiu Mistrihca0df552017-12-04 17:18:51 +0000900 // %bb.0: // %entry
Justin Holewinski22452852013-11-15 13:02:10 +0000901 ld.param.u64 %rl2, [kernel_param_0];
902 mov.u32 %r3, %tid.x;
903 ld.param.u64 %rl3, [kernel_param_1];
904 mul.wide.s32 %rl4, %r3, 4;
905 add.s64 %rl5, %rl2, %rl4;
906 ld.param.u64 %rl6, [kernel_param_2];
907 add.s64 %rl7, %rl3, %rl4;
908 add.s64 %rl1, %rl6, %rl4;
909 ld.global.f32 %f1, [%rl5];
910 ld.global.f32 %f2, [%rl7];
911 setp.eq.f32 %p1, %f1, 0f3F800000;
912 setp.eq.f32 %p2, %f2, 0f00000000;
913 or.pred %p3, %p1, %p2;
914 @%p3 bra BB0_1;
915 bra.uni BB0_2;
916 BB0_1:
917 mov.f32 %f110, 0f3F800000;
918 st.global.f32 [%rl1], %f110;
919 ret;
920 BB0_2: // %__nv_isnanf.exit.i
921 abs.f32 %f4, %f1;
922 setp.gtu.f32 %p4, %f4, 0f7F800000;
923 @%p4 bra BB0_4;
Francis Visoiu Mistrihca0df552017-12-04 17:18:51 +0000924 // %bb.3: // %__nv_isnanf.exit5.i
Justin Holewinski22452852013-11-15 13:02:10 +0000925 abs.f32 %f5, %f2;
926 setp.le.f32 %p5, %f5, 0f7F800000;
927 @%p5 bra BB0_5;
928 BB0_4: // %.critedge1.i
929 add.f32 %f110, %f1, %f2;
930 st.global.f32 [%rl1], %f110;
931 ret;
932 BB0_5: // %__nv_isinff.exit.i
933
934 ...
935
936 BB0_26: // %__nv_truncf.exit.i.i.i.i.i
937 mul.f32 %f90, %f107, 0f3FB8AA3B;
938 cvt.rzi.f32.f32 %f91, %f90;
939 mov.f32 %f92, 0fBF317200;
940 fma.rn.f32 %f93, %f91, %f92, %f107;
941 mov.f32 %f94, 0fB5BFBE8E;
942 fma.rn.f32 %f95, %f91, %f94, %f93;
943 mul.f32 %f89, %f95, 0f3FB8AA3B;
944 // inline asm
945 ex2.approx.ftz.f32 %f88,%f89;
946 // inline asm
947 add.f32 %f96, %f91, 0f00000000;
948 ex2.approx.f32 %f97, %f96;
949 mul.f32 %f98, %f88, %f97;
950 setp.lt.f32 %p15, %f107, 0fC2D20000;
951 selp.f32 %f99, 0f00000000, %f98, %p15;
952 setp.gt.f32 %p16, %f107, 0f42D20000;
953 selp.f32 %f110, 0f7F800000, %f99, %p16;
954 setp.eq.f32 %p17, %f110, 0f7F800000;
955 @%p17 bra BB0_28;
Francis Visoiu Mistrihca0df552017-12-04 17:18:51 +0000956 // %bb.27:
Justin Holewinski22452852013-11-15 13:02:10 +0000957 fma.rn.f32 %f110, %f110, %f108, %f110;
958 BB0_28: // %__internal_accurate_powf.exit.i
959 setp.lt.f32 %p18, %f1, 0f00000000;
960 setp.eq.f32 %p19, %f3, 0f3F800000;
961 and.pred %p20, %p18, %p19;
962 @!%p20 bra BB0_30;
963 bra.uni BB0_29;
964 BB0_29:
965 mov.b32 %r9, %f110;
966 xor.b32 %r10, %r9, -2147483648;
967 mov.b32 %f110, %r10;
968 BB0_30: // %__nv_powf.exit
969 st.global.f32 [%rl1], %f110;
970 ret;
971 }
972