blob: 95d6b0d9b82a19fef66520d88d92ed5dda32ade9 [file] [log] [blame]
Justin Lebare61b1822016-09-07 20:37:41 +00001=========================
Justin Lebarf1708bc2016-09-07 20:09:53 +00002Compiling CUDA with clang
Justin Lebare61b1822016-09-07 20:37:41 +00003=========================
Jingyue Wu69662672015-11-10 22:35:47 +00004
5.. contents::
6 :local:
7
8Introduction
9============
10
Justin Lebarf1708bc2016-09-07 20:09:53 +000011This document describes how to compile CUDA code with clang, and gives some
12details about LLVM and clang's CUDA implementations.
13
14This document assumes a basic familiarity with CUDA. Information about CUDA
15programming can be found in the
Jingyue Wu69662672015-11-10 22:35:47 +000016`CUDA programming guide
17<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
18
Justin Lebarf1708bc2016-09-07 20:09:53 +000019Compiling CUDA Code
20===================
Jingyue Wu69662672015-11-10 22:35:47 +000021
Justin Lebarf1708bc2016-09-07 20:09:53 +000022Prerequisites
23-------------
Jingyue Wu69662672015-11-10 22:35:47 +000024
Artem Belevich6eae3d12018-11-16 01:02:43 +000025CUDA is supported since llvm 3.9. Current release of clang (7.0.0) supports CUDA
267.0 through 9.2. If you need support for CUDA 10, you will need to use clang
27built from r342924 or newer.
Jingyue Wu69662672015-11-10 22:35:47 +000028
Artem Belevich6eae3d12018-11-16 01:02:43 +000029Before you build CUDA code, you'll need to have installed the appropriate driver
30for your nvidia GPU and the CUDA SDK. See `NVIDIA's CUDA installation guide
31<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for
32details. Note that clang `does not support
33<https://llvm.org/bugs/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed by
34many Linux package managers; you probably need to install CUDA in a single
35directory from NVIDIA's package.
Jingyue Wu69662672015-11-10 22:35:47 +000036
Artem Belevich6eae3d12018-11-16 01:02:43 +000037CUDA compilation is supported on Linux. Compilation on MacOS and Windows may or
38may not work and currently have no maintainers. Compilation with CUDA-9.x is
39`currently broken on Windows <https://bugs.llvm.org/show_bug.cgi?id=38811>`_.
Justin Lebare65e5dd2016-11-18 00:42:00 +000040
Justin Lebare61b1822016-09-07 20:37:41 +000041Invoking clang
42--------------
Jingyue Wu69662672015-11-10 22:35:47 +000043
Justin Lebare61b1822016-09-07 20:37:41 +000044Invoking clang for CUDA compilation works similarly to compiling regular C++.
45You just need to be aware of a few additional flags.
Jingyue Wu69662672015-11-10 22:35:47 +000046
Justin Lebar5b033da2016-09-07 20:42:24 +000047You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_
Justin Lebar70425af2016-09-07 21:46:21 +000048program as a toy example. Save it as ``axpy.cu``. (Clang detects that you're
49compiling CUDA code by noticing that your filename ends with ``.cu``.
50Alternatively, you can pass ``-x cuda``.)
51
52To build and run, run the following commands, filling in the parts in angle
53brackets as described below:
Jingyue Wu69662672015-11-10 22:35:47 +000054
55.. code-block:: console
56
Justin Lebare61b1822016-09-07 20:37:41 +000057 $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
58 -L<CUDA install path>/<lib64 or lib> \
Jingyue Wue3e4ffd2016-01-30 23:48:47 +000059 -lcudart_static -ldl -lrt -pthread
Jingyue Wu69662672015-11-10 22:35:47 +000060 $ ./axpy
61 y[0] = 2
62 y[1] = 4
63 y[2] = 6
64 y[3] = 8
65
Justin Lebarac4f8e12016-11-22 23:13:29 +000066On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get
67"CUDA driver version is insufficient for CUDA runtime version" errors when you
68run your program.
69
Justin Lebar70425af2016-09-07 21:46:21 +000070* ``<CUDA install path>`` -- the directory where you installed CUDA SDK.
71 Typically, ``/usr/local/cuda``.
Jingyue Wu69662672015-11-10 22:35:47 +000072
Justin Lebar70425af2016-09-07 21:46:21 +000073 Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise,
74 pass e.g. ``-L/usr/local/cuda/lib``. (In CUDA, the device code and host code
75 always have the same pointer widths, so if you're compiling 64-bit code for
Artem Belevich6eae3d12018-11-16 01:02:43 +000076 the host, you're also compiling 64-bit code for the device.) Note that as of
77 v10.0 CUDA SDK `no longer supports compilation of 32-bit
Artem Belevicheb5dfd02018-11-16 01:23:12 +000078 applications <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-features>`_.
Justin Lebar78e95fa2016-09-07 20:09:46 +000079
Justin Lebar70425af2016-09-07 21:46:21 +000080* ``<GPU arch>`` -- the `compute capability
81 <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you
82 want to run your program on a GPU with compute capability of 3.5, specify
Justin Lebare61b1822016-09-07 20:37:41 +000083 ``--cuda-gpu-arch=sm_35``.
Justin Lebar9dd6a532016-03-21 23:05:15 +000084
Justin Lebare61b1822016-09-07 20:37:41 +000085 Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``;
86 only ``sm_XX`` is currently supported. However, clang always includes PTX in
87 its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be
88 forwards-compatible with e.g. ``sm_35`` GPUs.
Justin Lebar9dd6a532016-03-21 23:05:15 +000089
Justin Lebar70425af2016-09-07 21:46:21 +000090 You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs.
Justin Lebar9dd6a532016-03-21 23:05:15 +000091
Justin Lebar66760952016-09-07 21:46:49 +000092The `-L` and `-l` flags only need to be passed when linking. When compiling,
93you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install
Artem Belevich6eae3d12018-11-16 01:02:43 +000094the CUDA SDK into ``/usr/local/cuda`` or ``/usr/local/cuda-X.Y``.
Justin Lebar66760952016-09-07 21:46:49 +000095
Justin Lebar29697182016-05-25 23:11:31 +000096Flags that control numerical code
Justin Lebare61b1822016-09-07 20:37:41 +000097---------------------------------
Justin Lebar29697182016-05-25 23:11:31 +000098
99If you're using GPUs, you probably care about making numerical code run fast.
100GPU hardware allows for more control over numerical operations than most CPUs,
101but this results in more compiler options for you to juggle.
102
103Flags you may wish to tweak include:
104
105* ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when
106 compiling CUDA) Controls whether the compiler emits fused multiply-add
107 operations.
108
109 * ``off``: never emit fma operations, and prevent ptxas from fusing multiply
110 and add instructions.
111 * ``on``: fuse multiplies and adds within a single statement, but never
112 across statements (C11 semantics). Prevent ptxas from fusing other
113 multiplies and adds.
114 * ``fast``: fuse multiplies and adds wherever profitable, even across
115 statements. Doesn't prevent ptxas from fusing additional multiplies and
116 adds.
117
118 Fused multiply-add instructions can be much faster than the unfused
119 equivalents, but because the intermediate result in an fma is not rounded,
120 this flag can affect numerical code.
121
122* ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled,
123 floating point operations may flush `denormal
124 <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0.
125 Operations on denormal numbers are often much slower than the same operations
126 on normal numbers.
127
128* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
129 compiler may emit calls to faster, approximate versions of transcendental
130 functions, instead of using the slower, fully IEEE-compliant versions. For
131 example, this flag allows clang to emit the ptx ``sin.approx.f32``
132 instruction.
133
134 This is implied by ``-ffast-math``.
135
Justin Lebar4574c112016-09-15 02:04:32 +0000136Standard library support
137========================
138
139In clang and nvcc, most of the C++ standard library is not supported on the
140device side.
141
Justin Lebarc53e3842016-09-16 04:14:02 +0000142``<math.h>`` and ``<cmath>``
143----------------------------
Justin Lebar4574c112016-09-15 02:04:32 +0000144
145In clang, ``math.h`` and ``cmath`` are available and `pass
146<https://github.com/llvm-mirror/test-suite/blob/master/External/CUDA/math_h.cu>`_
147`tests
148<https://github.com/llvm-mirror/test-suite/blob/master/External/CUDA/cmath.cu>`_
149adapted from libc++'s test suite.
150
151In nvcc ``math.h`` and ``cmath`` are mostly available. Versions of ``::foof``
152in namespace std (e.g. ``std::sinf``) are not available, and where the standard
153calls for overloads that take integral arguments, these are usually not
154available.
155
156.. code-block:: c++
157
158 #include <math.h>
159 #include <cmath.h>
160
161 // clang is OK with everything in this function.
162 __device__ void test() {
163 std::sin(0.); // nvcc - ok
164 std::sin(0); // nvcc - error, because no std::sin(int) override is available.
165 sin(0); // nvcc - same as above.
166
167 sinf(0.); // nvcc - ok
168 std::sinf(0.); // nvcc - no such function
169 }
170
Justin Lebarc53e3842016-09-16 04:14:02 +0000171``<std::complex>``
172------------------
Justin Lebar4574c112016-09-15 02:04:32 +0000173
174nvcc does not officially support ``std::complex``. It's an error to use
175``std::complex`` in ``__device__`` code, but it often works in ``__host__
176__device__`` code due to nvcc's interpretation of the "wrong-side rule" (see
177below). However, we have heard from implementers that it's possible to get
178into situations where nvcc will omit a call to an ``std::complex`` function,
179especially when compiling without optimizations.
180
Justin Lebar38b5ba02016-11-17 01:03:42 +0000181As of 2016-11-16, clang supports ``std::complex`` without these caveats. It is
182tested with libstdc++ 4.8.5 and newer, but is known to work only with libc++
183newer than 2016-11-16.
Justin Lebar4574c112016-09-15 02:04:32 +0000184
Justin Lebarc53e3842016-09-16 04:14:02 +0000185``<algorithm>``
186---------------
187
188In C++14, many useful functions from ``<algorithm>`` (notably, ``std::min`` and
189``std::max``) become constexpr. You can therefore use these in device code,
190when compiling with clang.
Justin Lebar4574c112016-09-15 02:04:32 +0000191
Justin Lebare61b1822016-09-07 20:37:41 +0000192Detecting clang vs NVCC from code
193=================================
194
195Although clang's CUDA implementation is largely compatible with NVCC's, you may
196still want to detect when you're compiling CUDA code specifically with clang.
197
198This is tricky, because NVCC may invoke clang as part of its own compilation
199process! For example, NVCC uses the host compiler's preprocessor when
200compiling for device code, and that host compiler may in fact be clang.
201
202When clang is actually compiling CUDA code -- rather than being used as a
203subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is
204defined only in device mode (but will be defined if NVCC is using clang as a
205preprocessor). So you can use the following incantations to detect clang CUDA
206compilation, in host and device modes:
207
208.. code-block:: c++
209
210 #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
Justin Lebar4574c112016-09-15 02:04:32 +0000211 // clang compiling CUDA code, host mode.
Justin Lebare61b1822016-09-07 20:37:41 +0000212 #endif
213
214 #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
Justin Lebar4574c112016-09-15 02:04:32 +0000215 // clang compiling CUDA code, device mode.
Justin Lebare61b1822016-09-07 20:37:41 +0000216 #endif
217
218Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can
219detect NVCC specifically by looking for ``__NVCC__``.
220
Justin Lebar4574c112016-09-15 02:04:32 +0000221Dialect Differences Between clang and nvcc
222==========================================
223
224There is no formal CUDA spec, and clang and nvcc speak slightly different
225dialects of the language. Below, we describe some of the differences.
226
227This section is painful; hopefully you can skip this section and live your life
228blissfully unaware.
229
230Compilation Models
231------------------
232
233Most of the differences between clang and nvcc stem from the different
234compilation models used by clang and nvcc. nvcc uses *split compilation*,
235which works roughly as follows:
236
237 * Run a preprocessor over the input ``.cu`` file to split it into two source
238 files: ``H``, containing source code for the host, and ``D``, containing
239 source code for the device.
240
241 * For each GPU architecture ``arch`` that we're compiling for, do:
242
243 * Compile ``D`` using nvcc proper. The result of this is a ``ptx`` file for
244 ``P_arch``.
245
246 * Optionally, invoke ``ptxas``, the PTX assembler, to generate a file,
247 ``S_arch``, containing GPU machine code (SASS) for ``arch``.
248
249 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
250 single "fat binary" file, ``F``.
251
252 * Compile ``H`` using an external host compiler (gcc, clang, or whatever you
253 like). ``F`` is packaged up into a header file which is force-included into
254 ``H``; nvcc generates code that calls into this header to e.g. launch
255 kernels.
256
257clang uses *merged parsing*. This is similar to split compilation, except all
258of the host and device code is present and must be semantically-correct in both
259compilation steps.
260
261 * For each GPU architecture ``arch`` that we're compiling for, do:
262
263 * Compile the input ``.cu`` file for device, using clang. ``__host__`` code
264 is parsed and must be semantically correct, even though we're not
265 generating code for the host at this time.
266
267 The output of this step is a ``ptx`` file ``P_arch``.
268
269 * Invoke ``ptxas`` to generate a SASS file, ``S_arch``. Note that, unlike
270 nvcc, clang always generates SASS code.
271
272 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
273 single fat binary file, ``F``.
274
275 * Compile ``H`` using clang. ``__device__`` code is parsed and must be
276 semantically correct, even though we're not generating code for the device
277 at this time.
278
279 ``F`` is passed to this compilation, and clang includes it in a special ELF
280 section, where it can be found by tools like ``cuobjdump``.
281
282(You may ask at this point, why does clang need to parse the input file
283multiple times? Why not parse it just once, and then use the AST to generate
284code for the host and each device architecture?
285
286Unfortunately this can't work because we have to define different macros during
287host compilation and during device compilation for each GPU architecture.)
288
289clang's approach allows it to be highly robust to C++ edge cases, as it doesn't
290need to decide at an early stage which declarations to keep and which to throw
291away. But it has some consequences you should be aware of.
292
293Overloading Based on ``__host__`` and ``__device__`` Attributes
294---------------------------------------------------------------
295
296Let "H", "D", and "HD" stand for "``__host__`` functions", "``__device__``
297functions", and "``__host__ __device__`` functions", respectively. Functions
298with no attributes behave the same as H.
299
300nvcc does not allow you to create H and D functions with the same signature:
301
302.. code-block:: c++
303
304 // nvcc: error - function "foo" has already been defined
305 __host__ void foo() {}
306 __device__ void foo() {}
307
308However, nvcc allows you to "overload" H and D functions with different
309signatures:
310
311.. code-block:: c++
312
313 // nvcc: no error
314 __host__ void foo(int) {}
315 __device__ void foo() {}
316
317In clang, the ``__host__`` and ``__device__`` attributes are part of a
318function's signature, and so it's legal to have H and D functions with
319(otherwise) the same signature:
320
321.. code-block:: c++
322
323 // clang: no error
324 __host__ void foo() {}
325 __device__ void foo() {}
326
327HD functions cannot be overloaded by H or D functions with the same signature:
328
329.. code-block:: c++
330
331 // nvcc: error - function "foo" has already been defined
332 // clang: error - redefinition of 'foo'
333 __host__ __device__ void foo() {}
334 __device__ void foo() {}
335
336 // nvcc: no error
337 // clang: no error
338 __host__ __device__ void bar(int) {}
339 __device__ void bar() {}
340
341When resolving an overloaded function, clang considers the host/device
342attributes of the caller and callee. These are used as a tiebreaker during
343overload resolution. See `IdentifyCUDAPreference
344<http://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules,
345but at a high level they are:
346
347 * D functions prefer to call other Ds. HDs are given lower priority.
348
349 * Similarly, H functions prefer to call other Hs, or ``__global__`` functions
350 (with equal priority). HDs are given lower priority.
351
352 * HD functions prefer to call other HDs.
353
354 When compiling for device, HDs will call Ds with lower priority than HD, and
355 will call Hs with still lower priority. If it's forced to call an H, the
356 program is malformed if we emit code for this HD function. We call this the
357 "wrong-side rule", see example below.
358
359 The rules are symmetrical when compiling for host.
360
361Some examples:
362
363.. code-block:: c++
364
365 __host__ void foo();
366 __device__ void foo();
367
368 __host__ void bar();
369 __host__ __device__ void bar();
370
371 __host__ void test_host() {
372 foo(); // calls H overload
373 bar(); // calls H overload
374 }
375
376 __device__ void test_device() {
377 foo(); // calls D overload
378 bar(); // calls HD overload
379 }
380
381 __host__ __device__ void test_hd() {
382 foo(); // calls H overload when compiling for host, otherwise D overload
383 bar(); // always calls HD overload
384 }
385
386Wrong-side rule example:
387
388.. code-block:: c++
389
390 __host__ void host_only();
391
392 // We don't codegen inline functions unless they're referenced by a
393 // non-inline function. inline_hd1() is called only from the host side, so
394 // does not generate an error. inline_hd2() is called from the device side,
395 // so it generates an error.
396 inline __host__ __device__ void inline_hd1() { host_only(); } // no error
397 inline __host__ __device__ void inline_hd2() { host_only(); } // error
398
399 __host__ void host_fn() { inline_hd1(); }
400 __device__ void device_fn() { inline_hd2(); }
401
402 // This function is not inline, so it's always codegen'ed on both the host
403 // and the device. Therefore, it generates an error.
404 __host__ __device__ void not_inline_hd() { host_only(); }
405
406For the purposes of the wrong-side rule, templated functions also behave like
407``inline`` functions: They aren't codegen'ed unless they're instantiated
408(usually as part of the process of invoking them).
409
410clang's behavior with respect to the wrong-side rule matches nvcc's, except
411nvcc only emits a warning for ``not_inline_hd``; device code is allowed to call
412``not_inline_hd``. In its generated code, nvcc may omit ``not_inline_hd``'s
413call to ``host_only`` entirely, or it may try to generate code for
414``host_only`` on the device. What you get seems to depend on whether or not
415the compiler chooses to inline ``host_only``.
416
417Member functions, including constructors, may be overloaded using H and D
418attributes. However, destructors cannot be overloaded.
419
420Using a Different Class on Host/Device
421--------------------------------------
422
423Occasionally you may want to have a class with different host/device versions.
424
425If all of the class's members are the same on the host and device, you can just
426provide overloads for the class's member functions.
427
428However, if you want your class to have different members on host/device, you
429won't be able to provide working H and D overloads in both classes. In this
430case, clang is likely to be unhappy with you.
431
432.. code-block:: c++
433
434 #ifdef __CUDA_ARCH__
435 struct S {
436 __device__ void foo() { /* use device_only */ }
437 int device_only;
438 };
439 #else
440 struct S {
441 __host__ void foo() { /* use host_only */ }
442 double host_only;
443 };
444
445 __device__ void test() {
446 S s;
447 // clang generates an error here, because during host compilation, we
448 // have ifdef'ed away the __device__ overload of S::foo(). The __device__
449 // overload must be present *even during host compilation*.
450 S.foo();
451 }
452 #endif
453
454We posit that you don't really want to have classes with different members on H
455and D. For example, if you were to pass one of these as a parameter to a
456kernel, it would have a different layout on H and D, so would not work
457properly.
458
459To make code like this compatible with clang, we recommend you separate it out
460into two classes. If you need to write code that works on both host and
461device, consider writing an overloaded wrapper function that returns different
462types on host and device.
463
464.. code-block:: c++
465
466 struct HostS { ... };
467 struct DeviceS { ... };
468
469 __host__ HostS MakeStruct() { return HostS(); }
470 __device__ DeviceS MakeStruct() { return DeviceS(); }
471
472 // Now host and device code can call MakeStruct().
473
474Unfortunately, this idiom isn't compatible with nvcc, because it doesn't allow
475you to overload based on the H/D attributes. Here's an idiom that works with
476both clang and nvcc:
477
478.. code-block:: c++
479
480 struct HostS { ... };
481 struct DeviceS { ... };
482
483 #ifdef __NVCC__
484 #ifndef __CUDA_ARCH__
485 __host__ HostS MakeStruct() { return HostS(); }
486 #else
487 __device__ DeviceS MakeStruct() { return DeviceS(); }
488 #endif
489 #else
490 __host__ HostS MakeStruct() { return HostS(); }
491 __device__ DeviceS MakeStruct() { return DeviceS(); }
492 #endif
493
494 // Now host and device code can call MakeStruct().
495
496Hopefully you don't have to do this sort of thing often.
497
Jingyue Wu69662672015-11-10 22:35:47 +0000498Optimizations
499=============
500
Justin Lebarf0bb43f2016-09-07 21:46:53 +0000501Modern CPUs and GPUs are architecturally quite different, so code that's fast
502on a CPU isn't necessarily fast on a GPU. We've made a number of changes to
503LLVM to make it generate good GPU code. Among these changes are:
Jingyue Wu69662672015-11-10 22:35:47 +0000504
Justin Lebarf0bb43f2016-09-07 21:46:53 +0000505* `Straight-line scalar optimizations <https://goo.gl/4Rb9As>`_ -- These
506 reduce redundancy within straight-line code.
Jingyue Wu69662672015-11-10 22:35:47 +0000507
Justin Lebarf0bb43f2016-09-07 21:46:53 +0000508* `Aggressive speculative execution
509 <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_
510 -- This is mainly for promoting straight-line scalar optimizations, which are
511 most effective on code along dominator paths.
Jingyue Wu69662672015-11-10 22:35:47 +0000512
Justin Lebarf0bb43f2016-09-07 21:46:53 +0000513* `Memory space inference
514 <http://llvm.org/doxygen/NVPTXInferAddressSpaces_8cpp_source.html>`_ --
515 In PTX, we can operate on pointers that are in a paricular "address space"
516 (global, shared, constant, or local), or we can operate on pointers in the
517 "generic" address space, which can point to anything. Operations in a
518 non-generic address space are faster, but pointers in CUDA are not explicitly
519 annotated with their address space, so it's up to LLVM to infer it where
520 possible.
Jingyue Wu69662672015-11-10 22:35:47 +0000521
Justin Lebarf0bb43f2016-09-07 21:46:53 +0000522* `Bypassing 64-bit divides
523 <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ --
524 This was an existing optimization that we enabled for the PTX backend.
525
526 64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs.
527 Many of the 64-bit divides in our benchmarks have a divisor and dividend
528 which fit in 32-bits at runtime. This optimization provides a fast path for
529 this common case.
530
531* Aggressive loop unrooling and function inlining -- Loop unrolling and
Jingyue Wu69662672015-11-10 22:35:47 +0000532 function inlining need to be more aggressive for GPUs than for CPUs because
Justin Lebarf0bb43f2016-09-07 21:46:53 +0000533 control flow transfer in GPU is more expensive. More aggressive unrolling and
534 inlining also promote other optimizations, such as constant propagation and
535 SROA, which sometimes speed up code by over 10x.
536
537 (Programmers can force unrolling and inline using clang's `loop unrolling pragmas
Jingyue Wu69662672015-11-10 22:35:47 +0000538 <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
Justin Lebarf0bb43f2016-09-07 21:46:53 +0000539 and ``__attribute__((always_inline))``.)
Jingyue Wu89e030262016-02-23 23:34:49 +0000540
Jingyue Wu8c5f0de2016-03-30 05:05:40 +0000541Publication
542===========
543
Justin Lebarf0bb43f2016-09-07 21:46:53 +0000544The team at Google published a paper in CGO 2016 detailing the optimizations
545they'd made to clang/LLVM. Note that "gpucc" is no longer a meaningful name:
546The relevant tools are now just vanilla clang/LLVM.
547
Jingyue Wu8c5f0de2016-03-30 05:05:40 +0000548| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
549| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt
550| *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)*
Justin Lebarf0bb43f2016-09-07 21:46:53 +0000551|
Artem Belevich6eae3d12018-11-16 01:02:43 +0000552| `Slides from the CGO talk <http://wujingyue.github.io/docs/gpucc-talk.pdf>`_
Justin Lebarf0bb43f2016-09-07 21:46:53 +0000553|
Artem Belevich6eae3d12018-11-16 01:02:43 +0000554| `Tutorial given at CGO <http://wujingyue.github.io/docs/gpucc-tutorial.pdf>`_
Jingyue Wu8c5f0de2016-03-30 05:05:40 +0000555
Jingyue Wu89e030262016-02-23 23:34:49 +0000556Obtaining Help
557==============
558
559To obtain help on LLVM in general and its CUDA support, see `the LLVM
560community <http://llvm.org/docs/#mailing-lists>`_.