Andrea Di Biagio | 29b29cc | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 1 | llvm-mca - LLVM Machine Code Analyzer |
| 2 | ===================================== |
| 3 | |
| 4 | SYNOPSIS |
| 5 | -------- |
| 6 | |
| 7 | :program:`llvm-mca` [*options*] [input] |
| 8 | |
| 9 | DESCRIPTION |
| 10 | ----------- |
| 11 | |
| 12 | :program:`llvm-mca` is a performance analysis tool that uses information |
| 13 | available in LLVM (e.g. scheduling models) to statically measure the performance |
| 14 | of machine code in a specific CPU. |
| 15 | |
| 16 | Performance is measured in terms of throughput as well as processor resource |
| 17 | consumption. The tool currently works for processors with an out-of-order |
| 18 | backend, for which there is a scheduling model available in LLVM. |
| 19 | |
| 20 | The main goal of this tool is not just to predict the performance of the code |
| 21 | when run on the target, but also help with diagnosing potential performance |
| 22 | issues. |
| 23 | |
Matt Davis | dfa0460 | 2018-08-03 15:56:07 +0000 | [diff] [blame] | 24 | Given an assembly code sequence, :program:`llvm-mca` estimates the Instructions |
| 25 | Per Cycle (IPC), as well as hardware resource pressure. The analysis and |
| 26 | reporting style were inspired by the IACA tool from Intel. |
Andrea Di Biagio | 29b29cc | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 27 | |
Matt Davis | dfa0460 | 2018-08-03 15:56:07 +0000 | [diff] [blame] | 28 | For example, you can compile code with clang, output assembly, and pipe it |
| 29 | directly into :program:`llvm-mca` for analysis: |
Sanjay Patel | 272dbbe | 2018-04-10 17:49:45 +0000 | [diff] [blame] | 30 | |
| 31 | .. code-block:: bash |
| 32 | |
Sanjay Patel | 30c344c | 2018-04-10 18:10:14 +0000 | [diff] [blame] | 33 | $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2 |
Andrea Di Biagio | aae4cd3 | 2018-04-09 16:39:52 +0000 | [diff] [blame] | 34 | |
Andrea Di Biagio | a1fecd5 | 2018-05-17 16:48:53 +0000 | [diff] [blame] | 35 | Or for Intel syntax: |
| 36 | |
Simon Pilgrim | 49f3003 | 2018-05-17 16:58:42 +0000 | [diff] [blame] | 37 | .. code-block:: bash |
Andrea Di Biagio | a1fecd5 | 2018-05-17 16:48:53 +0000 | [diff] [blame] | 38 | |
| 39 | $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2 |
| 40 | |
Andrea Di Biagio | 29b29cc | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 41 | OPTIONS |
| 42 | ------- |
| 43 | |
| 44 | If ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard |
| 45 | input. Otherwise, it will read from the specified filename. |
| 46 | |
| 47 | If the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output |
| 48 | to standard output if the input is from standard input. If the :option:`-o` |
| 49 | option specifies "``-``", then the output will also be sent to standard output. |
| 50 | |
| 51 | |
| 52 | .. option:: -help |
| 53 | |
| 54 | Print a summary of command line options. |
| 55 | |
| 56 | .. option:: -mtriple=<target triple> |
| 57 | |
| 58 | Specify a target triple string. |
| 59 | |
| 60 | .. option:: -march=<arch> |
| 61 | |
| 62 | Specify the architecture for which to analyze the code. It defaults to the |
| 63 | host default target. |
| 64 | |
| 65 | .. option:: -mcpu=<cpuname> |
| 66 | |
Andrea Di Biagio | e508042 | 2018-04-25 10:18:25 +0000 | [diff] [blame] | 67 | Specify the processor for which to analyze the code. By default, the cpu name |
| 68 | is autodetected from the host. |
Andrea Di Biagio | 29b29cc | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 69 | |
| 70 | .. option:: -output-asm-variant=<variant id> |
| 71 | |
| 72 | Specify the output assembly variant for the report generated by the tool. |
| 73 | On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables |
| 74 | the AT&T (vic. Intel) assembly format for the code printed out by the tool in |
| 75 | the analysis report. |
| 76 | |
| 77 | .. option:: -dispatch=<width> |
| 78 | |
| 79 | Specify a different dispatch width for the processor. The dispatch width |
Andrea Di Biagio | 14cfc65 | 2018-04-05 16:42:32 +0000 | [diff] [blame] | 80 | defaults to field 'IssueWidth' in the processor scheduling model. If width is |
| 81 | zero, then the default dispatch width is used. |
Andrea Di Biagio | 29b29cc | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 82 | |
Andrea Di Biagio | 29b29cc | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 83 | .. option:: -register-file-size=<size> |
| 84 | |
Andrea Di Biagio | 14cfc65 | 2018-04-05 16:42:32 +0000 | [diff] [blame] | 85 | Specify the size of the register file. When specified, this flag limits how |
Matt Davis | 8073c65 | 2018-07-31 18:59:46 +0000 | [diff] [blame] | 86 | many physical registers are available for register renaming purposes. A value |
| 87 | of zero for this flag means "unlimited number of physical registers". |
Andrea Di Biagio | 29b29cc | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 88 | |
| 89 | .. option:: -iterations=<number of iterations> |
| 90 | |
| 91 | Specify the number of iterations to run. If this flag is set to 0, then the |
Andrea Di Biagio | 259800c | 2018-04-10 12:50:03 +0000 | [diff] [blame] | 92 | tool sets the number of iterations to a default value (i.e. 100). |
Andrea Di Biagio | 29b29cc | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 93 | |
| 94 | .. option:: -noalias=<bool> |
| 95 | |
| 96 | If set, the tool assumes that loads and stores don't alias. This is the |
| 97 | default behavior. |
| 98 | |
| 99 | .. option:: -lqueue=<load queue size> |
| 100 | |
| 101 | Specify the size of the load queue in the load/store unit emulated by the tool. |
| 102 | By default, the tool assumes an unbound number of entries in the load queue. |
| 103 | A value of zero for this flag is ignored, and the default load queue size is |
Matt Davis | 95e53c2 | 2018-07-17 16:11:54 +0000 | [diff] [blame] | 104 | used instead. |
Andrea Di Biagio | 29b29cc | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 105 | |
| 106 | .. option:: -squeue=<store queue size> |
| 107 | |
| 108 | Specify the size of the store queue in the load/store unit emulated by the |
| 109 | tool. By default, the tool assumes an unbound number of entries in the store |
| 110 | queue. A value of zero for this flag is ignored, and the default store queue |
| 111 | size is used instead. |
| 112 | |
Andrea Di Biagio | 29b29cc | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 113 | .. option:: -timeline |
| 114 | |
| 115 | Enable the timeline view. |
| 116 | |
| 117 | .. option:: -timeline-max-iterations=<iterations> |
| 118 | |
| 119 | Limit the number of iterations to print in the timeline view. By default, the |
| 120 | timeline view prints information for up to 10 iterations. |
| 121 | |
| 122 | .. option:: -timeline-max-cycles=<cycles> |
| 123 | |
| 124 | Limit the number of cycles in the timeline view. By default, the number of |
| 125 | cycles is set to 80. |
| 126 | |
Andrea Di Biagio | 33dfb9d | 2018-03-26 13:21:48 +0000 | [diff] [blame] | 127 | .. option:: -resource-pressure |
| 128 | |
| 129 | Enable the resource pressure view. This is enabled by default. |
| 130 | |
Andrea Di Biagio | 6ea791d | 2018-04-03 16:46:23 +0000 | [diff] [blame] | 131 | .. option:: -register-file-stats |
| 132 | |
| 133 | Enable register file usage statistics. |
| 134 | |
Andrea Di Biagio | 955aba4 | 2018-04-10 14:55:14 +0000 | [diff] [blame] | 135 | .. option:: -dispatch-stats |
| 136 | |
| 137 | Enable extra dispatch statistics. This view collects and analyzes instruction |
| 138 | dispatch events, as well as static/dynamic dispatch stall events. This view |
| 139 | is disabled by default. |
| 140 | |
Andrea Di Biagio | 2d438c0 | 2018-04-11 11:37:46 +0000 | [diff] [blame] | 141 | .. option:: -scheduler-stats |
| 142 | |
| 143 | Enable extra scheduler statistics. This view collects and analyzes instruction |
| 144 | issue events. This view is disabled by default. |
| 145 | |
Andrea Di Biagio | 900cf75 | 2018-04-11 12:12:53 +0000 | [diff] [blame] | 146 | .. option:: -retire-stats |
| 147 | |
| 148 | Enable extra retire control unit statistics. This view is disabled by default. |
| 149 | |
Andrea Di Biagio | fafdf4a | 2018-03-26 13:44:54 +0000 | [diff] [blame] | 150 | .. option:: -instruction-info |
| 151 | |
| 152 | Enable the instruction info view. This is enabled by default. |
| 153 | |
Andrea Di Biagio | acf3f6e | 2018-05-17 12:27:03 +0000 | [diff] [blame] | 154 | .. option:: -all-stats |
| 155 | |
| 156 | Print all hardware statistics. This enables extra statistics related to the |
| 157 | dispatch logic, the hardware schedulers, the register file(s), and the retire |
| 158 | control unit. This option is disabled by default. |
| 159 | |
| 160 | .. option:: -all-views |
| 161 | |
| 162 | Enable all the view. |
| 163 | |
Andrea Di Biagio | 181ce9f | 2018-03-26 12:04:53 +0000 | [diff] [blame] | 164 | .. option:: -instruction-tables |
| 165 | |
| 166 | Prints resource pressure information based on the static information |
| 167 | available from the processor model. This differs from the resource pressure |
| 168 | view because it doesn't require that the code is simulated. It instead prints |
| 169 | the theoretical uniform distribution of resource pressure for every |
| 170 | instruction in sequence. |
| 171 | |
Matt Davis | 95e53c2 | 2018-07-17 16:11:54 +0000 | [diff] [blame] | 172 | |
Andrea Di Biagio | 29b29cc | 2018-03-08 13:05:02 +0000 | [diff] [blame] | 173 | EXIT STATUS |
| 174 | ----------- |
| 175 | |
| 176 | :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed |
| 177 | to standard error, and the tool returns 1. |
| 178 | |
Matt Davis | dfa0460 | 2018-08-03 15:56:07 +0000 | [diff] [blame] | 179 | USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS |
| 180 | --------------------------------------------- |
| 181 | :program:`llvm-mca` allows for the optional usage of special code comments to |
| 182 | mark regions of the assembly code to be analyzed. A comment starting with |
| 183 | substring ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment |
| 184 | starting with substring ``LLVM-MCA-END`` marks the end of a code region. For |
| 185 | example: |
| 186 | |
| 187 | .. code-block:: none |
| 188 | |
| 189 | # LLVM-MCA-BEGIN My Code Region |
| 190 | ... |
| 191 | # LLVM-MCA-END |
| 192 | |
| 193 | Multiple regions can be specified provided that they do not overlap. A code |
| 194 | region can have an optional description. If no user-defined region is specified, |
| 195 | then :program:`llvm-mca` assumes a default region which contains every |
| 196 | instruction in the input file. Every region is analyzed in isolation, and the |
| 197 | final performance report is the union of all the reports generated for every |
| 198 | code region. |
| 199 | |
| 200 | Inline assembly directives may be used from source code to annotate the |
| 201 | assembly text: |
| 202 | |
| 203 | .. code-block:: c++ |
| 204 | |
| 205 | int foo(int a, int b) { |
| 206 | __asm volatile("# LLVM-MCA-BEGIN foo"); |
| 207 | a += 42; |
| 208 | __asm volatile("# LLVM-MCA-END"); |
| 209 | a *= b; |
| 210 | return a; |
| 211 | } |
| 212 | |
Andrea Di Biagio | c49e383 | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 213 | HOW LLVM-MCA WORKS |
| 214 | ------------------ |
Matt Davis | e05bab2 | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 215 | |
Andrea Di Biagio | c49e383 | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 216 | :program:`llvm-mca` takes assembly code as input. The assembly code is parsed |
| 217 | into a sequence of MCInst with the help of the existing LLVM target assembly |
| 218 | parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module |
| 219 | to generate a performance report. |
Matt Davis | e05bab2 | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 220 | |
| 221 | The Pipeline module simulates the execution of the machine code sequence in a |
| 222 | loop of iterations (default is 100). During this process, the pipeline collects |
| 223 | a number of execution related statistics. At the end of this process, the |
| 224 | pipeline generates and prints a report from the collected statistics. |
| 225 | |
Andrea Di Biagio | c49e383 | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 226 | Here is an example of a performance report generated by the tool for a |
| 227 | dot-product of two packed float vectors of four elements. The analysis is |
| 228 | conducted for target x86, cpu btver2. The following result can be produced via |
| 229 | the following command using the example located at |
Matt Davis | e05bab2 | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 230 | ``test/tools/llvm-mca/X86/BtVer2/dot-product.s``: |
| 231 | |
| 232 | .. code-block:: bash |
| 233 | |
| 234 | $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s |
| 235 | |
| 236 | .. code-block:: none |
| 237 | |
| 238 | Iterations: 300 |
| 239 | Instructions: 900 |
| 240 | Total Cycles: 610 |
Andrea Di Biagio | 4dfd5db | 2018-08-29 17:56:39 +0000 | [diff] [blame] | 241 | Total uOps: 900 |
| 242 | |
Matt Davis | e05bab2 | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 243 | Dispatch Width: 2 |
Andrea Di Biagio | 4dfd5db | 2018-08-29 17:56:39 +0000 | [diff] [blame] | 244 | uOps Per Cycle: 1.48 |
Matt Davis | e05bab2 | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 245 | IPC: 1.48 |
| 246 | Block RThroughput: 2.0 |
| 247 | |
| 248 | |
| 249 | Instruction Info: |
| 250 | [1]: #uOps |
| 251 | [2]: Latency |
| 252 | [3]: RThroughput |
| 253 | [4]: MayLoad |
| 254 | [5]: MayStore |
| 255 | [6]: HasSideEffects (U) |
| 256 | |
| 257 | [1] [2] [3] [4] [5] [6] Instructions: |
| 258 | 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2 |
| 259 | 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3 |
| 260 | 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4 |
| 261 | |
| 262 | |
| 263 | Resources: |
| 264 | [0] - JALU0 |
| 265 | [1] - JALU1 |
| 266 | [2] - JDiv |
| 267 | [3] - JFPA |
| 268 | [4] - JFPM |
| 269 | [5] - JFPU0 |
| 270 | [6] - JFPU1 |
| 271 | [7] - JLAGU |
| 272 | [8] - JMul |
| 273 | [9] - JSAGU |
| 274 | [10] - JSTC |
| 275 | [11] - JVALU0 |
| 276 | [12] - JVALU1 |
| 277 | [13] - JVIMUL |
| 278 | |
| 279 | |
| 280 | Resource pressure per iteration: |
| 281 | [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] |
| 282 | - - - 2.00 1.00 2.00 1.00 - - - - - - - |
| 283 | |
| 284 | Resource pressure by instruction: |
| 285 | [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions: |
| 286 | - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2 |
| 287 | - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3 |
| 288 | - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4 |
| 289 | |
| 290 | According to this report, the dot-product kernel has been executed 300 times, |
Andrea Di Biagio | 4dfd5db | 2018-08-29 17:56:39 +0000 | [diff] [blame] | 291 | for a total of 900 simulated instructions. The total number of simulated micro |
| 292 | opcodes (uOps) is also 900. |
Matt Davis | e05bab2 | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 293 | |
| 294 | The report is structured in three main sections. The first section collects a |
| 295 | few performance numbers; the goal of this section is to give a very quick |
Andrea Di Biagio | 4dfd5db | 2018-08-29 17:56:39 +0000 | [diff] [blame] | 296 | overview of the performance throughput. Important performance indicators are |
| 297 | **IPC**, **uOps Per Cycle**, and **Block RThroughput** (Block Reciprocal |
Andrea Di Biagio | c122af5 | 2018-07-31 18:19:15 +0000 | [diff] [blame] | 298 | Throughput). |
| 299 | |
| 300 | IPC is computed dividing the total number of simulated instructions by the total |
Andrea Di Biagio | 4dfd5db | 2018-08-29 17:56:39 +0000 | [diff] [blame] | 301 | number of cycles. In the absence of loop-carried data dependencies, the |
Andrea Di Biagio | c122af5 | 2018-07-31 18:19:15 +0000 | [diff] [blame] | 302 | observed IPC tends to a theoretical maximum which can be computed by dividing |
| 303 | the number of instructions of a single iteration by the *Block RThroughput*. |
| 304 | |
Andrea Di Biagio | 4dfd5db | 2018-08-29 17:56:39 +0000 | [diff] [blame] | 305 | Field 'uOps Per Cycle' is computed dividing the total number of simulated micro |
| 306 | opcodes by the total number of cycles. A delta between Dispatch Width and this |
| 307 | field is an indicator of a performance issue. In the absence of loop-carried |
| 308 | data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical |
| 309 | maximum throughput which can be computed by dividing the number of uOps of a |
| 310 | single iteration by the *Block RThroughput*. |
Andrea Di Biagio | c122af5 | 2018-07-31 18:19:15 +0000 | [diff] [blame] | 311 | |
Andrea Di Biagio | 4dfd5db | 2018-08-29 17:56:39 +0000 | [diff] [blame] | 312 | Field *uOps Per Cycle* is bounded from above by the dispatch width. That is |
| 313 | because the dispatch width limits the maximum size of a dispatch group. Both IPC |
| 314 | and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The |
| 315 | availability of hardware resources affects the resource pressure distribution, |
| 316 | and it limits the number of instructions that can be executed in parallel every |
| 317 | cycle. A delta between Dispatch Width and the theoretical maximum uOps per |
| 318 | Cycle (computed by dividing the number of uOps of a single iteration by the |
| 319 | *Block RTrhoughput*) is an indicator of a performance bottleneck caused by the |
| 320 | lack of hardware resources. |
| 321 | In general, the lower the Block RThroughput, the better. |
| 322 | |
| 323 | In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there |
| 324 | are no loop-carried dependencies, the observed *uOps Per Cycle* is expected to |
| 325 | approach 1.50 when the number of iterations tends to infinity. The delta between |
| 326 | the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is |
| 327 | an indicator of a performance bottleneck caused by the lack of hardware |
| 328 | resources, and the *Resource pressure view* can help to identify the problematic |
| 329 | resource usage. |
Matt Davis | e05bab2 | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 330 | |
| 331 | The second section of the report shows the latency and reciprocal |
| 332 | throughput of every instruction in the sequence. That section also reports |
| 333 | extra information related to the number of micro opcodes, and opcode properties |
| 334 | (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects'). |
| 335 | |
| 336 | The third section is the *Resource pressure view*. This view reports |
| 337 | the average number of resource cycles consumed every iteration by instructions |
| 338 | for every processor resource unit available on the target. Information is |
| 339 | structured in two tables. The first table reports the number of resource cycles |
| 340 | spent on average every iteration. The second table correlates the resource |
| 341 | cycles to the machine instruction in the sequence. For example, every iteration |
| 342 | of the instruction vmulps always executes on resource unit [6] |
| 343 | (JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle |
Matt Davis | bdf2ac0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 344 | per iteration. Note that on AMD Jaguar, vector floating-point multiply can |
| 345 | only be issued to pipeline JFPU1, while horizontal floating-point additions can |
| 346 | only be issued to pipeline JFPU0. |
Matt Davis | e05bab2 | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 347 | |
| 348 | The resource pressure view helps with identifying bottlenecks caused by high |
| 349 | usage of specific hardware resources. Situations with resource pressure mainly |
| 350 | concentrated on a few resources should, in general, be avoided. Ideally, |
| 351 | pressure should be uniformly distributed between multiple resources. |
| 352 | |
| 353 | Timeline View |
| 354 | ^^^^^^^^^^^^^ |
Andrea Di Biagio | c49e383 | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 355 | The timeline view produces a detailed report of each instruction's state |
Matt Davis | e05bab2 | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 356 | transitions through an instruction pipeline. This view is enabled by the |
| 357 | command line option ``-timeline``. As instructions transition through the |
| 358 | various stages of the pipeline, their states are depicted in the view report. |
| 359 | These states are represented by the following characters: |
| 360 | |
| 361 | * D : Instruction dispatched. |
| 362 | * e : Instruction executing. |
| 363 | * E : Instruction executed. |
| 364 | * R : Instruction retired. |
| 365 | * = : Instruction already dispatched, waiting to be executed. |
| 366 | * \- : Instruction executed, waiting to be retired. |
| 367 | |
| 368 | Below is the timeline view for a subset of the dot-product example located in |
| 369 | ``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by |
Andrea Di Biagio | c49e383 | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 370 | :program:`llvm-mca` using the following command: |
Matt Davis | e05bab2 | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 371 | |
| 372 | .. code-block:: bash |
| 373 | |
| 374 | $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s |
| 375 | |
| 376 | .. code-block:: none |
| 377 | |
| 378 | Timeline view: |
| 379 | 012345 |
| 380 | Index 0123456789 |
| 381 | |
| 382 | [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2 |
| 383 | [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3 |
| 384 | [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 |
| 385 | [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 |
| 386 | [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3 |
| 387 | [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 |
| 388 | [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 |
| 389 | [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3 |
| 390 | [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4 |
| 391 | |
| 392 | |
| 393 | Average Wait times (based on the timeline view): |
| 394 | [0]: Executions |
| 395 | [1]: Average time spent waiting in a scheduler's queue |
| 396 | [2]: Average time spent waiting in a scheduler's queue while ready |
| 397 | [3]: Average time elapsed from WB until retire stage |
| 398 | |
| 399 | [0] [1] [2] [3] |
| 400 | 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2 |
| 401 | 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3 |
| 402 | 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4 |
| 403 | |
| 404 | The timeline view is interesting because it shows instruction state changes |
Andrea Di Biagio | c49e383 | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 405 | during execution. It also gives an idea of how the tool processes instructions |
Matt Davis | e05bab2 | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 406 | executed on the target, and how their timing information might be calculated. |
| 407 | |
| 408 | The timeline view is structured in two tables. The first table shows |
| 409 | instructions changing state over time (measured in cycles); the second table |
| 410 | (named *Average Wait times*) reports useful timing statistics, which should |
| 411 | help diagnose performance bottlenecks caused by long data dependencies and |
| 412 | sub-optimal usage of hardware resources. |
| 413 | |
| 414 | An instruction in the timeline view is identified by a pair of indices, where |
| 415 | the first index identifies an iteration, and the second index is the |
| 416 | instruction index (i.e., where it appears in the code sequence). Since this |
| 417 | example was generated using 3 iterations: ``-iterations=3``, the iteration |
| 418 | indices range from 0-2 inclusively. |
| 419 | |
| 420 | Excluding the first and last column, the remaining columns are in cycles. |
| 421 | Cycles are numbered sequentially starting from 0. |
| 422 | |
| 423 | From the example output above, we know the following: |
| 424 | |
| 425 | * Instruction [1,0] was dispatched at cycle 1. |
| 426 | * Instruction [1,0] started executing at cycle 2. |
| 427 | * Instruction [1,0] reached the write back stage at cycle 4. |
| 428 | * Instruction [1,0] was retired at cycle 10. |
| 429 | |
| 430 | Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the |
| 431 | scheduler's queue for the operands to become available. By the time vmulps is |
| 432 | dispatched, operands are already available, and pipeline JFPU1 is ready to |
| 433 | serve another instruction. So the instruction can be immediately issued on the |
| 434 | JFPU1 pipeline. That is demonstrated by the fact that the instruction only |
| 435 | spent 1cy in the scheduler's queue. |
| 436 | |
| 437 | There is a gap of 5 cycles between the write-back stage and the retire event. |
| 438 | That is because instructions must retire in program order, so [1,0] has to wait |
| 439 | for [0,2] to be retired first (i.e., it has to wait until cycle 10). |
| 440 | |
| 441 | In the example, all instructions are in a RAW (Read After Write) dependency |
| 442 | chain. Register %xmm2 written by vmulps is immediately used by the first |
| 443 | vhaddps, and register %xmm3 written by the first vhaddps is used by the second |
| 444 | vhaddps. Long data dependencies negatively impact the ILP (Instruction Level |
| 445 | Parallelism). |
| 446 | |
| 447 | In the dot-product example, there are anti-dependencies introduced by |
| 448 | instructions from different iterations. However, those dependencies can be |
| 449 | removed at register renaming stage (at the cost of allocating register aliases, |
Matt Davis | 8073c65 | 2018-07-31 18:59:46 +0000 | [diff] [blame] | 450 | and therefore consuming physical registers). |
Matt Davis | e05bab2 | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 451 | |
| 452 | Table *Average Wait times* helps diagnose performance issues that are caused by |
| 453 | the presence of long latency instructions and potentially long data dependencies |
Andrea Di Biagio | c49e383 | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 454 | which may limit the ILP. Note that :program:`llvm-mca`, by default, assumes at |
| 455 | least 1cy between the dispatch event and the issue event. |
Matt Davis | e05bab2 | 2018-07-19 20:33:59 +0000 | [diff] [blame] | 456 | |
| 457 | When the performance is limited by data dependencies and/or long latency |
| 458 | instructions, the number of cycles spent while in the *ready* state is expected |
| 459 | to be very small when compared with the total number of cycles spent in the |
| 460 | scheduler's queue. The difference between the two counters is a good indicator |
| 461 | of how large of an impact data dependencies had on the execution of the |
| 462 | instructions. When performance is mostly limited by the lack of hardware |
| 463 | resources, the delta between the two counters is small. However, the number of |
| 464 | cycles spent in the queue tends to be larger (i.e., more than 1-3cy), |
| 465 | especially when compared to other low latency instructions. |
Matt Davis | bdf2ac0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 466 | |
| 467 | Extra Statistics to Further Diagnose Performance Issues |
| 468 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 469 | The ``-all-stats`` command line option enables extra statistics and performance |
| 470 | counters for the dispatch logic, the reorder buffer, the retire control unit, |
| 471 | and the register file. |
| 472 | |
Andrea Di Biagio | 614e612 | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 473 | Below is an example of ``-all-stats`` output generated by :program:`llvm-mca` |
Andrea Di Biagio | 0f9e36f | 2018-08-27 14:52:52 +0000 | [diff] [blame] | 474 | for 300 iterations of the dot-product example discussed in the previous |
| 475 | sections. |
Matt Davis | bdf2ac0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 476 | |
| 477 | .. code-block:: none |
| 478 | |
| 479 | Dynamic Dispatch Stall Cycles: |
| 480 | RAT - Register unavailable: 0 |
| 481 | RCU - Retire tokens unavailable: 0 |
Andrea Di Biagio | f0c09e5 | 2018-08-30 10:50:20 +0000 | [diff] [blame] | 482 | SCHEDQ - Scheduler full: 272 (44.6%) |
Matt Davis | bdf2ac0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 483 | LQ - Load queue full: 0 |
| 484 | SQ - Store queue full: 0 |
| 485 | GROUP - Static restrictions on the dispatch group: 0 |
| 486 | |
| 487 | |
Andrea Di Biagio | f0c09e5 | 2018-08-30 10:50:20 +0000 | [diff] [blame] | 488 | Dispatch Logic - number of cycles where we saw N micro opcodes dispatched: |
Matt Davis | bdf2ac0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 489 | [# dispatched], [# cycles] |
| 490 | 0, 24 (3.9%) |
| 491 | 1, 272 (44.6%) |
| 492 | 2, 314 (51.5%) |
| 493 | |
| 494 | |
| 495 | Schedulers - number of cycles where we saw N instructions issued: |
| 496 | [# issued], [# cycles] |
| 497 | 0, 7 (1.1%) |
| 498 | 1, 306 (50.2%) |
| 499 | 2, 297 (48.7%) |
| 500 | |
Matt Davis | bdf2ac0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 501 | Scheduler's queue usage: |
Andrea Di Biagio | 0f9e36f | 2018-08-27 14:52:52 +0000 | [diff] [blame] | 502 | [1] Resource name. |
| 503 | [2] Average number of used buffer entries. |
| 504 | [3] Maximum number of used buffer entries. |
| 505 | [4] Total number of buffer entries. |
| 506 | |
| 507 | [1] [2] [3] [4] |
| 508 | JALU01 0 0 20 |
| 509 | JFPU01 17 18 18 |
| 510 | JLSAGU 0 0 12 |
Matt Davis | bdf2ac0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 511 | |
| 512 | |
| 513 | Retire Control Unit - number of cycles where we saw N instructions retired: |
| 514 | [# retired], [# cycles] |
| 515 | 0, 109 (17.9%) |
| 516 | 1, 102 (16.7%) |
| 517 | 2, 399 (65.4%) |
| 518 | |
Andrea Di Biagio | 848dcbd | 2018-11-23 12:12:57 +0000 | [diff] [blame] | 519 | Total ROB Entries: 64 |
| 520 | Max Used ROB Entries: 35 ( 54.7% ) |
| 521 | Average Used ROB Entries per cy: 32 ( 50.0% ) |
| 522 | |
Matt Davis | bdf2ac0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 523 | |
| 524 | Register File statistics: |
| 525 | Total number of mappings created: 900 |
| 526 | Max number of mappings used: 35 |
| 527 | |
| 528 | * Register File #1 -- JFpuPRF: |
| 529 | Number of physical registers: 72 |
| 530 | Total number of mappings created: 900 |
| 531 | Max number of mappings used: 35 |
| 532 | |
| 533 | * Register File #2 -- JIntegerPRF: |
| 534 | Number of physical registers: 64 |
| 535 | Total number of mappings created: 0 |
| 536 | Max number of mappings used: 0 |
| 537 | |
| 538 | If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for |
| 539 | SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch |
Andrea Di Biagio | f0c09e5 | 2018-08-30 10:50:20 +0000 | [diff] [blame] | 540 | logic is unable to dispatch a full group because the scheduler's queue is full. |
Matt Davis | bdf2ac0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 541 | |
Andrea Di Biagio | 614e612 | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 542 | Looking at the *Dispatch Logic* table, we see that the pipeline was only able to |
Andrea Di Biagio | f0c09e5 | 2018-08-30 10:50:20 +0000 | [diff] [blame] | 543 | dispatch two micro opcodes 51.5% of the time. The dispatch group was limited to |
| 544 | one micro opcode 44.6% of the cycles, which corresponds to 272 cycles. The |
Matt Davis | bdf2ac0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 545 | dispatch statistics are displayed by either using the command option |
| 546 | ``-all-stats`` or ``-dispatch-stats``. |
| 547 | |
| 548 | The next table, *Schedulers*, presents a histogram displaying a count, |
| 549 | representing the number of instructions issued on some number of cycles. In |
Andrea Di Biagio | 614e612 | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 550 | this case, of the 610 simulated cycles, single instructions were issued 306 |
| 551 | times (50.2%) and there were 7 cycles where no instructions were issued. |
Matt Davis | bdf2ac0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 552 | |
Andrea Di Biagio | 0f9e36f | 2018-08-27 14:52:52 +0000 | [diff] [blame] | 553 | The *Scheduler's queue usage* table shows that the average and maximum number of |
| 554 | buffer entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01 |
Matt Davis | bdf2ac0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 555 | reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements |
| 556 | three schedulers: |
| 557 | |
| 558 | * JALU01 - A scheduler for ALU instructions. |
| 559 | * JFPU01 - A scheduler floating point operations. |
| 560 | * JLSAGU - A scheduler for address generation. |
| 561 | |
| 562 | The dot-product is a kernel of three floating point instructions (a vector |
| 563 | multiply followed by two horizontal adds). That explains why only the floating |
| 564 | point scheduler appears to be used. |
| 565 | |
| 566 | A full scheduler queue is either caused by data dependency chains or by a |
| 567 | sub-optimal usage of hardware resources. Sometimes, resource pressure can be |
| 568 | mitigated by rewriting the kernel using different instructions that consume |
| 569 | different scheduler resources. Schedulers with a small queue are less resilient |
Andrea Di Biagio | 614e612 | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 570 | to bottlenecks caused by the presence of long data dependencies. The scheduler |
| 571 | statistics are displayed by using the command option ``-all-stats`` or |
| 572 | ``-scheduler-stats``. |
Matt Davis | bdf2ac0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 573 | |
| 574 | The next table, *Retire Control Unit*, presents a histogram displaying a count, |
| 575 | representing the number of instructions retired on some number of cycles. In |
Andrea Di Biagio | 614e612 | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 576 | this case, of the 610 simulated cycles, two instructions were retired during the |
| 577 | same cycle 399 times (65.4%) and there were 109 cycles where no instructions |
| 578 | were retired. The retire statistics are displayed by using the command option |
| 579 | ``-all-stats`` or ``-retire-stats``. |
Matt Davis | bdf2ac0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 580 | |
| 581 | The last table presented is *Register File statistics*. Each physical register |
| 582 | file (PRF) used by the pipeline is presented in this table. In the case of AMD |
Andrea Di Biagio | 614e612 | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 583 | Jaguar, there are two register files, one for floating-point registers (JFpuPRF) |
| 584 | and one for integer registers (JIntegerPRF). The table shows that of the 900 |
| 585 | instructions processed, there were 900 mappings created. Since this dot-product |
| 586 | example utilized only floating point registers, the JFPuPRF was responsible for |
| 587 | creating the 900 mappings. However, we see that the pipeline only used a |
| 588 | maximum of 35 of 72 available register slots at any given time. We can conclude |
| 589 | that the floating point PRF was the only register file used for the example, and |
| 590 | that it was never resource constrained. The register file statistics are |
| 591 | displayed by using the command option ``-all-stats`` or |
Matt Davis | bdf2ac0 | 2018-07-21 18:32:47 +0000 | [diff] [blame] | 592 | ``-register-file-stats``. |
| 593 | |
| 594 | In this example, we can conclude that the IPC is mostly limited by data |
| 595 | dependencies, and not by resource pressure. |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 596 | |
| 597 | Instruction Flow |
| 598 | ^^^^^^^^^^^^^^^^ |
Andrea Di Biagio | 614e612 | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 599 | This section describes the instruction flow through the default pipeline of |
| 600 | :program:`llvm-mca`, as well as the functional units involved in the process. |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 601 | |
| 602 | The default pipeline implements the following sequence of stages used to |
| 603 | process instructions. |
| 604 | |
| 605 | * Dispatch (Instruction is dispatched to the schedulers). |
| 606 | * Issue (Instruction is issued to the processor pipelines). |
| 607 | * Write Back (Instruction is executed, and results are written back). |
| 608 | * Retire (Instruction is retired; writes are architecturally committed). |
| 609 | |
| 610 | The default pipeline only models the out-of-order portion of a processor. |
| 611 | Therefore, the instruction fetch and decode stages are not modeled. Performance |
Andrea Di Biagio | 614e612 | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 612 | bottlenecks in the frontend are not diagnosed. :program:`llvm-mca` assumes that |
| 613 | instructions have all been decoded and placed into a queue before the simulation |
| 614 | start. Also, :program:`llvm-mca` does not model branch prediction. |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 615 | |
| 616 | Instruction Dispatch |
| 617 | """""""""""""""""""" |
| 618 | During the dispatch stage, instructions are picked in program order from a |
| 619 | queue of already decoded instructions, and dispatched in groups to the |
| 620 | simulated hardware schedulers. |
| 621 | |
| 622 | The size of a dispatch group depends on the availability of the simulated |
| 623 | hardware resources. The processor dispatch width defaults to the value |
| 624 | of the ``IssueWidth`` in LLVM's scheduling model. |
| 625 | |
| 626 | An instruction can be dispatched if: |
| 627 | |
| 628 | * The size of the dispatch group is smaller than processor's dispatch width. |
| 629 | * There are enough entries in the reorder buffer. |
| 630 | * There are enough physical registers to do register renaming. |
| 631 | * The schedulers are not full. |
| 632 | |
| 633 | Scheduling models can optionally specify which register files are available on |
Andrea Di Biagio | 614e612 | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 634 | the processor. :program:`llvm-mca` uses that information to initialize register |
| 635 | file descriptors. Users can limit the number of physical registers that are |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 636 | globally available for register renaming by using the command option |
Andrea Di Biagio | 614e612 | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 637 | ``-register-file-size``. A value of zero for this option means *unbounded*. By |
| 638 | knowing how many registers are available for renaming, the tool can predict |
| 639 | dispatch stalls caused by the lack of physical registers. |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 640 | |
| 641 | The number of reorder buffer entries consumed by an instruction depends on the |
Andrea Di Biagio | 614e612 | 2018-08-03 12:44:56 +0000 | [diff] [blame] | 642 | number of micro-opcodes specified for that instruction by the target scheduling |
| 643 | model. The reorder buffer is responsible for tracking the progress of |
| 644 | instructions that are "in-flight", and retiring them in program order. The |
| 645 | number of entries in the reorder buffer defaults to the value specified by field |
| 646 | `MicroOpBufferSize` in the target scheduling model. |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 647 | |
| 648 | Instructions that are dispatched to the schedulers consume scheduler buffer |
Andrea Di Biagio | c49e383 | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 649 | entries. :program:`llvm-mca` queries the scheduling model to determine the set |
| 650 | of buffered resources consumed by an instruction. Buffered resources are |
| 651 | treated like scheduler resources. |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 652 | |
| 653 | Instruction Issue |
| 654 | """"""""""""""""" |
| 655 | Each processor scheduler implements a buffer of instructions. An instruction |
| 656 | has to wait in the scheduler's buffer until input register operands become |
| 657 | available. Only at that point, does the instruction becomes eligible for |
| 658 | execution and may be issued (potentially out-of-order) for execution. |
Andrea Di Biagio | c49e383 | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 659 | Instruction latencies are computed by :program:`llvm-mca` with the help of the |
| 660 | scheduling model. |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 661 | |
Andrea Di Biagio | c49e383 | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 662 | :program:`llvm-mca`'s scheduler is designed to simulate multiple processor |
| 663 | schedulers. The scheduler is responsible for tracking data dependencies, and |
| 664 | dynamically selecting which processor resources are consumed by instructions. |
| 665 | It delegates the management of processor resource units and resource groups to a |
| 666 | resource manager. The resource manager is responsible for selecting resource |
| 667 | units that are consumed by instructions. For example, if an instruction |
| 668 | consumes 1cy of a resource group, the resource manager selects one of the |
| 669 | available units from the group; by default, the resource manager uses a |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 670 | round-robin selector to guarantee that resource usage is uniformly distributed |
| 671 | between all units of a group. |
| 672 | |
Andrea Di Biagio | d6b95e9 | 2018-08-03 12:55:28 +0000 | [diff] [blame] | 673 | :program:`llvm-mca`'s scheduler internally groups instructions into three sets: |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 674 | |
Andrea Di Biagio | d6b95e9 | 2018-08-03 12:55:28 +0000 | [diff] [blame] | 675 | * WaitSet: a set of instructions whose operands are not ready. |
| 676 | * ReadySet: a set of instructions ready to execute. |
| 677 | * IssuedSet: a set of instructions executing. |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 678 | |
Andrea Di Biagio | d6b95e9 | 2018-08-03 12:55:28 +0000 | [diff] [blame] | 679 | Depending on the operands availability, instructions that are dispatched to the |
| 680 | scheduler are either placed into the WaitSet or into the ReadySet. |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 681 | |
Andrea Di Biagio | d6b95e9 | 2018-08-03 12:55:28 +0000 | [diff] [blame] | 682 | Every cycle, the scheduler checks if instructions can be moved from the WaitSet |
| 683 | to the ReadySet, and if instructions from the ReadySet can be issued to the |
| 684 | underlying pipelines. The algorithm prioritizes older instructions over younger |
| 685 | instructions. |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 686 | |
| 687 | Write-Back and Retire Stage |
| 688 | """"""""""""""""""""""""""" |
Andrea Di Biagio | d6b95e9 | 2018-08-03 12:55:28 +0000 | [diff] [blame] | 689 | Issued instructions are moved from the ReadySet to the IssuedSet. There, |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 690 | instructions wait until they reach the write-back stage. At that point, they |
| 691 | get removed from the queue and the retire control unit is notified. |
| 692 | |
Andrea Di Biagio | d6b95e9 | 2018-08-03 12:55:28 +0000 | [diff] [blame] | 693 | When instructions are executed, the retire control unit flags the instruction as |
| 694 | "ready to retire." |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 695 | |
Andrea Di Biagio | d6b95e9 | 2018-08-03 12:55:28 +0000 | [diff] [blame] | 696 | Instructions are retired in program order. The register file is notified of the |
| 697 | retirement so that it can free the physical registers that were allocated for |
| 698 | the instruction during the register renaming stage. |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 699 | |
| 700 | Load/Store Unit and Memory Consistency Model |
| 701 | """""""""""""""""""""""""""""""""""""""""""" |
Andrea Di Biagio | c49e383 | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 702 | To simulate an out-of-order execution of memory operations, :program:`llvm-mca` |
| 703 | utilizes a simulated load/store unit (LSUnit) to simulate the speculative |
| 704 | execution of loads and stores. |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 705 | |
Andrea Di Biagio | c49e383 | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 706 | Each load (or store) consumes an entry in the load (or store) queue. Users can |
| 707 | specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the |
| 708 | load and store queues respectively. The queues are unbounded by default. |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 709 | |
| 710 | The LSUnit implements a relaxed consistency model for memory loads and stores. |
| 711 | The rules are: |
| 712 | |
| 713 | 1. A younger load is allowed to pass an older load only if there are no |
| 714 | intervening stores or barriers between the two loads. |
| 715 | 2. A younger load is allowed to pass an older store provided that the load does |
| 716 | not alias with the store. |
| 717 | 3. A younger store is not allowed to pass an older store. |
| 718 | 4. A younger store is not allowed to pass an older load. |
| 719 | |
| 720 | By default, the LSUnit optimistically assumes that loads do not alias |
| 721 | (`-noalias=true`) store operations. Under this assumption, younger loads are |
| 722 | always allowed to pass older stores. Essentially, the LSUnit does not attempt |
| 723 | to run any alias analysis to predict when loads and stores do not alias with |
| 724 | each other. |
| 725 | |
| 726 | Note that, in the case of write-combining memory, rule 3 could be relaxed to |
| 727 | allow reordering of non-aliasing store operations. That being said, at the |
| 728 | moment, there is no way to further relax the memory model (``-noalias`` is the |
| 729 | only option). Essentially, there is no option to specify a different memory |
| 730 | type (e.g., write-back, write-combining, write-through; etc.) and consequently |
| 731 | to weaken, or strengthen, the memory model. |
| 732 | |
| 733 | Other limitations are: |
| 734 | |
| 735 | * The LSUnit does not know when store-to-load forwarding may occur. |
| 736 | * The LSUnit does not know anything about cache hierarchy and memory types. |
| 737 | * The LSUnit does not know how to identify serializing operations and memory |
| 738 | fences. |
| 739 | |
| 740 | The LSUnit does not attempt to predict if a load or store hits or misses the L1 |
| 741 | cache. It only knows if an instruction "MayLoad" and/or "MayStore." For |
| 742 | loads, the scheduling model provides an "optimistic" load-to-use latency (which |
| 743 | usually matches the load-to-use latency for when there is a hit in the L1D). |
| 744 | |
Andrea Di Biagio | c49e383 | 2018-07-31 15:29:10 +0000 | [diff] [blame] | 745 | :program:`llvm-mca` does not know about serializing operations or memory-barrier |
| 746 | like instructions. The LSUnit conservatively assumes that an instruction which |
| 747 | has both "MayLoad" and unmodeled side effects behaves like a "soft" |
| 748 | load-barrier. That means, it serializes loads without forcing a flush of the |
| 749 | load queue. Similarly, instructions that "MayStore" and have unmodeled side |
| 750 | effects are treated like store barriers. A full memory barrier is a "MayLoad" |
| 751 | and "MayStore" instruction with unmodeled side effects. This is inaccurate, but |
| 752 | it is the best that we can do at the moment with the current information |
| 753 | available in LLVM. |
Matt Davis | d08e6c7 | 2018-07-30 22:30:14 +0000 | [diff] [blame] | 754 | |
| 755 | A load/store barrier consumes one entry of the load/store queue. A load/store |
| 756 | barrier enforces ordering of loads/stores. A younger load cannot pass a load |
| 757 | barrier. Also, a younger store cannot pass a store barrier. A younger load |
| 758 | has to wait for the memory/load barrier to execute. A load/store barrier is |
| 759 | "executed" when it becomes the oldest entry in the load/store queue(s). That |
| 760 | also means, by construction, all of the older loads/stores have been executed. |
| 761 | |
| 762 | In conclusion, the full set of load/store consistency rules are: |
| 763 | |
| 764 | #. A store may not pass a previous store. |
| 765 | #. A store may not pass a previous load (regardless of ``-noalias``). |
| 766 | #. A store has to wait until an older store barrier is fully executed. |
| 767 | #. A load may pass a previous load. |
| 768 | #. A load may not pass a previous store unless ``-noalias`` is set. |
| 769 | #. A load has to wait until an older load barrier is fully executed. |