Keith Wyss | 630a4ac | 2017-08-02 21:47:27 +0000 | [diff] [blame] | 1 | ====================================== |
| 2 | XRay Flight Data Recorder Trace Format |
| 3 | ====================================== |
| 4 | |
| 5 | :Version: 1 as of 2017-07-20 |
| 6 | |
| 7 | .. contents:: |
| 8 | :local: |
| 9 | |
| 10 | |
| 11 | Introduction |
| 12 | ============ |
| 13 | |
| 14 | When gathering XRay traces in Flight Data Recorder mode, each thread of an |
| 15 | application will claim buffers to fill with trace data, which at some point |
| 16 | is finalized and flushed. |
| 17 | |
Eric Christopher | 4362e9e | 2018-05-27 09:19:03 +0000 | [diff] [blame] | 18 | A goal of the profiler is to minimize overhead, the flushed data directly |
Keith Wyss | 630a4ac | 2017-08-02 21:47:27 +0000 | [diff] [blame] | 19 | corresponds to the buffer. |
| 20 | |
| 21 | This document describes the format of a trace file. |
| 22 | |
| 23 | |
| 24 | General |
| 25 | ======= |
| 26 | |
| 27 | Each trace file corresponds to a sequence of events in a particular thread. |
| 28 | |
| 29 | The file has a header followed by a sequence of discriminated record types. |
| 30 | |
Sylvestre Ledru | 0fa72c3 | 2017-08-12 15:08:11 +0000 | [diff] [blame] | 31 | The endianness of byte fields matches the endianess of the platform which |
Keith Wyss | 630a4ac | 2017-08-02 21:47:27 +0000 | [diff] [blame] | 32 | produced the trace file. |
| 33 | |
| 34 | |
| 35 | Header Section |
| 36 | ============== |
| 37 | |
| 38 | A trace file begins with a 32 byte header. |
| 39 | |
| 40 | +-------------------+-----------------+----------------------------------------+ |
| 41 | | Field | Size (bytes) | Description | |
| 42 | +===================+=================+========================================+ |
| 43 | | version | ``2`` | Anticipates versioned readers. This | |
| 44 | | | | document describes the format when | |
| 45 | | | | version == 1 | |
| 46 | +-------------------+-----------------+----------------------------------------+ |
| 47 | | type | ``2`` | An enumeration encoding the type of | |
| 48 | | | | trace. Flight Data Recorder mode | |
| 49 | | | | traces have type == 1 | |
| 50 | +-------------------+-----------------+----------------------------------------+ |
| 51 | | bitfield | ``4`` | Holds parameters that are not aligned | |
| 52 | | | | to bytes. Further described below. | |
| 53 | +-------------------+-----------------+----------------------------------------+ |
| 54 | | cycle_frequency | ``8`` | The frequency in hertz of the CPU | |
| 55 | | | | oscillator used to measure duration of | |
| 56 | | | | events in ticks. | |
| 57 | +-------------------+-----------------+----------------------------------------+ |
| 58 | | buffer_size | ``8`` | The size in bytes of the data portion | |
| 59 | | | | of the trace following the header. | |
| 60 | +-------------------+-----------------+----------------------------------------+ |
| 61 | | reserved | ``8`` | Reserved for future use. | |
| 62 | +-------------------+-----------------+----------------------------------------+ |
| 63 | |
| 64 | The bitfield parameter of the file header is composed of the following fields. |
| 65 | |
| 66 | +-------------------+----------------+-----------------------------------------+ |
| 67 | | Field | Size (bits) | Description | |
| 68 | +===================+================+=========================================+ |
| 69 | | constant_tsc | ``1`` | Whether the platform's timestamp | |
| 70 | | | | counter used to record ticks between | |
| 71 | | | | events ticks at a constant frequency | |
| 72 | | | | despite CPU frequency changes. | |
| 73 | | | | 0 == non-constant. 1 == constant. | |
| 74 | +-------------------+----------------+-----------------------------------------+ |
| 75 | | nonstop_tsc | ``1`` | Whether the tsc continues to count | |
| 76 | | | | despite whether the CPU is in a low | |
| 77 | | | | power state. 0 == stop. 1 == non-stop. | |
| 78 | +-------------------+----------------+-----------------------------------------+ |
| 79 | | reserved | ``30`` | Not meaningful. | |
| 80 | +-------------------+----------------+-----------------------------------------+ |
| 81 | |
| 82 | |
| 83 | Data Section |
| 84 | ============ |
| 85 | |
| 86 | Following the header in a trace is a data section with size matching the |
| 87 | buffer_size field in the header. |
| 88 | |
| 89 | The data section is a stream of elements of different types. |
| 90 | |
| 91 | There are a few categories of data in the sequence. |
| 92 | |
| 93 | - ``Function Records``: Function Records contain the timing of entry into and |
| 94 | exit from function execution. Function Records have 8 bytes each. |
| 95 | |
| 96 | - ``Metadata Records``: Metadata records serve many purposes. Mostly, they |
| 97 | capture information that may be too costly to record for each function, but |
| 98 | that is required to contextualize the fine-grained timings. They also are used |
| 99 | as markers for user-defined Event Data payloads. Metadata records have 16 |
| 100 | bytes each. |
| 101 | |
| 102 | - ``Event Data``: Free form data may be associated with events that are traced |
| 103 | by the binary and encode data defined by a handler function. Event data is |
| 104 | always preceded with a marker record which indicates how large it is. |
| 105 | |
| 106 | - ``Function Arguments``: The arguments to some functions are included in the |
| 107 | trace. These are either pointer addresses or primitives that are read and |
| 108 | logged independently of their types in a high level language. To the tracer, |
Eric Christopher | 4362e9e | 2018-05-27 09:19:03 +0000 | [diff] [blame] | 109 | they are all numbers. Function Records that have attached arguments will |
| 110 | indicate their presence on the function entry record. We only support logging |
| 111 | contiguous function argument sequences starting with argument zero, which will |
| 112 | be the "this" pointer for member function invocations. For example, we don't |
| 113 | support logging the first and third argument. |
Keith Wyss | 630a4ac | 2017-08-02 21:47:27 +0000 | [diff] [blame] | 114 | |
| 115 | A reader of the memory format must maintain a state machine. The format makes no |
| 116 | attempt to pad for alignment, and it is not seekable. |
| 117 | |
| 118 | |
| 119 | Function Records |
| 120 | ---------------- |
| 121 | |
| 122 | Function Records have an 8 byte layout. This layout encodes information to |
| 123 | reconstruct a call stack of instrumented function and their durations. |
| 124 | |
| 125 | +---------------+--------------+-----------------------------------------------+ |
| 126 | | Field | Size (bits) | Description | |
| 127 | +===============+==============+===============================================+ |
| 128 | | discriminant | ``1`` | Indicates whether a reader should read a | |
| 129 | | | | Function or Metadata record. Set to ``0`` for | |
| 130 | | | | Function records. | |
| 131 | +---------------+--------------+-----------------------------------------------+ |
| 132 | | action | ``3`` | Specifies whether the function is being | |
| 133 | | | | entered, exited, or is a non-standard entry | |
| 134 | | | | or exit produced by optimizations. | |
| 135 | +---------------+--------------+-----------------------------------------------+ |
| 136 | | function_id | ``28`` | A numeric ID for the function. Resolved to a | |
| 137 | | | | name via the xray instrumentation map. The | |
| 138 | | | | instrumentation map is built by xray at | |
| 139 | | | | compile time into an object file and pairs | |
| 140 | | | | the function ids to addresses. It is used for | |
| 141 | | | | patching and as a lookup into the binary's | |
| 142 | | | | symbols to obtain names. | |
| 143 | +---------------+--------------+-----------------------------------------------+ |
| 144 | | tsc_delta | ``32`` | The number of ticks of the timestamp counter | |
| 145 | | | | since a previous record recorded a delta or | |
| 146 | | | | other TSC resetting event. | |
| 147 | +---------------+--------------+-----------------------------------------------+ |
| 148 | |
| 149 | On little-endian machines, the bitfields are ordered from least significant bit |
| 150 | bit to most significant bit. A reader can read an 8 bit value and apply the mask |
| 151 | ``0x01`` for the discriminant. Similarly, they can read 32 bits and unsigned |
| 152 | shift right by ``0x04`` to obtain the function_id field. |
| 153 | |
| 154 | On big-endian machine, the bitfields are written in order from most significant |
| 155 | bit to least significant bit. A reader would read an 8 bit value and unsigned |
| 156 | shift right by 7 bits for the discriminant. The function_id field could be |
| 157 | obtained by reading a 32 bit value and applying the mask ``0x0FFFFFFF``. |
| 158 | |
| 159 | Function action types are as follows. |
| 160 | |
| 161 | +---------------+--------------+-----------------------------------------------+ |
| 162 | | Type | Number | Description | |
| 163 | +===============+==============+===============================================+ |
| 164 | | Entry | ``0`` | Typical function entry. | |
| 165 | +---------------+--------------+-----------------------------------------------+ |
| 166 | | Exit | ``1`` | Typical function exit. | |
| 167 | +---------------+--------------+-----------------------------------------------+ |
Dean Michael Berris | d6aeff8 | 2017-09-18 06:08:46 +0000 | [diff] [blame] | 168 | | Tail_Exit | ``2`` | An exit from a function due to tail call | |
Keith Wyss | 630a4ac | 2017-08-02 21:47:27 +0000 | [diff] [blame] | 169 | | | | optimization. | |
| 170 | +---------------+--------------+-----------------------------------------------+ |
| 171 | | Entry_Args | ``3`` | A function entry that records arguments. | |
| 172 | +---------------+--------------+-----------------------------------------------+ |
| 173 | |
| 174 | Entry_Args records do not contain the arguments themselves. Instead, metadata |
| 175 | records for each of the logged args follow the function record in the stream. |
| 176 | |
| 177 | |
| 178 | Metadata Records |
| 179 | ---------------- |
| 180 | |
| 181 | Interspersed throughout the buffer are 16 byte Metadata records. For typically |
| 182 | instrumented binaries, they will be sparser than Function records, and they |
| 183 | provide a fuller picture of the binary execution state. |
| 184 | |
| 185 | Metadata record layout is partially record dependent, but they share a common |
| 186 | structure. |
| 187 | |
| 188 | The same bit field rules described for function records apply to the first byte |
| 189 | of MetadataRecords. Within this byte, little endian machines use lsb to msb |
| 190 | ordering and big endian machines use msb to lsb ordering. |
| 191 | |
| 192 | +---------------+--------------+-----------------------------------------------+ |
| 193 | | Field | Size | Description | |
| 194 | +===============+==============+===============================================+ |
| 195 | | discriminant | ``1 bit`` | Indicates whether a reader should read a | |
| 196 | | | | Function or Metadata record. Set to ``1`` for | |
| 197 | | | | Metadata records. | |
| 198 | +---------------+--------------+-----------------------------------------------+ |
| 199 | | record_kind | ``7 bits`` | The type of Metadata record. | |
| 200 | +---------------+--------------+-----------------------------------------------+ |
| 201 | | data | ``15 bytes`` | A data field used differently for each record | |
| 202 | | | | type. | |
| 203 | +---------------+--------------+-----------------------------------------------+ |
| 204 | |
| 205 | Here is a table of the enumerated record kinds. |
| 206 | |
| 207 | +--------+---------------------------+ |
| 208 | | Number | Type | |
| 209 | +========+===========================+ |
| 210 | | 0 | NewBuffer | |
| 211 | +--------+---------------------------+ |
| 212 | | 1 | EndOfBuffer | |
| 213 | +--------+---------------------------+ |
| 214 | | 2 | NewCPUId | |
| 215 | +--------+---------------------------+ |
| 216 | | 3 | TSCWrap | |
| 217 | +--------+---------------------------+ |
| 218 | | 4 | WallTimeMarker | |
| 219 | +--------+---------------------------+ |
| 220 | | 5 | CustomEventMarker | |
| 221 | +--------+---------------------------+ |
| 222 | | 6 | CallArgument | |
| 223 | +--------+---------------------------+ |
| 224 | |
| 225 | |
| 226 | NewBuffer Records |
| 227 | ----------------- |
| 228 | |
| 229 | Each buffer begins with a NewBuffer record immediately after the header. |
| 230 | It records the thread ID of the thread that the trace belongs to. |
| 231 | |
| 232 | Its data segment is as follows. |
| 233 | |
| 234 | +---------------+--------------+-----------------------------------------------+ |
| 235 | | Field | Size (bytes) | Description | |
| 236 | +===============+==============+===============================================+ |
| 237 | | thread_Id | ``2`` | Thread ID for buffer. | |
| 238 | +---------------+--------------+-----------------------------------------------+ |
| 239 | | reserved | ``13`` | Unused. | |
| 240 | +---------------+--------------+-----------------------------------------------+ |
| 241 | |
| 242 | |
| 243 | WallClockTime Records |
| 244 | --------------------- |
| 245 | |
| 246 | Following the NewBuffer record, each buffer records an absolute time as a frame |
| 247 | of reference for the durations recorded by timestamp counter deltas. |
| 248 | |
| 249 | Its data segment is as follows. |
| 250 | |
| 251 | +---------------+--------------+-----------------------------------------------+ |
| 252 | | Field | Size (bytes) | Description | |
| 253 | +===============+==============+===============================================+ |
| 254 | | seconds | ``8`` | Seconds on absolute timescale. The starting | |
| 255 | | | | point is unspecified and depends on the | |
| 256 | | | | implementation and platform configured by the | |
| 257 | | | | tracer. | |
| 258 | +---------------+--------------+-----------------------------------------------+ |
| 259 | | microseconds | ``4`` | The microsecond component of the time. | |
| 260 | +---------------+--------------+-----------------------------------------------+ |
| 261 | | reserved | ``3`` | Unused. | |
| 262 | +---------------+--------------+-----------------------------------------------+ |
| 263 | |
| 264 | |
| 265 | NewCpuId Records |
| 266 | ---------------- |
| 267 | |
| 268 | Each function entry invokes a routine to determine what CPU is executing. |
| 269 | Typically, this is done with readtscp, which reads the timestamp counter at the |
| 270 | same time. |
| 271 | |
| 272 | If the tracing detects that the execution has switched CPUs or if this is the |
| 273 | first instrumented entry point, the tracer will output a NewCpuId record. |
| 274 | |
| 275 | Its data segment is as follows. |
| 276 | |
| 277 | +---------------+--------------+-----------------------------------------------+ |
| 278 | | Field | Size (bytes) | Description | |
| 279 | +===============+==============+===============================================+ |
| 280 | | cpu_id | ``2`` | CPU Id. | |
| 281 | +---------------+--------------+-----------------------------------------------+ |
| 282 | | absolute_tsc | ``8`` | The absolute value of the timestamp counter. | |
| 283 | +---------------+--------------+-----------------------------------------------+ |
| 284 | | reserved | ``5`` | Unused. | |
| 285 | +---------------+--------------+-----------------------------------------------+ |
| 286 | |
| 287 | |
| 288 | TSCWrap Records |
| 289 | --------------- |
| 290 | |
| 291 | Since each function record uses a 32 bit value to represent the number of ticks |
| 292 | of the timestamp counter since the last reference, it is possible for this value |
| 293 | to overflow, particularly for sparsely instrumented binaries. |
| 294 | |
| 295 | When this delta would not fit into a 32 bit representation, a reference absolute |
| 296 | timestamp counter record is written in the form of a TSCWrap record. |
| 297 | |
| 298 | Its data segment is as follows. |
| 299 | |
| 300 | +---------------+--------------+-----------------------------------------------+ |
| 301 | | Field | Size (bytes) | Description | |
| 302 | +===============+==============+===============================================+ |
| 303 | | absolute_tsc | ``8`` | Timestamp counter value. | |
| 304 | +---------------+--------------+-----------------------------------------------+ |
| 305 | | reserved | ``7`` | Unused. | |
| 306 | +---------------+--------------+-----------------------------------------------+ |
| 307 | |
| 308 | |
| 309 | CallArgument Records |
| 310 | -------------------- |
| 311 | |
| 312 | Immediately following an Entry_Args type function record, there may be one or |
| 313 | more CallArgument records that contain the traced function's parameter values. |
| 314 | |
| 315 | The order of the CallArgument Record sequency corresponds one to one with the |
| 316 | order of the function parameters. |
| 317 | |
| 318 | CallArgument data segment: |
| 319 | |
| 320 | +---------------+--------------+-----------------------------------------------+ |
| 321 | | Field | Size (bytes) | Description | |
| 322 | +===============+==============+===============================================+ |
| 323 | | argument | ``8`` | Numeric argument (may be pointer address). | |
| 324 | +---------------+--------------+-----------------------------------------------+ |
| 325 | | reserved | ``7`` | Unused. | |
| 326 | +---------------+--------------+-----------------------------------------------+ |
| 327 | |
| 328 | |
| 329 | CustomEventMarker Records |
| 330 | ------------------------- |
| 331 | |
| 332 | XRay provides the feature of logging custom events. This may be leveraged to |
| 333 | record tracing info for RPCs or similarly trace data that is application |
| 334 | specific. |
| 335 | |
| 336 | Custom Events themselves are an unstructured (application defined) segment of |
| 337 | memory with arbitrary size within the buffer. They are preceded by |
| 338 | CustomEventMarkers to indicate their presence and size. |
| 339 | |
| 340 | CustomEventMarker data segment: |
| 341 | |
| 342 | +---------------+--------------+-----------------------------------------------+ |
| 343 | | Field | Size (bytes) | Description | |
| 344 | +===============+==============+===============================================+ |
| 345 | | event_size | ``4`` | Size of preceded event. | |
| 346 | +---------------+--------------+-----------------------------------------------+ |
| 347 | | absolute_tsc | ``8`` | A timestamp counter of the event. | |
| 348 | +---------------+--------------+-----------------------------------------------+ |
| 349 | | reserved | ``3`` | Unused. | |
| 350 | +---------------+--------------+-----------------------------------------------+ |
| 351 | |
| 352 | |
| 353 | EndOfBuffer Records |
| 354 | ------------------- |
| 355 | |
| 356 | An EndOfBuffer record type indicates that there is no more trace data in this |
| 357 | buffer. The reader is expected to seek past the remaining buffer_size expressed |
| 358 | before the start of buffer and look for either another header or EOF. |
| 359 | |
| 360 | |
| 361 | Format Grammar and Invariants |
| 362 | ============================= |
| 363 | |
| 364 | Not all sequences of Metadata records and Function records are valid data. A |
| 365 | sequence should be parsed as a state machine. The expectations for a valid |
| 366 | format can be expressed as a context free grammar. |
| 367 | |
| 368 | This is an attempt to explain the format with statements in EBNF format. |
| 369 | |
| 370 | - Format := Header ThreadBuffer* EOF |
| 371 | |
| 372 | - ThreadBuffer := NewBuffer WallClockTime NewCPUId BodySequence* End |
| 373 | |
| 374 | - BodySequence := NewCPUId | TSCWrap | Function | CustomEvent |
| 375 | |
| 376 | - Function := (Function_Entry_Args CallArgument*) | Function_Other_Type |
| 377 | |
| 378 | - CustomEvent := CustomEventMarker CustomEventUnstructuredMemory |
| 379 | |
| 380 | - End := EndOfBuffer RemainingBufferSizeToSkip |
| 381 | |
| 382 | |
| 383 | Function Record Order |
| 384 | --------------------- |
| 385 | |
| 386 | There are a few clarifications that may help understand what is expected of |
| 387 | Function records. |
| 388 | |
| 389 | - Functions with an Exit are expected to have a corresponding Entry or |
| 390 | Entry_Args function record precede them in the trace. |
| 391 | |
| 392 | - Tail_Exit Function records record the Function ID of the function whose return |
| 393 | address the program counter will take. In other words, the final function that |
| 394 | would be popped off of the call stack if tail call optimization was not used. |
| 395 | |
| 396 | - Not all functions marked for instrumentation are necessarily in the trace. The |
| 397 | tracer uses heuristics to preserve the trace for non-trivial functions. |
| 398 | |
| 399 | - Not every entry must have a traced Exit or Tail Exit. The buffer may run out |
| 400 | of space or the program may request for the tracer to finalize toreturn the |
| 401 | buffer before an instrumented function exits. |