...

Text file src/cmd/compile/abi-internal.md

Documentation: cmd/compile

     1# Go internal ABI specification
     2
     3Self-link: [go.dev/s/regabi](https://go.dev/s/regabi)
     4
     5This document describes Go’s internal application binary interface
     6(ABI), known as ABIInternal.
     7Go's ABI defines the layout of data in memory and the conventions for
     8calling between Go functions.
     9This ABI is *unstable* and will change between Go versions.
    10If you’re writing assembly code, please instead refer to Go’s
    11[assembly documentation](/doc/asm.html), which describes Go’s stable
    12ABI, known as ABI0.
    13
    14All functions defined in Go source follow ABIInternal.
    15However, ABIInternal and ABI0 functions are able to call each other
    16through transparent *ABI wrappers*, described in the [internal calling
    17convention proposal](https://golang.org/design/27539-internal-abi).
    18
    19Go uses a common ABI design across all architectures.
    20We first describe the common ABI, and then cover per-architecture
    21specifics.
    22
    23*Rationale*: For the reasoning behind using a common ABI across
    24architectures instead of the platform ABI, see the [register-based Go
    25calling convention proposal](https://golang.org/design/40724-register-calling).
    26
    27## Memory layout
    28
    29Go's built-in types have the following sizes and alignments.
    30Many, though not all, of these sizes are guaranteed by the [language
    31specification](/doc/go_spec.html#Size_and_alignment_guarantees).
    32Those that aren't guaranteed may change in future versions of Go (for
    33example, we've considered changing the alignment of int64 on 32-bit).
    34
    35| Type                        | 64-bit |       | 32-bit |       |
    36|-----------------------------|--------|-------|--------|-------|
    37|                             | Size   | Align | Size   | Align |
    38| bool, uint8, int8           | 1      | 1     | 1      | 1     |
    39| uint16, int16               | 2      | 2     | 2      | 2     |
    40| uint32, int32               | 4      | 4     | 4      | 4     |
    41| uint64, int64               | 8      | 8     | 8      | 4     |
    42| int, uint                   | 8      | 8     | 4      | 4     |
    43| float32                     | 4      | 4     | 4      | 4     |
    44| float64                     | 8      | 8     | 8      | 4     |
    45| complex64                   | 8      | 4     | 8      | 4     |
    46| complex128                  | 16     | 8     | 16     | 4     |
    47| uintptr, *T, unsafe.Pointer | 8      | 8     | 4      | 4     |
    48
    49The types `byte` and `rune` are aliases for `uint8` and `int32`,
    50respectively, and hence have the same size and alignment as these
    51types.
    52
    53The layout of `map`, `chan`, and `func` types is equivalent to *T.
    54
    55To describe the layout of the remaining composite types, we first
    56define the layout of a *sequence* S of N fields with types
    57t<sub>1</sub>, t<sub>2</sub>, ..., t<sub>N</sub>.
    58We define the byte offset at which each field begins relative to a
    59base address of 0, as well as the size and alignment of the sequence
    60as follows:
    61
    62```
    63offset(S, i) = 0  if i = 1
    64             = align(offset(S, i-1) + sizeof(t_(i-1)), alignof(t_i))
    65alignof(S)   = 1  if N = 0
    66             = max(alignof(t_i) | 1 <= i <= N)
    67sizeof(S)    = 0  if N = 0
    68             = align(offset(S, N) + sizeof(t_N), alignof(S))
    69```
    70
    71Where sizeof(T) and alignof(T) are the size and alignment of type T,
    72respectively, and align(x, y) rounds x up to a multiple of y.
    73
    74The `interface{}` type is a sequence of 1. a pointer to the runtime type
    75description for the interface's dynamic type and 2. an `unsafe.Pointer`
    76data field.
    77Any other interface type (besides the empty interface) is a sequence
    78of 1. a pointer to the runtime "itab" that gives the method pointers and
    79the type of the data field and 2. an `unsafe.Pointer` data field.
    80An interface can be "direct" or "indirect" depending on the dynamic
    81type: a direct interface stores the value directly in the data field,
    82and an indirect interface stores a pointer to the value in the data
    83field.
    84An interface can only be direct if the value consists of a single
    85pointer word.
    86
    87An array type `[N]T` is a sequence of N fields of type T.
    88
    89The slice type `[]T` is a sequence of a `*[cap]T` pointer to the slice
    90backing store, an `int` giving the `len` of the slice, and an `int`
    91giving the `cap` of the slice.
    92
    93The `string` type is a sequence of a `*[len]byte` pointer to the
    94string backing store, and an `int` giving the `len` of the string.
    95
    96A struct type `struct { f1 t1; ...; fM tM }` is laid out as the
    97sequence t1, ..., tM, tP, where tP is either:
    98
    99- Type `byte` if sizeof(tM) = 0 and any of sizeof(t*i*) ≠ 0.
   100- Empty (size 0 and align 1) otherwise.
   101
   102The padding byte prevents creating a past-the-end pointer by taking
   103the address of the final, empty fN field.
   104
   105Note that user-written assembly code should generally not depend on Go
   106type layout and should instead use the constants defined in
   107[`go_asm.h`](/doc/asm.html#data-offsets).
   108
   109## Function call argument and result passing
   110
   111Function calls pass arguments and results using a combination of the
   112stack and machine registers.
   113Each argument or result is passed either entirely in registers or
   114entirely on the stack.
   115Because access to registers is generally faster than access to the
   116stack, arguments and results are preferentially passed in registers.
   117However, any argument or result that contains a non-trivial array or
   118does not fit entirely in the remaining available registers is passed
   119on the stack.
   120
   121Each architecture defines a sequence of integer registers and a
   122sequence of floating-point registers.
   123At a high level, arguments and results are recursively broken down
   124into values of base types and these base values are assigned to
   125registers from these sequences.
   126
   127Arguments and results can share the same registers, but do not share
   128the same stack space.
   129Beyond the arguments and results passed on the stack, the caller also
   130reserves spill space on the stack for all register-based arguments
   131(but does not populate this space).
   132
   133The receiver, arguments, and results of function or method F are
   134assigned to registers or the stack using the following algorithm:
   135
   1361. Let NI and NFP be the length of integer and floating-point register
   137   sequences defined by the architecture.
   138   Let I and FP be 0; these are the indexes of the next integer and
   139   floating-point register.
   140   Let S, the type sequence defining the stack frame, be empty.
   1411. If F is a method, assign F’s receiver.
   1421. For each argument A of F, assign A.
   1431. Add a pointer-alignment field to S. This has size 0 and the same
   144   alignment as `uintptr`.
   1451. Reset I and FP to 0.
   1461. For each result R of F, assign R.
   1471. Add a pointer-alignment field to S.
   1481. For each register-assigned receiver and argument of F, let T be its
   149   type and add T to the stack sequence S.
   150   This is the argument's (or receiver's) spill space and will be
   151   uninitialized at the call.
   1521. Add a pointer-alignment field to S.
   153
   154Assigning a receiver, argument, or result V of underlying type T works
   155as follows:
   156
   1571. Remember I and FP.
   1581. If T has zero size, add T to the stack sequence S and return.
   1591. Try to register-assign V.
   1601. If step 3 failed, reset I and FP to the values from step 1, add T
   161   to the stack sequence S, and assign V to this field in S.
   162
   163Register-assignment of a value V of underlying type T works as follows:
   164
   1651. If T is a boolean or integral type that fits in an integer
   166   register, assign V to register I and increment I.
   1671. If T is an integral type that fits in two integer registers, assign
   168   the least significant and most significant halves of V to registers
   169   I and I+1, respectively, and increment I by 2
   1701. If T is a floating-point type and can be represented without loss
   171   of precision in a floating-point register, assign V to register FP
   172   and increment FP.
   1731. If T is a complex type, recursively register-assign its real and
   174   imaginary parts.
   1751. If T is a pointer type, map type, channel type, or function type,
   176   assign V to register I and increment I.
   1771. If T is a string type, interface type, or slice type, recursively
   178   register-assign V’s components (2 for strings and interfaces, 3 for
   179   slices).
   1801. If T is a struct type, recursively register-assign each field of V.
   1811. If T is an array type of length 0, do nothing.
   1821. If T is an array type of length 1, recursively register-assign its
   183   one element.
   1841. If T is an array type of length > 1, fail.
   1851. If I > NI or FP > NFP, fail.
   1861. If any recursive assignment above fails, fail.
   187
   188The above algorithm produces an assignment of each receiver, argument,
   189and result to registers or to a field in the stack sequence.
   190The final stack sequence looks like: stack-assigned receiver,
   191stack-assigned arguments, pointer-alignment, stack-assigned results,
   192pointer-alignment, spill space for each register-assigned argument,
   193pointer-alignment.
   194The following diagram shows what this stack frame looks like on the
   195stack, using the typical convention where address 0 is at the bottom:
   196
   197    +------------------------------+
   198    |             . . .            |
   199    | 2nd reg argument spill space |
   200    | 1st reg argument spill space |
   201    | <pointer-sized alignment>    |
   202    |             . . .            |
   203    | 2nd stack-assigned result    |
   204    | 1st stack-assigned result    |
   205    | <pointer-sized alignment>    |
   206    |             . . .            |
   207    | 2nd stack-assigned argument  |
   208    | 1st stack-assigned argument  |
   209    | stack-assigned receiver      |
   210    +------------------------------+ ↓ lower addresses
   211
   212To perform a call, the caller reserves space starting at the lowest
   213address in its stack frame for the call stack frame, stores arguments
   214in the registers and argument stack fields determined by the above
   215algorithm, and performs the call.
   216At the time of a call, spill space, result stack fields, and result
   217registers are left uninitialized.
   218Upon return, the callee must have stored results to all result
   219registers and result stack fields determined by the above algorithm.
   220
   221There are no callee-save registers, so a call may overwrite any
   222register that doesn’t have a fixed meaning, including argument
   223registers.
   224
   225### Example
   226
   227Consider the function `func f(a1 uint8, a2 [2]uintptr, a3 uint8) (r1
   228struct { x uintptr; y [2]uintptr }, r2 string)` on a 64-bit
   229architecture with hypothetical integer registers R0–R9.
   230
   231On entry, `a1` is assigned to `R0`, `a3` is assigned to `R1` and the
   232stack frame is laid out in the following sequence:
   233
   234    a2      [2]uintptr
   235    r1.x    uintptr
   236    r1.y    [2]uintptr
   237    a1Spill uint8
   238    a3Spill uint8
   239    _       [6]uint8  // alignment padding
   240
   241In the stack frame, only the `a2` field is initialized on entry; the
   242rest of the frame is left uninitialized.
   243
   244On exit, `r2.base` is assigned to `R0`, `r2.len` is assigned to `R1`,
   245and `r1.x` and `r1.y` are initialized in the stack frame.
   246
   247There are several things to note in this example.
   248First, `a2` and `r1` are stack-assigned because they contain arrays.
   249The other arguments and results are register-assigned.
   250Result `r2` is decomposed into its components, which are individually
   251register-assigned.
   252On the stack, the stack-assigned arguments appear at lower addresses
   253than the stack-assigned results, which appear at lower addresses than
   254the argument spill area.
   255Only arguments, not results, are assigned a spill area on the stack.
   256
   257### Rationale
   258
   259Each base value is assigned to its own register to optimize
   260construction and access.
   261An alternative would be to pack multiple sub-word values into
   262registers, or to simply map an argument's in-memory layout to
   263registers (this is common in C ABIs), but this typically adds cost to
   264pack and unpack these values.
   265Modern architectures have more than enough registers to pass all
   266arguments and results this way for nearly all functions (see the
   267appendix), so there’s little downside to spreading base values across
   268registers.
   269
   270Arguments that can’t be fully assigned to registers are passed
   271entirely on the stack in case the callee takes the address of that
   272argument.
   273If an argument could be split across the stack and registers and the
   274callee took its address, it would need to be reconstructed in memory,
   275a process that would be proportional to the size of the argument.
   276
   277Non-trivial arrays are always passed on the stack because indexing
   278into an array typically requires a computed offset, which generally
   279isn’t possible with registers.
   280Arrays in general are rare in function signatures (only 0.7% of
   281functions in the Go 1.15 standard library and 0.2% in kubelet).
   282We considered allowing array fields to be passed on the stack while
   283the rest of an argument’s fields are passed in registers, but this
   284creates the same problems as other large structs if the callee takes
   285the address of an argument, and would benefit <0.1% of functions in
   286kubelet (and even these very little).
   287
   288We make exceptions for 0 and 1-element arrays because these don’t
   289require computed offsets, and 1-element arrays are already decomposed
   290in the compiler’s SSA representation.
   291
   292The ABI assignment algorithm above is equivalent to Go’s stack-based
   293ABI0 calling convention if there are zero architecture registers.
   294This is intended to ease the transition to the register-based internal
   295ABI and make it easy for the compiler to generate either calling
   296convention.
   297An architecture may still define register meanings that aren’t
   298compatible with ABI0, but these differences should be easy to account
   299for in the compiler.
   300
   301The assignment algorithm assigns zero-sized values to the stack
   302(assignment step 2) in order to support ABI0-equivalence.
   303While these values take no space themselves, they do result in
   304alignment padding on the stack in ABI0.
   305Without this step, the internal ABI would register-assign zero-sized
   306values even on architectures that provide no argument registers
   307because they don't consume any registers, and hence not add alignment
   308padding to the stack.
   309
   310The algorithm reserves spill space for arguments in the caller’s frame
   311so that the compiler can generate a stack growth path that spills into
   312this reserved space.
   313If the callee has to grow the stack, it may not be able to reserve
   314enough additional stack space in its own frame to spill these, which
   315is why it’s important that the caller do so.
   316These slots also act as the home location if these arguments need to
   317be spilled for any other reason, which simplifies traceback printing.
   318
   319There are several options for how to lay out the argument spill space.
   320We chose to lay out each argument according to its type's usual memory
   321layout but to separate the spill space from the regular argument
   322space.
   323Using the usual memory layout simplifies the compiler because it
   324already understands this layout.
   325Also, if a function takes the address of a register-assigned argument,
   326the compiler must spill that argument to memory in its usual memory
   327layout and it's more convenient to use the argument spill space for
   328this purpose.
   329
   330Alternatively, the spill space could be structured around argument
   331registers.
   332In this approach, the stack growth spill path would spill each
   333argument register to a register-sized stack word.
   334However, if the function takes the address of a register-assigned
   335argument, the compiler would have to reconstruct it in memory layout
   336elsewhere on the stack.
   337
   338The spill space could also be interleaved with the stack-assigned
   339arguments so the arguments appear in order whether they are register-
   340or stack-assigned.
   341This would be close to ABI0, except that register-assigned arguments
   342would be uninitialized on the stack and there's no need to reserve
   343stack space for register-assigned results.
   344We expect separating the spill space to perform better because of
   345memory locality.
   346Separating the space is also potentially simpler for `reflect` calls
   347because this allows `reflect` to summarize the spill space as a single
   348number.
   349Finally, the long-term intent is to remove reserved spill slots
   350entirely – allowing most functions to be called without any stack
   351setup and easing the introduction of callee-save registers – and
   352separating the spill space makes that transition easier.
   353
   354## Closures
   355
   356A func value (e.g., `var x func()`) is a pointer to a closure object.
   357A closure object begins with a pointer-sized program counter
   358representing the entry point of the function, followed by zero or more
   359bytes containing the closed-over environment.
   360
   361Closure calls follow the same conventions as static function and
   362method calls, with one addition. Each architecture specifies a
   363*closure context pointer* register and calls to closures store the
   364address of the closure object in the closure context pointer register
   365prior to the call.
   366
   367## Software floating-point mode
   368
   369In "softfloat" mode, the ABI simply treats the hardware as having zero
   370floating-point registers.
   371As a result, any arguments containing floating-point values will be
   372passed on the stack.
   373
   374*Rationale*: Softfloat mode is about compatibility over performance
   375and is not commonly used.
   376Hence, we keep the ABI as simple as possible in this case, rather than
   377adding additional rules for passing floating-point values in integer
   378registers.
   379
   380## Architecture specifics
   381
   382This section describes per-architecture register mappings, as well as
   383other per-architecture special cases.
   384
   385### amd64 architecture
   386
   387The amd64 architecture uses the following sequence of 9 registers for
   388integer arguments and results:
   389
   390    RAX, RBX, RCX, RDI, RSI, R8, R9, R10, R11
   391
   392It uses X0 – X14 for floating-point arguments and results.
   393
   394*Rationale*: These sequences are chosen from the available registers
   395to be relatively easy to remember.
   396
   397Registers R12 and R13 are permanent scratch registers.
   398R15 is a scratch register except in dynamically linked binaries.
   399
   400*Rationale*: Some operations such as stack growth and reflection calls
   401need dedicated scratch registers in order to manipulate call frames
   402without corrupting arguments or results.
   403
   404Special-purpose registers are as follows:
   405
   406| Register | Call meaning | Return meaning | Body meaning |
   407| --- | --- | --- | --- |
   408| RSP | Stack pointer | Same | Same |
   409| RBP | Frame pointer | Same | Same |
   410| RDX | Closure context pointer | Scratch | Scratch |
   411| R12 | Scratch | Scratch | Scratch |
   412| R13 | Scratch | Scratch | Scratch |
   413| R14 | Current goroutine | Same | Same |
   414| R15 | GOT reference temporary if dynlink | Same | Same |
   415| X15 | Zero value (*) | Same | Scratch |
   416
   417(*) Except on Plan 9, where X15 is a scratch register because SSE
   418registers cannot be used in note handlers (so the compiler avoids
   419using them except when absolutely necessary).
   420
   421*Rationale*: These register meanings are compatible with Go’s
   422stack-based calling convention except for R14 and X15, which will have
   423to be restored on transitions from ABI0 code to ABIInternal code.
   424In ABI0, these are undefined, so transitions from ABIInternal to ABI0
   425can ignore these registers.
   426
   427*Rationale*: For the current goroutine pointer, we chose a register
   428that requires an additional REX byte.
   429While this adds one byte to every function prologue, it is hardly ever
   430accessed outside the function prologue and we expect making more
   431single-byte registers available to be a net win.
   432
   433*Rationale*: We could allow R14 (the current goroutine pointer) to be
   434a scratch register in function bodies because it can always be
   435restored from TLS on amd64.
   436However, we designate it as a fixed register for simplicity and for
   437consistency with other architectures that may not have a copy of the
   438current goroutine pointer in TLS.
   439
   440*Rationale*: We designate X15 as a fixed zero register because
   441functions often have to bulk zero their stack frames, and this is more
   442efficient with a designated zero register.
   443
   444*Implementation note*: Registers with fixed meaning at calls but not
   445in function bodies must be initialized by "injected" calls such as
   446signal-based panics.
   447
   448#### Stack layout
   449
   450The stack pointer, RSP, grows down and is always aligned to 8 bytes.
   451
   452The amd64 architecture does not use a link register.
   453
   454A function's stack frame is laid out as follows:
   455
   456    +------------------------------+
   457    | return PC                    |
   458    | RBP on entry                 |
   459    | ... locals ...               |
   460    | ... outgoing arguments ...   |
   461    +------------------------------+ ↓ lower addresses
   462
   463The "return PC" is pushed as part of the standard amd64 `CALL`
   464operation.
   465On entry, a function subtracts from RSP to open its stack frame and
   466saves the value of RBP directly below the return PC.
   467A leaf function that does not require any stack space may omit the
   468saved RBP.
   469
   470The Go ABI's use of RBP as a frame pointer register is compatible with
   471amd64 platform conventions so that Go can inter-operate with platform
   472debuggers and profilers.
   473
   474#### Flags
   475
   476The direction flag (D) is always cleared (set to the “forward”
   477direction) at a call.
   478The arithmetic status flags are treated like scratch registers and not
   479preserved across calls.
   480All other bits in RFLAGS are system flags.
   481
   482At function calls and returns, the CPU is in x87 mode (not MMX
   483technology mode).
   484
   485*Rationale*: Go on amd64 does not use either the x87 registers or MMX
   486registers. Hence, we follow the SysV platform conventions in order to
   487simplify transitions to and from the C ABI.
   488
   489At calls, the MXCSR control bits are always set as follows:
   490
   491| Flag | Bit | Value | Meaning |
   492| --- | --- | --- | --- |
   493| FZ | 15 | 0 | Do not flush to zero |
   494| RC | 14/13 | 0 (RN) | Round to nearest |
   495| PM | 12 | 1 | Precision masked |
   496| UM | 11 | 1 | Underflow masked |
   497| OM | 10 | 1 | Overflow masked |
   498| ZM | 9 | 1 | Divide-by-zero masked |
   499| DM | 8 | 1 | Denormal operations masked |
   500| IM | 7 | 1 | Invalid operations masked |
   501| DAZ | 6 | 0 | Do not zero de-normals |
   502
   503The MXCSR status bits are callee-save.
   504
   505*Rationale*: Having a fixed MXCSR control configuration allows Go
   506functions to use SSE operations without modifying or saving the MXCSR.
   507Functions are allowed to modify it between calls (as long as they
   508restore it), but as of this writing Go code never does.
   509The above fixed configuration matches the process initialization
   510control bits specified by the ELF AMD64 ABI.
   511
   512The x87 floating-point control word is not used by Go on amd64.
   513
   514### arm64 architecture
   515
   516The arm64 architecture uses R0 – R15 for integer arguments and results.
   517
   518It uses F0 – F15 for floating-point arguments and results.
   519
   520*Rationale*: 16 integer registers and 16 floating-point registers are
   521more than enough for passing arguments and results for practically all
   522functions (see Appendix). While there are more registers available,
   523using more registers provides little benefit. Additionally, it will add
   524overhead on code paths where the number of arguments are not statically
   525known (e.g. reflect call), and will consume more stack space when there
   526is only limited stack space available to fit in the nosplit limit.
   527
   528Registers R16 and R17 are permanent scratch registers. They are also
   529used as scratch registers by the linker (Go linker and external
   530linker) in trampolines.
   531
   532Register R18 is reserved and never used. It is reserved for the OS
   533on some platforms (e.g. macOS).
   534
   535Registers R19 – R25 are permanent scratch registers. In addition,
   536R27 is a permanent scratch register used by the assembler when
   537expanding instructions.
   538
   539Floating-point registers F16 – F31 are also permanent scratch
   540registers.
   541
   542Special-purpose registers are as follows:
   543
   544| Register | Call meaning | Return meaning | Body meaning |
   545| --- | --- | --- | --- |
   546| RSP | Stack pointer | Same | Same |
   547| R30 | Link register | Same | Scratch (non-leaf functions) |
   548| R29 | Frame pointer | Same | Same |
   549| R28 | Current goroutine | Same | Same |
   550| R27 | Scratch | Scratch | Scratch |
   551| R26 | Closure context pointer | Scratch | Scratch |
   552| R18 | Reserved (not used) | Same | Same |
   553| ZR  | Zero value | Same | Same |
   554
   555*Rationale*: These register meanings are compatible with Go’s
   556stack-based calling convention.
   557
   558*Rationale*: The link register, R30, holds the function return
   559address at the function entry. For functions that have frames
   560(including most non-leaf functions), R30 is saved to stack in the
   561function prologue and restored in the epilogue. Within the function
   562body, R30 can be used as a scratch register.
   563
   564*Implementation note*: Registers with fixed meaning at calls but not
   565in function bodies must be initialized by "injected" calls such as
   566signal-based panics.
   567
   568#### Stack layout
   569
   570The stack pointer, RSP, grows down and is always aligned to 16 bytes.
   571
   572*Rationale*: The arm64 architecture requires the stack pointer to be
   57316-byte aligned.
   574
   575A function's stack frame, after the frame is created, is laid out as
   576follows:
   577
   578    +------------------------------+
   579    | ... locals ...               |
   580    | ... outgoing arguments ...   |
   581    | return PC                    | ← RSP points to
   582    | frame pointer on entry       |
   583    +------------------------------+ ↓ lower addresses
   584
   585The "return PC" is loaded to the link register, R30, as part of the
   586arm64 `CALL` operation.
   587
   588On entry, a function subtracts from RSP to open its stack frame, and
   589saves the values of R30 and R29 at the bottom of the frame.
   590Specifically, R30 is saved at 0(RSP) and R29 is saved at -8(RSP),
   591after RSP is updated.
   592
   593A leaf function that does not require any stack space may omit the
   594saved R30 and R29.
   595
   596The Go ABI's use of R29 as a frame pointer register is compatible with
   597arm64 architecture requirement so that Go can inter-operate with platform
   598debuggers and profilers.
   599
   600This stack layout is used by both register-based (ABIInternal) and
   601stack-based (ABI0) calling conventions.
   602
   603#### Flags
   604
   605The arithmetic status flags (NZCV) are treated like scratch registers
   606and not preserved across calls.
   607All other bits in PSTATE are system flags and are not modified by Go.
   608
   609The floating-point status register (FPSR) is treated like scratch
   610registers and not preserved across calls.
   611
   612At calls, the floating-point control register (FPCR) bits are always
   613set as follows:
   614
   615| Flag | Bit | Value | Meaning |
   616| --- | --- | --- | --- |
   617| DN  | 25 | 0 | Propagate NaN operands |
   618| FZ  | 24 | 0 | Do not flush to zero |
   619| RC  | 23/22 | 0 (RN) | Round to nearest, choose even if tied |
   620| IDE | 15 | 0 | Denormal operations trap disabled |
   621| IXE | 12 | 0 | Inexact trap disabled |
   622| UFE | 11 | 0 | Underflow trap disabled |
   623| OFE | 10 | 0 | Overflow trap disabled |
   624| DZE | 9 | 0 | Divide-by-zero trap disabled |
   625| IOE | 8 | 0 | Invalid operations trap disabled |
   626| NEP | 2 | 0 | Scalar operations do not affect higher elements in vector registers |
   627| AH  | 1 | 0 | No alternate handling of de-normal inputs |
   628| FIZ | 0 | 0 | Do not zero de-normals |
   629
   630*Rationale*: Having a fixed FPCR control configuration allows Go
   631functions to use floating-point and vector (SIMD) operations without
   632modifying or saving the FPCR.
   633Functions are allowed to modify it between calls (as long as they
   634restore it), but as of this writing Go code never does.
   635
   636### loong64 architecture
   637
   638The loong64 architecture uses R4 – R19 for integer arguments and integer results.
   639
   640It uses F0 – F15 for floating-point arguments and results.
   641
   642Registers R20 - R21, R23 – R28, R30 - R31, F16 – F31 are permanent scratch registers.
   643
   644Register R2 is reserved and never used.
   645
   646Register R20, R21 is Used by runtime.duffcopy, runtime.duffzero.
   647
   648Special-purpose registers used within Go generated code and Go assembly code
   649are as follows:
   650
   651| Register | Call meaning | Return meaning | Body meaning |
   652| --- | --- | --- | --- |
   653| R0 | Zero value | Same | Same |
   654| R1 | Link register | Link register | Scratch |
   655| R3 | Stack pointer | Same | Same |
   656| R20,R21 | Scratch | Scratch | Used by duffcopy, duffzero |
   657| R22 | Current goroutine | Same | Same |
   658| R29 | Closure context pointer | Same | Same |
   659| R30, R31 | used by the assembler | Same | Same |
   660
   661*Rationale*: These register meanings are compatible with Go’s stack-based
   662calling convention.
   663
   664#### Stack layout
   665
   666The stack pointer, R3, grows down and is aligned to 8 bytes.
   667
   668A function's stack frame, after the frame is created, is laid out as
   669follows:
   670
   671    +------------------------------+
   672    | ... locals ...               |
   673    | ... outgoing arguments ...   |
   674    | return PC                    | ← R3 points to
   675    +------------------------------+ ↓ lower addresses
   676
   677This stack layout is used by both register-based (ABIInternal) and
   678stack-based (ABI0) calling conventions.
   679
   680The "return PC" is loaded to the link register, R1, as part of the
   681loong64 `JAL` operation.
   682
   683#### Flags
   684All bits in CSR are system flags and are not modified by Go.
   685
   686### ppc64 architecture
   687
   688The ppc64 architecture uses R3 – R10 and R14 – R17 for integer arguments
   689and results.
   690
   691It uses F1 – F12 for floating-point arguments and results.
   692
   693Register R31 is a permanent scratch register in Go.
   694
   695Special-purpose registers used within Go generated code and Go
   696assembly code are as follows:
   697
   698| Register | Call meaning | Return meaning | Body meaning |
   699| --- | --- | --- | --- |
   700| R0  | Zero value | Same | Same |
   701| R1  | Stack pointer | Same | Same |
   702| R2  | TOC register | Same | Same |
   703| R11 | Closure context pointer | Scratch | Scratch |
   704| R12 | Function address on indirect calls | Scratch | Scratch |
   705| R13 | TLS pointer | Same | Same |
   706| R20,R21 | Scratch | Scratch | Used by duffcopy, duffzero |
   707| R30 | Current goroutine | Same | Same |
   708| R31 | Scratch | Scratch | Scratch |
   709| LR  | Link register | Link register | Scratch |
   710*Rationale*: These register meanings are compatible with Go’s
   711stack-based calling convention.
   712
   713The link register, LR, holds the function return
   714address at the function entry and is set to the correct return
   715address before exiting the function. It is also used
   716in some cases as the function address when doing an indirect call.
   717
   718The register R2 contains the address of the TOC (table of contents) which
   719contains data or code addresses used when generating position independent
   720code. Non-Go code generated when using cgo contains TOC-relative addresses
   721which depend on R2 holding a valid TOC. Go code compiled with -shared or
   722-dynlink initializes and maintains R2 and uses it in some cases for
   723function calls; Go code compiled without these options does not modify R2.
   724
   725When making a function call R12 contains the function address for use by the
   726code to generate R2 at the beginning of the function. R12 can be used for
   727other purposes within the body of the function, such as trampoline generation.
   728
   729R20 and R21 are used in duffcopy and duffzero which could be generated
   730before arguments are saved so should not be used for register arguments.
   731
   732The Count register CTR can be used as the call target for some branch instructions.
   733It holds the return address when preemption has occurred.
   734
   735On PPC64 when a float32 is loaded it becomes a float64 in the register, which is
   736different from other platforms and that needs to be recognized by the internal
   737implementation of reflection so that float32 arguments are passed correctly.
   738
   739Registers R18 - R29 and F13 - F31 are considered scratch registers.
   740
   741#### Stack layout
   742
   743The stack pointer, R1, grows down and is aligned to 8 bytes in Go, but changed
   744to 16 bytes when calling cgo.
   745
   746A function's stack frame, after the frame is created, is laid out as
   747follows:
   748
   749    +------------------------------+
   750    | ... locals ...               |
   751    | ... outgoing arguments ...   |
   752    | 24  TOC register R2 save     | When compiled with -shared/-dynlink
   753    | 16  Unused in Go             | Not used in Go
   754    |  8  CR save                  | nonvolatile CR fields
   755    |  0  return PC                | ← R1 points to
   756    +------------------------------+ ↓ lower addresses
   757
   758The "return PC" is loaded to the link register, LR, as part of the
   759ppc64 `BL` operations.
   760
   761On entry to a non-leaf function, the stack frame size is subtracted from R1 to
   762create its stack frame, and saves the value of LR at the bottom of the frame.
   763
   764A leaf function that does not require any stack space does not modify R1 and
   765does not save LR.
   766
   767*NOTE*: We might need to save the frame pointer on the stack as
   768in the PPC64 ELF v2 ABI so Go can inter-operate with platform debuggers
   769and profilers.
   770
   771This stack layout is used by both register-based (ABIInternal) and
   772stack-based (ABI0) calling conventions.
   773
   774#### Flags
   775
   776The condition register consists of 8 condition code register fields
   777CR0-CR7. Go generated code only sets and uses CR0, commonly set by
   778compare functions and use to determine the target of a conditional
   779branch. The generated code does not set or use CR1-CR7.
   780
   781The floating point status and control register (FPSCR) is initialized
   782to 0 by the kernel at startup of the Go program and not changed by
   783the Go generated code.
   784
   785### riscv64 architecture
   786
   787The riscv64 architecture uses X10 – X17, X8, X9, X18 – X23 for integer arguments
   788and results.
   789
   790It uses F10 – F17, F8, F9, F18 – F23 for floating-point arguments and results.
   791
   792Special-purpose registers used within Go generated code and Go
   793assembly code are as follows:
   794
   795| Register | Call meaning | Return meaning | Body meaning |
   796| --- | --- | --- | --- |
   797| X0  | Zero value | Same | Same |
   798| X1  | Link register | Link register | Scratch |
   799| X2  | Stack pointer | Same | Same |
   800| X3  | Global pointer | Same | Used by dynamic linker |
   801| X4  | TLS (thread pointer) | TLS | Scratch |
   802| X24,X25 | Scratch | Scratch | Used by duffcopy, duffzero |
   803| X26 | Closure context pointer | Scratch | Scratch |
   804| X27 | Current goroutine | Same | Same |
   805| X31 | Scratch | Scratch | Scratch |
   806
   807*Rationale*: These register meanings are compatible with Go’s
   808stack-based calling convention. Context register X20 will change to X26,
   809duffcopy, duffzero register will change to X24, X25 before this register ABI been adopted.
   810X10 – X17, X8, X9, X18 – X23, is the same order as A0 – A7, S0 – S7 in platform ABI.
   811F10 – F17, F8, F9, F18 – F23, is the same order as FA0 – FA7, FS0 – FS7 in platform ABI.
   812X8 – X23, F8 – F15 are used for compressed instruction (RVC) which will benefit code size in the future.
   813
   814#### Stack layout
   815
   816The stack pointer, X2, grows down and is aligned to 8 bytes.
   817
   818A function's stack frame, after the frame is created, is laid out as
   819follows:
   820
   821    +------------------------------+
   822    | ... locals ...               |
   823    | ... outgoing arguments ...   |
   824    | return PC                    | ← X2 points to
   825    +------------------------------+ ↓ lower addresses
   826
   827The "return PC" is loaded to the link register, X1, as part of the
   828riscv64 `CALL` operation.
   829
   830#### Flags
   831
   832The riscv64 has Zicsr extension for control and status register (CSR) and
   833treated as scratch register.
   834All bits in CSR are system flags and are not modified by Go.
   835
   836## Future directions
   837
   838### Spill path improvements
   839
   840The ABI currently reserves spill space for argument registers so the
   841compiler can statically generate an argument spill path before calling
   842into `runtime.morestack` to grow the stack.
   843This ensures there will be sufficient spill space even when the stack
   844is nearly exhausted and keeps stack growth and stack scanning
   845essentially unchanged from ABI0.
   846
   847However, this wastes stack space (the median wastage is 16 bytes per
   848call), resulting in larger stacks and increased cache footprint.
   849A better approach would be to reserve stack space only when spilling.
   850One way to ensure enough space is available to spill would be for
   851every function to ensure there is enough space for the function's own
   852frame *as well as* the spill space of all functions it calls.
   853For most functions, this would change the threshold for the prologue
   854stack growth check.
   855For `nosplit` functions, this would change the threshold used in the
   856linker's static stack size check.
   857
   858Allocating spill space in the callee rather than the caller may also
   859allow for faster reflection calls in the common case where a function
   860takes only register arguments, since it would allow reflection to make
   861these calls directly without allocating any frame.
   862
   863The statically-generated spill path also increases code size.
   864It is possible to instead have a generic spill path in the runtime, as
   865part of `morestack`.
   866However, this complicates reserving the spill space, since spilling
   867all possible register arguments would, in most cases, take
   868significantly more space than spilling only those used by a particular
   869function.
   870Some options are to spill to a temporary space and copy back only the
   871registers used by the function, or to grow the stack if necessary
   872before spilling to it (using a temporary space if necessary), or to
   873use a heap-allocated space if insufficient stack space is available.
   874These options all add enough complexity that we will have to make this
   875decision based on the actual code size growth caused by the static
   876spill paths.
   877
   878### Clobber sets
   879
   880As defined, the ABI does not use callee-save registers.
   881This significantly simplifies the garbage collector and the compiler's
   882register allocator, but at some performance cost.
   883A potentially better balance for Go code would be to use *clobber
   884sets*: for each function, the compiler records the set of registers it
   885clobbers (including those clobbered by functions it calls) and any
   886register not clobbered by function F can remain live across calls to
   887F.
   888
   889This is generally a good fit for Go because Go's package DAG allows
   890function metadata like the clobber set to flow up the call graph, even
   891across package boundaries.
   892Clobber sets would require relatively little change to the garbage
   893collector, unlike general callee-save registers.
   894One disadvantage of clobber sets over callee-save registers is that
   895they don't help with indirect function calls or interface method
   896calls, since static information isn't available in these cases.
   897
   898### Large aggregates
   899
   900Go encourages passing composite values by value, and this simplifies
   901reasoning about mutation and races.
   902However, this comes at a performance cost for large composite values.
   903It may be possible to instead transparently pass large composite
   904values by reference and delay copying until it is actually necessary.
   905
   906## Appendix: Register usage analysis
   907
   908In order to understand the impacts of the above design on register
   909usage, we
   910[analyzed](https://github.com/aclements/go-misc/tree/master/abi) the
   911impact of the above ABI on a large code base: cmd/kubelet from
   912[Kubernetes](https://github.com/kubernetes/kubernetes) at tag v1.18.8.
   913
   914The following table shows the impact of different numbers of available
   915integer and floating-point registers on argument assignment:
   916
   917```
   918|      |        |       |      stack args |          spills |     stack total |
   919| ints | floats | % fit | p50 | p95 | p99 | p50 | p95 | p99 | p50 | p95 | p99 |
   920|    0 |      0 |  6.3% |  32 | 152 | 256 |   0 |   0 |   0 |  32 | 152 | 256 |
   921|    0 |      8 |  6.4% |  32 | 152 | 256 |   0 |   0 |   0 |  32 | 152 | 256 |
   922|    1 |      8 | 21.3% |  24 | 144 | 248 |   8 |   8 |   8 |  32 | 152 | 256 |
   923|    2 |      8 | 38.9% |  16 | 128 | 224 |   8 |  16 |  16 |  24 | 136 | 240 |
   924|    3 |      8 | 57.0% |   0 | 120 | 224 |  16 |  24 |  24 |  24 | 136 | 240 |
   925|    4 |      8 | 73.0% |   0 | 120 | 216 |  16 |  32 |  32 |  24 | 136 | 232 |
   926|    5 |      8 | 83.3% |   0 | 112 | 216 |  16 |  40 |  40 |  24 | 136 | 232 |
   927|    6 |      8 | 87.5% |   0 | 112 | 208 |  16 |  48 |  48 |  24 | 136 | 232 |
   928|    7 |      8 | 89.8% |   0 | 112 | 208 |  16 |  48 |  56 |  24 | 136 | 232 |
   929|    8 |      8 | 91.3% |   0 | 112 | 200 |  16 |  56 |  64 |  24 | 136 | 232 |
   930|    9 |      8 | 92.1% |   0 | 112 | 192 |  16 |  56 |  72 |  24 | 136 | 232 |
   931|   10 |      8 | 92.6% |   0 | 104 | 192 |  16 |  56 |  72 |  24 | 136 | 232 |
   932|   11 |      8 | 93.1% |   0 | 104 | 184 |  16 |  56 |  80 |  24 | 128 | 232 |
   933|   12 |      8 | 93.4% |   0 | 104 | 176 |  16 |  56 |  88 |  24 | 128 | 232 |
   934|   13 |      8 | 94.0% |   0 |  88 | 176 |  16 |  56 |  96 |  24 | 128 | 232 |
   935|   14 |      8 | 94.4% |   0 |  80 | 152 |  16 |  64 | 104 |  24 | 128 | 232 |
   936|   15 |      8 | 94.6% |   0 |  80 | 152 |  16 |  64 | 112 |  24 | 128 | 232 |
   937|   16 |      8 | 94.9% |   0 |  16 | 152 |  16 |  64 | 112 |  24 | 128 | 232 |
   938|    ∞ |      8 | 99.8% |   0 |   0 |   0 |  24 | 112 | 216 |  24 | 120 | 216 |
   939```
   940
   941The first two columns show the number of available integer and
   942floating-point registers.
   943The first row shows the results for 0 integer and 0 floating-point
   944registers, which is equivalent to ABI0.
   945We found that any reasonable number of floating-point registers has
   946the same effect, so we fixed it at 8 for all other rows.
   947
   948The “% fit” column gives the fraction of functions where all arguments
   949and results are register-assigned and no arguments are passed on the
   950stack.
   951The three “stack args” columns give the median, 95th and 99th
   952percentile number of bytes of stack arguments.
   953The “spills” columns likewise summarize the number of bytes in
   954on-stack spill space.
   955And “stack total” summarizes the sum of stack arguments and on-stack
   956spill slots.
   957Note that these are three different distributions; for example,
   958there’s no single function that takes 0 stack argument bytes, 16 spill
   959bytes, and 24 total stack bytes.
   960
   961From this, we can see that the fraction of functions that fit entirely
   962in registers grows very slowly once it reaches about 90%, though
   963curiously there is a small minority of functions that could benefit
   964from a huge number of registers.
   965Making 9 integer registers available on amd64 puts it in this realm.
   966We also see that the stack space required for most functions is fairly
   967small.
   968While the increasing space required for spills largely balances out
   969the decreasing space required for stack arguments as the number of
   970available registers increases, there is a general reduction in the
   971total stack space required with more available registers.
   972This does, however, suggest that eliminating spill slots in the future
   973would noticeably reduce stack requirements.

View as plain text