SiFive - August 28, 2017
All Aboard, Part 3: Linker Relaxation in the RISC-V Toolchain
Last week's blog entry discussed relocations and how they apply to the RISC-V toolchain. This week we'll be delving a bit deeper into the RISC-V linker to discuss linker relaxation, a concept so important it has greatly shaped the design of the RISC-V ISA. Linker relaxation is a mechanism for optimizing programs at link-time, as opposed to traditional program optimization which happens at compile-time. This blog will follow an example linker relaxation through the toolchain, demonstrate an example of how linker relaxations meaningfully improve the performance of a real program and introduce a new RISC-V relocation. We'll shy away from discussing the impact of linker relaxations on the RISC-V ISA, until another blog entry.
Just like last time, we'll start with a simple C test program that's not linked against anything else. This program won't perform a sane computation, the goal is just that it's simple enough to get the point across. I'll skip the assembly this time: as this post is about the linker, we can't really discuss anything until we get to an object file. Since you're now an expert in the toolchain, I'll just blast out some commands here:
$ cat test.c
int func(int a) __attribute__((noinline));
int func(int a) {
return a + 1;
}
int _start(int a) {
return func(a);
}
$ riscv64-unknown-linux-gnu-gcc test.c -o test -O3
$ riscv64-unknown-linux-gnu-objdump -d -r test.o
test.o: file format elf64-littleriscv
Disassembly of section .text:
0000000000000000 <func>:
0: 2505 addiw a0,a0,1
2: 8082 ret
0000000000000004 <_start>:
4: 00000317 auipc ra,0x0
4: R_RISCV_CALL func
4: R_RISCV_RELAX *ABS*
8: 00030067 jr ra
You can now see a new RISC-V relocation: R_RISCV_CALL
. This
relocation sits between an auipc
and a jalr
instruction (here
disassembled as the jr
shorthand as this is a tail call) and points to
the symbol that should be the target of the jump, in this case the
func
symbol. The R_RISCV_CALL
relocation is paired with a
R_RISCV_RELAX
relocation, which allows the linker to relax this
relocation pair -- the whole point of this blog!
In order to understand relaxation, we first must examine the RISC-V ISA a
bit. In the RISC-V ISA there are two unconditional control transfer
instructions: jalr
, which jumps to an absolute address as specified by
an immediate offset from a register; and jal
, which jumps to a
pc-relative offset as specified by an immediate. The only differences
between the auipc
+jalr
pair in this object file and a single jal
are that the pair can address a 32-bit signed offset from the current PC
while the jal
can only address a 21-bit signed offset from the current
PC, and that the jal
instruction is half the size (which is a good
proxy for twice the speed).
As the compiler cannot know if the offset between _start
and func
will fit within a 21-bit offset, it is forced to generate the longer
call. We don't want to impose this cost in cases where it's not
necessary, so we instead optimize this case in the linker. Let's look
at the executable to see the result of linker relaxation:
$ riscv64-unknown-linux-gnu-objdump -d -r test
test: file format elf64-littleriscv
Disassembly of section .text:
0000000000010078 <func>:
10078: 2505 addiw a0,a0,1
1007a: 8082 ret
000000000001007c <_start>:
1007c: ffdff06f j 10078 <func>
As you can see, the linker knows that the call from _start
to func
fits within the 21-bit offset of the jal
instruction and converts it
to a single instruction.
The RISC-V Implementation of Linker Relaxation
While the concept of linker relaxation is fairly straight-forward, there
are a lot of tricky details that need to be done correctly in order to
ensure the linker produces the correct symbol addresses everywhere. To
the best of my knowledge, the RISC-V BFD port make the most aggressive
use of linker relaxations: essentially no .text
section symbol
addresses are known until link time. This has a few interesting side
effects:
.align
directives must be handled by the linker for any relaxable sections.- Debug information must be emitted twice: once by the compiler for the object file and once again by the linker for the executable.
All these points probably warrant blog posts of their own -- some of these are planned, others require me to fix some bugs before I feel comfortable talking about them :).
The actual implementation of linker relaxation is, as you'd expect,
fairly esoteric. The code lives in _bfd_riscv_relax_section
inside
binutils-gdb/bfd/elfnn-riscv.c
, which looks roughly like the
following:
_bfd_riscv_relax_section:
if section shouldn't be relaxed:
return DONE
for each relocation:
if relocation is relaxable:
store per-relocation function pointer
read the symbol table
obtain the symbol's address
call the per-relocation function
Essentially, all it's doing is some shared bookkeeping code and then
calling a relocation-specific function to actually relax the relocation.
The relax functions all look somewhat similar, so I'll show an example
of the function that relaxes R_RISCV_CALL
relocation that was
discussed above
_bfd_riscv_relax_call:
compute a pessimistic address range
if relocation doesn't fit into a UJ-type immediate:
return DONE
compute offsets for various short jumps
if RVC is enabled and the relocation fits in a C.JAL:
convert the jump to c.jal
if relocation fits in an JAL:
convert the jump to a jal
if call target is near absolute address 0:
convert the jump to a x0-offset JALR
delete the JALR, as it's not necessary anymore
While this specific function only relaxes the R_RISCV_CALL
relocation,
it follows the pattern that most of the implementations of the
relaxation functions do:
generic_relax_function:
add some slack to the address, as all addresses can move
for each possible instruction to shorten the relocation:
if possible instruction can fit the target address:
replace the relocation
cleanup
return DONE
return DONE
There's one of these functions for each class of relocations that the RISC-V toolchain knows how to relax:
_bfd_riscv_relax_call
: relaxes two-instructionauipc
+jalr
sequences via theR_RISCV_CALL
andR_RISCV_CALL_PLT
relocations._bfd_riscv_relax_lui
: relaxes two-instructionlui
+addi
-like sequences via theR_RISCV_HI20
+R_RISCV_LO12_I
-like relocation pairs. The second instruction/relocation can be any of the various instructions that matches aR_RISCV_LO12_I
orR_RISCV_LO12_S
relocation (addi
,lw
,sd
, etc)._bfd_riscv_relax_pc
: Relaxes two-instructionauipc
+addi
-like sequences via theR_RISCV_PCREL_HI20'+'R_RISCV_PCREL_LO12_I
-like relocation pairs. Much like thelui
case there's a handful of relocation types possible for the second one, all of which are PC-relative._bfd_riscv_relax_tls_le
: Relaxes thread local storage references when using the local executable model. We'll talk about TLS is a later blog, as there's a lot going on here._bfd_riscv_relax_align
: Relaxes.align
directives in text sections. This is another on we'll discuss later, but one specific interesting constraint here is thatR_RISCV_ALIGN
relocations must be relaxed for correctness, which means they're relaxed even when relaxations are otherwise disabled.
Relaxing Against the Global Pointer
It may seem like linker relaxation involves a huge amount of
complexity for a small gain: we trade knowing no .text
section symbol
addresses until link time for shortening a few sequences by a single
instruction. As it turns out, linker relaxation is very important for
getting good performance on real code. For our first time looking at
real code, we'll take a look at the Dhrystone benchmark -- in addition
to being super simple, Dhrystone also spends a lot of time loading from
global variables and therefore benefits very clearly from linker
relaxation.
Let's take a look at the Dhrystone source code first. While it's a bit more complicated than the examples that have been present in this blog so far, if you look closely the code is actually pretty straightforward. Here's the source for one Dhrystone function, along with the definitions of the various global variables it references:
/* Global Variables: */
Boolean Bool_Glob;
char Ch_1_Glob,
Ch_2_Glob;
Proc_4 () /* without parameters */
/*******/
/* executed once */
{
Boolean Bool_Loc;
Bool_Loc = Ch_1_Glob == 'A';
Bool_Glob = Bool_Loc | Bool_Glob;
Ch_2_Glob = 'B';
} /* Proc_4 */
As you can see, the code performs three accesses to global variables in order to do a simple comparison and a logical operation. While this might seem kind of silly, this is what a lot of the Dhrystone benchmark looks like. Since Dhrystone is pretty much the only benchmark that will actually run everywhere (SPECInt won't run on my wristwatch, for example), it's still used as the baseline for many microarchitectural comparisons so we need to make it go fast.
In order to understand the specific relaxation that's being performed in this case, it's probably best to start with the code the toolchain generates before this optimization, which I've copied below:
0000000040400826 <Proc_4>:
40400826: 3fc00797 auipc a5,0x3fc00
4040082a: f777c783 lbu a5,-137(a5) # 8000079d <Ch_1_Glob>
4040082e: 3fc00717 auipc a4,0x3fc00
40400832: f7272703 lw a4,-142(a4) # 800007a0 <Bool_Glob>
40400836: fbf78793 addi a5,a5,-65
4040083a: 0017b793 seqz a5,a5
4040083e: 8fd9 or a5,a5,a4
40400840: 3fc00717 auipc a4,0x3fc00
40400844: f6f72023 sw a5,-160(a4) # 800007a0 <Bool_Glob>
40400848: 3fc00797 auipc a5,0x3fc00
4040084c: 04200713 li a4,66
40400850: f4e78a23 sb a4,-172(a5) # 8000079c <Ch_2_Glob>
40400854: 8082 ret
As you can see, this function consists of 13 instructions, 4 of which
are auipc
instructions. All of these auipc
instructions are used to
calculate the addresses of global variables for a subsequent memory
access, and all of these generated addresses are within a 12-bit offset
of each other. If you're thinking "we only really need one of these
auipc
instructions", you're both right and wrong: while we could
generate a single auipc
(though that requires some GCC work we haven't
done yet and is thus the subject of a future blog post), we can actually
do one better and get by with zero auipc
instructions!
If you've just gone and pored over the RISC-V ISA manual to find an
instruction that loads Ch_1_Glob
(which lives at 0x8000079D
) in a
single instruction then you should give up now, as there isn't one.
There is, of course, a trick -- it is common on register-rich,
addressing-mode-poor ISAs to have a dedicated ABI register known as the
global pointer that contains an address in the .data
segment. The
linker is then capable of relaxing accesses to global variables that
live within a 12-bit signed offset from this value -- essentially we've
just cached the lui
in the global pointer register, optimizing this
common code path.
In order to get a bit more visibility into how this works, let's take a look at a snippet of GCC's default linker script for RISC-V:
/* We want the small data sections together, so single-instruction offsets
can access them all, and initialized data all before uninitialized, so
we can shorten the on-disk segment size. */
.sdata :
{
__global_pointer$ = . + 0x800;
*(.srodata.cst16) *(.srodata.cst8) *(.srodata.cst4) *(.srodata.cst2) *(.srodata .srodata.*)
*(.sdata .sdata.* .gnu.linkonce.s.*)
}
_edata = .; PROVIDE (edata = .);
. = .;
__bss_start = .;
.sbss :
{
*(.dynsbss)
*(.sbss .sbss.* .gnu.linkonce.sb.*)
*(.scommon)
As you can see, the magic __global_pointer$
symbol is defined to point
0x800
bytes past the start of the .sdata
section. The 0x800
magic
number allows signed 12-bit offsets from __global_pointer$
to address
symbols at the start of the .sdata
section. The linker assumes that
if this symbol is defined, then the gp
register contains that value,
which it can then use to relax accesses to global symbols within that
12-bit range. The compiler treats the gp
register as a constant so it
doesn't need to be saved or restored, which means it is generally only
written by _start
, the ELF entry point. Here's an example from the
RISC-V newlib port's crt0.S
file
.option push
.option norelax
1:auipc gp, %pcrel_hi(__global_pointer$)
addi gp, gp, %pcrel_lo(1b)
.option pop
Note that we need to disable relaxations while setting gp
, otherwise
the linker would relax this two-instruction sequence to mv gp, gp
The actual implementation of the relaxation, which lives in
_bfd_riscv_relax_lui
and _bfd_riscv_relax_pc
, is fairly boring.
Like all the other relaxations, it performs some bounds checks, deletes
the unused instruction and then converts the short-offset instruction
to a different type. We may delve deeper into the implementation of
various linker relaxations in future blog posts, but for now I'll just
drop the relaxed output here to demonstrate it actually works:
00000000400003f0 <Proc_4>:
400003f0: 8651c783 lbu a5,-1947(gp) # 80001fbd <Ch_1_Glob>
400003f4: 8681a703 lw a4,-1944(gp) # 80001fc0 <Bool_Glob>
400003f8: fbf78793 addi a5,a5,-65
400003fc: 0017b793 seqz a5,a5
40000400: 00e7e7b3 or a5,a5,a4
40000404: 86f1a423 sw a5,-1944(gp) # 80001fc0 <Bool_Glob>
40000408: 04200713 li a4,66
4000040c: 86e18223 sb a4,-1948(gp) # 80001fbc <Ch_2_Glob>
40000410: 00008067 ret
12-bit Offsets aren't Enough for Anyone
Just to be clear: linker relaxations are an optimization for the common case. The linker transparently emits two-instruction addressing sequences for symbols that it cannot optimize. To demonstrate what happens when the linker can't relax a symbol, let's go through another example:
$ cat relax.c
long near;
long far[2];
long data(void) {
return near | far;
}
int main() {
return data();
}
$ riscv64-unknown-linux-gnu-gcc relax.c -O3 -o relax --save-temps
$ riscv64-unknown-linux-gnu-objdump -d relax.o
relax.o: file format elf64-littleriscv
Disassembly of section .text:
0000000000000000 data:
0: 000007b7 lui a5,0x0
0: R_RISCV_HI20 near
0: R_RISCV_RELAX *ABS*
4: 0007b503 ld a0,0(a5) # 0 data
4: R_RISCV_LO12_I near
4: R_RISCV_RELAX *ABS*
8: 000007b7 lui a5,0x0
8: R_RISCV_HI20 far
8: R_RISCV_RELAX *ABS*
c: 0007b783 ld a5,0(a5) # 0 data
c: R_RISCV_LO12_I far
c: R_RISCV_RELAX *ABS*
10: 8d5d or a0,a0,a5
12: 8082 ret
Disassembly of section .text.startup:
0000000000000000 main:
0: 1141 addi sp,sp,-16
2: e406 sd ra,8(sp)
4: 00000097 auipc ra,0x0
4: R_RISCV_CALL data
4: R_RISCV_RELAX *ABS*
8: 000080e7 jalr ra
c: 60a2 ld ra,8(sp)
e: 2501 sext.w a0,a0
10: 0141 addi sp,sp,16
12: 8082 ret
Here we can see three relocations groups: two HI20
/LO12
relocation
pairs and a CALL
relocation. In this case, the CALL
relocation can
be relaxed as can the HI20
/LO12
pair that references near
, but the
HI20
/LO12
pair that references far
can't. In this case the
linker still functions correctly, producing a single-instruction
addressing sequence for the near
symbol it can relax while just
relocating the addressing sequence for the far
symbol that it can't
reference with a single instruction.
$ riscv64-unknown-linux-gnu-objdump -d -r relax
Disassembly of section .text:
0000000000010330 main:
10330: 1141 addi sp,sp,-16
10332: e406 sd ra,8(sp)
10334: 0b8000ef jal ra,103ec data
10338: 60a2 ld ra,8(sp)
1033a: 2501 sext.w a0,a0
1033c: 0141 addi sp,sp,16
1033e: 8082 ret
00000000000103ec data:
103ec: 8181b503 ld a0,-2024(gp) # 12038 near
103f0: 67e9 lui a5,0x1a
103f2: 0407b783 ld a5,64(a5) # 1a040 far
103f6: 8d5d or a0,a0,a5
103f8: 8082 ret
Though it might be a bit redundant at this point, I already had the following examples written out so I figured I'd just leave them here to be a bit more explicit:
--- relax.o
+++ relax
- 4: 00000097 auipc ra,0x0
- 4: R_RISCV_CALL data
- 4: R_RISCV_RELAX *ABS*
- 8: 000080e7 jalr ra
+ 10334: 0b8000ef jal ra,103ec data
In the example above, we can see the R_RISCV_CALL
relocation is requested.
This relocation is defined to operate over an adjacent auipc
/jalr
pair,
referencing a signed 32-bit PC-relative call target. In this case we were able
to relax this instruction pair to a single jal
instruction as the actual
call target was within a 21-bit signed offset from the current PC. You'll find
that almost all R_RISCV_CALL
relocations will be relaxable, as most code
expresses some amount of call locality.
--- relax.o
+++ relax
- 0: 000007b7 lui a5,0x0
- 0: R_RISCV_HI20 near
- 0: R_RISCV_RELAX *ABS*
- 4: 0007b503 ld a0,0(a5) # 0 data
- 4: R_RISCV_LO12_I near
- 4: R_RISCV_RELAX *ABS*
+ 103ec: 8181b503 ld a0,-2024(gp) # 12038 near
In the example above, we can see a R_RISCV_HI20
/R_RISCV_LO12_I
relocation pair is requested. These relocations are each defined to operate
over a single instruction: the R_RISCV_HI20
relocates the 20-bit offset of
a lui
while the R_RISCV_LO12_I
relocates the 12-bit offset of various
I-type instructions (ld
in this example). In this case we were able to
relax this instruction pair to a single ld
instruction, as the final symbol
address was within a 12-bit offset of gp
, the global pointer.
--- relax.o
+++ relax
- 8: 000007b7 lui a5,0x0
- 8: R_RISCV_HI20 far
- 8: R_RISCV_RELAX *ABS*
- c: 0007b783 ld a5,0(a5) # 0 data
- c: R_RISCV_LO12_I far
- c: R_RISCV_RELAX *ABS*
+ 103f0: 67e9 lui a5,0x1a
+ 103f2: 0407b783 ld a5,64(a5) # 1a040 far
In the example above, we see another R_RISCV_HI20
/R_RISCV_LO12_I
, but
this time we can't relax it as it's not within a 12-bit offset of gp
. Note
that we still generate the correct code for this case by filling out the
relocations. You will get a link-time error whenever it is impossible to
correctly relocate a requested relocation, as otherwise the linked executable
wouldn't produce the correct answer.
Stay tuned, as there's a whole lot more to come on linker relaxations in future blog posts.