SiFive - September 28, 2018

Last Week in RISC-V: Sept 28, 2018

I spent too much time writing my single entry this week so it's all I have. The issue itself is somewhat complex, so I thought it warranted a deep dive.

Relaxing R_RISCV_PCREL_* in the Presence of Addends

Jim recently found an issue with our ELF relocation scheme. While technically it's a somewhat wide-ranging ABI bug, the actual set of cases in which the issue will manifest is somewhat restricted -- in other words: while I'm sure the proper long-term fix for this will be quite involved, it probably isn't biting too many people in practice right now. If you're unsure about how ELF relocations and linker relaxation work on RISC-V systems, some of the early entries in my "All Aboard" series go into greater detail. For this week's summary, I'll restrict myself to the specific issues at hand.

The vast majority of RISC-V relocations follow the pattern of having one relocation to fill out the high 20 bits of a 32-bit address (relocations that follow the naming pattern R_RISCV_*HI20), and a second relocation to fill out the low 12 bits of a 32-bit address (relocations that follow the naming pattern R_RISCV_*LO12*). This R_RISCV*_HI20/R_RISCV*_LO12_* pattern is baked into RISC-V's instruction encodings, with the U format providing the high 20 bits and the I or S formats proving the low 12 bits (hence the names *_LO12_S and *_LO12_I). While the issue at hand applies to all relocations that follow this pattern, the toolchain can handle this extra complexity for the vast majority of these relocations -- that's why it took us so long to notice! It turns out that there is one specific case where the ABI we've chosen where this is difficult to do: the case where multiple LO12 relocations share the same HI20 relocation but have different addends.

Unfortunately this toolchain bug manifests as incorrect code generation from the linker, which is pretty much the worst sort of bug -- nobody understands the linker, so it's a long exercise for users to track down the bug and then a lot of work to fix it. In practice this bug it is somewhat unlikely to actually manifest in real code, but since it's a linker bug it's probably best to understand what is going on with an example before we dig into exactly why this bug is unlikely to manifest.

A Concrete Manifestation of this Bug

Like most of the examples I write it's somewhat contrived, I'll discuss how this applies to more likely user code later. The following assembly program is nonsensical but legal:

.text
.global _start
_start:
.balign 4096
.fill 2048, 1, 0
        1: auipc t0, %pcrel_hi(global)
        addi  t1, t0, %pcrel_lo(1b)+0
        addi  t0, t0, %pcrel_lo(1b)+8

.data
.balign 4096
global:
.fill 2, 8, 0

it's a fairly simple program: it takes the address of a global symbol and an offset into that global symbol. When assembled it generates a single R_RISCV_PCREL_HI20 relocation to fill out the top 20 bits of the auipc, and then a pair of R_RISCV_PCREL_LO12_I relocations to fill out each addi. This program assembles and links correctly, resulting in the pre-linked object file you'd expect

    17fe:	00000297          	auipc	t0,0x0
			17fe: R_RISCV_PCREL_HI20	global
			17fe: R_RISCV_RELAX	*ABS*
    1802:	00028313          	mv	t1,t0
			1802: R_RISCV_PCREL_LO12_I	.L11
			1802: R_RISCV_RELAX	*ABS*
    1806:	00028293          	mv	t0,t0
			1806: R_RISCV_PCREL_LO12_I	.L11+0x8
			1806: R_RISCV_RELAX	*ABS*+0x8

and a sensible executable

   11800:	00003297          	auipc	t0,0x3
   11804:	80028313          	addi	t1,t0,-2048 # 14000 <global>
   11808:	80828293          	addi	t0,t0,-2040

Things start to get interesting when we slightly perturb this program. In this case, let's look at what happens when we change the alignment of the auipc. At the assembly source level this is a straight-forward change (because I carefully constructed the example that way):

.global _start
_start:
.balign 4096
.fill 2050, 1, 0
        1: auipc t0, %pcrel_hi(global)
        addi  t1, t0, %pcrel_lo(1b)+0
        addi  t0, t0, %pcrel_lo(1b)+8

.data
.balign 4096
global:
.fill 2, 8, 0

In this case we generate a pre-linked object file that looks like you'd expect, essentially the same as what's above but with different alignment

     1800:	00000297          	auipc	t0,0x0
 			1800: R_RISCV_PCREL_HI20	global
 			1800: R_RISCV_RELAX	*ABS*
     1804:	00028313          	mv	t1,t0
 			1804: R_RISCV_PCREL_LO12_I	.L11
 			1804: R_RISCV_RELAX	*ABS*
     1808:	00028293          	mv	t0,t0
 			1808: R_RISCV_PCREL_LO12_I	.L11+0x8
 			1808: R_RISCV_RELAX	*ABS*+0x8

The interesting thing here is that this slight input perturbation results in an incorrect binary after linking

   11802:	00002297          	auipc	t0,0x2
   11806:	7fe28313          	addi	t1,t0,2046 # 14000 <global>
   1180a:	80628293          	addi	t0,t0,-2042

Essentially what's going on here is that our relocations are capable of providing 2^32 distinct bit patterns, but in the presence of pair relocations with distinct addends we need to be capable of generating 2^32+1 bit patterns. Like many bugs of this sort, it's quite obvious once it's been described but fairly difficult to find in practice.

Why this Probably isn't Manifesting in Your Program

If you've managed to make it this far you're probably getting pretty worried: a linker bug that manifests as silent incorrect code generation is a pretty hairy thing to deal with. While I agree, it turns out this bug has lurked for so long because it's somewhat unlikely to manifest in real application code under the set of compiler options that are actually sensible to use. We are of course working on fixing the bug (and there's already patches to at least cause it to manifest as an explicit failure to link), but a fully performant fix is somewhat tricky.

This bug can only manifest under a very specific set of conditions:

  • There must be multiple LO12 relocations that share a HI20 relocation but have different addends.
  • The result of the HI20 relocation must be misaligned WRT to final target of the LO12 relocations.

As long as you're writing C code and haven't written your own GCC optimization passes, these conditions are mutually exclusive:

  • GCC doesn't know how to share a single auipc between multiple accesses, so the only way to end up with a shared relocation for the top 20 bits of an address is via lui.
  • GCC always aligns the intermediate results of the lui-based relocations. This alignment is implicit: GCC aligns the target symbol, which for lui-based addressing results in an intermediate result that is always aligned.

Importantly, the second case is not true for our auipc-based relocations. Since auipc includes the current PC as part of the intermediate result, aligning the target symbol is not sufficient to align the intermediate result.

What's the Takeaway for a User of the Toolchain?

This article has been somewhat involved. I wrote it because I think it's an interesting dive into a toolchain bug, but most people reading this probably aren't interested in that level of detail and just want to know what to do. My official recommendation is simple: don't use -mcmodel=medany on an RV32I target. That's actually exactly the same recommendation as I would have provided before we found this bug. The key point here is that -mcmodel=medany doesn't actually provide any benefit on RV32I targets: -mcmodel=medlow can already generate every possible 32-bit address, so all -mcmodel=medany does is add a bit of complexity to the address generation. I used to think that all that did was generate slightly worse code, but now that we've found a correctness bug my recommendation is simply a bit stronger.

The only users who are likely to need -mcmodel=medany are those on 64-bit systems. While you're unlikely to find a bug in -mcmodel=medany -mexplicit-relocs because GCC can't perform constant subexpression elimination on auipc, I'd still recommend avoiding -mcmodel=medany -mexplicit-relocs. While -mcmodel=medany -mexplicit-relocs is necessary to squeeze the last fer percent out of Dhrystone, we've found it actually generates worse code for real programs which is why it's not the toolchain default. If you're really that worried about a few percent performance in Dhrystone then I'd recommend finding a better benchmark.

I also explicitly recommend that users do not try to backport the various patches to fix this bug themselves and instead wait for us to completely verify these bug fixes and add them to our backport branches. The workarounds above are really the right thing to do even without the bug, and due to the subtlety of what's going on here, you're more likely to end up introducing a bug than fixing one.

Events