SiFive - October 09, 2017

All Aboard, Part 6: Booting a RISC-V Linux Kernel

This post begins a short detour into Linux land, during which we'll be discussing the RISC-V Linux kernel port. SiFive has recently announced the Linux-capable U54-MC RISC-V Core IP, and our Linux port was recently submitted to linux-next, Linux's staging branch, so assuming that everything goes smoothly we should be merged at the end of the next merge window. Along with Linux we should soon have the full suite of core system components upstream, both for embedded systems (via binutils, GCC, and newlib) and larger (via binutils, GCC, glibc, and Linux).

This blog series will discus the Linux port as it currently stands, which will naturally discuss various aspects of the RISC-V supervisor spec. The hope is that this series will be useful for people interested in hacking on the RISC-V Linux kernel port as well as people porting other operating systems so they can see some of the design decisions we made.

`bbl`, the Berkeley Boot Loader

The standard RISC-V privilege model contains four modes:

User mode is where userspace programs run. This is specified by Part I of the RISC-V ISA manual (the user specification), while all the lower modes are specified by Part II (the supervisor specification).
Supervisor mode is where Linux runs.
Hypervisor mode is currently left unspecified.
Machine mode is the lowest protection mode, and is meant to run the machine-specific firmware that may be microcode on other machines.

One of the overarching goals of the RISC-V supervisor specification is to allow a single kernel image to run on any RISC-V platform. In order to avoid baking assumptions about the hardware into the Linux image, we rely on two abstraction mechanisms:

The details of the underlying hardware are described by a device tree, the standard format for describing machines that's used by most modern systems. This specifies the memory map, the configuration of all the harts in the system, and how all the statically allocated devices are attached.
The implementation of complicated functionality that would usually be implemented as microcode traps is hidden behind the supervisor binary interface, also known as the SBI. This allows supervisors like Linux to be written to an interface provided by a lower level of the privilege stack without the hardware complexity of adding a bunch of emulation instructions.

In our RISC-V supervisor-mode implementation, which as far as I know is by far the most widely used one, both the device tree and the SBI are provided by a machine-mode shim known as the Berkeley Boot Loader (or bbl). This is meant to be the first stage of the bootloader that a user might reasonably want to replace, and maps to something like the BIOS or EFI on a PC-style system -- there are ports of both UEFI and LinuxBIOS to RISC-V, but I know very little about them so I'll restrict myself to bbl.

bbl is expected to have been chain loaded from another boot loader, with the entry point running in machine mode. It is passed a device tree from the prior boot loader stage, and performs the following steps:

One hart is selected to be the main hart. The other harts are put to sleep until bbl is ready to transfer control to Linux, at which point they will all be woken up and enter Linux around the same time.
The device tree that was passed in from the previous stage is read and filtered. This allows bbl to strip out information that Linux shouldn't be interested in. For example, on SiFive systems we tend to have an extra utility hart that handles various low-level tasks like power management. On our current systems this hart is implemented as a very small core that lacks floating point, caches and virtual memory. In order to avoid adding a bunch of SiFive-specific logic to Linux, we instead just strip out the harts that Linux can't boot.
All the other harts are woken up so they can set up their PMPs, trap handlers and enter supervisor mode.
The mhartid CSR is read so Linux can be passed a unique per-hart identifier.
A PMP is set up to allow supervisor mode to access all memory.
Machine mode trap handlers, including a machine mode stack, is set up. bbl's machine mode code needs to handle both unimplemented instructions and machine-mode interrupts.
The processor executes a mret to jump from machine mode to supervisor mode.
bbl jumps to the start of its payload, which in this case is Linux.

Early Boot in Linux

When Linux boots, it expects the system to be in the following state:

a0 contains a unique per-hart ID. We currently map these to Linux CPU IDs, so they're expected to be contiguous and close to 0.
a1 contains a pointer to the device tree, represented as a binary flattened device tree (DTB).
Memory is identity mapped, which bbl accomplishes by not enabling paging.
The kernel's ELF image has been loaded correctly, with all the various segments at their correct addresses. This isn't particularly onerous for Linux, as it has a simple ELF image to load.

The first thing Linux needs to do when it boots is handle an impedance mismatch between the RISC-V specification and what Linux expects: RISC-V systems boot harts in an arbitrary order and at arbitrary times, while Linux expects a single hart to boot first and then wake up all other harts. We handle this with the "hart lottery," which is a very short AMO-based sequence that picks the first hart to boot. The rest of the harts spin, waiting for Linux to boot far enough along that they can continue.

At this point we proceed with a fairly standard Linux early boot process:

A linear mapping of all physical memory is set up, with PAGE_OFFSET as the offset.
Paging is enabled.
The C runtime is set up, which includes the stack and global pointers.
A spin-only trap vector is set up that catches any errors early in the boot process.
start_kernel is called to enter the standard Linux boot process.

That concludes the entire assembly early boot section of our port -- by my count, only 71 instructions! We're pretty proud of how little assembly we have in our port, it serves as a testament to the simplicity of the supervisor ISA specification.

`setup_arch`

The normal Linux early boot process proceeds until we reach setup_arch, which is the arch-specific setup code that's fairly early in the kernel boot process. On RISC-V systems, setup_arch proceeds to perform the following operations:

Enable the EARLY_PRINTK console, if the SBI console driver is enabled. On RISC-V we unconditionally enable early printk because the SBI console is well-behaved so there is no reason not to enable it. As this happens extremely early in the boot process, you get debugging output pretty much the whole time.
The kernel command line is parsed, and the early arch-specific options are dealt with. Here we only bother allowing the user to control the amount of physical memory actually used by Linux.
The device tree's memory map is parsed in order to find the kernel image's memory block, which is marked as reserved. The rest of the device tree's memory is released to the kernel for allocation.
The memory management subsystem is initialized, including the zero page and various zones. We only support ZONE_NORMAL, so this is quite simple.
Any other hart in the system is woken up.
The processor's ISA is read from the device tree, which is used to fill out the HWCAP field in the ELF auxvec. This allows userspace programs to determine what the hardware they're executing on looks like. For now we assume a homogeneous ISA, as anything else doesn't really map into the UNIX model of processes.

This is the only RISC-V specific part of the boot process; after this, control returns to the upstream kernel code.

Booting other Harts in an SMP System

The most interesting part of the RISC-V boot process is how we wake up other harts in the system. Recall that, on RISC-V systems, harts boot by jumping to the kernel's start address at arbitrary times. We do this to simplify the specification of the ISA: if there's no concept of a hart that's disabled, then there is no need to specify how to power on or off harts. Since power management is frequently a tricky thing to handle, we push this off to machine-specific code and provide a clean interface to the supervisor.

In order to handle this wrinkle, the harts that lose the lottery spin waiting for the main hart to make it far enough along the boot process that they've been allocated the memory they need to run -- in this case, a task_struct that lives in the kernel's tp variable (a bit of an abuse of the thread pointer, but we just want something the compiler won't mangle) and a stack.

The code to do this is fairly straightforward, even if it is a sizeable chunk of the assembly code in our startup function:

.Lsecondary_start:
        li a1, CONFIG_NR_CPUS
        bgeu a0, a1, .Lsecondary_park

        /* Set trap vector to spin forever to help debug */
        la a3, .Lsecondary_park
        csrw stvec, a3

        slli a3, a0, LGREG
        la a1, __cpu_up_stack_pointer
        la a2, __cpu_up_task_pointer
        add a1, a3, a1
        add a2, a3, a2

        /*
         * This hart didn't win the lottery, so we wait for the winning hart to
         * get far enough along the boot process that it should continue.
         */
.Lwait_for_cpu_up:
        REG_L sp, (a1)
        REG_L tp, (a2)
        beqz sp, .Lwait_for_cpu_up
        beqz tp, .Lwait_for_cpu_up
        fence

        /* Enable virtual memory and relocate to virtual address */
        call relocate

        tail smp_callin

This leaves the __cpu_up function, which boots a target hard by ID, also to be fairly simple:

int __cpu_up(unsigned int cpu, struct task_struct *tidle)
{
        tidle->thread_info.cpu = cpu;

        /*
         * On RISC-V systems, all harts boot on their own accord.  Our _start
         * selects the first hart to boot the kernel and causes the remainder
         * of the harts to spin in a loop waiting for their stack pointer to be
         * setup by that main hart. Writing __cpu_up_stack_pointer signals to
         * the spinning harts that they can continue the boot process.
         */
        smp_mb();
        __cpu_up_stack_pointer[cpu] = task_stack_page(tidle) + THREAD_SIZE;
        __cpu_up_task_pointer[cpu] = tidle;

        while (!cpu_online(cpu))
                cpu_relax();

        return 0;
}

Shutting Down

It seems most natural to discuss the shutdown process along with the boot process, but the process is actually very simple on RISC-V: the generic kernel code cleans up the whole kernel, at which point sbi_shutdown is called to inform the machine-mode code that it should terminate.

That's about all that's involved in booting (and halting) a RISC-V Linux kernel. Since we use device tree and push most of the platform-specific work into the firmware, the process is actually pretty straightforward.

Next week we'll discuss task switching, which is the first thing the kernel does after the boot process completes.

All Aboard, Part 6: Booting a RISC-V Linux Kernel

bbl, the Berkeley Boot Loader

Early Boot in Linux

setup_arch

Booting other Harts in an SMP System

Shutting Down

Read more of the All Aboard blog series:

`bbl`, the Berkeley Boot Loader

`setup_arch`