SiFive - December 11, 2017

All Aboard, Part 9: Paging and the MMU in the RISC-V Linux Kernel

This entry will cover the RISC-V port of Linux's memory management subsystem. Since the vast majority of the memory management code in Linux is architecture-independent, the vast majority of our memory management code handles interfacing with our MMU, defining our page table format, and interfacing with drivers that have memory allocation constraints.

I will refrain from discussing the RISC-V memory model in this blog, both because it isn't yet finished and because it's complicated enough to warrant its own series of blog posts.

Also, as a side note: for those of you not following the internet, we've gotten our core architecture port into Linus' tree and are slated to release as part of 4.15. This won't be a fully bootable RISC-V system, but it's at least a good first step!

Privilege Levels in RISC-V Systems

The RISC-V ISA defines a stack of execution environments. While each environment is designed to be classically virtualizable, in standard systems each level in the stack is designed to provide the next level's execution environment. Starting from the least-privileged level in the stack, the execution environments are:

User-mode software executes in an AEE (Application Execution Environment). On Linux systems, the AEE is also known as the user ABI: the set of system calls supported by the kernel. The AEE also includes the entire user ISA, since user-mode programs are expected to be able to execute instructions other than just scall.
Supervisor-mode software executes in an SEE (Supervisor Execution Environment). This environment consists of the supervisor-mode instructions and CSRs defined by the privileged ISA document, along with the SBI. Supervisor-mode software is expected to provide multiple AEEs, Linux provides one AEE to each process.
Hypervisor-mode software executes in an HEE (Hypervisor Execution Environment), and is expected to provide multiple SEEs. The hypervisor-mode section of the privileged ISA document is still being written, so we'll ignore this for now.
Machine-mode software executes in an MEE (Machine Execution Environment), and is expected to provide one higher-level execution context. Since the privileged mode ISA makes implementing the U, S, and H extensions optional (thus allowing for M, M+U, M+S+U, and M+H+S+U systems), it's expected that different machine-mode software implementation will provide either an HEE, SEE, or AEE.

While it's fairly standard to provide execution environment stacks that do not match this hierarchy, the software executing in each environment can't tell the difference. That's not to say that user-mode software is entirely portable: for example, the Linux AEE is different than the FreeBSD AEE because they provide different system calls. The intent is simply that programs written to execute in the Linux AEE can't tell if they're executing on Linux on hardware, in Linux running in Spike, or in QEMU's user-mode emulation. None of this is a new concept, it's just a bit more explicitly stated in the RISC-V ISA specification than it is on many other architectures.

Since RISC-V is classically virtualizable at every level of the privilege stack, no explicit hardware support is necessary to provide any execution environment: for example, QEMU's user-mode emulation provides an AEE on systems that have no hardware support for any RISC-V ISA. While these systems can be made reasonably performant, the main purpose of RISC-V is to enable hardware implementations of the ISA -- for example, even though a Xeon running QEMU will probably be the fastest implementation of the RISC-V Linux AEE for the foreseeable future it would be more appropriate to run a hardware implementation of RISC-V on my wristwatch.

The RISC-V ISA documents are designed to allow the software implementations at various levels of the privilege stack to use the execution environment they're written against in order to provide the execution environment above them. Some of these are so obvious you probably haven't noticed that we've been talking about them for a while: for example, it's assumed that the hardware handles executing an addi instruction in userspace without the kernel's intervention because it would be silly not to.

We've been able to put off the discussion of privileged levels until now because the vast majority of the design is obvious: all of the user instructions are handled by the hardware without supervisor intervention except for scall, which just transfers control to the kernel's single trap entry point -- see my previous blog post All Aboard, Part 7: Entering and Exiting the Linux Kernel on RISC-V for details. Since application execution environments provide the illusion that user-space programs have access to a big flat address space we've been able to more or less ignore memory when discussing user applications. As is common with computing systems, memory is the hard part -- thus, we only really need to discuss RISC-V privilege modes when talking about paging.

For this blog post, we'll be focusing on supervisor code running on reasonable systems -- thus we won't do things like emulating unsupported instructions from userspace. The focus of the blog will be on how to provide a RISC-V application execution environment given a supervisor execution environment.

The RISC-V Application-Class Supervisor Execution Environments

Supervisor programs, like Linux, execute on a supervisor execution environment. Much like how the user-level ISA leaves many of the specifics of the AEE to be implemented in different ways on different platforms (system calls on Linux vs BSD, for example), the privileged ISA doesn't specify all the details of the SEE that application-class supervisors (like Linux or BSD) can expect -- those will be specified as part of the platform specification.

In this blog, I'll quickly go over a few key aspects of the RISC-V supervisor execution environment for application-class supervisors. This environment is designed to support UNIX-style operating systems running in supervisor mode, emulating POSIX-compliant application execution environments. A highlight of the proposed (with the caveat that I'm not in the platform specification working group, so this is all just my guess) requirements of the application-class SEE are:

Either the RV32I or RV64I base ISAs, along with the M, A, and C extensions. The F and D extensions are optional but paired together, leaving the possible standard ISAs for application-class SEEs as RV32IMAC, RV32IMAFDC (RV32GC), RV64IMAC, and RV64IMAFDC (RV64GC).
On RV32I-based systems, support for Sv32 page-based virtual memory.
On RV64I-based systems, support for at least Sv48 page-based virtual memory.
Upon entering the SEE, the PMAs are set such that memory accesses are point-to-point strongly ordered between harts and devices.
An SBI that implements various fences, timers, and a console.

The application-class SEE, as specified by the upcoming RISC-V platform specification, in the contract between standard Linux distributions and hardware vendors -- of course these restrictions don't apply for the embedded space, where many of them would be onerous. In practice: if you expect users to be able to swap out the boot media on your platform, then you should meet the requirements of the application-class SEE.

The RISC-V Linux Application Execution Environments

Supervisor-mode software on RISC-V uses a supervisor execution environment in order to provide one or multiple application execution environments. Fundamentally, an AEE (like any execution environment) is simply the definition of the next state of the machine upon every instruction's execution. On RISC-V systems, the AEE depends on:

The ISA string, which determines what the vast majority of instructions do as well as which registers constitute the machine's current state.
The supervisor's user-visible ABI, which determines what the scall instruction does. This is different than the C compiler's ABI, which defines the interface between different components of the application.
The contents of the entire memory address space.

In an idealized world, each process consists of its own independent AEE, with Linux multiplexing these on top of a single SEE. Of course, there's all sort of problems to this model in practice, but none of this is specific to RISC-V systems. The concept of a self-contained and well-defined AEE is still useful from a standards standpoint, and we hope to progress on properly specifying the RISC-V Linux AEE family (as well as AEEs for other POSIX-like systems) as we progress with our ports.

Paging on RISC-V Systems

After that lengthy divergence into the definition of RISC-V's privileged modes, we can finally get to the whole point of this blog post: paging on RISC-V systems. Paging is the main mechanism used to provide user mode with the illusion of having an AEE -- like most things in computer architecture, it turns out that memory is the tricky part.

One of the nice things about designing an ISA at the time RISC-V was designed is that so many different solutions to difficult problems have been tried that we pretty much know what to do now. Thus, we arrived at a pretty standard page-based virtual memory system when designing the RISC-V's supervisor virtual memory interface. The exact page table formats and such are listed in the relevant RISC-V ISA manuals so I won't go through them here, but there are a few highlights:

Pages are 4KiB at the leaf node, and it's possible to map large contiguous regions with every level of the page table.
RV32I-based systems can have up to 34-bit physical addresses with a three level page table.
RV64I-based systems can have multiple virtual address widths, starting with 39-bit and extending up to 64-bit in increments of 9 bits.
Mappings must be synchronized via the sfence.vma instruction.
There are bits for global mappings, supervisor-only, read/write/execute, and accessed/dirty.
There is a single valid bit, which allows storing XLEN-1 bits of flags in an otherwise unused page tables. Additionally, there are two bits of software flags in mapped pages.
Address space identifiers are 9 bits or RV32I and 16 bits on RV64I, and they're a hint so a valid implementation is to ignore them.
The accessed and dirty bits are strongly ordered with respect to accesses from the same hart, but are optional (with a trap-based mechanism when unsupported).

The Linux implementation of paging is functional but not complete: we're missing support for ASIDs, for example. Like many things in our port, these extra features will come with time.

Handling Device DMA

RISC-V does not currently define an IOMMU, so device accesses are performed in a single linear address space provided by the SEE (aka, physical memory). Combined with the lack of a mechanism to modify PMAs, this makes device IO on RISC-V very simple: we essentially just don't do anything specific to our ISA.

Handling 32-bit DMA Regions

Some devices only support 32-bit addressing even when attached to a system with longer physical addresses. Since RISC-V lacks an IOMMU, we handle these devices by using kernel bounce buffers. This is correct but slow: while it may be fine for SoC-style systems where the set of devices is well known at elaboration time, as more complicated RISC-V systems become available we will eventually need to standardize a mechanism for virtualizing device addressing.

Our bounce buffer mechanism simply uses the standard mechanisms provided by Linux, so there isn't anything RISC-V specific about it. We provide a 32-bit ZONE_DMA, allow allocating from that, and use bounce buffers to handle ioremap() for already-allocated pages outside the legal region.