

School of Computer Science & Engineering

**COMP9242 Advanced Operating Systems** 

#### 2020 T2 Week 05b **Microkernel Design & Implementation The 25-year quest for the right API** @GernotHeiser



## **Copyright Notice**

#### These slides are distributed under the Creative Commons Attribution 3.0 License

- You are free:
  - to share—to copy, distribute and transmit the work
  - to remix-to adapt the work
- under the following conditions:
  - Attribution: You must attribute the work (but not in any way that suggests that the author endorses you or your use of the work) as follows:

"Courtesy of Gernot Heiser, UNSW Sydney"

The complete license text can be found at http://creativecommons.org/licenses/by/3.0/legalcode



#### L4 Microkernels – Deployed by the Billions





© Gernot Heiser 2019 – CC Attribution License



# L4: The Quest for a Real Microkernel



#### L4: The Quest for a Real Microkernel



A concept is tolerated inside the microkernel only if moving it outside the kernel, i.e. permitting competing implementations, would prevent the implementation of the system's required functionality. [Liedtke, SOSP'95]



#### L4: 25 Years High Performance Microkernels



COMP9242 2020T2 W05b: Microkernel D&I

© Gernot Heiser 2019 – CC Attribution License



#### L4 IPC Performance Over the Years

| Name      | Year | Processor                       | MHz   | Cycles | μs   |
|-----------|------|---------------------------------|-------|--------|------|
| Original  | 1993 | i486                            | 50    | 250    | 5.00 |
| Original  | 1997 | Pentium                         | 160   | 121    | 0.75 |
| L4/MIPS   | 1997 | MIPS R4700                      | 100   | 86     | 0.86 |
| L4/Alpha  | 1997 | Alpha 21064                     | 433   | 45     | 0.10 |
| Hazelnut  | 2002 | Pentium 4                       | 1,400 | 2,000  | 1.38 |
| Pistachio | 2005 | Itanium                         | 1,500 | 36     | 0.02 |
| OKL4      | 2007 | Arm XScale 255                  | 400   | 151    | 0.64 |
| NOVA      | 2010 | x86 i7 Bloomfield (32-bit)      | 2,660 | 288    | 0.11 |
| seL4      | 2013 | ARM11                           | 532   | 188    | 0.35 |
| seL4      | 2018 | x86 i7 Haswell (64-bit)         | 3,400 | 442    | 0.13 |
| seL4      | 2018 | Arm Cortex A9                   | 1,000 | 303    | 0.30 |
| seL4      | 2020 | RISC-V HiFive (64-bit, no ASID) | 1,500 | 500    | 0.33 |



### Minimality: Source Lines of Code (SLOC)

| Architecture | C/C++                                                   | asm                                                               | total                                                                                                               |
|--------------|---------------------------------------------------------|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| i486         | 0 k                                                     | 6.4 k                                                             | 6.4 k                                                                                                               |
| Alpha        | 0 k                                                     | 14.2 k                                                            | 14.2 k                                                                                                              |
| MIPS64       | 6.0 k                                                   | 4.5 k                                                             | 10.5 k                                                                                                              |
| x86          | 10.0 k                                                  | 0.8 k                                                             | 10.8 k                                                                                                              |
| x86          | 22.4 k                                                  | 1.4 k                                                             | 23.0 k                                                                                                              |
| ARMv5        | 7.6 k                                                   | 1.4 k                                                             | 9.0 k                                                                                                               |
| ARMv6        | 15.0 k                                                  | 0.0 k                                                             | 15.0 k                                                                                                              |
| x86          | 36.2 k                                                  | 1.1 k                                                             | 37.6 k                                                                                                              |
| ARMv6        | 9.7 k                                                   | 0.5 k                                                             | 10.2 k                                                                                                              |
|              | i486<br>Alpha<br>MIPS64<br>x86<br>x86<br>ARMv5<br>ARMv5 | i4860 kAlpha0 kMIPS646.0 kx8610.0 kx8622.4 kARMv57.6 kARMv615.0 k | i4860 k6.4 kAlpha0 k14.2 kMIPS646.0 k4.5 kx8610.0 k0.8 kx8622.4 k1.4 kARMv57.6 k1.4 kARMv615.0 k0.0 kx8636.2 k1.1 k |



# What Have We Learnt in 25 Years?



#### Issues With 2G Microkernels

- L4 solved microkernel performance [Härtig et al, SOSP'97] left a number of issues unsolved
- Problem: ad-hoc approach to security and resource management
  - Global thread name space  $\Rightarrow$  covert channels [Shapiro'03]
  - Threads as IPC targets  $\Rightarrow$  insufficient encapsulation
  - Single kernel memory pool  $\Rightarrow$  DoS attacks
  - No delegation of authority  $\Rightarrow$  impacts flexibility, performance
  - Unprincipled management of time

Solved by capabilities



#### **Traditional L4: Recursive Address Spaces**



COMP9242 2020T2 W05b: Microkernel D&I

#### Issues With 2G Microkernels

- L4 solved microkernel performance [Härtig et al, SOSP'97] left a number of issues unsolved
- Problem: ad-hoc approach to security and resource management
  - Global thread name space ⇒ covert channels [Shapiro'03]
  - Threads as IPC targets  $\Rightarrow$  insufficient encapsulation
  - Single kernel memory pool  $\Rightarrow$  DoS attacks
  - No delegation of authority  $\Rightarrow$  impacts the ribility, performance
  - Unprincipled management of time

Solved by seL4 memory management model



## **Direct vs Indirect IPC Addressing**

- Direct: Queue senders/messages at receiver
  - Need unique thread IDs
  - Kernel guarantees identity of sender
    - useful for authentication
- Indirect: Mailbox/port object
  - Just a user-level handle for the kernel-level queue
  - Extra object type extra weight?
  - Communication partners are anonymous
    - Need separate mechanism for authentication





#### Other Issues with L4 IPC Adressing



#### Issues With 2G Microkernels

- L4 solved microkernel performance [Härtig et al, SOSP'97] left a number of issues unsolved
- Problem: ad-hoc approach to security and resource management
  - Global thread name space  $\Rightarrow$  covert channels [Shapiro'03]
  - Threads as IPC targets ⇒ insufficient encapsulation
  - Single kernel memory pool  $\Rightarrow$  DoS attacks

Solved by caps & endpoints

• No delegation of authority  $\Rightarrow$  impacts flexibility, performance

• Unprincipled management of time

Examine later





# Other Design & Implementation Issues





#### Sender address space



16 COMP9242 2020T2 W05b: Microkernel D&I







#### IPC Fastpath: Send Phase of Call

| Running         | <u>1</u> ) -<br>2)<br>3) | <ul> <li>Prologue</li> <li>Save minimal state, get args</li> <li>Identify destination</li> <li>Cap lookup;<br/>get endpoint; check queue</li> <li>Get receiver TCB</li> <li>Check receiver can still run</li> <li>Check receiver priority is ≥ ours</li> </ul> | Wait to receive<br>312 cycles<br>on Arm A9                                                              |
|-----------------|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
|                 | 4)                       | <ul> <li>Mark sender blocked and enqueue</li> <li>Block caller on reply object</li> <li>Donate scheduling context</li> </ul>                                                                                                                                   | <ul><li>Direct process switch:</li><li>no scheduler invocation</li><li>sched-context donation</li></ul> |
| Wait to receive | 5)<br>6)                 | <ul> <li>Switch to receiver</li> <li>Copy virtual message registers</li> <li>Epilogue (restore &amp; return)</li> </ul>                                                                                                                                        | Running                                                                                                 |
|                 |                          |                                                                                                                                                                                                                                                                |                                                                                                         |





## How About Real-Time Support?

- Kernel runs with interrupts disabled
  - No concurrency control ⇒ simpler kernel
    - Easier reasoning about correctness
    - Better average-case performance

How about longrunning system calls?





## sel4 Incremental Consistency Paradigm





## sel4 Example: Destroying IPC Endpoint





### sel4 Difficult Example: Revoking Badge













### L4 Scheduler Optimisation: Lazy Scheduling



•





- Frequent blocking/unblocking in IPC-based systems
- Many ready-queue manipulations





#### Scheduler Optimisation: Direct Process Switch



#### **Remember: Delegation of Critical Sections**



#### Sel4 MCS Model: Scheduling Contexts





## Sel4 Delegation with Scheduling Contexts



OS mechanism for managing time [Lyons et al, EuroSys'18]

31 COMP9242 2020T2 W05b: Microkernel D&I



## Mixed-Criticality Support

#### For *mixed-criticality systems* (MCS), OS must provide:

• Temporal isolation, to force jobs to adhere to declared WCET

Solved by scheduling contexts

• Mechanisms for safely sharing resources across criticalities

What if budget expires while shared server executing on Low's scheduling context?

Crit: Low Client, Clie

COMP9242 2020T2 W05b: Microkernel D&I





#### Policy-free mechanism for dealing with budget depletion

Possible actions:

- Provide emergency budget to leave critical section
- Cancel operation & roll-back server
- Reduce priority of low-crit client (together with one of the above)
- Implement priority inheritance (if you must...)



#### Issues With 2G Microkernels

- L4 solved microkernel performance [Härtig et al, SOSP'97] left a number of issues unsolved
- Problem: ad-hoc approach to security and resource management
  - Global thread name space  $\Rightarrow$  covert channels [Shapiro'03]
  - Threads as IPC targets  $\Rightarrow$  insufficient encapsulation
  - Single kernel memory pool  $\Rightarrow$  DoS attacks
  - No delegation of authority  $\Rightarrow$  impacts flexibility, performance
  - Unprincipled management of time\_

Solved by scheduling contexts & time-out exceptions



## Lessons & Principles

35 COMP9242 2020T2 W05b: Microkernel D&I

© Gernot Heiser 2019 – CC Attribution License



### **Original L4 Design and Implementation**

#### Implement. Tricks [SOSP'93]

- Process kernel
- Virtual TCB array
   Modified
- Lazy scheduling
- Direct process switch
- Non-preemptible
- Non-pertable

- Retained
- Non-standard calling convention
- Assembler

#### **Design Decisions [SOSP'95]**

- Synchronous IPC
- Rich message structure, arbitrary out-of-line messages
- Zero-copy register messages
- User-mode page-fault handlers
- Threads as IPC destinations
- IPC timeouts
- Hierarchical IPC control
- User-mode device drivers
- Process hierarchy
- Recursive address-space construction





## **Reflecting on Changes**

#### Original L4 design had two major shortcomings:

- 1. Insufficient/impractical resource control
  - Poor/non-existent control over kernel memory use
  - Inflexible & costly process hierarchies (policy!)
  - Arbitrary limits on number of address spaces and threads (policy!)
  - Poor information hiding (IPC addressed to threads)
  - Insufficient mechanisms for authority delegation
- 2. Over-optimised IPC abstraction, mangles:
  - Communication
  - Synchronisation
  - Memory management sending mappings
  - Scheduling time-slice donation





- Fully delegatable access control
- All resource management is subject to user-defined policies
  - Applies to kernel resources too!
- Performance on par with best-performing L4 kernels
  - Prerequisite for real-world deployment!
- Suitability for real-time use
  - Important for safety-critical systems
- Suitable for *formal verification* 
  - Requires small size, avoid complex constructs

Largely in line with traditional L4 approach!



#### A Thirty-Year Dream!

1. Introduction

Operating R. S Systems Edit

#### R. Stockton Gaines Editor

#### Specification and Verification of the UCLA Unix† Security Kernel

Bruce J. Walker, Richard A. Kemmerer, and Gerald J. Popek University of California, Los Angeles

Data Secure Unix, a kernel structured operating system, was constructed as part of an ongoing effort at UCLA to develop procedures by which operating systems can be produced and shown secure. Program verification methods were extensively applied as a constructive means of demonstrating security enforcement.

Here we report the specification and verification experience in producing a secure operating system. The work represents a significant attempt to verify a largescale, production level software system, including all aspects from initial specification to verification of implemented code.

Key Words and Phrases: verification, security, operating systems, protection, programming methodology, ALPHARD, formal specifications, Unix, security kernel

CR Categories: 4.29, 4.35, 6.35

Early attempts to make operating systems secure merely found and fixed flaws in existing systems. As these efforts failed, it became clear that piecemeal alterations were unlikely ever to succeed [20]. A more systematic method was required, presumably one that controlled the system's design and implementation. Then secure operation could be demonstrated in a stronger sense than an ingenuous claim that the last bug had been eliminated, particularly since production systems are rarely static, and errors easily introduced.

Our research seeks to develop means by which an operating system can be shown data secure, meaning that direct access to data must be possible only if the recorded protection policy permits it. The two major components of this task are: (1) developing system architectures that minimize the amount and complexity of software involved in both protection decisions and enforcement, by isolating them into kernel modules; and (2) applying extensive verification methods to that kernel software in order to prove that our of data security criterion is met. This paper reports on the latter part, the verification experience. Those interested in architectural issues should see [23]. Related work includes the PSOS operating system project at SRI [25] which uses the hierarchical design methodology described by Robinson and Levitt in [26], and efforts to prove communications software at the University of Texas [31].

Every verification step, from the development of toplevel specifications to machine-aided proof of the Pascal code, was carried out. Although these steps were not completed for all portions of the kernel, most of the job was done for much of the kernel. The remainder is clearly more of the same. We therefore consider the project essentially complete. In this paper, as each verification step is discussed, an estimate of the completed portion of that step is given, together with an indication of the amount of work required for completion. One should realize that it is essential to carry the verification process through the steps of actual code-level proofs because most security flaws in real systems are found at this level [20]. Security flaws were found in our system during verification, despite the fact that the implementation was written carefully and tested extensively. An example of Our research seeks to develop means by which an operating system can be shown data secure, meaning that direct access to data must be possible only if the recorded protection policy permits it. The two major components

| Communications | February 1980 |  |  |
|----------------|---------------|--|--|
| of             | Volume 23     |  |  |
| the ACM        | Number 2      |  |  |

39 COMP9242 2020T2 W05b: Microkernel D&I

