Toro kernel
A dedicated unikernel for microservices
Friday, May 12, 2023
Virtio-vsock driver speed up transmission by making one copy
Wednesday, January 11, 2023
Running MPI applications as an unikernel
This post aims at giving some context to the presentation I will give at FOSDEM in the HPC room. Recently, I have been working on implementing MPI functions on top of Toro unikernel. The goal of this PoC is to understand the benefits of running a MPI application on top of a highly parallel unikernel like Toro. This post first starts by presenting what a unikernel is, then, It talks about why toro is different from other unikernels. I finish by presenting what benchmark I am using to compare the current implementation and future work.
An unikernel is a minimalistic operating system that is compiled within the user application and aims at reducing the OS interference. The kernel and the user application share the view of the memory. The communication between the kernel and the application is by using simple calls instead of syscalls. Both the application and the kernel run in the most privileged ring, i.e, ring0 in x86. There are several existing unikernels like OSv, NanoVM, MirageOS and Unikraft. These projects focus on enabling compiling a POSIX application within the unikernel. The resulting binary is thus deployed without the need of an OS.
Toro is a non-POSIX unikernel that requires modification of the application. Most of the functions are named like those in the POSIX standard but the implementation is slightly different. Note that deploying an app as a VM impacts the performance of such an application. Unikernel allows MPI applications to run closer to the hardware thus reducing the degradation of the performance due to virtualization.
Toro runs an instance of the kernel per-core and most of the kernel-data is per-cpu. When a thread requires allocating memory, it gets it from a pool that is local to the cpu in which the thread runs. That pool is created during the initialization by allocating a fixed amount of memory per core. The scheduler is cooperative and it is based on the SysThreadSwitch() function that yields the current processor from the current thread to the scheduler. The scheduler is not preemptive. This results that most of the race conditions at the kernel data are fixed by just disabling irqs if needed.
A MPI application is deployed in several nodes and executes in parallel. Among these instances, there is the root instance that acts as the chef of the orchestra. These instances require mechanisms to synchronise and to communicate. For example, the MPI_Barrier() function acts as a barrier in which every instance waits until all instances reach it. Instances can cross the barrier only after all the other instances have reached it. MPI_Send() and MPI_Recv() are functions to send and receive data among the different instances. These primitives are used to build more complex functions like MPI_Bcast() and MPI_Reduce(). The former is for sending data to the root node and the latter to send data to the root for processing by using a reduction function like SUM or MIN.
In Toro, each instance of the MPI application is executed by a thread in a dedicated core. The number of instances is thus equal to the number of available physical cores. The threads execute without kernel interference until they require IO.
To benchmark the MPI implementation in Toro, we have chosen the OSU microbenchmarks (see https://mvapich.cse.ohio-state.edu/benchmarks/). These are a set of MPI applications that are used to benchmark different implementations of the MPI standard. These are simple benchmarks in which the task of porting them to Toro is simple. In particular, we focus on the benchmarks that stress the MPI_Barrier(), MPI_Bcast() and MPI_Reduce() functions. These functions are benchmarked by relying on the following tests: osu_barrier, osu_bcast and osu_reduce.
The osu_barrier measures the time that the MPI_Barrier() takes. This function makes the thread wait until it is able to cross the barrier. The instructions after the barrier are executed only after all threads have reached the barrier and thus are allowed to cross it.
The osu_bcast and the osu_reduce benchmarks require communicating among the instances. The first simply sends packets to the root core. The second one collects all the data in the root core for processing. Both benchmarks stress the communication to the root core.
The communication between instances relies on a dedicated device named virtio-bus which is a mechanism to send data among cores by using virtqueues. For every other core in the system, each core has a reception queue and a transmission queue. For example, if the core #0 would like to send a packet to the core #1, it uses a transmission virtqueue that is specific for sending from core #0 to core#1. The MPI function MPI_send() and MPI_recv() are built on top of this device.
A virtqueue has three rings: buffer ring, available ring and the used ring. The buffer ring contains a set of buffers that are used to store data. The available ring contains buffer descriptors that are produced by the backend. The used ring contains buffer descriptors that have been consumed by the frontend. In our case, the core that sends a packet is the producer and the destination core is the consumer. The current implementation does not rely on an irq to notify the destination core. The destination core has to pool the available ring to get a new packet. After that, it puts the buffer in the used ring.
Let's see what happens when we run the osu_barrier.c benchmark in a VM with three cores in a host with two physical cores (2.4 GHz). We use QEMU microvm as a device model. This is a cheap host in the OVH cloud. The sketch of the benchmark in C is the following:
The benchmark measures the latency of the MPI_Barrier(), and then, it outputs the min, max and the average values among the different instances:
MPI_BARRIER: min_time: 84 cycles, max_time: 1144877 cycles, avg_time: 605571 cycles
The time is measured in CPU cycles that correspond with about 34 ns for the min_time. In this simple test, we observe the huge variations of the max and average measures. We observe that while two of the threads output values around the min_time, a third thread is responsible for the large variation in the max and average measures. The reason may be the use of more VCPUs than CPUs thus preventing the threads to run in parallel. Also, this is a host in the OVH cloud that might be deployed as a nested host thus impacting performance.
These are still early results that require to be rerun in a physical host with more cores and with a large number of tests to compare with existing implementations. The idea is to complete these tests and to show more concrete examples during the presentation at FOSDEM. I hope to see you there!
Follow me on Twitter: https://twitter.com/ToroKernel
Monday, September 05, 2022
Profiling and Optimizing Toro unikernel with Qprofiler
Hello everyone, in a recent video (you can watch it at https://youtu.be/DXmjZFIB_N0), I have presented how I use Qprofiler to profile and then optimize Toro unikernel. I would like to elaborate what this tool is and how to use it. You can read more about Qprofiler at http://torokerneleng.blogspot.com/2019/07/qprofiler-profiler-for-guests-in-qemukvm.html. This is a simple tool that samples a guest to get what code the guest is running. By doing this in a period of time, we get a list of pointers together with a counter that indicates the number of times the guest was in that position. This allows us to get a rough idea where the guest is spending most of the time and, in case the performance is not good, optimize those functions that dominate the overall running time. The sampling does not require any instrumentation in the code. Qprofiler only requires the symbols to get the name of the function from a pointer in memory.
In the video, we use Qprofiler to profile the example that shows how cores communicate by using VirtIO. When we run Qprofiler, we got the following output:
64 SYSTHREADSWITCH at Process.pas:1244
35 WRITE_PORTB at Arch.pas:459
33 SCHEDULING at Process.pas:1134
24 GETGSOFFSET at Arch.pas:429
The output shows us that most of the time the guest is switching the context. Note that Qprofiler only sample core #0. In particular, we see in the video that the time is mostly spent during the storing of the registers. During context switching, the registers are first stored and then, when the thread is switched back, the registers are restored. This requires to push all registers in the stack and then pop all them.
The current source code is a bit conservative and stores/loads all registers. However, we should only store/restore those registers that the x86-64 ABI tells us the caller must not to clober. These are the non-volatile registers. By removing the storing/restoring of these registers, we profile again and we got the following output:
58 SCHEDULING at Process.pas:1141
47 WRITE_PORTB at Arch.pas:460
10 RECVFROM at VirtIOBus.pas:166
8 SYSTHREADSWITCH at Process.pas:1228
We can see that the function that is dominating the running time is different. We had effectively improved the duration of the SysThreadSwitch function by removing the unnecessary storing and restoring of registers.
You can watch the whole video at https://youtu.be/DXmjZFIB_N0 also you can give a try to Qprofiler by cloning it from https://github.com/torokernel/qprofiler
Thursday, August 18, 2022
Recent improvement in accessing Per-CPU variables
In Toro, all kernel data structures are per-CPU variables. This is because each instance of the kernel in each core is independent one from each other. Also this has the benefits that access to kernel data structure is lock-less thus preventing the use of spin-locks and its contention.
For example, each core has an entry in the array CPU[] that contains information about the threads that are running in that core. Also each core has the DedicateFilesystem[] array with information about the filesystems that a core is allowed to access.
In general, when a core wants to access a per-CPU variable, it first gets the core id by issuing the GetApicId() function that returns the id of the lapic. Then, it uses that value as an index for a per-CPU array, e.g., CPU[id]. The whole operation requires two steps. Also, to get the lapic id, this requires access to the memory-mapped region of the lapic.
In recent work, we improved the access to per-CPU variables by using the %gs register of each core. This is an improvement that is already used in Linux for a long time (see https://lwn.net/Articles/198184/). The access to a per-CPU variable can be done in one single instruction if we use the %gs register to keep a pointer to an array of the per-CPU variables. Then, we require only an offset to find the entry of the variable that we are looking for. This is faster than issuing the GetApicId and then using it as an index. To get the value of a per-CPU variable in the %rax register, the function becomes a single assembler instruction:
mov %rax, %gs:offset
This is implemented in Toro by relying on a single GDT that is shared among all the cores. At booting time, we reserve a number of descriptors that are used for the %gs descriptor of each core. Each core loads in %gs the corresponding descriptor by relying on its lapic id. Each core loads in %gs a pointer to an array of pointers for each per-CPU variable, e.g., CPU, CurrentThread, CoreId, etc. Each variable is represented with a different offset. Each offset is a different entry in the table.
But how fast is this improvement? I've compared GetApicId() with GetCoreId(). The former relies on the lapic whereas the latter relies on the per-CPU variable. I've measured an improvement of x10 when using the per-CPU variable. This only tests a single invocation of the function. The function is heavily used in code so I expect a bigger overall improvement.