Toro kernel

Friday, May 12, 2023

Virtio-vsock driver speed up transmission by making one copy

This post explains the improvements to drop to one the number of the copies required to send data over AF_VSOCK. Current mechanism to send packets through virtio-vsock makes two copies before sending it through the virtio-vsock device. The first copy is from the user’s application to the socket buffer. The SysSocketSend() copies the content from the user to a Packet structure. This structure contains the content and the vsock header. The packet is sent to the driver by relying on the SysNetworkSend() function that only accepts packet structures. When the driver gets the packet, the second copy is from the packet to the corresponding virtqueue's ring. This is the available ring of the transmission queue of the virtio-vsock device. During the second copy, we get a free pre-allocated buffer, we copy the content from the user and then we enqueue it into the available ring.

To drop the number to one, we remove the need of the first copy by reusing the packet's content in the virtio-vsock driver. This modification has been addressed by the github issue https://github.com/torokernel/torokernel/issues/363. Instead of allocating a new buffer, the virtio-vsock buffer reuses the packet buffer. Note that the available ring contains indexes that identify descriptors in the descriptor ring. We make this descriptor to point to user's buffer. After the buffer is consumed by the device, the driver tells the kernel that the packet has been sent and the kernel simply releases the memory allocated for that packet. The mechanism to send packets does not wait for the packet to be sent. It fills the available ring of the transmission queue with packets and returns immediately to the user.

We have benchmarked these changes but the mechanism does not show significant improvements. We use ab to compare two instances of the WebServer appliance. This is an appliance that runs a WebServer over virtio-vsock and gets files by using virtio-fs. This is the appliance that hosts Toro's website. For the benchmark, we use the following command:

ab -c 1 -n 1000 http://127.0.0.1:4000/images/apporikernel.png (133 kb)

In this test, the Requests per second without the changes is 2.27 [#/sec] (mean) whereas the same parameter is 2.28 [#/sec] (mean) with the changes. The time per request is 440.995 [ms] (mean) wo the changes whereas the value with the change is 438.213 [ms] (mean). During this benchmark, we use socat to translate tcp connections to vsock connections.

We removed socat and we launch a ping-pong test with different packet sizes. These are the results with the changes:

request size: 8192 bytes, time: 1138.86 ms

request size: 16384 bytes, time: 1169.71 ms

These are the results without the changes:

request size: 8192 bytes, time: 1131.54 ms

request size: 16384 bytes, time: 1165.18 ms

For some reason, the instance that has two copies takes less time than the instance with a single copy. We require more benchmarks to understand why such a improvement has none effect on these tests. The next step would be to use zero copy by reusing the user’s buffer. This is how it is implemented in the virtio-fs driver in which the user’s buffer is shared with the device.

Wednesday, January 11, 2023

Running MPI applications as an unikernel

This post aims at giving some context to the presentation I will give at FOSDEM in the HPC room. Recently, I have been working on implementing MPI functions on top of Toro unikernel. The goal of this PoC is to understand the benefits of running a MPI application on top of a highly parallel unikernel like Toro. This post first starts by presenting what a unikernel is, then, It talks about why toro is different from other unikernels. I finish by presenting what benchmark I am using to compare the current implementation and future work.

An unikernel is a minimalistic operating system that is compiled within the user application and aims at reducing the OS interference. The kernel and the user application share the view of the memory. The communication between the kernel and the application is by using simple calls instead of syscalls. Both the application and the kernel run in the most privileged ring, i.e, ring0 in x86. There are several existing unikernels like OSv, NanoVM, MirageOS and Unikraft. These projects focus on enabling compiling a POSIX application within the unikernel. The resulting binary is thus deployed without the need of an OS.

Toro is a non-POSIX unikernel that requires modification of the application. Most of the functions are named like those in the POSIX standard but the implementation is slightly different. Note that deploying an app as a VM impacts the performance of such an application. Unikernel allows MPI applications to run closer to the hardware thus reducing the degradation of the performance due to virtualization.

Toro runs an instance of the kernel per-core and most of the kernel-data is per-cpu. When a thread requires allocating memory, it gets it from a pool that is local to the cpu in which the thread runs. That pool is created during the initialization by allocating a fixed amount of memory per core. The scheduler is cooperative and it is based on the SysThreadSwitch() function that yields the current processor from the current thread to the scheduler. The scheduler is not preemptive. This results that most of the race conditions at the kernel data are fixed by just disabling irqs if needed.

A MPI application is deployed in several nodes and executes in parallel. Among these instances, there is the root instance that acts as the chef of the orchestra. These instances require mechanisms to synchronise and to communicate. For example, the MPI_Barrier() function acts as a barrier in which every instance waits until all instances reach it. Instances can cross the barrier only after all the other instances have reached it. MPI_Send() and MPI_Recv() are functions to send and receive data among the different instances. These primitives are used to build more complex functions like MPI_Bcast() and MPI_Reduce(). The former is for sending data to the root node and the latter to send data to the root for processing by using a reduction function like SUM or MIN.

In Toro, each instance of the MPI application is executed by a thread in a dedicated core. The number of instances is thus equal to the number of available physical cores. The threads execute without kernel interference until they require IO.

To benchmark the MPI implementation in Toro, we have chosen the OSU microbenchmarks (see https://mvapich.cse.ohio-state.edu/benchmarks/). These are a set of MPI applications that are used to benchmark different implementations of the MPI standard. These are simple benchmarks in which the task of porting them to Toro is simple. In particular, we focus on the benchmarks that stress the MPI_Barrier(), MPI_Bcast() and MPI_Reduce() functions. These functions are benchmarked by relying on the following tests: osu_barrier, osu_bcast and osu_reduce.

The osu_barrier measures the time that the MPI_Barrier() takes. This function makes the thread wait until it is able to cross the barrier. The instructions after the barrier are executed only after all threads have reached the barrier and thus are allowed to cross it.

The osu_bcast and the osu_reduce benchmarks require communicating among the instances. The first simply sends packets to the root core. The second one collects all the data in the root core for processing. Both benchmarks stress the communication to the root core.

The communication between instances relies on a dedicated device named virtio-bus which is a mechanism to send data among cores by using virtqueues. For every other core in the system, each core has a reception queue and a transmission queue. For example, if the core #0 would like to send a packet to the core #1, it uses a transmission virtqueue that is specific for sending from core #0 to core#1. The MPI function MPI_send() and MPI_recv() are built on top of this device.

A virtqueue has three rings: buffer ring, available ring and the used ring. The buffer ring contains a set of buffers that are used to store data. The available ring contains buffer descriptors that are produced by the backend. The used ring contains buffer descriptors that have been consumed by the frontend. In our case, the core that sends a packet is the producer and the destination core is the consumer. The current implementation does not rely on an irq to notify the destination core. The destination core has to pool the available ring to get a new packet. After that, it puts the buffer in the used ring.

Let's see what happens when we run the osu_barrier.c benchmark in a VM with three cores in a host with two physical cores (2.4 GHz). We use QEMU microvm as a device model. This is a cheap host in the OVH cloud. The sketch of the benchmark in C is the following:

for (i=0; i< 100; i++){

start = Mpi_Wtime();

Mpi_Barrier(MPI_COMM_WORLD);

end = Mpi_Wtime();

sum += (int)(end - start);

}

sum /= 100;

Mpi_Reduce(&sum, &min_time, 1, MPI_MIN, root);

Mpi_Reduce(&sum, &max_time, 1, MPI_MAX, root);

Mpi_Reduce(&sum, &avg_time, 1, MPI_SUM, root);

The benchmark measures the latency of the MPI_Barrier(), and then, it outputs the min, max and the average values among the different instances:

MPI_BARRIER: min_time: 84 cycles, max_time: 1144877 cycles, avg_time: 605571 cycles

The time is measured in CPU cycles that correspond with about 34 ns for the min_time. In this simple test, we observe the huge variations of the max and average measures. We observe that while two of the threads output values around the min_time, a third thread is responsible for the large variation in the max and average measures. The reason may be the use of more VCPUs than CPUs thus preventing the threads to run in parallel. Also, this is a host in the OVH cloud that might be deployed as a nested host thus impacting performance.

These are still early results that require to be rerun in a physical host with more cores and with a large number of tests to compare with existing implementations. The idea is to complete these tests and to show more concrete examples during the presentation at FOSDEM. I hope to see you there!

Toro kernel is open source on GitHub: https://github.com/torokernel/torokernel
Follow me on Twitter: https://twitter.com/ToroKernel

Sponsor me on GitHub: https://github.com/sponsors/MatiasVara

Monday, September 05, 2022

Profiling and Optimizing Toro unikernel with Qprofiler

Hello everyone, in a recent video (you can watch it at https://youtu.be/DXmjZFIB_N0), I have presented how I use Qprofiler to profile and then optimize Toro unikernel. I would like to elaborate what this tool is and how to use it. You can read more about Qprofiler at http://torokerneleng.blogspot.com/2019/07/qprofiler-profiler-for-guests-in-qemukvm.html. This is a simple tool that samples a guest to get what code the guest is running. By doing this in a period of time, we get a list of pointers together with a counter that indicates the number of times the guest was in that position. This allows us to get a rough idea where the guest is spending most of the time and, in case the performance is not good, optimize those functions that dominate the overall running time. The sampling does not require any instrumentation in the code. Qprofiler only requires the symbols to get the name of the function from a pointer in memory.

In the video, we use Qprofiler to profile the example that shows how cores communicate by using VirtIO. When we run Qprofiler, we got the following output:

64 SYSTHREADSWITCH at Process.pas:1244

35 WRITE_PORTB at Arch.pas:459

33 SCHEDULING at Process.pas:1134

24 GETGSOFFSET at Arch.pas:429

The output shows us that most of the time the guest is switching the context. Note that Qprofiler only sample core #0. In particular, we see in the video that the time is mostly spent during the storing of the registers. During context switching, the registers are first stored and then, when the thread is switched back, the registers are restored. This requires to push all registers in the stack and then pop all them.

The current source code is a bit conservative and stores/loads all registers. However, we should only store/restore those registers that the x86-64 ABI tells us the caller must not to clober. These are the non-volatile registers. By removing the storing/restoring of these registers, we profile again and we got the following output:

58 SCHEDULING at Process.pas:1141

47 WRITE_PORTB at Arch.pas:460

10 RECVFROM at VirtIOBus.pas:166

8 SYSTHREADSWITCH at Process.pas:1228

We can see that the function that is dominating the running time is different. We had effectively improved the duration of the SysThreadSwitch function by removing the unnecessary storing and restoring of registers.

You can watch the whole video at https://youtu.be/DXmjZFIB_N0 also you can give a try to Qprofiler by cloning it from https://github.com/torokernel/qprofiler

Thursday, August 18, 2022

Recent improvement in accessing Per-CPU variables

In Toro, all kernel data structures are per-CPU variables. This is because each instance of the kernel in each core is independent one from each other. Also this has the benefits that access to kernel data structure is lock-less thus preventing the use of spin-locks and its contention.

For example, each core has an entry in the array CPU[] that contains information about the threads that are running in that core. Also each core has the DedicateFilesystem[] array with information about the filesystems that a core is allowed to access.

In general, when a core wants to access a per-CPU variable, it first gets the core id by issuing the GetApicId() function that returns the id of the lapic. Then, it uses that value as an index for a per-CPU array, e.g., CPU[id]. The whole operation requires two steps. Also, to get the lapic id, this requires access to the memory-mapped region of the lapic.

In recent work, we improved the access to per-CPU variables by using the %gs register of each core. This is an improvement that is already used in Linux for a long time (see https://lwn.net/Articles/198184/). The access to a per-CPU variable can be done in one single instruction if we use the %gs register to keep a pointer to an array of the per-CPU variables. Then, we require only an offset to find the entry of the variable that we are looking for. This is faster than issuing the GetApicId and then using it as an index. To get the value of a per-CPU variable in the %rax register, the function becomes a single assembler instruction:

mov %rax, %gs:offset

This is implemented in Toro by relying on a single GDT that is shared among all the cores. At booting time, we reserve a number of descriptors that are used for the %gs descriptor of each core. Each core loads in %gs the corresponding descriptor by relying on its lapic id. Each core loads in %gs a pointer to an array of pointers for each per-CPU variable, e.g., CPU, CurrentThread, CoreId, etc. Each variable is represented with a different offset. Each offset is a different entry in the table.

But how fast is this improvement? I've compared GetApicId() with GetCoreId(). The former relies on the lapic whereas the latter relies on the per-CPU variable. I've measured an improvement of x10 when using the per-CPU variable. This only tests a single invocation of the function. The function is heavily used in code so I expect a bigger overall improvement.