A dedicated unikernel for microservices

Friday, May 12, 2023

Virtio-vsock driver speed up transmission by making one copy

This post explains the improvements to drop to one the number of the copies required to send data over AF_VSOCK. Current mechanism to send packets through virtio-vsock makes two copies before sending it through the virtio-vsock device. The first copy is from the user’s application to the socket buffer. The SysSocketSend() copies the content from the user to a Packet structure. This structure contains the content and the vsock header. The packet is sent to the driver by relying on the SysNetworkSend() function that only accepts packet structures. When the driver gets the packet, the second copy is from the packet to the corresponding virtqueue's ring. This is the available ring of the transmission queue of the virtio-vsock device. During the second copy, we get a free pre-allocated buffer, we copy the content from the user and then we enqueue it into the available ring. 
To drop the number to one, we remove the need of the first copy by reusing the packet's content in the virtio-vsock driver. This modification has been addressed by the github issue https://github.com/torokernel/torokernel/issues/363. Instead of allocating a new buffer, the virtio-vsock buffer reuses the packet buffer. Note that the available ring contains indexes that identify descriptors in the descriptor ring. We make this descriptor to point to user's buffer. After the buffer is consumed by the device, the driver tells the kernel that the packet has been sent and the kernel simply releases the memory allocated for that packet. The mechanism to send packets does not wait for the packet to be sent. It fills the available ring of the transmission queue with packets and returns immediately to the user.
We have benchmarked these changes but the mechanism does not show significant improvements. We use ab to compare two instances of the WebServer appliance. This is an appliance that runs a WebServer over virtio-vsock and gets files by using virtio-fs. This is the appliance that hosts Toro's website. For the benchmark, we use the following command:

ab -c 1 -n 1000 http://127.0.0.1:4000/images/apporikernel.png (133 kb)

In this test, the Requests per second without the changes is  2.27 [#/sec] (mean) whereas the same parameter is 2.28 [#/sec] (mean) with the changes. The time per request is 440.995 [ms] (mean) wo the changes whereas the value with the change is 438.213 [ms] (mean). During this benchmark, we use socat to translate tcp connections to vsock connections.

We removed socat and we launch a ping-pong test with different packet sizes. These are the results with the changes:
request size:  8192 bytes, time:  1138.86 ms
request size:  16384 bytes, time:  1169.71 ms

These are the results without the changes:
request size:  8192 bytes, time:  1131.54 ms
request size:  16384 bytes, time:  1165.18 ms

For some reason, the instance that has two copies takes less time than the instance with a single copy. We require more benchmarks to understand why such a improvement has none effect on these tests. The next step would be to use zero copy by reusing the user’s buffer. This is how it is implemented in the virtio-fs driver in which the user’s buffer is shared with the device.

Wednesday, January 11, 2023

Running MPI applications as an unikernel

This post aims at giving some context to the presentation I will give at FOSDEM in the HPC room. Recently, I have been working on implementing MPI functions on top of Toro unikernel. The goal of this PoC is to understand the benefits of running a MPI application on top of a highly parallel unikernel like Toro. This post first starts by presenting what a unikernel is, then, It talks about why toro is different from other unikernels. I finish by presenting what benchmark I am using to compare the current implementation and future work. 

An unikernel is a minimalistic operating system that is compiled within the user application and aims at reducing the OS interference. The kernel and the user application share the view of the memory. The communication between the kernel and the application is by using simple calls instead of syscalls. Both the application and the kernel run in the most privileged ring, i.e, ring0 in x86. There are several existing unikernels like OSv, NanoVM, MirageOS and Unikraft. These projects focus on enabling compiling a POSIX application within the unikernel. The resulting binary is thus deployed without the need of an OS. 

Toro is a non-POSIX unikernel that requires modification of the application. Most of the functions are named like those in the POSIX standard but the implementation is slightly different. Note that deploying an app as a VM impacts the performance of such an application. Unikernel allows MPI applications to run closer to the hardware thus reducing the degradation of the performance due to virtualization.  

Toro runs an instance of the kernel per-core and most of the kernel-data is per-cpu. When a thread requires allocating memory, it gets it from a pool that is local to the cpu in which the thread runs. That pool is created during the initialization by allocating a fixed amount of memory per core. The scheduler is cooperative and it is based on the SysThreadSwitch() function that yields the current processor from the current thread to the scheduler. The scheduler is not preemptive. This results that most of the race conditions at the kernel data are fixed by just disabling irqs if needed. 

A MPI application is deployed in several nodes and executes in parallel. Among these instances, there is the root instance that acts as the chef of the orchestra. These instances require mechanisms to synchronise and to communicate. For example, the MPI_Barrier() function acts as a barrier in which every instance waits until all instances reach it. Instances can cross the barrier only after all the other instances have reached it. MPI_Send() and MPI_Recv() are functions to send and receive data among the different instances. These primitives are used to build more complex functions like MPI_Bcast() and MPI_Reduce(). The former is for sending data to the root node and the latter to send data to the root for processing by using a reduction function like SUM or MIN.     

In Toro, each instance of the MPI application is executed by a thread in a dedicated core. The number of instances is thus equal to the number of available physical cores. The threads execute without kernel interference until they require IO.  

To benchmark the MPI implementation in Toro, we have chosen the OSU microbenchmarks (see https://mvapich.cse.ohio-state.edu/benchmarks/). These are a set of MPI applications that are used to benchmark different implementations of the MPI standard. These are simple benchmarks in which the task of porting them to Toro is simple. In particular, we focus on the benchmarks that stress the MPI_Barrier(), MPI_Bcast() and MPI_Reduce() functions. These functions are benchmarked by relying on the following tests: osu_barrier, osu_bcast and osu_reduce. 

The osu_barrier measures the time that the MPI_Barrier() takes. This function makes the thread wait until it is able to cross the barrier. The instructions after the barrier are executed only after all threads have reached the barrier and thus are allowed to cross it. 

The osu_bcast and the osu_reduce benchmarks require communicating among the instances. The first simply sends packets to the root core. The second one collects all the data in the root core for processing. Both benchmarks stress the communication to the root core. 

The communication between instances relies on a dedicated device named virtio-bus which is a mechanism to send data among cores by using virtqueues. For every other core in the system, each core has a reception queue and a transmission queue. For example, if the core #0 would like to send a packet to the core #1, it uses a transmission virtqueue that is specific for sending from core #0 to core#1. The MPI function MPI_send() and MPI_recv() are built on top of this device.

A virtqueue has three rings: buffer ring, available ring and the used ring. The buffer ring contains a set of buffers that are used to store data. The available ring contains buffer descriptors that are produced by the backend. The used ring contains buffer descriptors that have been consumed by the frontend. In our case, the core that sends a packet is the producer and the destination core is the consumer. The current implementation does not rely on an irq to notify the destination core. The destination core has to pool the available ring to get a new packet. After that, it puts the buffer in the used ring.

Let's see what happens when we run the osu_barrier.c benchmark in a VM with three cores in a host with two physical cores (2.4 GHz). We use QEMU microvm as a device model. This is a cheap host in the OVH cloud. The sketch of the benchmark in C is the following:

    for (i=0; i< 100; i++){
        start = Mpi_Wtime();
        Mpi_Barrier(MPI_COMM_WORLD);
        end = Mpi_Wtime();
        sum += (int)(end - start);
    }
    sum /= 100;
    Mpi_Reduce(&sum, &min_time, 1, MPI_MIN, root);
    Mpi_Reduce(&sum, &max_time, 1, MPI_MAX, root);
    Mpi_Reduce(&sum, &avg_time, 1, MPI_SUM, root);

The benchmark measures the latency of the MPI_Barrier(), and then, it outputs the min, max and the average values among the different instances:

MPI_BARRIER: min_time: 84 cycles, max_time: 1144877 cycles, avg_time: 605571 cycles

The time is measured in CPU cycles that correspond with about 34 ns for the min_time. In this simple test, we observe the huge variations of the max and average measures. We observe that while two of the threads output values around the min_time, a third thread is responsible for the large variation in the max and average measures. The reason may be the use of more VCPUs than CPUs thus preventing the threads to run in parallel. Also, this is a host in the OVH cloud that might be deployed as a nested host thus impacting performance. 

These are still early results that require to be rerun in a physical host with more cores and with a large number of tests to compare with existing implementations. The idea is to complete these tests and to show more concrete examples during the presentation at FOSDEM. I hope to see you there!

Toro kernel is open source on GitHub: https://github.com/torokernel/torokernel
Follow me on Twitter: https://twitter.com/ToroKernel