This post aims at giving some context to the presentation I will give at FOSDEM in the HPC room. Recently, I have been working on implementing MPI functions on top of Toro unikernel. The goal of this PoC is to understand the benefits of running a MPI application on top of a highly parallel unikernel like Toro. This post first starts by presenting what a unikernel is, then, It talks about why toro is different from other unikernels. I finish by presenting what benchmark I am using to compare the current implementation and future work.
An unikernel is a minimalistic operating system that is compiled within the user application and aims at reducing the OS interference. The kernel and the user application share the view of the memory. The communication between the kernel and the application is by using simple calls instead of syscalls. Both the application and the kernel run in the most privileged ring, i.e, ring0 in x86. There are several existing unikernels like OSv, NanoVM, MirageOS and Unikraft. These projects focus on enabling compiling a POSIX application within the unikernel. The resulting binary is thus deployed without the need of an OS.
Toro is a non-POSIX unikernel that requires modification of the application. Most of the functions are named like those in the POSIX standard but the implementation is slightly different. Note that deploying an app as a VM impacts the performance of such an application. Unikernel allows MPI applications to run closer to the hardware thus reducing the degradation of the performance due to virtualization.
Toro runs an instance of the kernel per-core and most of the kernel-data is per-cpu. When a thread requires allocating memory, it gets it from a pool that is local to the cpu in which the thread runs. That pool is created during the initialization by allocating a fixed amount of memory per core. The scheduler is cooperative and it is based on the SysThreadSwitch() function that yields the current processor from the current thread to the scheduler. The scheduler is not preemptive. This results that most of the race conditions at the kernel data are fixed by just disabling irqs if needed.
A MPI application is deployed in several nodes and executes in parallel. Among these instances, there is the root instance that acts as the chef of the orchestra. These instances require mechanisms to synchronise and to communicate. For example, the MPI_Barrier() function acts as a barrier in which every instance waits until all instances reach it. Instances can cross the barrier only after all the other instances have reached it. MPI_Send() and MPI_Recv() are functions to send and receive data among the different instances. These primitives are used to build more complex functions like MPI_Bcast() and MPI_Reduce(). The former is for sending data to the root node and the latter to send data to the root for processing by using a reduction function like SUM or MIN.
In Toro, each instance of the MPI application is executed by a thread in a dedicated core. The number of instances is thus equal to the number of available physical cores. The threads execute without kernel interference until they require IO.
To benchmark the MPI implementation in Toro, we have chosen the OSU microbenchmarks (see https://mvapich.cse.ohio-state.edu/benchmarks/). These are a set of MPI applications that are used to benchmark different implementations of the MPI standard. These are simple benchmarks in which the task of porting them to Toro is simple. In particular, we focus on the benchmarks that stress the MPI_Barrier(), MPI_Bcast() and MPI_Reduce() functions. These functions are benchmarked by relying on the following tests: osu_barrier, osu_bcast and osu_reduce.
The osu_barrier measures the time that the MPI_Barrier() takes. This function makes the thread wait until it is able to cross the barrier. The instructions after the barrier are executed only after all threads have reached the barrier and thus are allowed to cross it.
The osu_bcast and the osu_reduce benchmarks require communicating among the instances. The first simply sends packets to the root core. The second one collects all the data in the root core for processing. Both benchmarks stress the communication to the root core.
The communication between instances relies on a dedicated device named virtio-bus which is a mechanism to send data among cores by using virtqueues. For every other core in the system, each core has a reception queue and a transmission queue. For example, if the core #0 would like to send a packet to the core #1, it uses a transmission virtqueue that is specific for sending from core #0 to core#1. The MPI function MPI_send() and MPI_recv() are built on top of this device.
A virtqueue has three rings: buffer ring, available ring and the used ring. The buffer ring contains a set of buffers that are used to store data. The available ring contains buffer descriptors that are produced by the backend. The used ring contains buffer descriptors that have been consumed by the frontend. In our case, the core that sends a packet is the producer and the destination core is the consumer. The current implementation does not rely on an irq to notify the destination core. The destination core has to pool the available ring to get a new packet. After that, it puts the buffer in the used ring.
Let's see what happens when we run the osu_barrier.c benchmark in a VM with three cores in a host with two physical cores (2.4 GHz). We use QEMU microvm as a device model. This is a cheap host in the OVH cloud. The sketch of the benchmark in C is the following:
The benchmark measures the latency of the MPI_Barrier(), and then, it outputs the min, max and the average values among the different instances:
MPI_BARRIER: min_time: 84 cycles, max_time: 1144877 cycles, avg_time: 605571 cycles
The time is measured in CPU cycles that correspond with about 34 ns for the min_time. In this simple test, we observe the huge variations of the max and average measures. We observe that while two of the threads output values around the min_time, a third thread is responsible for the large variation in the max and average measures. The reason may be the use of more VCPUs than CPUs thus preventing the threads to run in parallel. Also, this is a host in the OVH cloud that might be deployed as a nested host thus impacting performance.
These are still early results that require to be rerun in a physical host with more cores and with a large number of tests to compare with existing implementations. The idea is to complete these tests and to show more concrete examples during the presentation at FOSDEM. I hope to see you there!
Follow me on Twitter: https://twitter.com/ToroKernel
No comments:
Post a Comment