A dedicated unikernel for microservices

Monday, September 05, 2022

Profiling and Optimizing Toro unikernel with Qprofiler

Hello everyone, in a recent video (you can watch it at https://youtu.be/DXmjZFIB_N0), I have presented how I use Qprofiler to profile and then optimize Toro unikernel. I would like to elaborate what this tool is and how to use it. You can read more about Qprofiler at http://torokerneleng.blogspot.com/2019/07/qprofiler-profiler-for-guests-in-qemukvm.html. This is a simple tool that samples a guest to get what code the guest is running. By doing this in a period of time, we get a list of pointers together with a counter that indicates the number of times the guest was in that position. This allows us to get a rough idea where the guest is spending most of the time and, in case the performance is not good, optimize those functions that dominate the overall running time. The sampling does not require any instrumentation in the code. Qprofiler only requires the symbols to get the name of the function from a pointer in memory. 

In the video, we use Qprofiler to profile the example that shows how cores communicate by using VirtIO. When we run Qprofiler, we got the following output:

64 SYSTHREADSWITCH at Process.pas:1244 

35 WRITE_PORTB at Arch.pas:459 

33 SCHEDULING at Process.pas:1134 

24 GETGSOFFSET at Arch.pas:429

The output shows us that most of the time the guest is switching the context. Note that Qprofiler only sample core #0. In particular, we see in the video that the time is mostly spent during the storing of the registers. During context switching, the registers are first stored and then, when the thread is switched back, the registers are restored. This requires to push all registers in the stack and then pop all them. 

The current source code is a bit conservative and stores/loads all registers. However, we should only store/restore those registers that the x86-64 ABI tells us the caller must not to clober. These are the non-volatile registers. By removing the storing/restoring of these registers, we profile again and we got the following output:

58 SCHEDULING at Process.pas:1141

47 WRITE_PORTB at Arch.pas:460

10 RECVFROM at VirtIOBus.pas:166

8 SYSTHREADSWITCH at Process.pas:1228

We can see that the function that is dominating the running time is different. We had effectively improved the duration of the SysThreadSwitch function by removing the unnecessary storing and restoring of registers.

You can watch the whole video at https://youtu.be/DXmjZFIB_N0 also you can give a try to Qprofiler by cloning it from https://github.com/torokernel/qprofiler

Thursday, August 18, 2022

Recent improvement in accessing Per-CPU variables

In Toro, all kernel data structures are per-CPU variables. This is because each instance of the kernel in each core is independent one from each other. Also this has the benefits that access to kernel data structure is lock-less thus preventing the use of spin-locks and its contention. 

For example, each core has an entry in the array CPU[] that contains information about the threads that are running in that core. Also each core has the DedicateFilesystem[] array with information about the filesystems that a core is allowed to access. 

In general, when a core wants to access a per-CPU variable, it first gets the core id by issuing the GetApicId() function that returns the id of the lapic. Then, it uses that value as an index for a per-CPU array, e.g., CPU[id]. The whole operation requires two steps. Also, to get the lapic id, this requires access to the memory-mapped region of the lapic.

In recent work, we improved the access to per-CPU variables by using the %gs register of each core. This is an improvement that is already used in Linux for a long time (see https://lwn.net/Articles/198184/). The access to a per-CPU variable can be done in one single instruction if we use the %gs register to keep a pointer to an array of the per-CPU variables. Then, we require only an offset to find the entry of the variable that we are looking for. This is faster than issuing the GetApicId and then using it as an index. To get the value of a per-CPU variable in the %rax register, the function becomes a single assembler instruction:

mov %rax, %gs:offset

This is implemented in Toro by relying on a single GDT that is shared among all the cores. At booting time, we reserve a number of descriptors that are used for the %gs descriptor of each core. Each core loads in %gs the corresponding descriptor by relying on its lapic id. Each core loads in %gs a pointer to an array of pointers for each per-CPU variable, e.g., CPU, CurrentThread, CoreId, etc. Each variable is represented with a different offset. Each offset is a different entry in the table.

But how fast is this improvement? I've compared GetApicId() with GetCoreId(). The former relies on the lapic whereas the latter relies on the per-CPU variable. I've measured an improvement of x10 when using the per-CPU variable. This only tests a single invocation of the function. The function is heavily used in code so I expect a bigger overall improvement.

Saturday, July 30, 2022

Implementing MPI_REDUCE in Toro unikernel

This post presents the implementation of MPI_REDUCE in Toro unikernel. This is an MPI operation that allows the application of a reduction operation over a vector. In this example, each task has a vector of 64 elements that is initialised with the rank of the task, i.e., the id of the core in which the task executes. Each task sends to the root core this vector for reduction by using the MPI_SUM operation. The root core collects all vectors and applies a matrix sum. This results in a new vector of 64 elements. For example, if the vector has three elements with all 1’: [1, 1, 1], and we have 3 cores, the resulting vector would be: [3, 3, 3]. 
In this example, VirtIO is used to send the data among the cores. Each core has its own queue to exchange data with the other cores in the system. During the reduction, each core sends its data to the root core. The exchange happens without the use of lock operations. 
Let’s see how the example works. 

void main(){
    int rank, i;
    int r[VECTOR_LEN];
    int s[VECTOR_LEN];
    rank = GetRank();
    for (i=0; i < VECTOR_LEN; i++){
        r[i] = rank;
    printf("hello from core", rank);

First, the program initialises the vector r[] by using the rank. The rank is the core id in which the program is running. Note that an instance of this program executes for each core in the system. Toro only supports threads and the memory-space is flat, i.e., cores share the page-directory.

    // reduce an array of size VECTOR_LEN
    // by using the MPI_SUM operation
    Mpi_Reduce(r, s, VECTOR_LEN, MPI_SUM, root);    

Then, the program uses the MPI_Reduce() function with the MPI_SUM operation over the r[] vector. The root core collects all vectors and applies the MPI_SUM operation. The resulting vector is stored in s[]. This vector can only be read by the root core. Mpi_Reduce() relies on VirtIO to send data to the root core.

  if (rank == root){
     for (i=0;  i < VECTOR_LEN; i++){
       // Sum = ((N - 1) * N) / 2
       if (s[i] != (((GetCores()-1) * GetCores()) / 2)){
           printf("failed!, core:", rank);
     if (i == VECTOR_LEN){
        printf("success! value:", s[i-1]);

Finally, the root core checks that the result is the expected. In this case, since each core sends a vector with its rank, the resulting vector is simply the sum of its ranks. I tried this example with 5 cores and the result is the following video, enjoy! 

Sunday, July 03, 2022

Using VirtIO for inter-core communication

This article presents a new device named VirtIOBus that allows cores to communicate by relying on VirtIO. The design is simple, each core has a reception queue for each core in the system. For example, in a three-cores system, each core has (N-1) reception queues, i.e., 2 reception queues per core. When the core #0 wants to send a packet to core #1, the core #0 gets a descriptor from the available ring of the reception virtqueue of the core #1, fills the buffer, and then enqueues the descriptor in the used ring. To notify the core #1, the core#0 can either trigger an inter-core interruption or simple continues until the core #1 gets the buffer. In the current implementation, interruptions are disabled, and the core #1 has just to pool the used rings to get new buffers. The enqueue and dequeue operations do not require atomic operations or any lock mechanism.
The API is built of two functions:
  • SendTo(core, buf)
  • RecvFrom(core, buf)
When a core wants to send data, it has to just set the destination core and the buffer from which the content is copied from. To get a new buffer, the core invokes RecvFrom that returns the oldest buffer in the queue. If the queue is empty, the function blocks until a new buffer arrives.
You can see the use of these functions in the example here. In this example, data is broadcasted from core #0 to all cores in a three-cores system. The core #0 sends a “ping” to core #1 and core #2, and then, core #1 and core#2 responds with a “pong”. The output is as follows:

VirtIOBus: Core[1]->Core[0] queue has been initiated
VirtIOBus: Core[2]->Core[0] queue has been initiated
VirtIOBus: Core[0]->Core[1] queue has been initiated
VirtIOBus: Core[2]->Core[1] queue has been initiated
VirtIOBus: Core[0]->Core[2] queue has been initiated
VirtIOBus: Core[1]->Core[2] queue has been initiated
Core[0] -> Core[1]: ping
Core[1] -> Core[0]: pong
Core[0] -> Core[2]: ping
Core[2] -> Core[0]: pong

In the future, the current mechanism will be used for the implementation of MPI functions like MPI_Send() and MPI_Recv(). The motivation is to port MPI applications to the Toro unikernel so stay tunned!

P.S.: You can find a demo here.

Thursday, March 31, 2022

Notes about an hypervisor-agnostic implementation of VirtIO

This post presents some very informal notes about the requirements for a virtio backend that would be independent of the hypervisor/OS. Some of these requirements are:

* The library shall require an API that the hypervisor has to provide. 

* The library shall be flexible enough depending on the use case. For example, the library shall be able to work in a type-II hypervisor but also in a type-I hypervisor.

The use-cases are: 

1. The backend running as a user-space process like in KVM and QEMU, 

2. The backend running as a user-space process in Dom0 like in Xen, 

3. The backend running as a VM, e.g., JailHouse. 

I found that other works have already highlighted the need for a virtio backend that would be independent of the VMM and the hypervisor, e.g, Stratos project. In such a backend, the hypervisor would be abstracted away by a common interface, i.e., driver. This would raise some requirements to the hypervisor and the interface that the hypervisor need to expose to be abler to plug such a backend. Also, the way that virtio-devices can be implemented may vary. For example, it could be in a user process, a thread, or in a VM.

If we deal with an hypervisor that does not provide such a interface, we still need some sort of mechanism to communicate backend and frontend. This may require some sort of synchronization mechanism maybe by extending the mmio layout.

I have found three cases: 

1. Type II hypervisor(KVM) in which the backend run as a user application and the backend run as a part of the VMM. Backend only needs to register some callbacks to trap access to IO regions. 

2. Type-I like XEN (hvm), this is also the case although the VMM is running as in a different VM

3. Type-I**, there is not such a VM exit mechanism and backend can't see all frontend memory, the requirements are more to the hypervisor. One possible solution is to share a region of the memory between the frontend and the backend, let the frontend allocate and manipulate that memory that would be used for io-buffers. 

It could be interesting to put the backend as a VM. For that use-case, we require to communicate these VMs somehow. One possible way would to share a region of memory between the VMs and in addition implement some sort of ring bell mechanism between the VMs.

In my experiments, I added some extra bits in the mmio layout to allow to set the device and the driver blocked or resumed. These bits are used for the device-status initialization only. Then notifications to the vrings are done by using interVM irqs. 

In my PoC, each frontend gets a region of memory in which the virtio layout and the io-buffers are mapped. This is defined statically when the guest is created. The memory for io-buffers is also allocated at that time.  

If BE and FE are implemented as VMs, VMExits plays no role because we would need the hypervisor in that case. Also, it would be nice to be able to specify the BE as a driver for Linux.