A dedicated unikernel for microservices

Friday, July 05, 2019

QProfiler: A profiler for guests in QEMU/KVM

In this article, I am going to talk about QProfiler which is a tool to profile a guest running on top of QEMU/KVM. The source code is hosted at https://github.com/torokernel/qprofiler. I started this project because I was interested in profiling Toro running as a guest on QEMU/KVM. Roughly speaking, Profiling is to count how often each function is executed. This gives an idea about where the execution time is spent. I am not an expert on this area but I will sum up my research. There are two mechanisms to profile:
   1) by counting how often each function is invoked.
   2) by sampling a process and counting which function is executed in that time.
The mechanism number 1) is intrusive since the code must be modified. The executable must be compiled with the "-pg" option that makes each function to invoke mcount() thus counting the number of times a function is executed. The main benefit of mechanism number 2) is it can profile a process without any modification. However, the result may be not accurate and limited by the maximum sample frequency. In my case, I decided to use the mechanism number 2) by implementing an script that samples a VM by using the Qemu Monitor Protocol. The script gets the %rip register and the %rbp register thus enabling to get current function and the invoked function. It is also possible to get a full backtrace but it remains a TODO work. The only change in the code is to compile by using the “-g” option to add debugging information to the binary. Then, by using addr2line is possible to get the name of the function from an address. The scripts accepts as parameter the duration of the sampling and the sampling frequency. For example, if the script samples during 10 seconds and the sampling frequency is 1s, we end up with 10 samples.
Using QProfile on StaticWebServer shows that 96% of the time the guest is executing Move(). This means that most the time the application is copying data from one block to other. For example, this happens when a new packet arrives and the content is moved to the user’s buffer. This means the networking is not very well optimized and there are too many copies between the kernel’s buffers and the user’s buffers.
There are still open questions regarding with the use of this mechanism: 
  - How fast the script can sample?
  - How does QMP actually work? And does it affect the guest execution?
  - May be more accurate to count the number of times a function is invoked?

Tuesday, July 02, 2019

Speeding Up the Booting Time of a Toro Appliance

Toro is a unikernel written in Freepascal that compiles within the user application and enables to build appliances that can be executed in any modern hypervisor or cloud provider. There are use cases that require that appliances boot faster, e.g., deploying microservices on demand, rebooting from a crash, etc. In these use cases, appliances must be created and initialized and this procedure must be fast enough to keep the quality of service. In this article, we present the work done to speed up the booting time of a Toro appliance. This article begins by explaining how Toro boots up and how we improved current mechanisms by using a multiboot kernel. Then, we present three approaches namely QEMU, NEMU and Firecraker that aim at optimizing some aspects of the Virtual Machine Monitor (VMM) to speed up the booting time of an appliance. 

What do we call “booting time”?

We call booting time the time until KernelMain() is invoked. During that time, the following steps are involved:
    1. The VMM is initialized, e.g., device model initialization, BIOS. This happens at the host side. 
    2. The bootloader starts to execute, e.g., CPUs are initialized, paging is enabled, kernel is loaded into memory. 
    3. The kernel starts to execute, e.g., KernelMain() is executed.  
In this article, booting time takes into account the time just before point 3. 

How does current bootloader work in Toro?

In Toro, the user application is a normal pascal program in which the programmer decides which units to use. The user application and the kernel compile together thus resulting in a binary ELF64. From this binary, the building process generates an image that can be used to boot up a VM or a baremetal host. The generation of the image is based on Build (see https://github.com/torokernel/torokernel/blob/master/builder/build.pas). This application takes an executable and a bootloader, and combines them into a RAW image. The source code of the bootloader can be found at https://github.com/torokernel/torokernel/blob/master/boot/x86_64.s. This is a simple bootloader that:
    • enables long mode and paging
    • wakes up all the cores
    • loads the kernel into memory
    • jumps to the kernel main.     
The benefit of using a RAW image is that the bootloader is simple. It just reads continuous blocks from the disk and put them into memory. No filesystem is needed. Also, it enables to boot the image in most hypervisors without extra work. In addition, the RAW image can be converted to VMDK format to launch a VM in VirtualBox or HyperV. The main drawback of using a RAW image is its size is too big since it is the copy of the kernel in memory. This increases the time to load the kernel into memory. Typically, an image is 4MB and takes about 1.5s to boot up.

The multiboot approach

One way to improve the size of the binary thus reducing the booting time is to leverage on the multiboot specification to generate a multiboot kernel and then use an existing bootloader to boot it up. To do this, the binary must be compiled by following a specific linkage. The binary needs to have a multiboot header that allows the multiboot bootloader to find the different sections and load them into memory. The user application and the kernel are still compiled together but the result is a multiboot binary. 

VMM like QEMU has the option to boot up by using a multiboot kernel. However, QEMU is based on an old multiboot specification and only supports 32 bits kernels. It is some magic necessary to embed a 64 bits kernel into a 32 bits kernel. This magic is done by the script at https://github.com/torokernel/torokernel/blob/master/builder/BuildMultibootKernel.sh

The following figure illustrates what happens when the parameter “-kernel” is passed to QEMU. The kernel binary has to have an extra section named MultibootHeader. That section is used by QEMU to get information during the booting time. For example, it gets the starting address of the bootloader. QEMU then loads the .text and .data sections into memory and jumps to the starting address of the bootloader. In the figure, MultibootLoader is actually in the .text but we split it for the sake of simplicity. When the bootloader starts to execute, the CPU is already in protect mode and paging is enabled. Since previous steps are already done when the bootloader starts to execute, the code of the bootloader can be simplified thus saving time during booting time. The bootloader just has to enable long mode and wake up the cores.       



By using the multiboot approach, the kernel binary is reduced to 145kb and the booting time results in about 450ms so we have a factor of 33% of improvement. The main drawback of this approach is that we need a VMM that supports the loading of a multiboot kernel. Otherwise, we need a bootloader that supports multiboot specification like grub.

Playing with the Virtual Machine Monitor

In this section, we present three approaches to improve the booting time by optimizing the VMM. Roughly speaking, these approaches simplify some aspect of the VMM, e.g., the loading of the kernel, the device model and/or the BIOS. The following figure illustrates the possible components that made a VMM. In this figure, the VMM is in charge of the Device Emulation and the BIOS. It communicates to the KVM driver, which can also provide in-kernel device emulation.   



In the following, we roughly present each approach.

QBOOT
It is a minimal x86 firmware for QEMU to boot Linux (see http://github.com/bonzini/qboot). From authors, it is “a couple hardware initialization runtimes written mostly from scratch but with good help from SeaBIOS source code”.

NEMU
It is based on QEMU and only supports x86-64 and aarch64. It proposes a reduced device model by focusing on non-emulated devices to reduce the VMM’s footprint and the attack surface. It proposes a new machine type named “virt” which is thinner and only boots from EUFI.

Firecraker
It is a simple VMM implemented in Rust developed by Amazon Web Services to accelerate the speed and efficiency of services like AWS Lambda and AWS Fargate. The kernel binary must be a ELF64. When kernel starts to execute, the CPU is already in long mode and page tables are set in the Linux way. This simplifies a lot the bootloader.

To evaluate these approaches, we measure the time it takes the kernel to start to execute, i.e., the time since the VM is launched until KernelMain() is invoked. To know more about this work, you can check the issue #276 at GitHub. By using QBOOT in QEMU, Toro takes 135 ms to boot up. In case of NEMU, it takes 95ms. In the case of Firecraker, Toro takes only 17ms to boot up. Note that a simple “echo ‘Hello World’” in the same machine takes about 2.62 ms to execute. 

Conclusions

We presented different approaches to speed up the booting time of a Toro appliance. Booting time is important when we want to launch appliances on demand or if we want to reboot an appliance because it has crashed. In the case of Toro, we show that by using multiboot kernel the size of the binary can be reduced from 4MB to 150kb and the booting time from 1.5s to 0.5ms. From there, improvements can be achieved by optimizing the VMM. Such improvements works on differents components of the VMM like the device model or the BIOS. For example, by using Firecraker, we are able to boot toro in 17ms.