A dedicated unikernel for microservices

Monday, September 05, 2022

Profiling and Optimizing Toro unikernel with Qprofiler

Hello everyone, in a recent video (you can watch it at https://youtu.be/DXmjZFIB_N0), I have presented how I use Qprofiler to profile and then optimize Toro unikernel. I would like to elaborate what this tool is and how to use it. You can read more about Qprofiler at http://torokerneleng.blogspot.com/2019/07/qprofiler-profiler-for-guests-in-qemukvm.html. This is a simple tool that samples a guest to get what code the guest is running. By doing this in a period of time, we get a list of pointers together with a counter that indicates the number of times the guest was in that position. This allows us to get a rough idea where the guest is spending most of the time and, in case the performance is not good, optimize those functions that dominate the overall running time. The sampling does not require any instrumentation in the code. Qprofiler only requires the symbols to get the name of the function from a pointer in memory. 

In the video, we use Qprofiler to profile the example that shows how cores communicate by using VirtIO. When we run Qprofiler, we got the following output:

64 SYSTHREADSWITCH at Process.pas:1244 

35 WRITE_PORTB at Arch.pas:459 

33 SCHEDULING at Process.pas:1134 

24 GETGSOFFSET at Arch.pas:429

The output shows us that most of the time the guest is switching the context. Note that Qprofiler only sample core #0. In particular, we see in the video that the time is mostly spent during the storing of the registers. During context switching, the registers are first stored and then, when the thread is switched back, the registers are restored. This requires to push all registers in the stack and then pop all them. 

The current source code is a bit conservative and stores/loads all registers. However, we should only store/restore those registers that the x86-64 ABI tells us the caller must not to clober. These are the non-volatile registers. By removing the storing/restoring of these registers, we profile again and we got the following output:

58 SCHEDULING at Process.pas:1141

47 WRITE_PORTB at Arch.pas:460

10 RECVFROM at VirtIOBus.pas:166

8 SYSTHREADSWITCH at Process.pas:1228

We can see that the function that is dominating the running time is different. We had effectively improved the duration of the SysThreadSwitch function by removing the unnecessary storing and restoring of registers.

You can watch the whole video at https://youtu.be/DXmjZFIB_N0 also you can give a try to Qprofiler by cloning it from https://github.com/torokernel/qprofiler