A dedicated unikernel for microservices

Friday, July 05, 2019

QProfiler: A profiler for guests in QEMU/KVM

In this article, I am going to talk about QProfiler which is a tool to profile a guest running on top of QEMU/KVM. The source code is hosted at https://github.com/torokernel/qprofiler. I started this project because I was interested in profiling Toro running as a guest on QEMU/KVM. Roughly speaking, Profiling is to count how often each function is executed. This gives an idea about where the execution time is spent. I am not an expert on this area but I will sum up my research. There are two mechanisms to profile:
   1) by counting how often each function is invoked.
   2) by sampling a process and counting which function is executed in that time.
The mechanism number 1) is intrusive since the code must be modified. The executable must be compiled with the "-pg" option that makes each function to invoke mcount() thus counting the number of times a function is executed. The main benefit of mechanism number 2) is it can profile a process without any modification. However, the result may be not accurate and limited by the maximum sample frequency. In my case, I decided to use the mechanism number 2) by implementing an script that samples a VM by using the Qemu Monitor Protocol. The script gets the %rip register and the %rbp register thus enabling to get current function and the invoked function. It is also possible to get a full backtrace but it remains a TODO work. The only change in the code is to compile by using the “-g” option to add debugging information to the binary. Then, by using addr2line is possible to get the name of the function from an address. The scripts accepts as parameter the duration of the sampling and the sampling frequency. For example, if the script samples during 10 seconds and the sampling frequency is 1s, we end up with 10 samples.
Using QProfile on StaticWebServer shows that 96% of the time the guest is executing Move(). This means that most the time the application is copying data from one block to other. For example, this happens when a new packet arrives and the content is moved to the user’s buffer. This means the networking is not very well optimized and there are too many copies between the kernel’s buffers and the user’s buffers.
There are still open questions regarding with the use of this mechanism: 
  - How fast the script can sample?
  - How does QMP actually work? And does it affect the guest execution?
  - May be more accurate to count the number of times a function is invoked?

No comments: