A dedicated unikernel for microservices

Friday, December 08, 2017

Reducing CPU usage on VMs that run Toro

Last days I worked on reducing the CPU usage of Toro. I observed that VMs that run Toro consume 100% of CPU which makes any solution based on Toro impossible for production. I identified four situations in which an idle loop is the issue:

  1. During a spin locks
  2. When there is no a thread in ready state 
  3. When there is no thread 
  4. When threads only poll a variable
Cases 1, 2 and 3 are in the kernel code. However, case 4 is when a user thread does idle work by polling a variable. So in this case, the solution would be harder since the scheduler has to figure out that the thread is only polling. Intel proposes different mechanisms to reduce the impact of idle loops [1, 2]. In particular, I was interested on the use of mwait/monitor instructions. However, I found this is not very well supported on all Hypervisors. So I have to base on the instructions hlt (halt) and pause. I want to highlight that hlt is a privilege instruction so only ring0 can use it. However, since in Toro both the kernel and the application run in ring0, hlt can be used either by the kernel or the user. Following cases correspond with the use by the kernel.

First, I tackled case 1 by introducing the pause instruction inside the loop. This relaxes the CPU when a thread is getting exclusive access to a resource. Cases 2 and 3 were improved by using hlt which just halts the CPU until next interruption. To tackle case 4, I proposed two APIs to tell the scheduler when a thread is polling a variable. When scheduler figures out that all threads in a core are polling a variable, it just turns the core off.

I tested this in my baremetal host (4 cores, 8 GB, 2GHz) in Scaleway with KVM and a VM running Toro. I also installed Monitorix to monitor the state of the host. To stress the system, I generate http traffic and monitor the CPU usage. A few seconds after the stress stopped, the CPU usage of the qemu process is only about 1%. This usage goes up and down depending on the stress. By topping, I get a patter like this:   

CPU%
6.6    0.6     0:46.92 qemu-system-x86
63.8  0.6     0:48.84 qemu-system-x86 (stress)
99.7  0.6     0:51.84 qemu-system-x86 (stress)
45.5  0.6     0:53.21 qemu-system-x86 (stress)
2.3    0.6     0:53.28 qemu-system-x86
4.0    0.6     0:53.40 qemu-system-x86
6.6    0.6     0:53.60 qemu-system-x86

This is not always the case and sometimes Toro takes longer to turn off. This may happen when a socket is not correctly closed and it ends up by timeout. I need to experiment more to measure how much reaction the system losses when the core is halted. This recent work, however, seems very promising!

[1] https://www.contrib.andrew.cmu.edu/~somlo/OSXKVM/IdleTalk.pdf
[2] Intel Volume 3, section 8.10.2 and 8.10.4

No comments: