A dedicated unikernel for microservices

Thursday, August 18, 2022

Recent improvement in accessing Per-CPU variables

In Toro, all kernel data structures are per-CPU variables. This is because each instance of the kernel in each core is independent one from each other. Also this has the benefits that access to kernel data structure is lock-less thus preventing the use of spin-locks and its contention. 

For example, each core has an entry in the array CPU[] that contains information about the threads that are running in that core. Also each core has the DedicateFilesystem[] array with information about the filesystems that a core is allowed to access. 

In general, when a core wants to access a per-CPU variable, it first gets the core id by issuing the GetApicId() function that returns the id of the lapic. Then, it uses that value as an index for a per-CPU array, e.g., CPU[id]. The whole operation requires two steps. Also, to get the lapic id, this requires access to the memory-mapped region of the lapic.

In recent work, we improved the access to per-CPU variables by using the %gs register of each core. This is an improvement that is already used in Linux for a long time (see https://lwn.net/Articles/198184/). The access to a per-CPU variable can be done in one single instruction if we use the %gs register to keep a pointer to an array of the per-CPU variables. Then, we require only an offset to find the entry of the variable that we are looking for. This is faster than issuing the GetApicId and then using it as an index. To get the value of a per-CPU variable in the %rax register, the function becomes a single assembler instruction:

mov %rax, %gs:offset

This is implemented in Toro by relying on a single GDT that is shared among all the cores. At booting time, we reserve a number of descriptors that are used for the %gs descriptor of each core. Each core loads in %gs the corresponding descriptor by relying on its lapic id. Each core loads in %gs a pointer to an array of pointers for each per-CPU variable, e.g., CPU, CurrentThread, CoreId, etc. Each variable is represented with a different offset. Each offset is a different entry in the table.

But how fast is this improvement? I've compared GetApicId() with GetCoreId(). The former relies on the lapic whereas the latter relies on the per-CPU variable. I've measured an improvement of x10 when using the per-CPU variable. This only tests a single invocation of the function. The function is heavily used in code so I expect a bigger overall improvement.

No comments: