A dedicated kernel for multi-threading applications.

Showing posts with label protection. Show all posts
Showing posts with label protection. Show all posts

Sunday, April 17, 2011

Memory Protection in a multicore environment

This post is contained into the final paper of Matias Vara named “Paralelizacion de Algoritmos Numericos con TORO Kernel” to get the degree on Electronic Engeniering from Universidad de La Plata. These theorical documents help to understand the kernel design.

Introduction

When a Kernel is designed for a multicore system, the shared memory must be protected of concurrent writing accesses. The memory's protection increments kernel code complexity and decreases operative system's performance. If one or more processors are having access to some data at the same time, mutual exclusion must be realized to protect shared data in multicore systems.

In a mono-processor multi-task system the scheduler often switch the task, so the unique risk is while the task is changing the information the scheduler take it out the cpu. The protecction is this case is easy: disabled the scheduler while the task is in a critical section and then enabled again.
In a Multiprocessor system that solution can't be implemented. When we have tasks running in parallel, two or more tasks may execute the same line in the same time; Hence, the scheduler state doesn't care.

Resources protection

For protect resources in a multiprocessing system we need to define atomic operations. These are implemented in just one assembler instruction but several clock cycles.

Atomic operations

In every processor, write and read operations are always atomic. This means that when the operation is executing nobody is using that memory area.
For certain kind of operations the processor blocked the memory, with this purpose is provided the #Lock signal that it is used for critical memory operations. While this signal is high, the calls from other processors are blocked.
Bus memory access is non-deterministic; this means that the first one processor gets the bus. All the processors compete for the bus, then in a system with a lot processor this is a bottleneck.
But, why do we need atomic operations? Supposing that we have to increment a counter, the pascal's source is :

counter := counter +1;

If this line is executed at the same time, in several processors, the result will be incorrect if it is not atomic.
The correct value is 2, using atomic operations the processors access to the variable once per time and the result is corrected. The time to the sincronization increments with the number of processor. The common atomics operations are "TEST and SET" and "COMPARE and SWAP".

Impact of atomic operations

In system with a few processors, atomic operations does not represent a big deal and they are a fast solution for shared memory problem; However, if we increment the number of processors then we make a bottleneck.
Supposing a computer with 8 cores and with 1.45 GHz [1], while an instruction average time is 0.24 ns, atomic increment spends 42.09 ns. The time wasted making lock becomes critical.


[1] Paula McKenney: RCU vs. Locking Performance on Different Types of CPUs.
http://www.rdrop.com/users/paulmck/RCU/LCA2004.02.13a.pdf, 2005

Thursday, December 30, 2010

x86 rings protection on Toro

In x86 arch there are 4 ring levels. Ring 0 is the most privileged level and ring 3 is the least. In a OS the kernel runs in ring 0 and the user application runs in ring 3.

Ring 0 descriptors are used by the kernel and ring 3 descriptors are used by the user. The GDT supports up to 8192 descriptors but the OS just uses 4, two for kernel´s text and data, and two for user´s text and data. With these descriptors the kernel can access to all memory, for example 4GB in 32 bits.

When the OS uses privileged levels the processor has to check if every operation is valid, these mechanisms adds latencies. In a multitasking OS protection is essential and it protects the kernel code and data.

But what happens with dedicated multithread application? It runs alone in the system and it was written carefully, we need protection in this case? guessing that, we can reduce a lot the OS.
For example, if we want to implement syscalls, we don´t need traps. If the kernel and user application are in same ring, we just may use “call” instruction to kernel´s function. Actually, OS are using interruptions for support syscalls but they are too expensive. Don´t forget that we are jumping from ring 0 to ring 3.

In other way, when we are running user application and a interruption happens, the processor has to jump from ring 3 to ring 0, that is too expensive. In general cases a kernel procedure handles the interruption. If the user application runs in the same level that the kernel, we don´t spend time in latencies.

On TORO, the kernel and user application run in ring 0. And cause the kernel and app are compiling together, the syscalls are implemented easy. It just uses “call” instruction.

Matias E. Vara