Toro kernel: Translation of paper

Hello here you have the translation of paper published in SL Ezine. Enjoy.

Current modern systems in which the parallelism and multiprocessing have reached their limits, requires programmers to think on new methodology in order to maximize the performance of actual hardware. Here I will try to discuss two issues for the operating system with high-grade of multiprocessing : access to the memory bus and access to shared resources.

The main issue regarding memory performance appears when a lot of processors want to access concurrently to memory through a shared memory bus. One solution to reduce the contention at the memory bus level, is the implementation of a memory controller such as AMD HyperTransport technology, every processor has its own integrated memory controller to access dedicated region of memory faster. Thus, accessing memory region that are not adjacent to the CPU hit small penalty and require 1 or 2 hops.

These controller are implemented using a NUMA model (Non Uniform Memory Access). Such architecture requires that the memory allocator at kernel level of the operating system must be re-written. When a thread running on a specific CPU, the memory allocator module of the OS is in charge to return a free block of memory located in the dedicated region handled by the processor.

Optimizing access to memory such way, brings subsequent problem on the table: accessing to shared memory. One way of solving such issue can be handled by implementing atomic operations (using a specific “lock” instruction). The lock instruction prevents other CPUs to access the same region of memory, and indirectly prevent other CPUs to access the memory bus at all while performing the operation. The protection can be implemented using a lot of “locks” but the system performance degrades rapidly, which becomes obvious as more processors attempt to access the shared resource.

One solution is to dedicate resources to processors, where resources are Block Devices, Net Devices or FileSystems. This requires some sort of protection run at level of local processor handling the resource. This can be handled in a nice and easy way by implementing a Round Robin Scheduler also known as Cooperative Threading Scheduler.

At this level a new issue rises on the table : the communication between processors. When sending message from one CPU to another, in order to prevent the use of atomic lock instructions, the communication system can be implemented using a matrix, where every processor has an array of slots referencing every single other processors. When a slot #N is set to “null”, it is interpreted as the CPU #N has not set a message to be imported by current local CPU. When CPU #N needs to sending a message to CPU#1, it will set its slot #N, which is then switched by CPU#1 to be imported and finally reset to “null”. The communication system can be extended from this simple base.

The new technologies are powerful when a NURA model is implemented at kernel level of the operating system (Non Uniform Resources Access).

An example of such technology is implemented in the Operating System TORO, available at http://toro.sourceforge.net

TORO is demonstrating an innovative operating system by integrating at the same ring level both kernel and the user application server. The threads of the user application server are distributed evenly on all CPUs and running independently in parallel.

The memory model chosen is NUMA without pagination.

During the initialization, the memory is divided proportionally for each processor installed on the system. When a thread needs memory, the memory allocator returns a free block of memory depending on which CPU the thread is running.

In the same way, TORO can dedicate resources to specific processor, i.e. a FileSystem. This only CPU then can access to this instance of FileSystem.

The scheduler is based on the cooperative threading model, therefore due to this design, TORO can migrate threads between CPUs and send messages between threads without using any lock instruction.

TORO is well suited for integrated system to run at high pace application servers like web servers, database servers. The neat part for programmers is to be able to compile application server embedding the kernel of the OS, meaning that when the system operates, the application server runs in kernel mode beside the kernel and at the same ring level, providing direct access to all resources without any overhead, and therefore maximizing performance for the overall system.

Matias E. Vara

Toro kernel

Saturday, October 20, 2007

Translation of paper

No comments: