A dedicated kernel for multi-threading applications.

Sunday, March 13, 2011

e1000 driver for TORO

I have just started the implementation of e1000 driver like Intel Gigabit or compatible. I am using Minix 3 source and qemu as an emulator (Begin 0.12.0 version it supports e1000 emulator). The detection procedure is complete as you can see in the picture, It is uploaded to SVN.

Saludos!
Matias E. Vara
www.torokernel.org

Sunday, March 06, 2011

Memory organization in a multicore system II

Continuation of the article Memory organization in a multicore system.

Non uniform memory architectures.

In a NUMA system the processors have assigned a memory region, it access more fastly that others to it, however every procesor can access to every memory position. It is used message passing to remote memory access. The programmer see a continuos memory space and the hardware makes that abstraction.

The first one in the NUMA tecnology was Sequent Computer Systems. They introduced NUMA at '90. Afterwards it was acquired by IBM and the tecnology was implemented in Power processors.

In other way, IBM made it own NUMA implementation called SE (Shared Everything). This implementation is presented in Power6 processors.

The Intel NUMA implementation is called QuickPath Interconnect. It allows to share memory between the processors and it is transparent for the Operative System. Each processor has a point to point controller.

AMD implementation uses fast links called "Hypertransport Links". In this implementation each procesor has a memory controller and a local memory. The processors are connected between them through a coherent Hypertransport link. Futhermore, each processor has a bi-directional no-coherent bus for IO devices.

Using Point-to-Point controller, the processor can access to memory region more fastly than other and there is an important latency if it tray to access to remote memory. In this way, we have two kind of memory: Local Memory and Remote.

Matias E. Vara
www.torokernel.org

Saturday, January 15, 2011

Memory organization in a Multicore system

This paper is a part of the final project called "Parallel Algorithm with TORO kernel", Electronic Engineering, Universidad Nacional de La Plata. In the next months I will publish more papers about my final project. Enjoy!

Memory organization in a Multicore system

Actually, the "Uniform memory access" is the common way to access the memory (See SMP). In this kind of arquitecture, every processor can read every byte of memory, and the processors are independent. In this case, a shared bus is used and the processors compite but only one can write or read. In this environments just one processor can access to a byte in a gived time. For the programmers the memory access is transparent.


In 1992 Intel made the first SMP processor called Pentium PRO. And the memory bus was called Front Side Bus.

That is a bi-directional bus, it is too simple and very cheap, and in theory it scales well.

The next intel step was partition the FSB in two independent bus, but the cache coherency was a bootle-neck.

In 2007 it was implemented a bus per processor.

This kind of architecture is used by Atom, Celeron, Pentium and Core2 of intel.

In a system with many cores, the traffic through the FSB is heavy. The FSB doesn´t scale and it has a limit of 16 processor per bus. So the FSB is wall for the new multicores technology.

We can have CPU that it executes instructions fastly but we waste time if we can´t make the capture and decodification fastly. In the best case, we lose one cycle more reading from the memory.

Since 2001 the FSB has been replaced with point to point devices as Hypertransport or Intel QuickPath Interconnect. That changed the model memory to non uniform memory access

Matias E. Vara www.torokernel.org


Thursday, December 30, 2010

x86 rings protection on Toro

In x86 arch there are 4 ring levels. Ring 0 is the most privileged level and ring 3 is the least. In a OS the kernel runs in ring 0 and the user application runs in ring 3.

Ring 0 descriptors are used by the kernel and ring 3 descriptors are used by the user. The GDT supports up to 8192 descriptors but the OS just uses 4, two for kernel´s text and data, and two for user´s text and data. With these descriptors the kernel can access to all memory, for example 4GB in 32 bits.

When the OS uses privileged levels the processor has to check if every operation is valid, these mechanisms adds latencies. In a multitasking OS protection is essential and it protects the kernel code and data.

But what happens with dedicated multithread application? It runs alone in the system and it was written carefully, we need protection in this case? guessing that, we can reduce a lot the OS.
For example, if we want to implement syscalls, we don´t need traps. If the kernel and user application are in same ring, we just may use “call” instruction to kernel´s function. Actually, OS are using interruptions for support syscalls but they are too expensive. Don´t forget that we are jumping from ring 0 to ring 3.

In other way, when we are running user application and a interruption happens, the processor has to jump from ring 3 to ring 0, that is too expensive. In general cases a kernel procedure handles the interruption. If the user application runs in the same level that the kernel, we don´t spend time in latencies.

On TORO, the kernel and user application run in ring 0. And cause the kernel and app are compiling together, the syscalls are implemented easy. It just uses “call” instruction.

Matias E. Vara