A dedicated kernel for multi-threading applications.

Showing posts with label NUMA. Show all posts
Showing posts with label NUMA. Show all posts

Friday, September 21, 2012

Toro's article in Microelectronic Congress 2012

Here it's the link to the article accepted by Microelectronic Congress 2012, its title is "Memory module designing for embedded purpose". It's about Toro Kernel memory module and it deals with the improvements did when an Operating System runs in a dedicated environment like embedded systems
In this opportunity the congress will be in the city of Rosario, Argentina. Unfortunately, this time I'll not be able to show the article as the last year. 
Again, the article is in spanish agreed with the congress' rules. 



Matias E. Vara
www.torokernel.org

Monday, April 30, 2012

Toro's article in Microelectronic Congress

Here it is the link to the article that I've shown at Microelectronic Congress 2011, Faculty of Engineering, University of La Plata, Argentina. The only thing is it is in spanish. It talks about TORO project and it shows a benchmarks that I made for my graduated project. Enjoy!

Matias E. Vara
www.torokernel.org

Monday, December 12, 2011

Fixed an important bug in emigrate procedure

That's just a brief post about a recent change in the way that Toro migrates threads. 
Previously, when a Thread running in core #0 wanted create a new Thread in core #1, function ThreadCreate allocated the TThread structure, TLS and the Stack then, It migrated the whole TThread structure to the core #1.
The main problem in this mechanism was that all memories block were allocated in parent core. This is a serious infraction in  the NUMA model: TThread, TLS and the Stack are not already local memory.
Thus, I rewrote the way that Threads are migrated. When a Thread wants to create a new one remotely, Toro still invokes ThreadCreate BUT it is executed in the remote core. Instance of migrate the TThread structure, now Toro migrates a set of arguments to be passed toward ThreadCreate. When ThreadCreate finishes, the parent thread retrieve the TThreadID value or nil if it fails. 
As we can see, while a local thread is made immediately when ThreadCreate is invoked, a remote thread  spend two steps of latency: one for migrate the parameters and other for retrieve the result.       


Matias E. Vara
www.torokernel.org 
      

Tuesday, April 05, 2011

Memory organization in a multicore system: Conclusion.

From programmer point of view, the access to local and remote memory is transparent. An NUMA could be implemented in a SMP system without any problem. However, the OS must do an efficient memory assignation for improve these technologies.
In the case of SMP, memory administation is easy to implemented while in NUMA is not. The system has to assign memory depending of the cpu where the process is running. Every CPU has an own memory bank. The system performance is poor if there are more remote access than local.
Windows has supported NUMA since 2003 version and Linux since 2.6.X. Both of them gives syscalls to exploit NUMA.
TORO kernel is optimized for NUMA technologies, keeping in mind the moderns processors. The unique way to support NUMA is using dedicate buses. In the high performance environment these improves mustn't forget.


Matias E. Vara
www.torokernel.org

Saturday, January 15, 2011

Memory organization in a Multicore system

This paper is a part of the final project called "Parallel Algorithm with TORO kernel", Electronic Engineering, Universidad Nacional de La Plata. In the next months I will publish more papers about my final project. Enjoy!

Memory organization in a Multicore system

Actually, the "Uniform memory access" is the common way to access the memory (See SMP). In this kind of arquitecture, every processor can read every byte of memory, and the processors are independent. In this case, a shared bus is used and the processors compite but only one can write or read. In this environments just one processor can access to a byte in a gived time. For the programmers the memory access is transparent.


In 1992 Intel made the first SMP processor called Pentium PRO. And the memory bus was called Front Side Bus.

That is a bi-directional bus, it is too simple and very cheap, and in theory it scales well.

The next intel step was partition the FSB in two independent bus, but the cache coherency was a bootle-neck.

In 2007 it was implemented a bus per processor.

This kind of architecture is used by Atom, Celeron, Pentium and Core2 of intel.

In a system with many cores, the traffic through the FSB is heavy. The FSB doesn´t scale and it has a limit of 16 processor per bus. So the FSB is wall for the new multicores technology.

We can have CPU that it executes instructions fastly but we waste time if we can´t make the capture and decodification fastly. In the best case, we lose one cycle more reading from the memory.

Since 2001 the FSB has been replaced with point to point devices as Hypertransport or Intel QuickPath Interconnect. That changed the model memory to non uniform memory access

Matias E. Vara www.torokernel.org