A dedicated unikernel for microservices

Saturday, July 30, 2022

Implementing MPI_REDUCE in Toro unikernel

This post presents the implementation of MPI_REDUCE in Toro unikernel. This is an MPI operation that allows the application of a reduction operation over a vector. In this example, each task has a vector of 64 elements that is initialised with the rank of the task, i.e., the id of the core in which the task executes. Each task sends to the root core this vector for reduction by using the MPI_SUM operation. The root core collects all vectors and applies a matrix sum. This results in a new vector of 64 elements. For example, if the vector has three elements with all 1’: [1, 1, 1], and we have 3 cores, the resulting vector would be: [3, 3, 3]. 
In this example, VirtIO is used to send the data among the cores. Each core has its own queue to exchange data with the other cores in the system. During the reduction, each core sends its data to the root core. The exchange happens without the use of lock operations. 
Let’s see how the example works. 

void main(){
    int rank, i;
    int r[VECTOR_LEN];
    int s[VECTOR_LEN];
    rank = GetRank();
    for (i=0; i < VECTOR_LEN; i++){
        r[i] = rank;
    }
    printf("hello from core", rank);

First, the program initialises the vector r[] by using the rank. The rank is the core id in which the program is running. Note that an instance of this program executes for each core in the system. Toro only supports threads and the memory-space is flat, i.e., cores share the page-directory.

 
    // reduce an array of size VECTOR_LEN
    // by using the MPI_SUM operation
    Mpi_Reduce(r, s, VECTOR_LEN, MPI_SUM, root);    

Then, the program uses the MPI_Reduce() function with the MPI_SUM operation over the r[] vector. The root core collects all vectors and applies the MPI_SUM operation. The resulting vector is stored in s[]. This vector can only be read by the root core. Mpi_Reduce() relies on VirtIO to send data to the root core.


  if (rank == root){
     for (i=0;  i < VECTOR_LEN; i++){
       // Sum = ((N - 1) * N) / 2
       if (s[i] != (((GetCores()-1) * GetCores()) / 2)){
           printf("failed!, core:", rank);
           break;
       }
     }
     if (i == VECTOR_LEN){
        printf("success! value:", s[i-1]);
     }
    }
}

Finally, the root core checks that the result is the expected. In this case, since each core sends a vector with its rank, the resulting vector is simply the sum of its ranks. I tried this example with 5 cores and the result is the following video, enjoy! 

No comments: