Mini-ranks
Sébastien Boisvert
2012-10-17, Argonne National Laboratory

Disclaimer: this document contains ideas and blueprints, but the true design
is in the source code with corresponding comments therein.

2012-10-17 10:00 UTC/GMT-5:

This idea was discussed during a meeting with Professor Rick Stevens.

The concept of mini-rank in electronics was introduced in 2008 in the paper

"Mini-rank: Adaptive DRAM architecture for improving memory power efficiency"
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.182.4308&rep=rep1&type=pdf


Here, we use the concept in the context of MPI ranks, not memory ranks.

The processing machinery of a computer is hierarchical with
this progression:

	socket -> processor -> core -> thread

The MPI programming model only gives MPI ranks that send messages to each
other. Usually a MPI Rank is mapped to a single process. And a process runs
on a single core.

Figure 1: The MPI programming model.

            +--------------------+
	    |   MPI_COMM_WORLD   |           MPI communicator
            +---------+----------+
	              |
    +------+------+---+--+------+------+
    |      |      |      |      |      |
  +---+  +---+  +---+  +---+  +---+  +---+ 
  | 0 |  | 1 |  | 2 |  | 3 |  | 4 |  | 5 |    MPI ranks
  +---+  +---+  +---+  +---+  +---+  +---+


Mini ranks can be thought as ranks within ranks.


Figure 2: The MPI programming model, with mini ranks.

            +--------------------+
	    |   MPI_COMM_WORLD   |           MPI communicator
            +---------+----------+
	              |
    +------+------+---+--+------+------+
    |      |      |      |      |      |
  +---+  +---+  +---+  +---+  +---+  +---+ 
  | 0 |  | 1 |  | 2 |  | 3 |  | 4 |  | 5 |    MPI ranks   (1 RankProcess.cpp instance per rank)
  +---+  +---+  +---+  +---+  +---+  +---+                    with the thread from main() for MPI calls
  |   |  |   |  |   |  |   |  |   |  |   |
  | 0 |  | 4 |  | 8 |  |12 |  |16 |  |20 |  |                
  | 1 |  | 5 |  | 9 |  |13 |  |17 |  |21 |  | => mini-ranks
  | 2 |  | 6 |  |10 |  |14 |  |18 |  |22 |  |
  | 3 |  | 7 |  |11 |  |15 |  |19 |  |23 |  |  (1 MiniRank.cpp instance per minirank (in 1 pthread))
  |   |  |   |  |   |  |   |  |   |  |   |  |   (1 Application running with its RayPlatform ComputeCore.cpp)
  +---+  +---+  +---+  +---+  +---+  +---+       

Current programming model:

Each MPI rank runs one 

class Machine.cpp (in Ray)
this class create application plugins and starts the main loop

the main loop is in RayPlatform.




Future software design:


class VirtualMachine.cpp  (runs as 1 MPI rank)

             (1 pthread doing the MPI calls)
		- picks up stuff from mini rank outbox
		- copy these buffers into another one that will be used by
		Mailman
		- deposit incoming messages into the inbox of miniranks


	class Machine.cpp  (running a mini rank in 1 pthread)
		plugins see only a inbox and a outbox

		buffers of mini ranks are pristine, they can not be dirty
		they can also call barrier on RayPlatform

		class ComputeCore.cpp  (with the main loop)

			- add locks on inbox and outbox.


As of Ray v2.0.0, each MPI rank is running as a Machine.cpp

Machine.cpp is not in RayPlatform however.


## Adapters

For adapters, the current implementation is that each plugin is a singleton.
This must be changed. Instead, each static variable in a plugin must be an
array of plugin.

Then, the callbacks will take an extra parameter: the index of the minirank
within the rank.

The current code was reverted to the virtual methods.


## Message reception

Designed in collaboration with Fangfang Xia

1. rank receives message with MPI
2. rank locks the corresponding minirank inbox
3. rank push the message to the minirank inbox
4. rank unlocks the minirank inbox
5. The minirank locks its inbox
6. The minirank consumes its message
7. The minirank unlocks its inbox

## Sending a message

Designed in collaboration with Fangfang Xia

1. The minirank puts a message in its outbox
2. The minirank locks the rank outbox
3. The minirank pushes onto the rank outbox
4. The minirank unlocks the rank outbox
5. The rank locks its outbox
6. The rank pushes messages onto the network
7. The rank unlocks its outbox



1 important aspect is that the rank has no buffering system at all.
All the buffering systems are in the miniranks. For MPI_Recv, the virtual
machine uses the buffer from the corresponding minirank. 
For MPI_Isend, the virtual machine uses the buffer from the minirank.
The dirty/cleaning subsystem is local to each minirank.


For a system with 8 nodes, 2 processors per node, and with 4 cores per processor,
we have 16 processors and a total of 64 cores.

Case 1:
It can be run like with the classic MPI programming model using 1 MPI rank per
core -- 64 MPI ranks. Thus we would have 64 processes.


mpiexec -n 64 Ray -mini-ranks-per-rank 1

  (the default for -mini-ranks-per-rank is 1)

 - 8 nodes
 - 16 processors
 - 64 cores
 - 64 MPI ranks
 - 64 processes
 - 64 threads (1-to-1)

Case 2:
An alternative is to have 1 MPI rank per node. Each MPI rank has 7 miniranks.
Each minirank runs in a thread. There is one extra thread for communication too.

mpiexec -n 8 -bynode Ray -mini-ranks-per-rank 7

 - 8 nodes
 - 16 processors
 - 64 cores
 - 8 MPI ranks
 - 8 processes
 - 8*7=54 mini-rank threads
 - 8 comm. threads (1-to-1)

Case 3:
We can also use 1 MPI rank per processor, and 3 miniranks per rank.

mpiexec -n 16 -bynode Ray -mini-ranks-per-rank 3

 - 8 nodes
 - 16 processors
 - 64 cores
 - 16 MPI ranks
 - 16 processes
 - 16*3=48 mini-rank threads
 - 16*1=16 comm. threads (1-to-1)


With Open-MPI, the --bind-to-socket will be very nice.
The -bynode option is necessary.

2012-10-17 20:00 UTC/GMT-5:00:

Tasks:

- [DONE]add VirtualMachine
- [DONE] add MiniRank
- propagate mini rank numbers instead of ranks
- compile Ray without errors
- port plugin to the old adapter architecture
- implement the case with 1 minirank per rank (no pthread)


2012-10-18 10:44 UTF/GMT -5:00:



An example of layout on the Colosse super computer at Laval University:

mpiexec -n 8 -bynode Ray -mini-ranks-per-rank 7


- assumes no HyperThreading(R)
- assumes 8 nodes, 2 sockets / node, 1 CPU / socket, 4 cores / CPU, 1 thread / core 
- assumes 8 MPI ranks, 7 mini-ranks / MPI rank
- total of 56 miniranks, 8 MPI ranks

The plugin code only know about mini-ranks. So the plugin code is portable.

   MPI rank 0
      control thread (communication thread, don't use pthread because it's the main thread)
      mini-rank 0 (addr [0,0], 1 pthread)
      mini-rank 1 (addr [0,1], 1 pthread)
      mini-rank 2 (addr [0,2], 1 pthread)
      mini-rank 3 (addr [0,3], 1 pthread)
      mini-rank 4 (addr [0,4], 1 pthread)
      mini-rank 5 (addr [0,5], 1 pthread)
      mini-rank 6 (addr [0,6], 1 pthread)

   MPI rank 1
      control thread (communication thread, don't use pthread because it's the main thread)
      mini-rank 7 (addr [1,0], 1 pthread)
      mini-rank 8 (addr [1,1], 1 pthread)
      mini-rank 9 (addr [1,2], 1 pthread)
      mini-rank 10 (addr [1,3], 1 pthread)
      mini-rank 11 (addr [1,4], 1 pthread)
      mini-rank 12 (addr [1,5], 1 pthread)
      mini-rank 13 (addr [1,6], 1 pthread)

   MPI rank 2
      control thread (communication thread, don't use pthread because it's the main thread)
      mini-rank 14 (addr [2,0], 1 pthread)
      mini-rank 15 (addr [2,1], 1 pthread)
      mini-rank 16 (addr [2,2], 1 pthread)
      mini-rank 17 (addr [2,3], 1 pthread)
      mini-rank 18 (addr [2,4], 1 pthread)
      mini-rank 19 (addr [2,5], 1 pthread)
      mini-rank 20 (addr [2,6], 1 pthread)

   MPI rank 3
      control thread (communication thread, don't use pthread because it's the main thread)
      mini-rank 21 (addr [3,0], 1 pthread)
      mini-rank 22 (addr [3,1], 1 pthread)
      mini-rank 23 (addr [3,2], 1 pthread)
      mini-rank 24 (addr [3,3], 1 pthread)
      mini-rank 25 (addr [3,4], 1 pthread)
      mini-rank 26 (addr [3,5], 1 pthread)
      mini-rank 27 (addr [3,6], 1 pthread)

   MPI rank 4
      control thread (communication thread, don't use pthread because it's the main thread)
      mini-rank 28 (addr [4,0], 1 pthread)
      mini-rank 29 (addr [4,1], 1 pthread)
      mini-rank 30 (addr [4,2], 1 pthread)
      mini-rank 31 (addr [4,3], 1 pthread)
      mini-rank 32 (addr [4,4], 1 pthread)
      mini-rank 33 (addr [4,5], 1 pthread)
      mini-rank 34 (addr [4,6], 1 pthread)

   MPI rank 5
      control thread (communication thread, don't use pthread because it's the main thread)
      mini-rank 35 (addr [5,0], 1 pthread)
      mini-rank 36 (addr [5,1], 1 pthread)
      mini-rank 37 (addr [5,2], 1 pthread)
      mini-rank 38 (addr [5,3], 1 pthread)
      mini-rank 39 (addr [5,4], 1 pthread)
      mini-rank 40 (addr [5,5], 1 pthread)
      mini-rank 41 (addr [5,6], 1 pthread)

   MPI rank 6
      control thread (communication thread, don't use pthread because it's the main thread)
      mini-rank 42 (addr [6,0], 1 pthread)
      mini-rank 43 (addr [6,1], 1 pthread)
      mini-rank 44 (addr [6,2], 1 pthread)
      mini-rank 45 (addr [6,3], 1 pthread)
      mini-rank 46 (addr [6,4], 1 pthread)
      mini-rank 47 (addr [6,5], 1 pthread)
      mini-rank 48 (addr [6,6], 1 pthread)

   MPI rank 7
      control thread (communication thread, don't use pthread because it's the main thread)
      mini-rank 49 (addr [7,0], 1 pthread)
      mini-rank 50 (addr [7,1], 1 pthread)
      mini-rank 51 (addr [7,2], 1 pthread)
      mini-rank 52 (addr [7,3], 1 pthread)
      mini-rank 53 (addr [7,4], 1 pthread)
      mini-rank 54 (addr [7,5], 1 pthread)
      mini-rank 55 (addr [7,6], 1 pthread)


2012-10-29 GMT -4:00

The secret sauce is about synchronization.

This 2003 Ph.D. thesis pretty much summarizes the state-of-the-art (in 2003).

  http://www.cse.chalmers.se/~tsigas/papers/Yi-Thesis.pdf


As a starting point, there is a lot of explanations on this project, with associated code
in the public domain.

  http://www.codeproject.com/Articles/43510/Lock-Free-Single-Producer-Single-Consumer-Circular


The future of high-performance computing is with lockless non-blocking distributed algorithms.

- lockless => choose atomic operations over locks;
- non-blocking => choose trylock over lock
- distributed => choose message passing over shared structures


