- •IBM Research
 - •IBM Research
 - •IBM Research
 - •Blue Gene/L
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research Simulation
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 - •IBM Research
 
IBM Research
Network Progress Calls
NAMD makes progress engine calls from the compute loops
– Typical frequency is10000 cycles, dynamically tunable
for ( i = 0; i < (i_upper SELF(- 1)); ++i ){
CmiNetworkProgress();
const CompAtom &p_i = p_0[i]; //……………………………
//Compute Pairlists
for (k=0; k<npairi; ++k) { //Compute forces
}
}
void CmiNetworkProgress() { new_time = rts_get_timebase();
if(new_time < lastProgress + PERIOD) { lastProgress = new_time;
return;
}
lastProgress = new_time; AdvanceCommunication();
}
Corporation
IBM Research
MPI Scalability
Charm++ MPI Driver
–Iprobe based implementation
–Higher progress overhead of MPI_Test
–Statically pinned FIFOs for point to point communication
32  | 
	© 2005 IBM Corporation  | 
  | 
IBM Research
Charm++ Native Driver
BGX Message Layer (developed by George Almasi)
–Lower progress overhead
–Active messages
• Easily design complex communication protocols
–Dynamic FIFO mapping
–Low overhead remote memory access
–Interrupts
–Charm++ BGX driver was developed by Chao Huang over this summer
33  | 
	© 2005 IBM Corporation  | 
  | 
IBM Research
BG/L Msglayer
  | 
	
  | 
	
  | 
	
  | 
	Messages  | 
	
  | 
	
  | 
	
  | 
	
  | 
	Msg Queues  | 
||||||||
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
|||
SpadMessage  | 
	
  | 
	TreeMessage  | 
	TorusMessage  | 
	
  | 
	Collective  | 
	
  | 
|||||||||||
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	p  | 
	Msq queue  | 
	
  | 
|
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
|
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
|
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	o  | 
	
  | 
	
  | 
|
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	s  | 
	
  | 
	
  | 
|
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	t  | 
	
  | 
	
  | 
|
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
|
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	Torus  | 
	
  | 
|
  | 
	
  | 
	Packets  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	Msq queue  | 
	
  | 
|||||
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	0  | 
	
  | 
	
  | 
|
  | 
	TreePacket  | 
	
  | 
	
  | 
	TorusPacket  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
||||||
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	1  | 
	FIFO  | 
	
  | 
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
||
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	2  | 
	
  | 
|
  | 
	
  | 
	
  | 
	Dynamically  | 
	
  | 
	
  | 
	Deterministically  | 
	
  | 
	
  | 
	
  | 
	pinning  | 
|||||||
  | 
	
  | 
	
  | 
	routed packet  | 
	
  | 
	
  | 
	routed packet  | 
	
  | 
	
  | 
	
  | 
	…  | 
	
  | 
	
  | 
|||||
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
|
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	n-1  | 
	
  | 
	
  | 
  | 
	
  | 
	
  | 
	Templates  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
|||||||
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
||||||||
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	Scratchpad  | 
	
  | 
||||||||
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	Msq queue  | 
	
  | 
|
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	TorusDirectMessage<>  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
|||||
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
	
  | 
Advance loop
  | 
	
  | 
	
  | 
	
  | 
	ts  | 
  | 
	
  | 
	
  | 
	e  | 
|
  | 
	
  | 
	k  | 
	
  | 
|
  | 
	c  | 
	
  | 
	
  | 
|
a  | 
	
  | 
	
  | 
	
  | 
|
p  | 
	
  | 
	
  | 
	
  | 
	
  | 
( This slide is taken from G. Almási’s talk on the “new” msglayer. )
Network
Coll. network FIFO
Torus FIFOs
I0 0 
1 
2 
H
I1 0 
1 
2 
H
R0 x+

x- y+

y- z+

z-
H
R1 x+

x- y+

y- z+

z-
H
Dispatching
Torus pkt. registry
0
1
2
…
p
Coll. pkt. disp.
34  | 
	© 2005 IBM Corporation  | 
  | 
IBM Research
Optimized Multicast
pinFifo Algorithms
–Decide which of the 6 FIFOs to use when send msg to {x,y,z,t}
–Cones, Chessboard
Dynamic FIFO mapping
–A special send queue that msg can go from whichever FIFO that is not full
35  | 
	© 2005 IBM Corporation  | 
  | 
IBM Research
Communication Pattern in PME
108 
 procs
108 procs
36  | 
	
  | 
	© 2005 IBM Corporation  | 
  | 
||
  | 
	
  | 
IBM Research
PME
Plane decomposition for 3D-FFT
PME objects placed close to patch objects on the torus
PME optimized through an asynchronous all-to-all with dynamic FIFO mapping
37  | 
	© 2005 IBM Corporation  | 
  | 
IBM Research
Performance Results
© 2005 IBM Corporation
IBM Research
BGX Message layer vs MPI
Fully non-blocking version performed below par on MPI
– Polling overhead high for a list of posted receives
BGX message layer works well with asynchronous communication
# Nodes  | 
	Cutoff  | 
	
  | 
	with PME  | 
||
Msglayer  | 
	MPI*  | 
	Msglayer  | 
	MPI*  | 
||
  | 
|||||
APoA1 Benchmark
4  | 
	2250  | 
	2250  | 
	
  | 
	
  | 
32  | 
	314  | 
	316  | 
	356  | 
	371  | 
128  | 
	85  | 
	91.6  | 
	103  | 
	
  | 
512  | 
	22.7  | 
	23.8  | 
	26.7  | 
	27.8  | 
1024  | 
	13.2  | 
	13.9  | 
	14.4  | 
	17.3  | 
2048  | 
	7.9  | 
	8.1  | 
	9.7  | 
	10.2  | 
4096  | 
	4.8  | 
	4.9  | 
	6.8  | 
	7.3  | 
NAMD Co-Processor Mode Performance (ms/step)
Message layer has sender side blocking communication here
39  | 
	© 2005 IBM Corporation  | 
  | 
IBM Research
Blocking vs Overlap
  | 
	Cutoff  | 
	
  | 
	with PME  | 
|
# Nodes  | 
	Blocking Sender  | 
	Non-Blocking  | 
	Blocking Sender  | 
	Non-Blocking  | 
  | 
||||
32  | 
	314  | 
	313  | 
	356  | 
	347  | 
128  | 
	85  | 
	82  | 
	103  | 
	97.2  | 
512  | 
	22.7  | 
	21.7  | 
	26.7  | 
	23.7  | 
1024  | 
	13.2  | 
	11.9  | 
	14.4  | 
	13.8  | 
2048  | 
	7.9  | 
	7.3  | 
	9.7  | 
	8.6  | 
4096  | 
	4.8  | 
	4.3  | 
	6.8  | 
	6.2  | 
8192  | 
	-  | 
	3.7  | 
	-  | 
	-  | 
APoA1 Benchmark in Co-Processor Mode
40  | 
	© 2005 IBM Corporation  | 
  | 
