- •Contents
- •List of Figures
- •List of Tables
- •Acknowledgments
- •Introduction to MPI
- •Overview and Goals
- •Background of MPI-1.0
- •Background of MPI-1.1, MPI-1.2, and MPI-2.0
- •Background of MPI-1.3 and MPI-2.1
- •Background of MPI-2.2
- •Who Should Use This Standard?
- •What Platforms Are Targets For Implementation?
- •What Is Included In The Standard?
- •What Is Not Included In The Standard?
- •Organization of this Document
- •MPI Terms and Conventions
- •Document Notation
- •Naming Conventions
- •Semantic Terms
- •Data Types
- •Opaque Objects
- •Array Arguments
- •State
- •Named Constants
- •Choice
- •Addresses
- •Language Binding
- •Deprecated Names and Functions
- •Fortran Binding Issues
- •C Binding Issues
- •C++ Binding Issues
- •Functions and Macros
- •Processes
- •Error Handling
- •Implementation Issues
- •Independence of Basic Runtime Routines
- •Interaction with Signals
- •Examples
- •Point-to-Point Communication
- •Introduction
- •Blocking Send and Receive Operations
- •Blocking Send
- •Message Data
- •Message Envelope
- •Blocking Receive
- •Return Status
- •Passing MPI_STATUS_IGNORE for Status
- •Data Type Matching and Data Conversion
- •Type Matching Rules
- •Type MPI_CHARACTER
- •Data Conversion
- •Communication Modes
- •Semantics of Point-to-Point Communication
- •Buffer Allocation and Usage
- •Nonblocking Communication
- •Communication Request Objects
- •Communication Initiation
- •Communication Completion
- •Semantics of Nonblocking Communications
- •Multiple Completions
- •Non-destructive Test of status
- •Probe and Cancel
- •Persistent Communication Requests
- •Send-Receive
- •Null Processes
- •Datatypes
- •Derived Datatypes
- •Type Constructors with Explicit Addresses
- •Datatype Constructors
- •Subarray Datatype Constructor
- •Distributed Array Datatype Constructor
- •Address and Size Functions
- •Lower-Bound and Upper-Bound Markers
- •Extent and Bounds of Datatypes
- •True Extent of Datatypes
- •Commit and Free
- •Duplicating a Datatype
- •Use of General Datatypes in Communication
- •Correct Use of Addresses
- •Decoding a Datatype
- •Examples
- •Pack and Unpack
- •Canonical MPI_PACK and MPI_UNPACK
- •Collective Communication
- •Introduction and Overview
- •Communicator Argument
- •Applying Collective Operations to Intercommunicators
- •Barrier Synchronization
- •Broadcast
- •Example using MPI_BCAST
- •Gather
- •Examples using MPI_GATHER, MPI_GATHERV
- •Scatter
- •Examples using MPI_SCATTER, MPI_SCATTERV
- •Example using MPI_ALLGATHER
- •All-to-All Scatter/Gather
- •Global Reduction Operations
- •Reduce
- •Signed Characters and Reductions
- •MINLOC and MAXLOC
- •All-Reduce
- •Process-local reduction
- •Reduce-Scatter
- •MPI_REDUCE_SCATTER_BLOCK
- •MPI_REDUCE_SCATTER
- •Scan
- •Inclusive Scan
- •Exclusive Scan
- •Example using MPI_SCAN
- •Correctness
- •Introduction
- •Features Needed to Support Libraries
- •MPI's Support for Libraries
- •Basic Concepts
- •Groups
- •Contexts
- •Intra-Communicators
- •Group Management
- •Group Accessors
- •Group Constructors
- •Group Destructors
- •Communicator Management
- •Communicator Accessors
- •Communicator Constructors
- •Communicator Destructors
- •Motivating Examples
- •Current Practice #1
- •Current Practice #2
- •(Approximate) Current Practice #3
- •Example #4
- •Library Example #1
- •Library Example #2
- •Inter-Communication
- •Inter-communicator Accessors
- •Inter-communicator Operations
- •Inter-Communication Examples
- •Caching
- •Functionality
- •Communicators
- •Windows
- •Datatypes
- •Error Class for Invalid Keyval
- •Attributes Example
- •Naming Objects
- •Formalizing the Loosely Synchronous Model
- •Basic Statements
- •Models of Execution
- •Static communicator allocation
- •Dynamic communicator allocation
- •The General case
- •Process Topologies
- •Introduction
- •Virtual Topologies
- •Embedding in MPI
- •Overview of the Functions
- •Topology Constructors
- •Cartesian Constructor
- •Cartesian Convenience Function: MPI_DIMS_CREATE
- •General (Graph) Constructor
- •Distributed (Graph) Constructor
- •Topology Inquiry Functions
- •Cartesian Shift Coordinates
- •Partitioning of Cartesian structures
- •Low-Level Topology Functions
- •An Application Example
- •MPI Environmental Management
- •Implementation Information
- •Version Inquiries
- •Environmental Inquiries
- •Tag Values
- •Host Rank
- •IO Rank
- •Clock Synchronization
- •Memory Allocation
- •Error Handling
- •Error Handlers for Communicators
- •Error Handlers for Windows
- •Error Handlers for Files
- •Freeing Errorhandlers and Retrieving Error Strings
- •Error Codes and Classes
- •Error Classes, Error Codes, and Error Handlers
- •Timers and Synchronization
- •Startup
- •Allowing User Functions at Process Termination
- •Determining Whether MPI Has Finished
- •Portable MPI Process Startup
- •The Info Object
- •Process Creation and Management
- •Introduction
- •The Dynamic Process Model
- •Starting Processes
- •The Runtime Environment
- •Process Manager Interface
- •Processes in MPI
- •Starting Processes and Establishing Communication
- •Reserved Keys
- •Spawn Example
- •Manager-worker Example, Using MPI_COMM_SPAWN.
- •Establishing Communication
- •Names, Addresses, Ports, and All That
- •Server Routines
- •Client Routines
- •Name Publishing
- •Reserved Key Values
- •Client/Server Examples
- •Ocean/Atmosphere - Relies on Name Publishing
- •Simple Client-Server Example.
- •Other Functionality
- •Universe Size
- •Singleton MPI_INIT
- •MPI_APPNUM
- •Releasing Connections
- •Another Way to Establish MPI Communication
- •One-Sided Communications
- •Introduction
- •Initialization
- •Window Creation
- •Window Attributes
- •Communication Calls
- •Examples
- •Accumulate Functions
- •Synchronization Calls
- •Fence
- •General Active Target Synchronization
- •Lock
- •Assertions
- •Examples
- •Error Handling
- •Error Handlers
- •Error Classes
- •Semantics and Correctness
- •Atomicity
- •Progress
- •Registers and Compiler Optimizations
- •External Interfaces
- •Introduction
- •Generalized Requests
- •Examples
- •Associating Information with Status
- •MPI and Threads
- •General
- •Initialization
- •Introduction
- •File Manipulation
- •Opening a File
- •Closing a File
- •Deleting a File
- •Resizing a File
- •Preallocating Space for a File
- •Querying the Size of a File
- •Querying File Parameters
- •File Info
- •Reserved File Hints
- •File Views
- •Data Access
- •Data Access Routines
- •Positioning
- •Synchronism
- •Coordination
- •Data Access Conventions
- •Data Access with Individual File Pointers
- •Data Access with Shared File Pointers
- •Noncollective Operations
- •Collective Operations
- •Seek
- •Split Collective Data Access Routines
- •File Interoperability
- •Datatypes for File Interoperability
- •Extent Callback
- •Datarep Conversion Functions
- •Matching Data Representations
- •Consistency and Semantics
- •File Consistency
- •Random Access vs. Sequential Files
- •Progress
- •Collective File Operations
- •Type Matching
- •Logical vs. Physical File Layout
- •File Size
- •Examples
- •Asynchronous I/O
- •I/O Error Handling
- •I/O Error Classes
- •Examples
- •Subarray Filetype Constructor
- •Requirements
- •Discussion
- •Logic of the Design
- •Examples
- •MPI Library Implementation
- •Systems with Weak Symbols
- •Systems Without Weak Symbols
- •Complications
- •Multiple Counting
- •Linker Oddities
- •Multiple Levels of Interception
- •Deprecated Functions
- •Deprecated since MPI-2.0
- •Deprecated since MPI-2.2
- •Language Bindings
- •Overview
- •Design
- •C++ Classes for MPI
- •Class Member Functions for MPI
- •Semantics
- •C++ Datatypes
- •Communicators
- •Exceptions
- •Mixed-Language Operability
- •Problems With Fortran Bindings for MPI
- •Problems Due to Strong Typing
- •Problems Due to Data Copying and Sequence Association
- •Special Constants
- •Fortran 90 Derived Types
- •A Problem with Register Optimization
- •Basic Fortran Support
- •Extended Fortran Support
- •The mpi Module
- •No Type Mismatch Problems for Subroutines with Choice Arguments
- •Additional Support for Fortran Numeric Intrinsic Types
- •Language Interoperability
- •Introduction
- •Assumptions
- •Initialization
- •Transfer of Handles
- •Status
- •MPI Opaque Objects
- •Datatypes
- •Callback Functions
- •Error Handlers
- •Reduce Operations
- •Addresses
- •Attributes
- •Extra State
- •Constants
- •Interlanguage Communication
- •Language Bindings Summary
- •Groups, Contexts, Communicators, and Caching Fortran Bindings
- •External Interfaces C++ Bindings
- •Change-Log
- •Bibliography
- •Examples Index
- •MPI Declarations Index
- •MPI Function Index
1
2
3
4
5
6
7
8
352 |
CHAPTER 11. ONE-SIDED COMMUNICATIONS |
11.4.1 Fence
MPI_WIN_FENCE(assert, win)
IN |
assert |
program assertion (integer) |
IN |
win |
window object (handle) |
9int MPI_Win_fence(int assert, MPI_Win win)
10
11MPI_WIN_FENCE(ASSERT, WIN, IERROR)
12INTEGER ASSERT, WIN, IERROR
13fvoid MPI::Win::Fence(int assert) const (binding deprecated, see Section 15.2) g
14
15The MPI call MPI_WIN_FENCE(assert, win) synchronizes RMA calls on win. The call
16is collective on the group of win. All RMA operations on win originating at a given process
17and started before the fence call will complete at that process before the fence call returns.
18They will be completed at their target before the fence call returns at the target. RMA
19operations on win started by a process after the fence call returns will access their target
20window only after MPI_WIN_FENCE has been called by the target process.
21The call completes an RMA access epoch if it was preceded by another fence call and
22the local process issued RMA communication calls on win between these two calls. The call
23completes an RMA exposure epoch if it was preceded by another fence call and the local
24window was the target of RMA accesses between these two calls. The call starts an RMA
25access epoch if it is followed by another fence call and by RMA communication calls issued
26between these two fence calls. The call starts an exposure epoch if it is followed by another
27fence call and the local window is the target of RMA accesses between these two fence calls.
28Thus, the fence call is equivalent to calls to a subset of post, start, complete, wait.
29A fence call usually entails a barrier synchronization: a process completes a call to
30MPI_WIN_FENCE only after all other processes in the group entered their matching call.
31However, a call to MPI_WIN_FENCE that is known not to end any epoch (in particular, a
32call with assert = MPI_MODE_NOPRECEDE) does not necessarily act as a barrier.
33The assert argument is used to provide assertions on the context of the call that may
34be used for various optimizations. This is described in Section 11.4.4. A value of assert =
350 is always valid.
36
37Advice to users. Calls to MPI_WIN_FENCE should both precede and follow calls
38to put, get or accumulate that are synchronized with fence calls. (End of advice to
39users.)
40
41
42
43
44
45
46
47
48
11.4. SYNCHRONIZATION CALLS |
353 |
11.4.2 General Active Target Synchronization
MPI_WIN_START(group, assert, win)
IN |
group |
group of target processes (handle) |
IN |
assert |
program assertion (integer) |
IN |
win |
window object (handle) |
int MPI_Win_start(MPI_Group group, int assert, MPI_Win win)
MPI_WIN_START(GROUP, ASSERT, WIN, IERROR)
INTEGER GROUP, ASSERT, WIN, IERROR
fvoid MPI::Win::Start(const MPI::Group& group, int assert) const (binding deprecated, see Section 15.2) g
Starts an RMA access epoch for win. RMA calls issued on win during this epoch must access only windows at processes in group. Each process in group must issue a matching call to MPI_WIN_POST. RMA accesses to each target window will be delayed, if necessary, until the target process executed the matching call to MPI_WIN_POST. MPI_WIN_START is allowed to block until the corresponding MPI_WIN_POST calls are executed, but is not required to.
The assert argument is used to provide assertions on the context of the call that may be used for various optimizations. This is described in Section 11.4.4. A value of assert = 0 is always valid.
MPI_WIN_COMPLETE(win)
IN |
win |
window object (handle) |
int MPI_Win_complete(MPI_Win win)
MPI_WIN_COMPLETE(WIN, IERROR)
INTEGER WIN, IERROR
fvoid MPI::Win::Complete() const (binding deprecated, see Section 15.2) g
Completes an RMA access epoch on win started by a call to MPI_WIN_START. All RMA communication calls issued on win during this epoch will have completed at the origin when the call returns.
MPI_WIN_COMPLETE enforces completion of preceding RMA calls at the origin, but not at the target. A put or accumulate call may not have completed at the target when it has completed at the origin.
Consider the sequence of calls in the example below.
Example 11.4 MPI_Win_start(group, flag, win);
MPI_Put(...,win);
MPI_Win_complete(win);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
354 |
CHAPTER 11. ONE-SIDED COMMUNICATIONS |
1The call to MPI_WIN_COMPLETE does not return until the put call has completed
2at the origin; and the target window will be accessed by the put operation only after the
3call to MPI_WIN_START has matched a call to MPI_WIN_POST by the target process.
4This still leaves much choice to implementors. The call to MPI_WIN_START can block
5until the matching call to MPI_WIN_POST occurs at all target processes. One can also
6have implementations where the call to MPI_WIN_START is nonblocking, but the call to
7MPI_PUT blocks until the matching call to MPI_WIN_POST occurred; or implementations
8where the rst two calls are nonblocking, but the call to MPI_WIN_COMPLETE blocks
9until the call to MPI_WIN_POST occurred; or even implementations where all three calls
10can complete before any target process called MPI_WIN_POST | the data put must be
11bu ered, in this last case, so as to allow the put to complete at the origin ahead of its
12completion at the target. However, once the call to MPI_WIN_POST is issued, the sequence
13above must complete, without further dependencies.
14
15
16
17
18
19
20
21
MPI_WIN_POST(group, assert, win)
IN |
group |
group of origin processes (handle) |
IN |
assert |
program assertion (integer) |
IN |
win |
window object (handle) |
22 |
int MPI_Win_post(MPI_Group group, int assert, MPI_Win win) |
|||
|
||||
23 |
MPI_WIN_POST(GROUP, ASSERT, WIN, IERROR) |
|||
24 |
||||
|
INTEGER GROUP, ASSERT, WIN, IERROR |
|||
25 |
|
|||
|
|
|
||
26 |
fvoid MPI::Win::Post(const MPI::Group& group, int assert) const (binding |
|||
|
||||
27 |
|
deprecated, see Section 15.2) g |
||
28 |
|
|||
|
Starts an RMA exposure epoch for the local window associated with win. Only processes |
|||
29 |
|
|||
in group should access the window with RMA calls on win during this epoch. Each process |
||||
30 |
||||
in group must issue a matching call to MPI_WIN_START. MPI_WIN_POST does not block. |
||||
31 |
||||
|
|
|
||
32 |
|
|
|
|
33 |
MPI_WIN_WAIT(win) |
|
||
|
|
|||
34 |
|
|
window object (handle) |
|
35 |
IN |
win |
||
|
|
|
||
36 |
|
|
|
|
37 |
int MPI_Win_wait(MPI_Win win) |
|
||
38 |
MPI_WIN_WAIT(WIN, IERROR) |
|
||
|
|
39
INTEGER WIN, IERROR
40
41 fvoid MPI::Win::Wait() const (binding deprecated, see Section 15.2) g
42
Completes an RMA exposure epoch started by a call to MPI_WIN_POST on win. This
43
call matches calls to MPI_WIN_COMPLETE(win) issued by each of the origin processes that
44
were granted access to the window during this epoch. The call to MPI_WIN_WAIT will block
45
until all matching calls to MPI_WIN_COMPLETE have occurred. This guarantees that all
46
these origin processes have completed their RMA accesses to the local window. When the
47
call returns, all these RMA accesses will have completed at the target window.
48
11.4. SYNCHRONIZATION CALLS |
|
355 |
||
PROCESS 0 |
PROCESS 1 |
PROCESS 2 |
PROCESS 3 |
|
|
post(0) |
post(0,3) |
|
|
|
. |
. |
|
|
start(1,2) |
. |
. |
start(2) |
|
. |
. |
|||
|
|
|||
put(1) |
. |
. |
|
|
. |
. |
put(2) |
||
|
||||
|
. |
. |
||
put(2) |
|
|||
. |
. |
|
||
|
|
|||
complete() |
. |
. |
complete() |
|
. |
. |
|||
|
|
|||
|
. |
. |
|
|
|
. |
. |
|
|
|
wait() |
wait() |
|
Figure 11.4: Active target communication. Dashed arrows represent synchronizations and solid arrows represent data transfer.
Figure 11.4 illustrates the use of these four functions. Process 0 puts data in the windows of processes 1 and 2 and process 3 puts data in the window of process 2. Each start call lists the ranks of the processes whose windows will be accessed; each post call lists the ranks of the processes that access the local window. The gure illustrates a possible timing for the events, assuming strong synchronization; in a weak synchronization, the start, put or complete calls may occur ahead of the matching post calls.
MPI_WIN_TEST(win, ag)
IN |
win |
window object (handle) |
OUT |
ag |
success ag (logical) |
int MPI_Win_test(MPI_Win win, int *flag)
MPI_WIN_TEST(WIN, FLAG, IERROR)
INTEGER WIN, IERROR
LOGICAL FLAG
fbool MPI::Win::Test() const (binding deprecated, see Section 15.2) g
This is the nonblocking version of MPI_WIN_WAIT. It returns ag = true if all accesses to the local window by the group to which it was exposed by the corresponding MPI_WIN_POST call have been completed as signalled by matching MPI_WIN_COMPLETE calls, and ag = false otherwise. In the former case MPI_WIN_WAIT would have returned immediately. The e ect of return of MPI_WIN_TEST with ag = true is the same as the e ect of a return of MPI_WIN_WAIT. If ag = false is returned, then the call has no visible e ect.
MPI_WIN_TEST should be invoked only where MPI_WIN_WAIT can be invoked. Once the call has returned ag = true, it must not be invoked anew, until the window is posted anew.
Assume that window win is associated with a \hidden" communicator wincomm, used for communication by the processes of win. The rules for matching of post and start calls
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
356 |
CHAPTER 11. ONE-SIDED COMMUNICATIONS |
1and for matching complete and wait call can be derived from the rules for matching sends
2and receives, by considering the following (partial) model implementation.
3
4MPI_WIN_POST(group,0,win) initiate a nonblocking send with tag tag0 to each process
5in group, using wincomm. No need to wait for the completion of these sends.
6
7MPI_WIN_START(group,0,win) initiate a nonblocking receive with tag tag0 from each
8process in group, using wincomm. An RMA access to a window in target process i is
9delayed until the receive from i is completed.
10
MPI_WIN_COMPLETE(win) initiate a nonblocking send with tag tag1 to each process
11
in the group of the preceding start call. No need to wait for the completion of these
12
sends.
13
14MPI_WIN_WAIT(win) initiate a nonblocking receive with tag tag1 from each process in
15the group of the preceding post call. Wait for the completion of all receives.
16
17No races can occur in a correct program: each of the sends matches a unique receive,
18and vice-versa.
19
20Rationale. The design for general active target synchronization requires the user to
21provide complete information on the communication pattern, at each end of a com-
22munication link: each origin speci es a list of targets, and each target speci es a list
23of origins. This provides maximum exibility (hence, e ciency) for the implementor:
24each synchronization can be initiated by either side, since each \knows" the identity of
25the other. This also provides maximum protection from possible races. On the other
26hand, the design requires more information than RMA needs, in general: in general,
27it is su cient for the origin to know the rank of the target, but not vice versa. Users
28that want more \anonymous" communication will be required to use the fence or lock
29mechanisms. (End of rationale.)
30
31
32
33
34
35
36
37
38
Advice to users. Assume a communication pattern that is represented by a directed graph G = < V; E >, where V = f0; : : : ; n 1g and ij 2 E if origin process i accesses the window at target process j. Then each process i issues a call to MPI_WIN_POST(ingroupi, . . . ), followed by a call to
MPI_WIN_START(outgroupi,. . . ), where outgroupi = fj : ij 2 Eg and ingroupi = fj : ji 2 Eg. A call is a noop, and can be skipped, if the group argument is empty.
After the communications calls, each process that issued a start will issue a complete. Finally, each process that issued a post will issue a wait.
39Note that each process may call with a group argument that has di erent members.
40(End of advice to users.)
41
42
43
44
45
46
47
48