Search

link to homepage

Institute for Advanced Simulation (IAS)

Navigation and service


MPIX: Blue Gene/Q specific extensions to MPI

IBM provides extensions to MPICH2 in order to ease the use of the Blue Gene/Q hardware. These extensions start with MPIX instead of MPI. A C/C++ interface is available for all extensions. A FORTRAN interface is available for a subset.

Usage

C/C++

In order to use the extensions, please include mpix.h:

#include <mpix.h>

and compile your program with the usual mpi compiler wrappers (mpixlc_r, mpixlcxx_r, ...).

FORTRAN

Beside 'include mpif.h' no extra 'use' or 'include' statement is needed. Just compile your program with the usual mpi compiler wrappers (mpixlf90_r, mpixlf95_r, ...).

Supported MPIX functions

The API of the available MPIX functions is described below and some usage examples are given. The examples were executed with the following LoadLeveler job script, if not stated otherwise

# @job_name = mpix-test
# @comment = "mpix-test"
# @environment = COPY_ALL
# @job_type = bluegene
# @bg_size = 64
# @bg_connectivity = torus
# @wall_clock_limit = 00:10:00
# @queue

runjob -p 64 : ./mpix_test.x

Below, the syntax of the functions/subroutines are given for C and Fortran, examples are provided in C only.

MPIX_Cart_comm_create

FORTRAN
MPIX_CART_COMM_CREATE(INTEGER cart_comm, INTEGER ierr)
C
int MPIX_Cart_comm_create(MPI_Comm *cart_comm)
Info
OUT cart_comm (new Cartesian communicator)

Return codes
MPI_SUCCESS
MPI_ERR_TOPOLOGY

Collective operation on MPI_COMM_WORLD

Description
Creates a Cartesian communicator that mimics the exact hardware on which it is run. The dimension of this communicator is

n = numdim + 1 = hw.torus_dimensions + 1

as can be obtained using MPIX_Torus_ndims, MPIX_Hardware or MPI_Cartdim_get. It will only work properly if the application runs on all nodes of a partition, i.e. if you reserve 512 nodes your application must not use less nodes. Because of MPICH2 dimension ordering, the associated arrays (i.e. coords, sizes, and periods) are in [A,B,C,D,E,T] order.

Regardless of the mapping of tasks used the coordinates of each rank in this communicator matches always its hardware coordinates without the need to modify the source code. This is in contrast to the coordinates obtained with the standard MPI_Cart_create.

Example A
MPI_Comm cart_comm;
int my_rank_cart,numdims;
int *cartcoords;

MPIX_Cart_comm_create(&cart_comm);
MPI_Comm_rank(cart_comm, &my_rank_cart);
MPI_Cartdim_get(cart_comm,&numdims);
cartcoords = malloc(numdims*sizeof(int));
MPI_Cart_coords(cart_comm,my_rank_cart,
numdims,cartcoords);

Example B
MPI_Comm cart_comm_std;
int my_rank_cart_std,numdims,reorder=1;
int *cartcoords_std,*periods,*dims;
MPIX_Hardware_t hw;

MPIX_Torus_ndims(numdims);
MPIX_Hardware(&hw);
periods = malloc((numdims+1)*sizeof(int));
dims = malloc((numdims+1)*sizeof(int));

for (i=0;i<=numdims;i++){
periods[i] = 0;
if (i == numdims) {
dims[i] = 64;
}
else {
dims[i] = hw.Size[i];
}
}

MPI_Cart_create(MPI_COMM_WORLD,numdims+1,
dims,periods,reorder,
&cart_comm_std);
MPI_Comm_rank(cart_comm_std, &my_rank_cart_std);
cartcoords_std = malloc((numdims+1)*sizeof(int));
MPI_Cart_coords(cart_comm_std,my_rank_cart_std,
numdims+1,cartcoords_std);

Results
When running the above examples on 64 nodes of JUQUEEN using the given --mapping option for runjob, the results given in the table below are obtained for the coordinates of rank 2049:

MappingExample AExample BHW Coords
--mapping ABCDETcartcoord[0] = 1cartcoord_std[0] = 1coord[0] = 1
cartcoord[1] = 0cartcoord_std[1] = 0coord[1] = 0
cartcoord[2] = 0cartcoord_std[2] = 0coord[2] = 0
cartcoord[3] = 0cartcoord_std[3] = 0coord[3] = 0
cartcoord[4] = 0cartcoord_std[4] = 0coord[4] = 0
cartcoord[5] = 1cartcoord_std[5] = 1coord[5] = 1
--mapping DCATBEcartcoord[0] = 0cartcoord_std[0] = 1coord[0] = 0
cartcoord[1] = 0cartcoord_std[1] = 0coord[1] = 0
cartcoord[2] = 0cartcoord_std[2] = 0coord[2] = 0
cartcoord[3] = 1cartcoord_std[3] = 0coord[3] = 1
cartcoord[4] = 1cartcoord_std[4] = 0coord[4] = 1
cartcoord[5] = 0cartcoord_std[5] = 1coord[5] = 0

The results show that regardless of the mapping used MPI_Comm cart_comm (Example A) generates always the hardware coordinates (HW Coords) of the rank.

MPIX_Comm_rank2global

FORTRAN
MPIX_COMM_RANK2GLOBAL(INTEGER comm,
INTEGER crank, INTEGER grank,
INTEGER ierr)
C
int MPIX_Comm_rank2global(MPI_Comm comm, int crank,
int *grank)
Info
IN comm (Communicator associated with crank)
IN crank (Input rank)
OUT grank (Rank in MPI_COMM_WORLD)

Return codes
MPI_SUCCESS
Returns an error on failure detection

Description
Determines the rank-in-COMM_WORLD grank of the process associated with rank-in-comm crank.

MPIX_Comm_update

C
int MPIX_Comm_update(MPI_Comm comm, int optimize)
Info
IN comm (Communicator to be optimized/deoptimized)
IN optimize (=1 optimize, =0 deoptimize)

Description
Optimize/deoptimize a communicator by adding/stripping platform specific optimizations (i.e. class routes support for efficient bcast/reductions).

MPIX_Dump_stacks

C
void MPIX_Dump_stacks();

Description
Prints the current system stack, the output is directed to stderr. The first frame (this function) is discarded to make the trace look nicer.

MPIX_Get_last_algorithm_name

C
int MPIX_Get_last_algorithm_name
(MPI_Comm comm, char *protocol, int length);
Info
IN comm (Comunicator)
OUT protocol (Name of protocol)
IN length (Length of protocol string)

Description
Returns the most recently used collective protocol name. The maximum length of the protocol string is '100'.

MPIX_Hardware

C
int MPIX_Hardware(MPIX_Hardware_t *hw)
Info
OUT hw (Struct with hardware information)

Description
Returns information about the hardware the application is running on. The struct MPIX_Hardware has the following components:

unsigned prank; // Physical rank of node
unsigned psize; // Size of the partition
unsigned ppn; // Processes per node
unsigned coreID; // Process id (0..63)
unsigned clockMHz; // Frequency in MHz
unsigned memSize; // Node memory in MB
unsigned torus_dimension;
// Dim. of torus
unsigned Size[MPIX_TORUS_MAX_DIMS];
// Max coords on torus
unsigned Coords[MPIX_TORUS_MAX_DIMS];
// This node's coords
unsigned isTorus[MPIX_TORUS_MAX_DIMS];
// Wraparound links?
unsigned rankInPset; // Zero on Blue Gene/Q
unsigned sizeOfPset; // Zero on Blue Gene/Q
unsigned idOfPset; // Zero on Blue Gene/Q

Example
MPIX_Hardware_t hw;
MPIX_Hardware(&hw);

When running the above examples on 64 nodes of JUQUEEN one obtains the following results for rank 2049 (mapping ABCDET):

hw.prank (physical rank of node): 192
hw.psize (size of partition) : 4096
hw.ppn (processes per node) : 64
hw.coreID (ID of core -> 0..63) : 0
hw.clockMHz (Clock speed) : 1600
hw.memSize (memSize) : 16384
hw.torus_dimension (dimension of torus) : 5
hw.Size[0] (max. dimension A) : 2
hw.Size[1] (max. dimension B) : 2
hw.Size[2] (max. dimension C) : 4
hw.Size[3] (max. dimension D) : 2
hw.Size[4] (max. dimension E) : 2
hw.Coords[0] (A coordinate of node): 0
hw.Coords[1] (B coordinate of node): 0
hw.Coords[2] (C coordinate of node): 0
hw.Coords[3] (D coordinate of node): 1
hw.Coords[4] (E coordinate of node): 1
hw.isTorus[0] (wraparounds in A?) : 1
hw.isTorus[1] (wraparounds in B?) : 1
hw.isTorus[2] (wraparounds in C?) : 1
hw.isTorus[3] (wraparounds in D?) : 1
hw.isTorus[4] (wraparounds in E?) : 1

MPIX_IO_distance

FORTRAN
MPIX_IO_DISTANCE(INTEGER io_distance)
C
int MPIX_IO_distance()
Info
OUT io_distance (Distance to I/O node
in Hops)
Description
Returns the distance to the associated I/O node in number of Hops.

MPIX_IO_link_id

FORTRAN
MPIX_IO_LINK_ID(INTEGER io_link_id)
C
int MPIX_IO_link_id()
Info
OUT io_link_id (ID of the link to corresponding
I/O node)
Description
Returns the ID of the link to the associated I/O node.

MPIX_IO_node_id

FORTRAN
MPIX_IO_node_id(INTEGER io_node_id)
C
int MPIX_IO_node_id()
Info
OUT io_distance (ID of the corresponding
I/O node)
Description
Returns the ID of the associated I/O node.

MPIX_Progress_quiesce

C
int MPIX_Progress_quiesce(double timeout)
Info
IN timeout (Maximum time (seconds) to wait;
=0 for internal default)
Return codes
MPI_SUCCESS (Network appears to be quiesced)
MPI_ERR_PENDING (Network did not quiesce)
MPI_ERR_OTHER (Encounter error(s), network
state unknown)

Description
Wait for network to quiesce.

MPIX_Pset_diff_comm_create

FORTRAN
MPIX_PSET_DIFF_COMM_CREATE
(INTEGER pset_comm_diff,
INTEGER ierr)
C
int MPIX_Pset_diff_comm_create
(MPI_Comm *pset_comm_diff)
Info
OUT pset_comm_diff (New communicator)

Return codes
MPI_SUCCESS
Error code in case of failure

Collective operation on MPI_COMM_WORLD

Description
Returns a communicator which contains only MPI ranks which run on nodes belonging to different I/O Bridge Nodes. The name of this function is chosen for backwards compatibility, since there are no psets on the Blue Gene/Q anymore. If this funtion is needed on a different communicator than MPI_COMM_WORLD, please use MPIX_Pset_diff_comm_create_from_parent.

MPIX_Pset_diff_comm_create_from_parent

FORTRAN
MPIX_PSET_DIFF_COMM_CREATE_FROM_PARENT
(INTEGER parent_comm,
INTEGER pset_comm_diff,
INTEGER ierr)
C
int MPIX_Pset_diff_comm_create_from_parent
(MPI_Comm parent_comm,
MPI_Comm *pset_comm_diff)
Info
IN parent_comm (Old communicator)
OUT pset_comm_diff (New communicator)

Return codes
MPI_SUCCESS
Error code in case of failure

Collective operation on parent_comm

Description
Returns a communicator which contains only MPI ranks which run on nodes belonging to different I/O Bridge Nodes. The name of this function is chosen for backwards compatibility, since there are no psets on the Blue Gene/Q anymore. If this function is needed on MPI_COMM_WORLD, MPIX_Pset_diff_comm_create can be used.

MPIX_Pset_same_comm_create

FORTRAN
MPIX_PSET_SAME_COMM_CREATE
(INTEGER pset_comm_same,
INTEGER ierr)
C
int MPIX_Pset_same_comm_create
(MPI_Comm *pset_comm_same)
Info
OUT pset_comm_same (New communicator)

Return codes
MPI_SUCCESS
Error code in case of failure

Collective operation on MPI_COMM_WORLD

Description
Returns a communicator which contains only MPI ranks which run on nodes belonging to the same I/O Bridge Node. The name of this function is chosen for backwards compatibility, since there are no psets on the Blue Gene/Q anymore. If this funtion is needed on a different communicator than MPI_COMM_WORLD, please use MPIX_Pset_same_comm_create_from_parent.

MPIX_Pset_same_comm_create_from_parent

FORTRAN
MPIX_PSET_SAME_COMM_CREATE_FROM_PARENT
(INTEGER parent_comm,
INTEGER pset_comm_same,
INTEGER ierr)
C
int MPIX_Pset_same_comm_create_from_parent
(MPI_Comm parent_comm,
MPI_Comm *pset_comm_same)
Info
IN parent_comm (Old communicator)
OUT pset_comm_same (New communicator)

Return codes
MPI_SUCCESS
Error code in case of failure

Collective operation on parent_comm

Description
Returns a communicator which contains only MPI ranks which run on nodes belonging to the same I/O Bridge Node. The name of this function is chosen for backwards compatibility, since there are no psets on the Blue Gene/Q anymore. If this function is needed on MPI_COMM_WORLD, MPIX_Pset_same_comm_create can be used.

MPIX_Rank2torus

C
int MPIX_Rank2torus(int rank, int *coords)
Info
IN rank (MPI rank to convert)
OUT coords (Coordinates of rank)

Description
Converts an MPI rank into physical coordinates (ABCDE coordinated plus core ID T). It's the complement function to MPIX_Torus2rank.

Example
int my_rank, numdims;
int *coords;
MPIX_Torus_ndims(numdims);
coords = malloc((nnumdims+1)*sizeof(int));
MPI_Comm_rank(cart_comm, &my_rank);
MPIX_Rank2torus(my_rank, coords);

Results
When running the above examples on 64 nodes of JUQUEEN one obtains the following results for rank 2049 in MPI_COMM_WORLD (mapping DCATBE):


coord[0] (A coordinate): 0
coord[1] (B coordinate): 0
coord[2] (C coordinate): 0
coord[3] (D coordinate): 1
coord[4] (E coordinate): 1
coord[5] (T coordinate): 0

MPIX_Torus_ndims

C
int MPIX_Torus_ndims(int *numdim)
Info
OUT numdim (Actual dimensions of torus)

Dexcription
Determines the number of physical hardware dimensions. This does not include the coordinate T for the core.

MPIX_Torus2rank

C
int MPIX_Torus2rank(int *coords, int *rank)
Info
IN coords (Coords to be converted)
OUT rank (Rank corresponding to coords)

Description
Converts a set of coordinates (physical+core/thread) to an MPI rank. It's the complement function to MPIX_Rank2torus.


Servicemeu

Homepage