Objective: To implement several versions of a broadcast collective (i.e., a one-to-many communication), and to compare them in various platform setups.
This module consists of 4 activities, each described in its own tab above, which should be done in sequence:
In all activities above you will compare your implementation with MPI_Bcast
, for different platform configurations.
Throughout this module, you will be adding code to a "skeleton program" (download bcast_skeleton.c):
This program performs a broadcast of a 108 bytes (~ 95 MiB) and takes the following command-line arguments:
You should not modify bcast_skeleton.c
outside of the "TO IMPLEMENT: BEGIN
" and
"TO IMPLEMENT: END
" comments. See comments therein for all needed information to implement
your code. Note that you can opt to write your code directly in between these two comments,
or #include
it, up to you. Regardless, the goal is to have a single program,
bcast_skeleton
, that is executed with different command-line arguments for
all activities in this module.
The skeleton program does the following for you: parsing of command-line arguments, creation of broadcast data, checking that the broadcast operation was done correctly, timing of the broadcast operation.
You should turn in a single archive (or a github repo) with:
Look at the section of code in bcast_skeleton.c
between the TO IMPLEMENT: BEGIN
and TO IMPLEMENT: END
comments. The comments list the relevant variables that you will use
in your code. You will note, in that section, that bcast_skeleton.c
does an #include bcast_solution.c
. So you should create a file bcast_solution.c
in which you will write all the code for this assignment. This is more convenient that modifying the bcast_skeleton.c
directly.
In bcast_solution.c
, implement a broadcast implementation named
naive_bcast
(first command-line argument). This implementation simply
has the root of the broadcast (process with rank 0) perform a single
point-to-point MPI_Send
of the whole buffer to each other process (which performs an MPI_Recv
).
In other words, the processes are logically organized as
a mono-directional star with process of rank 0 at the center.
The second command-line argument (chunk size) is ignored.
Augment your code in file bcast_solution.c
with
a broadcast implementation named
ring_bcast
(first command-line argument). In this implementation,
the process of rank 0 (the root of the broadcast) does a single MPI_Send
of the
whole buffer to the process of rank 1, the process of rank 1 then does a single MPI_Send
to the process of rank 2, etc. In other words, the processes are logically organized as
a mono-directional ring. The last process should not send the buffer back to the root process (the root process
does not call MPI_Recv
).
The second command-line argument (chunk size) is ignored.
Before you proceed with the evaluation of your two implementations, implement
a third broadcast implementation, named default_bcast, that simply has all processes
call MPI_Bcast
. This is the MPI implementation of the broadcast collective communication,
which in this module we implement "by hand" using point-to-point communications.
We compare the performance of the three implementations on a 50-processor physical ring. Although some supercomputers in the old days were designed with a ring network topology, this is no longer the case. The main drawback of a physical ring is that it has very large diameter (i.e., there can be ~n/2 hops between two processors in an n-processor ring). The main advantages is that the degree is low (2), which implies low cost, and that routing is straightforward. For now we assume a simple physical ring so as to better understand broadcast performance.
It is pretty simple to generate a Simgrid platform file for a ring and the corresponding hostfile. You can download a Python script (generate_xml_ring_and_hostfile.py) that generates these two files given a number of processors passed as a command-line argument. Just in case, here are the files generated using this script for 50 processors: ring_50.xml and hostfile_50.txt.
The way to invoke the program is as follows:
The --cfg=smpi/bcase:mpich
flag above specifies that we simulate MPI_Bcast
(for the
default_bcast implementation) as implemented
in MPICH. Other options are possible, but it's okay to stick with this implementation
Answer the following questions:
MPI_Wtime
function is convenient to determine the current (simulated) date. This
function returns the date as a double precision number (and is in fact already used in bcast_skeleton.c
).
Warning: SMPI implements sophisticated simulation models that capture many real-world effects (network protocol idiosyncrasies, MPI implementation idiosyncrasies). These effects will be seen in your experiments, just as they would on a real-world platform, and they tend to make performance behavior more difficult to understand. For instance, if you modify the buffer size, you will see non-linear effects on performance (e.g., broadcasting a buffer twice as large will not require exactly twice as much time, broadcasting a buffer 1 byte larger may increase broadcast time significantly).
One of the reasons why the default MPI broadcast is fast is that it does pipelining
of communication. Augment your code in bcast_solution.c
with a new broadcast implementation
called pipelined_ring_bcast.
This implementation uses the chunk size passed as the second command-line argument
to split the broadcast data buffer into multiple chunks. These chunks are communicated in sequence along
the ring. In the first step, process #0 sends chunk #0 to process #1; in the second step, process #1 sends chunk #0 to
process #2
and process #0 sends chunk #1 to process #1; and so on. As a result of this pipelining, provided the chunk size is
small
enough, all network links can be utilized simultaneously. As in the previous activity,
use MPI_Send
and MPI_Recv
for
communications.
Note that the chunk size may not divide the message size, in which case the last chunk will be smaller.
Run pipelined_ring_bcast using chunk sizes of 100,000 bytes, 500,000 bytes, 1,000,000 bytes, 5,000,000 bytes, 10,000,000 bytes, 50,000,000 bytes, and 100,000,000 (i.e., 1 chunk), on a 20-processor ring, a 35-processor ring, and a 50-processor ring. Report the simulated execution times. Remember that the generate_xml_ring_and_hostfile.py Python script can be used to generate platform files and hostfiles.
Answer the following questions:
Warning: You will likely see some "odd" behavior. For instance, as the chunk size increases you may see the broadcast time decrease, increase, and then decrease again. Or, if ou try very small chunk sizes, smaller than the ones above, you could see that for some particular chunk sizes the broadcast time is pretty low, if perhaps not the lowest across the board. These effects are because our simulated MPI captures many effects of network protocols and MPI implementations. These effects take us away from simple analytical models of communication performance, in which there is no weirdness, but that simply don't match the reality of communication behavior in real-world parallel platforms. As a result, experimental results are often confusing and it is difficult to draw valid conclusions without looking at a lot of results.
In our current implementation of the broadcast, we use MPI_Send
and
MPI_Recv
. One drawback of this approach is that a process can only
send or receive data at a time. However, it would make sense for a process
to send chunk #i to its successor while receiving chunk #i+1 from its predecessor. Intuitively,
it should make it possible to hide network overheads better and
perhaps to achieve higher data transfer rates.
Augment your code in bcast_solution.c
program with a new broadcast implementation
called asynchronous_pipelined_ring_bcast. This implementation builds on the
pipelined_ring_bcast implementation but uses the following MPI asynchronous communication primitives:
MPI_Isend
and MPI_Wait
.
Although in principle simple, adding asynchronous communication is often not done correctly by beginner MPI programmers. Here are common mistakes to avoid when implementing asynchronous_pipelined_ring_bcast:
MPI_Isend
must have a matching MPI_Wait
call.MPI_Isend
for chunk #i must be issued only after the
call to MPI_Recv
for chunk #i (since one must wait to have received a chunk to forward it
to one's successor).MPI_Isend
immediately followed by its matching
MPI_Wait
call is equivalent to a single MPI_Send
call, i.e., it
does not implement asynchronous communication! Run asynchronous_pipelined_ring_bcast using chunk sizes of 100,000 bytes, 500,000 bytes, 1,000,000 bytes, 5,000,000 bytes, 10,000,000 bytes, 50,000,000 bytes, and 100,000,000 (i.e., 1 chunk) on a 50-processor ring. Report the simulated execution times. Remember that the generate_xml_ring_and_hostfile.py Python script can be used to generate platform files and hostfiles.
Answer the following questions:
MPI_Isend
help? By how much?
Augment your code in bcast_solution.c
with an implementation called
asynchronous_pipelined_bintree_bcast that
implements the broadcast along a binary tree, splitting the message into chunks for
pipelining and using asynchronous communication. The root of the broadcast,
process of rank #0, is also the root of the binary tree. Note that the binary tree may
not be complete if the number of processors is not of the form 2n − 1.
Use a level order traversal numbering of the nodes in your tree: rank 0 has rank 1 as its left child and rank 2 as its right child, rank 1 has rank 3 as its left child and rank 4 as its right child, etc. The reason why we impose this order is that this is also the order used to number processes in the platform description file used for experimental evaluations (see Step #2). The idea is that the logical binary tree of processes is identical to the physical binary tree of processors (process #i runs on processor #i).
Report the (simulated) wall-clock time of default_bcast, asynchronous_pipelined_ring_bcast, and asynchronous_pipelined_bintree_bcast, on a 50-processor ring platform and a 50-processor binary tree platform. For asynchronous_pipelined_ring_bcast and asynchronous_pipelined_bintree_bcast use the "best chunksize" you determined in Activity #3 for the 50-processor ring platform.
You are provided with a Python script (generate_xml_bintree_and_hostfile.py) that generates the XML file (and hostfile) for the binary tree platform. As for the generate_xml_bintree_and_hostfile.py script, the number of processors is passed as a command-line argument. Just in case, here are the files generated using this script for 50 processors: bintree_50.xml and hostfile_50.txt.
Answer the following questions:
Report the (simulated) wall-clock time of default_bcast, asynchronous_pipelined_ring_bcast, and asynchronous_pipelined_bintree_bcast, on the following two platforms:
Answer the following question:
Congratulations for completing this module. At this point you have learned how to use point-to-point MPI communication, both synchronous and asynchronous; you have learned about the most common collective communication primitive, the broadcast; you have observed first hand the connection between communication patterns and the underlying network topology; you have learned how pipelining and asynchronous communication can boost communication performance.
As part of our experimental evaluations, you have observed the overall good
performance of the default MPI_Bcast
implementation. It would be very
enlightening to look at open source MPI implementations and inspect the code
(although it is intricate) to understand how the broadcast is implemented. In general,
few developers implement broadcast by hand. However, in some cases (see upcoming modules),
it can be useful to use a by-hand implementation of a broadcast to achieve better
performance (in general, better overlap of communication and computation).
At the end of your report, write a brief "essay" explaining what you found surprising/challenging, and what solutions you used to solve your difficulties. Are you happy with your implementations? Do you see room for improvement?