Objective: To analyze (mostly experimentally) the performance and scalability of an implementation of the outer-product algorithm for multiplying matrices.
This module uses the program developed in the Matrix Multiplication module in Topic #3. You must have completed that module to do this one.
To make the simulation results consistent regardless of your computer we first calibrate the simulation.
More specifically, we simulate a parallel platform in which each processor computes at a rate of 200Gflops. Since
SMPI measures compute times on your own machine, we need to tell SMPI how fast your machine is, which is
SMPI is done by passing a particular floating point value to
smpirun
using the --cfg=smpi/host-speed:VALUE
command-line option.
A python script called calibrate_flops.py
is
provided, which you should run once on your idle computer (or on your Docker container). This script
performs a binary search to find the necessary calibration factor,
which is printed to the screen and should be passed to the
--cfg=smpi/host-speed
command-line option. Of course,
if you change computer, you'll have to re-run the script to find a new
calibration factor.
This module consists of 2 activities, each described in its own tab above, which should be done in sequence:
You should turn in a single archive (or a github repo) with:
In this step we run experiments on a simulated 1600-host cluster with the
cluster_1600.xml
platform file and accompanying
hostfile_1600.txt
hostfile.
First, augment your matmul
program so that it prints out the wallclock time to perform the
matrix multiplication, in seconds (Using MPI_Wtime
is a good idea).
Run your program for N = 1600 on the above platform using p = 1, 4, 16, 64, 100, and 400 processes. For each number of processes, record the simulated wallclock time, averaged over 5 trials.
Now, modify the platform file to set the network latency to 0 and the bandwidth to a very large value, so as to simulate an ideal platform in which the network is close to infinitely fast. Re-run the experiments above.
Plot speed-up curves (speedup vs. number of processors) for the two series of experiments above. What parallel efficiency can you achieve with p = 400 processes on the ideal vs. the non-ideal platform? What do you conclude about the application’s prospects for this matrix size on a real machine?
Note: The simulations takes time and memory. You can decrease the scale as needed in case you can't run it on your machine. Just for reference, these simulations run well under 10 minutes minutes on the author's laptop.
Instrument your code so that at the end of execution, in addition to wallclock time,
it also reports the time spent in communication (which you measure via additional calls
to MPI_Wtime
). Using p=16 processes, run experiments with the original (i.e., with
the original network) platform file for N = 8X for X ∈ {100,110,120,...,400}, with 5 trials per experiment.
This runs in about 10 minutes on the author's laptop.
On the same plot show the computation time and the communication time vs. the matrix size. What do the trends look like? Do they make sense w.r.t. to some asymptotic models of computation and communication volumes? Is this good news or bad news for multiplying large matrices on our platform?
Experiments with large matrices are necessary to study scalability, but unfortunately these experiments are cpu-intensive, whether on a real platform or in simulation. One advantage of simulation is that one can play tricks to reduce cpu expenses, essentially by not performing actual computation. This is something researchers often do in simulation and in this module you're doing it too.
When SMPI encounters a basic block of computation in your program, it executes the block and times it to generate the correct delay in simulation. If this code takes 1 hour to run on your machine, then the simulation will then take at least 1 hour. If you are simulating 100 processes each computing for 1 hour, then the simulation will take 100 hours! This is because SMPI runs on a single core, running MPI processes in round-robin fashion. This sounds bad, but SMPI provides several ways to mitigate this problem. For instance, one can tell the simulator: "Execute/time the code only the first time you encounter it". A more radical option is to tell the simulator: "When you encounter this basic block, don't run it at all. Instead, I am telling you how many flops it involves and simply compute the delay by dividing this number of flops by the compute speed." In both cases, the code no longer computes correct results. But the idea is that for deterministic basic blocks, such as the computational part of matrix multiplication, the general performance behavior is preserved, and a simple re-calibration will lead to sufficiently accurate simulation results (this has been shown in SMPI research papers and by other researchers before SMPI even existed).
In this step we use the second option above, i.e., avoiding computation altogether. Consider the sample basic block below:
SMPI_SAMPLE_FLOPS
macro / basic block, which is
likely a good idea instead of simply deleting it. The FLOP_CALIBRATION_FACTOR
constant
is the constant factor hidden in the O(N) asymptotic complexity. In step #3 we use calibration to determine
an appropriate value.
Modify your program to do the above, for the basic block that multiplies matrix blocks. At this point,
since your code doesn't compute anything useful, you can also comment out the part of the code that
initializes the matrices and checks the validity of the results (in fact you really want to comment out the latter
to avoid "invalid results" messages). For now use an arbitrary value for FLOP_CALIBRATION_FACTOR
so
that you can compile and debug your program, and pay no attention to the simulated performance.
Besides cpu time, another limiting factor for scalable simulation is memory space. If I want to use SMPI to simulate 100 processes that each uses 1GiB of heap space, then I need a single machine with 100GiB of RAM! Again, this sounds bad, which is why SMPI allows simulated processes to use the same heap allocations!! This leads to nonsensical computation since processes overwrite each other's updates in the same allocated memory zones. But, for deterministic application such as matrix multiply, overall performance behavior is sufficiently preserved to lead to sufficiently accurate simulation results (again, see SMPI research papers and those referenced therein).
The above is straightforward with SMPI:
malloc
by SMPI_SHARED_MALLOC
;free
by SMPI_SHARED_FREE
.Do the above in your program. You now have a program that computes nothing useful at all, but still outputs a simulated wallclock time that research shows is sufficiently accurate to draw meaningful conclusions for our purpose! The point is that this program can be executed faster and for larger matrices. We now have an interesting simulation performance trade-off. With a lot of processes, we can scale the memory up. For instance, say the memory space needed for the (sequential) matrix multiplication is 100GiB. You can't run the simulation of this computation on your laptop. However, if you simulate a parallel execution with 100 processes, the above memory trick makes it so that you only need 1GiB of RAM to run the simulation. However, with a lot of processes, the time to simulate the broadcasts with large data sizes is large (due to the sheer number of individual point-to-point messages). So, some simulations will take too much space, and some simulations will take too much time. Nevertheless, our CPU and memory simulation "tricks" significantly widens the range of simulation experiments that can be done in practice.
One problem with our CPU trick is that the simulated wall-clock time will no longer be coherent with our simulation platform, i.e., it
no longer match with our --cfg=smpi/host-speed
command-line option since we use the
SMPI_SAMPLE_FLOPS
macro. Instead, we need
to determine a reasonable value for FLOP_CALIBRATION_FACTOR
.
Run your original (i.e., without the "simulation tricks") program from the matrix multiplication module in Topic #3,
using
cluster_1600.xml
and
hostfile_1600.txt
, for a matrix of size 2000x2000, with p=4 process.
Then empirically determine a value of FLOP_CALIBRATION_FACTOR
that leads to the same (or close) simulated elapsed time for your program with the "simulation tricks".
Doing a binary search on FLOP_CALIBRATION_FACTOR
, via a script, is great of course, but you
can probably just do a quick trial-and-error manual search. Once you've found this value, hard-code it into your
program, and voila.
Now that your simulation is fully calibrated,
run simulations, using
cluster_1600.xml
and
hostfile_1600.txt
, with p=100 processors,
using matrices of increasing sizes up to as large as you can run them on your machine (in terms of memory and time), with 5 trials
per matrix size.
It may be a good idea to do a quick back-of-the-envelope memory footprint calculation to make sure you don’t exceed your RAM.
Plot an average efficiency vs. matrix size curve. What do you conclude in terms of the scalability of the application? Are we in good shape if matrices are large?
Congratulations for making it this far. You have learned how to measure the computation and the communication performance shares of a message passing application, have learn how to look at performance results and draw conclusions about the scalability of the application, and have learned how to use simulation to run large-scale experiments without need for a large-scale machine.
In your README file write a brief "essay" explaining what you found surprising/challenging, and what solutions you used to solve your difficulties. Are you happy with your implementations? Do you see room for improvement?