Topic #4 Module: Matrix Multiplication

Overview

Objective: To analyze (mostly experimentally) the performance and scalability of an implementation of the outer-product algorithm for multiplying matrices.

Prerequisite

This module uses the program developed in the Matrix Multiplication module in Topic #3. You must have completed that module to do this one.

Simulation Calibration

To make the simulation results consistent regardless of your computer we first calibrate the simulation. More specifically, we simulate a parallel platform in which each processor computes at a rate of 200Gflops. Since SMPI measures compute times on your own machine, we need to tell SMPI how fast your machine is, which is SMPI is done by passing a particular floating point value to smpirun using the --cfg=smpi/host-speed:VALUE command-line option.

A python script called calibrate_flops.py is provided, which you should run once on your idle computer (or on your Docker container). This script performs a binary search to find the necessary calibration factor, which is printed to the screen and should be passed to the --cfg=smpi/host-speed command-line option. Of course, if you change computer, you'll have to re-run the script to find a new calibration factor.

Roadmap

This module consists of 2 activities, each described in its own tab above, which should be done in sequence:

  • Activity #1: Observe the impact of network performance on application performance.
  • Activity #2: Quantify algorithm scalability.

What to turn in

You should turn in a single archive (or a github repo) with:

All source code
XML platform files and hostfiles (see details in the activities)
A Makefile that compiles all executables (and has a 'clean' target!)

In this activity we measure and understand the impact of network communication on application performance. This activity consists of the two steps below.

In this step we run experiments on a simulated 1600-host cluster with the cluster_1600.xml platform file and accompanying hostfile_1600.txt hostfile.

First, augment your matmul program so that it prints out the wallclock time to perform the matrix multiplication, in seconds (Using MPI_Wtime is a good idea).

Run your program for N = 1600 on the above platform using p = 1, 4, 16, 64, 100, and 400 processes. For each number of processes, record the simulated wallclock time, averaged over 5 trials.

Now, modify the platform file to set the network latency to 0 and the bandwidth to a very large value, so as to simulate an ideal platform in which the network is close to infinitely fast. Re-run the experiments above.

Plot speed-up curves (speedup vs. number of processors) for the two series of experiments above. What parallel efficiency can you achieve with p = 400 processes on the ideal vs. the non-ideal platform? What do you conclude about the application’s prospects for this matrix size on a real machine?

Note: The simulations takes time and memory. You can decrease the scale as needed in case you can't run it on your machine. Just for reference, these simulations run well under 10 minutes minutes on the author's laptop.

Instrument your code so that at the end of execution, in addition to wallclock time, it also reports the time spent in communication (which you measure via additional calls to MPI_Wtime). Using p=16 processes, run experiments with the original (i.e., with the original network) platform file for N = 8X for X ∈ {100,110,120,...,400}, with 5 trials per experiment. This runs in about 10 minutes on the author's laptop.

On the same plot show the computation time and the communication time vs. the matrix size. What do the trends look like? Do they make sense w.r.t. to some asymptotic models of computation and communication volumes? Is this good news or bad news for multiplying large matrices on our platform?

In this activity we experiment with large matrices, to see if performance can scale reasonably well with large numbers of processors. In steps #1 and #2 below we play "simulation tricks" to avoid running too memory-intensive / cpu-intensive simulations. In step #3 we re-calibrate the simulation, and in step #4 we re-run experiments to quantify application scalability.

Experiments with large matrices are necessary to study scalability, but unfortunately these experiments are cpu-intensive, whether on a real platform or in simulation. One advantage of simulation is that one can play tricks to reduce cpu expenses, essentially by not performing actual computation. This is something researchers often do in simulation and in this module you're doing it too.

When SMPI encounters a basic block of computation in your program, it executes the block and times it to generate the correct delay in simulation. If this code takes 1 hour to run on your machine, then the simulation will then take at least 1 hour. If you are simulating 100 processes each computing for 1 hour, then the simulation will take 100 hours! This is because SMPI runs on a single core, running MPI processes in round-robin fashion. This sounds bad, but SMPI provides several ways to mitigate this problem. For instance, one can tell the simulator: "Execute/time the code only the first time you encounter it". A more radical option is to tell the simulator: "When you encounter this basic block, don't run it at all. Instead, I am telling you how many flops it involves and simply compute the delay by dividing this number of flops by the compute speed." In both cases, the code no longer computes correct results. But the idea is that for deterministic basic blocks, such as the computational part of matrix multiplication, the general performance behavior is preserved, and a simple re-calibration will lead to sufficiently accurate simulation results (this has been shown in SMPI research papers and by other researchers before SMPI even existed).

In this step we use the second option above, i.e., avoiding computation altogether. Consider the sample basic block below:

for (i=0; i < N; i++) {
  sum = sum * 3.0 * x * y;
}
This loop performs O(N) floating point operations, and in your SMPI code you can replace it by:
  double flops = (double)N * (double)FLOP_CALIBRATION_FACTOR ;
  SMPI_SAMPLE_FLOPS(flops) {
    // for (i=0; i < N; i++) {
    //   sum = sum * 3.0 * x * y;
  }
  
(note that the original code is simply commented-out without the SMPI_SAMPLE_FLOPS macro / basic block, which is likely a good idea instead of simply deleting it. The FLOP_CALIBRATION_FACTOR constant is the constant factor hidden in the O(N) asymptotic complexity. In step #3 we use calibration to determine an appropriate value.

Modify your program to do the above, for the basic block that multiplies matrix blocks. At this point, since your code doesn't compute anything useful, you can also comment out the part of the code that initializes the matrices and checks the validity of the results (in fact you really want to comment out the latter to avoid "invalid results" messages). For now use an arbitrary value for FLOP_CALIBRATION_FACTOR so that you can compile and debug your program, and pay no attention to the simulated performance.

Besides cpu time, another limiting factor for scalable simulation is memory space. If I want to use SMPI to simulate 100 processes that each uses 1GiB of heap space, then I need a single machine with 100GiB of RAM! Again, this sounds bad, which is why SMPI allows simulated processes to use the same heap allocations!! This leads to nonsensical computation since processes overwrite each other's updates in the same allocated memory zones. But, for deterministic application such as matrix multiply, overall performance behavior is sufficiently preserved to lead to sufficiently accurate simulation results (again, see SMPI research papers and those referenced therein).

The above is straightforward with SMPI:

Replace malloc by SMPI_SHARED_MALLOC;
Replace free by SMPI_SHARED_FREE.

Do the above in your program. You now have a program that computes nothing useful at all, but still outputs a simulated wallclock time that research shows is sufficiently accurate to draw meaningful conclusions for our purpose! The point is that this program can be executed faster and for larger matrices. We now have an interesting simulation performance trade-off. With a lot of processes, we can scale the memory up. For instance, say the memory space needed for the (sequential) matrix multiplication is 100GiB. You can't run the simulation of this computation on your laptop. However, if you simulate a parallel execution with 100 processes, the above memory trick makes it so that you only need 1GiB of RAM to run the simulation. However, with a lot of processes, the time to simulate the broadcasts with large data sizes is large (due to the sheer number of individual point-to-point messages). So, some simulations will take too much space, and some simulations will take too much time. Nevertheless, our CPU and memory simulation "tricks" significantly widens the range of simulation experiments that can be done in practice.

One problem with our CPU trick is that the simulated wall-clock time will no longer be coherent with our simulation platform, i.e., it no longer match with our --cfg=smpi/host-speed command-line option since we use the SMPI_SAMPLE_FLOPS macro. Instead, we need to determine a reasonable value for FLOP_CALIBRATION_FACTOR.

Run your original (i.e., without the "simulation tricks") program from the matrix multiplication module in Topic #3, using cluster_1600.xml and hostfile_1600.txt, for a matrix of size 2000x2000, with p=4 process. Then empirically determine a value of FLOP_CALIBRATION_FACTOR that leads to the same (or close) simulated elapsed time for your program with the "simulation tricks". Doing a binary search on FLOP_CALIBRATION_FACTOR, via a script, is great of course, but you can probably just do a quick trial-and-error manual search. Once you've found this value, hard-code it into your program, and voila.

Now that your simulation is fully calibrated, run simulations, using cluster_1600.xml and hostfile_1600.txt, with p=100 processors, using matrices of increasing sizes up to as large as you can run them on your machine (in terms of memory and time), with 5 trials per matrix size. It may be a good idea to do a quick back-of-the-envelope memory footprint calculation to make sure you don’t exceed your RAM.

Plot an average efficiency vs. matrix size curve. What do you conclude in terms of the scalability of the application? Are we in good shape if matrices are large?

What have we learned?

Congratulations for making it this far. You have learned how to measure the computation and the communication performance shares of a message passing application, have learn how to look at performance results and draw conclusions about the scalability of the application, and have learned how to use simulation to run large-scale experiments without need for a large-scale machine.

What was difficult?

In your README file write a brief "essay" explaining what you found surprising/challenging, and what solutions you used to solve your difficulties. Are you happy with your implementations? Do you see room for improvement?

The End