Topic #4 Module: LU Factorization

Overview

Objective: To analyze (mostly experimentally) the performance of an implementation of the LU factorization algorithm and try to improve it.

Prerequisite

This module uses the program developed in the LU Factorization module in Topic #3. You must have completed that module to do this one.

Simulation Calibration

To make the simulation results consistent regardless of your computer we first calibrate the simulation. More specifically, we simulate a parallel platform in which each processor computes at a rate of 400Gflops. Since SMPI measures compute times on your own machine, we need to tell SMPI how fast your machine is, which is SMPI is done by passing a particular floating point value to smpirun using the --cfg=smpi/running-power:VALUE command-line option.

A python script called calibrate_flops.py is provided, which you should run once on your idle computer. This script performs a binary search to find the necessary calibration factor, which is printed to the screen and should be passed to the --cfg=smpi/running-power command-line option. Of course, if you change computer, you'll have to re-run the script to find a new calibration factor.

Roadmap

This module consists of 2 activities, each described in its own tab above, which should be done in sequence:

  • Activity #1: Quantify the load imbalance the implementation at large scale.
  • Activity #2: Improve the load imbalance.
  • Activity #3: Determine the impact of network communication on scalability.

What to turn in

You should turn in a single archive (or a github repo) with:

All source code
XML platform files and hostfiles (see details in the activities)
A Makefile that compiles all executables (and has a 'clean' target!)

In this activity we investigate whether the original implementation achieves good load balance, i.e., do the processes have close to the same amounts of work to do. We wish to do this investigation for large matrices, to better observe load balancing behaviors. In steps #1 and #2 below we play "simulation tricks" to avoid running too memory-intensive / cpu-intensive simulations. In step #3 we re-calibrate the simulation, and in step #4 we run experiments to quantify load imbalance.

Experiments with large matrices are necessary to study scalability, but unfortunately these experiments are cpu-intensive, whether on a real platform or in simulation. One advantage of simulation is that one can play tricks to reduce cpu expenses, essentially by not performing actual computation. This is something researchers often do in simulation and in this module you're doing it too.

When SMPI encounters a basic block of computation in your program, it executes the block and times it to generate the correct delay in simulation. If this code takes 1 hour to run on your machine, then the simulation will then take at least 1 hour. If you are simulating 100 processes each computing for 1 hour, then the simulation will take 100 hours! This is because SMPI runs on a single core, running MPI processes in round-robin fashion. This sounds bad, but SMPI provides several ways to mitigate this problem. For instance, one can tell the simulator: "Execute/time the code only the first time you encounter it". A more radical option is to tell the simulator: "When you encounter this basic block, don't run it at all. Instead, I am telling you how many flops it involves and simply compute the delay by dividing this number of flops by the compute speed." In both cases, the code no longer computes correct results. But the idea is that for deterministic basic blocks, such as the computation and updates in columns in the LU factorization implementation, the general performance behavior is preserved, and a simple re-calibration will lead to sufficiently accurate simulation results (this has been shown in SMPI research papers and by other researchers before SMPI even existed).

In this step we use the second option above, i.e., avoiding computation altogether. Consider the sample basic block below:

for (i=0; i < N; i++) {
  sum = sum * 3.0 * x * y;
}
This loop performs O(N) floating point operations, and in your SMPI code you can replace it by:
  double flops = (double)N * (double)FLOP_CALIBRATION_FACTOR ;
  SMPI_SAMPLE_FLOPS(flops) {
    // for (i=0; i < N; i++) {
    //   sum = sum * 3.0 * x * y;
  }
  
(note that the original code is simply commented-out without the SMPI_SAMPLE_FLOPS macro / basic block, which is likely a good idea instead of simply deleting it. The FLOP_CALIBRATION_FACTOR constant is the constant factor hidden in the O(N) asymptotic complexity. In step #3 we use calibration to determine an appropriate value.

Modify your program to do the above, for the basic block that computes the k-th column and the basic block that updated the lower-right part of the matrix (i.e., the two computational for loops in the sequential version of the code, which you have parallelized when completing the LU Factorization module in Topic #3). At this point, since your code doesn't compute anything useful, you can also comment out the part of the code that initializes the matrices and checks the validity of the results (in fact you really want to comment out the latter to avoid "invalid results" messages). For now use an arbitrary value for FLOP_CALIBRATION_FACTOR so that you can compile and debug your program, and pay no attention to the simulated performance.

Besides cpu time, another limiting factor for scalable simulation is memory space. If I want to use SMPI to simulate 100 processes that each uses 1GiB of heap space, then I need a single machine with 100GiB of RAM! Again, this sounds bad, which is why SMPI allows simulated processes to use the same heap allocations!! This leads to nonsensical computation since processes overwrite each other's updates in the same allocated memory zones. But, for deterministic application such as matrix multiply, overall performance behavior is sufficiently preserved to lead to sufficiently accurate simulation results (again, see SMPI research papers and those referenced therein).

The above is straightforward with SMPI:

Replace malloc by SMPI_SHARED_MALLOC;
Replace free by SMPI_SHARED_FREE.

Do the above in your program. You now have a program that computes nothing useful at all, but still outputs a simulated wallclock time that research shows is sufficiently accurate to draw meaningful conclusions for our purpose! The point is that this program can be executed faster and for larger matrices. We now have an interesting simulation performance trade-off. With a lot of processes, we can scale the memory up. For instance, say the memory space needed for the (sequential) matrix multiplication is 100GiB. You can’t run the simulation of this computation on your laptop. However, if you simulate a parallel execution with 100 processes, the above memory trick makes it so that you only need 1GiB of RAM to run the simulation. However, with a lot of processes, the time to simulate the broadcasts with large data sizes is large (due to the sheer number of individual point-to-point messages). So, some simulations will take too much space, and some simulations will take too much time. Nevertheless, our CPU and memory simulation "tricks" significantly widens the range of simulation experiments that can be done in practice.

One problem with our CPU trick is that the simulated wall-clock time will no longer be coherent with our simulation platform, i.e., it no longer match with our --cfg=smpi/running-power command-line option since we use the SMPI_SAMPLE_FLOPS macro. Instead, we need to determine a reasonable value for FLOP_CALIBRATION_FACTOR.

Run your original (i.e., without the "simulation tricks") program from the LU Factorization module in Topic #3, using cluster_1600.xml and hostfile_1600.txt, for a matrix of size 2000x2000, with p=4 process. Then empirically determine a value of FLOP_CALIBRATION_FACTOR that leads to the same (or close) simulated elapsed time for your program with the "simulation tricks". Doing a binary search on FLOP_CALIBRATION_FACTOR, via a script, is great of course, but you can probably just do a quick trial-and-error manual search. Once you've found this value, hard-code it into your program, and voila.

Now that your simulation is fully calibrated, augment your implementation (in lu.c) so that at the end of the execution the program prints out the elapsed time as well as the percentage idle time. The idle time corresponds to the column updates phase (i.e., at a given iteration some processors will have fewer columns to update than others, and will thus remain idle until all processors are ready to move on). As usual, it's a good idea to use MPI_Wtime to measure time, and it's ok to add calls to MPI_Barrier to make it easier to measure idle time at each process. The percentage idle time is computed as the sum of all the individual idle times (which you can obtain using MPI_Reduce), divided by the product of the number of processes and the elapsed time.

Run your code using cluster_1600.xml and hostfile_1600.txt, but edit cluster_1600.xml to make the network very fast (zero latency, huge bandwidth). This will allow us to take the network out of the equation for now. Run your code for a 32000x32000 matrix with p = 1, 4, 8, 16, and 32 processes. On the same graph plot the elapsed time and the percentage idle time (using two different left and right vertical axes).

In your README file describe what you observe. Is the idle time significant? Does the performance scale well as the number of processors increase? Can you explain why that is given what the implementation is doing?

In this activity we modify our implementation to improve the load balancing, and re-run simulations to see the performance benefit.

The reason for poor load balance in the original implementation is that after some point, a process runs out of work to do (because the application proceeds from left to right along the columns of the matrix). This is an artifact of our data distribution. A solution is to use a cyclic distribution. More precisely, modify your program, calling it lu_roundrobin.c, so that the columns of the matrix are assigned to the processors in Round-Robin fashion (rank 0 processes columns 0, p, 2p, 3p,..., rank 1 processes columns 1, p + 1, 2p + 1, 3p + 1,...). Of course you must still check that the computed results are correct before moving on to the experiments in the next step (i.e., disable the "simulation tricks" for testing correctness).

With the "simulation tricks" from Activity #1 enabled, we can now re-run the same experiments as that in Step #4 of Activity #1. That is, run your code using cluster_1600.xml and hostfile_1600.txt, but edit cluster_1600.xml to make the network very fast (zero latency, huge bandwidth). Run your code for a 32000x32000 matrix with p = 1, 4, 8, 16, and 32 processes. On the same graph plot the elapsed time and the percentage idle time (using two different left and right vertical axes). It would be likely useful to plot the results from Activity #1, Step #4 on this same graph as well for easy comparison.

In your README file describe what you observe. Does the cyclic distribution reduce idle time significantly?

In this activity we see the effect of communication on overall application performance.

Using your last implementation, i.e., the one that achieves good load balancing, run your program for a 32000x32000 matrix, p = 1, 4, 8, 16, 32, 64, and 128 processes , using the original cluster_1600.xml platform (with a non-ideal network). Plot the parallel efficiency vs. the number of processors. In your README file describe what you observe. What do you conclude in terms of the scalability of your application? Would you say there is work that remains to be done? Any ideas?

What have we learned?

Congratulations for making it this far. You have learned how to measure the load imbalance in a message passing application, have learned how to implement a cyclic data distribution, and have learn how to look at performance results and draw conclusions about the scalability of the application, and have learned how to use simulation to run large-scale experiments without need for a large-scale machine. The performance of the application can be further improve, for instance by overlapping communication and computation. In fact, many techniques have been developed to improve the performance of this particular application.

What was difficult?

In your README file write a brief "essay" explaining what you found surprising/challenging, and what solutions you used to solve your difficulties. Are you happy with your implementations? Do you see room for improvement?

The End