Topic #3 Module: LU Factorization

Overview

Objective: To implement a Gaussian elimination algorithm (without pivoting) to perform a LU factorization of a square matrix in a distributed memory manner using message passing.

1-D Data Distribution

For this algorithm we use a simple 1-D block data distribution. We consider a NxN matrix, A. This matrix contains single precision elements (float). The execution takes place on p processes, where p divides N. Each process thus holds a N × N/p block column of the matrix. In other words, the process with rank 0 holds columns 0 to N/p − 1, the process with rank 1 holds columns N/p to 2N/p − 1, etc.

Roadmap

This module consists of 1 activity:

  • Activity #1: Have each MPI process allocate and initialize its own block column of matrix A, and have the processes perform the LU factorization.

What to turn in

You should turn in a single archive (or a github repo) with:

All source code
XML platform files and hostfiles (see details in the activities)
A Makefile that compiles all executables (and has a 'clean' target!)

Implement a program called lu (lu.c), that implements the Gaussian elimination (without pivoting) algorithm. You may have covered this algorithm in a course, and it is of course well-document in various texts. Regardless, a sequential implementation, sequential_lu.c, is provided. So you can use it as a starting point in a "let's parallelize an existing sequential implementation" scenario. The overall algorithm for iteration k is:

If I hold column k (global) of the matrix, update the column and broadcast it to all other processes.
For all the columns I hold that are after column k, update their lower portion.

To make it easy to test your implementation, we compute the LU factorization of a particular matrix defined as follows:

(note that the indices start at 0).

The matrix, once the Gaussian Elimination is complete, will have all 1’s in its lower triangular portion, and i2 values on its diagonal and in its upper triangular portion. Run lu_sequential.c with a small value of N to see this structure. Your code should check that these are the values computed by your implementation (each process can do its own check, and the number of errors can be "MPI_Reduced" as the process of rank 0 to print and overall "success" or "failure" message).

You can use any XML platform file and accompanying hostfile to run your MPI program, since this module is solely about correctness and not performance. For instance, you can use cluster_crossbar_64.xml as a platform file and hostfile_64.txt as a hostfile.

For debugging purposes, remember that it's a good idea to insert calls to MPI_Barrier so that debugging output from the program can be structured step-by-step.

Hints:

Use the standard MPI_Bcast() function to perform the broadcasts
It's a very good idea to implement local2global() and global2local() functions that do the translations between local and global indices.
Recall that sending a column is not very convenient because matrices are traditionally stored in row-major fashion in C, thus requiring either the use of an additional buffer or of an MPI data type. Or you store your matrix in column-major fashion and do the necessary mental gymnastic to parallelize the row-major sequential program...

What have we learned?

Congratulations for making it this far. You have learned to implement a moderately complex algorithm using a 1-D data distribution . This algorithm and its implementation are very typical of parallel numerical linear algebra implementations, but it's a bit simplistic. In fact, many optimization techniques are known to make the parallelization of LU factorization more efficient, in particular to hide the communication latencies. Hiding these latencies one of the themes of Topic #4.

What was difficult?

In a README file write a brief "essay" explaining what you found surprising/challenging, and what solutions you used to solve your difficulties. Are you happy with your implementations? Do you see room for improvement?

The End