Topic #2 Module: Reduction Operation

A "reduce" operation is essentially the inverse of a broadcast plus some computation: an associative/commutative operator is applied to data items located at each process, and a final combined result is eventually available at a root process (e.g., process of rank 0).

An interesting module would consist in exploring efficient reduction algorithms, i.e., efficient orchestration of communications (the time to apply the operator can be ignored for simplicity).

Here is a possible setting:

Processors are interconnected via a crossbar switch
At a give time step a process (one per processor) can either send or receive, but not both
There are m data items at each process, and these can be split into chunks for pipelining
A seemingly simply, yet still interesting case, is to assume uniform chunks.
Interesting effects may be achieved by allowing chunks of different sizes.

Here are three approaches that could be considered:

Binomial tree without chunking:

Step 0: Each odd numbered process sends to its left neighbor
Step 1: Each even-numbered process sends to its 2nd left neighbor
...

Pipeline tree:

Chunk the data
Step 0: process n sends chunk 0 to process n − 1
Step 1: process n − 1 sends chunk 0 to process n − 2, and process n sends chunk 1 to process n − 1
...

Binary reduction tree:

Chunk the data
Step 0: processes n and n−1 send chunk 0 to process n−2 (two communications that happen in sequence), processes n−3 and n−4 send chunk 0 to process n−5, ...
Step 1: processes n and n−1 send chunk 1 to process n−2 (two communications that happen in sequence), processes n−3 and n-4 send chunk 0 to process n−5, ...
...

There are many other options to explore and compare. The comparison could be both analytical (wall-clock time as of the number of processors, the number of messages, the chunk size, the communication latency, and the communication bandwidth) and in simulation with an MPI implementation.

In his research, Brad Lowery (see these slides), proposes another strategy he calls a Greedy Tree, which is optimal for uniform segmentation (based on previous work). A great activity option for this module would be to implement the Greedy Tree and to reproduce Lowery's findings.