A "reduce" operation is essentially the inverse of a broadcast plus some computation: an associative/commutative operator is applied to data items located at each process, and a final combined result is eventually available at a root process (e.g., process of rank 0).
An interesting module would consist in exploring efficient reduction algorithms, i.e., efficient orchestration of communications (the time to apply the operator can be ignored for simplicity).
Here is a possible setting:
Here are three approaches that could be considered:
Binomial tree without chunking:
Pipeline tree:
Binary reduction tree:
There are many other options to explore and compare. The comparison could be both analytical (wall-clock time as of the number of processors, the number of messages, the chunk size, the communication latency, and the communication bandwidth) and in simulation with an MPI implementation.
In his research, Brad Lowery (see these slides), proposes another strategy he calls a Greedy Tree, which is optimal for uniform segmentation (based on previous work). A great activity option for this module would be to implement the Greedy Tree and to reproduce Lowery's findings.