skip to content
Ball's Blog

Ring All Reduce Algorithm

/ 1 min read

Updated:
Table of Contents

Summary

AllReduce is a collective communication operation that performs the reduce function for all the data.

AllReduce

Let’s talk about the most efficient algorithm for allReduce. How should the data move?

Assumptions

Assume we have 3 devices, and we want to perform allReduce operation. Each device is connected to PCIe switch, and the link is full-duplex with 60GB/s bandwidth.

Assumption Topology

We assume each device has 180GB of data.

Naive Approach

The naive approach is to collect all the data to single device, and then perform the reduce operation. After that, we send the result to all the other devices.

Naive Approach

Naive approach takes 12 seconds to complete the allReduce operation.

Big problem with naive approach is that it doesn’t utilize the full bandwidth of the links.

Ring Algorithm

The ring approach is to let each device send its data to the next device in the ring.

Ring Algorithm

Ring approach takes 4 seconds to complete the allReduce operation.

As you can see, ring approach utilizes the full bandwidth of the links.

In general, for NN devices, WW bytes/s bandwidth, and BB bytes of data, ring approach takes 2(N1)NBW\frac{2(N-1)}{N} \cdot \frac{B}{W} seconds to complete the allReduce operation.