Data-parallel distributed training framework. Three questions: how it distributes training, why that works, and the arch + algorithm.
1. How — data parallelism. Data is split across servers; each server holds a full copy of the model.
2. Why it works ⭐ Each step starts from identical weights. The forward pass computes a local loss and doesn't change the weights; the backward pass produces local gradients that differ per server. Those gradients are then averaged across servers (all-reduce), and every server applies the same averaged gradient to the same starting weights — so all replicas end the step identical and stay in sync.
3. Arch + algorithm. Older systems used parameter servers +
workers (now deprecated); Horovod instead uses ring-allreduce with tensor
fusion. Ring-allreduce takes 2(N-1) steps — two passes around
the ring (a reduce-scatter then an all-gather) — and each step sends only
1/N of the data, so bandwidth usage stays low and independent of
the number of servers. Tensor fusion batches small tensors before sending,
improving bandwidth efficiency and average latency: it slightly delays the
first tensor (which waits to be batched) but cuts the queueing time of the
tensors that follow.