Tech Papers

~ papers i read, with notes & summaries ~

Horovod

May 2026 Parallelism / Distributed Training for AI simple

Data-parallel distributed training framework. Three questions: how it distributes training, why that works, and the arch + algorithm.

1. How — data parallelism. Data is split across servers; each server holds a full copy of the model.

2. Why it works ⭐ Each step starts from identical weights. The forward pass computes a local loss and doesn't change the weights; the backward pass produces local gradients that differ per server. Those gradients are then averaged across servers (all-reduce), and every server applies the same averaged gradient to the same starting weights — so all replicas end the step identical and stay in sync.

3. Arch + algorithm. Older systems used parameter servers + workers (now deprecated); Horovod instead uses ring-allreduce with tensor fusion. Ring-allreduce takes 2(N-1) steps — two passes around the ring (a reduce-scatter then an all-gather) — and each step sends only 1/N of the data, so bandwidth usage stays low and independent of the number of servers. Tensor fusion batches small tensors before sending, improving bandwidth efficiency and average latency: it slightly delays the first tensor (which waits to be batched) but cuts the queueing time of the tensors that follow.

PyTorch (NeurIPS 2019)

June 2026 Coding Framework todo

PyTorch's imperative style makes it far easier to grasp how deep learning frameworks work conceptually than TensorFlow's graph-based approach.

Orca (OSDI 2022)

June 2026 Inference Serving todo

The foundational paper for understanding LLM serving. PagedAttention, DistServe, and Splitwise all build on the surfaces Orca introduces.

ZeRO (SC 2020)

June 2026 Memory Optimization todo

Clearly explains why memory is the bottleneck and how to fix it systematically. Later work (FlashAttention, ZeRO-Offload) builds on this foundation.