← All episodes

Ep 23: Tushar Krishna | Large-Scale Distributed AI Systems and Tooling

Summary

In this episode, hosts Dr. Suvinay Subramanian and Dr. Lisa Hsu sit down with Dr. Tushar Krishna, an Associate Professor in the School of Electrical and Computer Engineering at Georgia Tech. A member of the ISCA, MICRO, and HPCA Halls of Fame, Dr. Krishna is a leading expert in the design of modern large-scale distributed AI systems. His career has seen him transition from a focus on network-on-chips (NoCs) during his PhD at MIT and his time at Intel to designing the complex communication fabrics and memory hierarchies that power today’s most advanced AI accelerators.

The discussion explores the evolution of computer architecture in the age of Artificial Intelligence, specifically how the "ImageNet moment" changed the way architects approach interconnects. Dr. Krishna explains his philosophy that "everything is a network," whether it is communication between processing elements on a single die or communication between thousands of GPUs across a data center. He highlights the unique opportunities presented by AI workloads, which offer more deterministic and predictable traffic patterns compared to traditional general-purpose computing.

A significant portion of the episode is dedicated to the "tooling ecosystem" required to design these massive systems. Dr. Krishna discusses the development of Astra-sim, Chakra, and Garnet—tools that allow researchers to navigate a design space spanning nanoseconds on a chip to milliseconds across a rack. He also shares his insights on the necessity of cross-stack co-design and the challenges of maintaining flexibility in hardware as models rapidly evolve from CNNs to Transformers and Large Language Models (LLMs).

Chapters

Takeaways