Ep 23: Tushar Krishna | Large-Scale Distributed AI Systems and Tooling

Summary

In this episode, hosts Dr. Suvinay Subramanian and Dr. Lisa Hsu sit down with Dr. Tushar Krishna, an Associate Professor in the School of Electrical and Computer Engineering at Georgia Tech. A member of the ISCA, MICRO, and HPCA Halls of Fame, Dr. Krishna is a leading expert in the design of modern large-scale distributed AI systems. His career has seen him transition from a focus on network-on-chips (NoCs) during his PhD at MIT and his time at Intel to designing the complex communication fabrics and memory hierarchies that power today’s most advanced AI accelerators.

The discussion explores the evolution of computer architecture in the age of Artificial Intelligence, specifically how the "ImageNet moment" changed the way architects approach interconnects. Dr. Krishna explains his philosophy that "everything is a network," whether it is communication between processing elements on a single die or communication between thousands of GPUs across a data center. He highlights the unique opportunities presented by AI workloads, which offer more deterministic and predictable traffic patterns compared to traditional general-purpose computing.

A significant portion of the episode is dedicated to the "tooling ecosystem" required to design these massive systems. Dr. Krishna discusses the development of Astra-sim, Chakra, and Garnet—tools that allow researchers to navigate a design space spanning nanoseconds on a chip to milliseconds across a rack. He also shares his insights on the necessity of cross-stack co-design and the challenges of maintaining flexibility in hardware as models rapidly evolve from CNNs to Transformers and Large Language Models (LLMs).

Chapters

00:00:00 — Introduction
00:04:48 — The Journey from Many-Core NoCs to AI Acceleration
00:07:19 — The ImageNet Moment and the Shift to Domain-Specific Interconnects
00:08:21 — Moving Up the Stack: Data Flow and Flexible Accelerators
00:16:17 — The Challenges of Distributed AI and Collective Communication
00:19:55 — Cross-Stack Co-design and the Search for Proper Abstractions
00:26:26 — The Interplay of Link Technologies and Physical Constraints
00:34:54 — Building a Tooling Ecosystem for Multi-Scale Simulation
00:41:18 — Astra-sim and the Chakra Standard for Workload Benchmarking
00:49:49 — The Philosophy of "Everything is a Network"
00:58:48 — Commercialization and the Future of Architectural Tools
00:63:31 — Words of Wisdom for the Next Generation of Architects

Takeaways

Deterministic Traffic Patterns: Unlike general-purpose many-core systems, AI workloads provide a unique opportunity for architects because their traffic patterns are highly predictable and deterministic once the workload is partitioned.
The "Everything is a Network" Philosophy: Architectural principles are scale-invariant; the same fundamental challenges of mapping, data flow, and interconnects apply whether designing a network-on-chip or a multi-rack data center fabric.
Operator-Based Abstraction: To balance the need for high-performance co-design with the need for future-proofing, architects have settled on "operators" (like matrix multiplication or vector operations) as the primary abstraction layer for hardware-software contracts.
The Necessity of Multi-Scale Tooling: Modern AI systems require a "plug-and-play" simulation ecosystem (like Astra-sim and Chakra) because individual simulators cannot simultaneously model the cycle-level details of on-chip wires and the millisecond-scale latencies of global networks.
Value of Engineering in Research: For aspiring computer architects, building robust tools and engaging in "deep engineering" is just as vital as innovative research, as these skill sets are highly transferable across the rapidly evolving technology landscape.