In this episode of the Computer Architecture Podcast, hosts Dr. Suvinay Subramanian and Dr. Lisa Hsu are joined by Dr. Kim Hazelwood, the West Coast Head of Engineering at Facebook AI Research (FAIR). Dr. Hazelwood brings a rich background to the discussion, having previously served as a tenured associate professor at the University of Virginia, a software engineer at Google, and Director of Systems Research at Yahoo Labs.
The conversation delves into the fascinating and rapidly evolving world of systems for Machine Learning (ML). Dr. Hazelwood shares insights from her current role at FAIR, where she is spearheading new ML research efforts, and discusses the significant shift from her previous work in infrastructure to a more research-focused position. The episode explores the unique challenges and considerations in designing and deploying ML systems at scale, contrasting them with traditional computer architecture research paradigms. Dr. Hazelwood also touches upon her personal journey, career pivots, and her philosophy on navigating the dynamic landscape of technology research and development.
Listeners will gain a deeper understanding of the complexities involved in building robust and efficient ML systems, the importance of a diverse and adaptable skill set in this field, and the future trends shaping computer architecture in the age of AI. The discussion also highlights the critical interplay between hardware, software, and the end-user experience in the context of ML applications.
Chapters
00:00:13 — Introduction to Dr. Kim Hazelwood and Her Role at Facebook AI Research
01:02:25 — What Motivates Dr. Hazelwood: A Glimpse into Her Daily Life and New Role
02:16:51 — Challenges in Deploying Real Systems at Scale for ML
03:46:63 — Unique Attributes of Designing Systems for ML vs. Classic Computer Architecture
04:51:92 — The Diversity of ML Workloads and the Importance of MLPerf
05:55:61 — Understanding Workload Diversity: Beyond CNNs and RNNs
07:43:94 — The Concept of "Reliability" in ML Systems
09:13:69 — Over-Indexing in Research: Parallels Between ML and Spec Benchmarks
10:58:84 — How Architects Can Keep a Pulse on Evolving Workloads
11:17:97 — The Role of Recommendation Systems and Data Dashboards at Facebook
15:00:16 — The Importance of Tools, Frameworks, and the Overall Ecosystem for ML
23:33:78 — Dr. Hazelwood's Career Journey: Pivots, Adaptability, and Hazelwood's Law
29:18:82 — The Future of Systems for ML and Computer Architecture
34:34:61 — The Growing Importance of Responsible AI and Environmental Impact
38:01:56 — Assembling and Leading Diverse Teams in ML Systems Research
39:00:56 — The Role of Visibility and Feedback in Managing Compute Resources
Takeaways
ML Workload Diversity: ML systems must cater to a vast diversity of workloads beyond common examples like CNNs and RNNs, including areas like recommendation systems, which are critical for platforms like Facebook.
Reliability in ML: The definition of "reliability" in ML systems extends beyond traditional hardware fault tolerance to include the ability of a model to train successfully over extended periods (e.g., days or weeks) without failure.
Adaptability and "Hazelwood's Law": In a rapidly evolving field like ML, it's crucial to identify and focus on underserved research areas ("gaps") rather than dogpiling on popular topics, ensuring a more holistic advancement of the field.
User-Centric Design for ML Systems: Designing ML systems, including hardware and software, requires a strong focus on the end-user experience, making tools and abstractions easy to use even for non-experts like fresh college graduates.
The Evolving Nature of ML Systems: The tools, frameworks, and even the fundamental understanding of ML systems are constantly changing, necessitating a flexible and iterative approach to research, development, and even performance evaluation (e.g., cycle-accurate simulation is often impractical for long-running ML training).