- This event has passed.
Efficient LLM Deployment at the Edge Through Quantization
July 16 @ 7:00 pm - 9:00 pm
LOCATION ADDRESS (Hybrid, in person or by zoom, you choose)
Hacker Dojo
855 Maude Ave
Mountain View, CA 94043
(for faster sign in, read [Hacker Dojo policies](https://tinyurl.com/9cn8sevt). When you sign up, state “I accept the Hacker Dojo policies”.)
If you want to join remotely, you can submit questions via Zoom Q&A. The zoom link:
[https://acm-org.zoom.us/](https://acm-org.zoom.us/j/93173066826?pwd=cDQzTi9JNG1lT1Nkb3JISnJwUGc1Zz09)
AGENDA
6:30 Door opens, Food
7:00 SFBayACM upcoming events, introduce the speaker
7:10 presentation starts
8:15-8:30 finish, depending on Q&A
Abstract:
The widespread adoption of large language models (LLMs) has sparked a revolution in the development of innovative solutions, with inference expected to account for 90% of the costs associated with LLM applications, compared to only 10% for training. This cost disparity, along with the environmental impact of inference and data privacy concerns, has underscored the need for optimization at the edge. Quantization has emerged as a crucial technique, offering significant performance gains in computation and memory usage. In this presentation, we will delve into modern quantization techniques that facilitate the deployment of LLMs at the edge. We will explore popular methods including AWQ, SmoothQuant, and Block Quantization, examining their trade-offs and optimizations. Using popular open-source models like Llama, OPT, and Mistral, along with Llama.cpp, a well-regarded C++ implementation, as a case study, we will analyze the impact of quantization on model performance and provide insights into best practices for achieving overall efficiency in LLM deployments.
Speaker Bio:
Dwith Chenna is a seasoned Research and Development professional specializing in algorithm development and optimization within computer vision, deep learning, and EdgeAI. With extensive experience in creating state-of-the-art, performance-critical perception systems, he has a deep understanding of optimizing deep learning models and AI inference on resource-constrained hardware such as digital signal processors. Dwith excels in evaluating embedded algorithms for performance and accuracy, focusing on key metrics like latency, memory, bandwidth, and power consumption. His expertise includes developing tooling and automation for these optimizations, as well as quantizing and tuning deep learning models to meet stringent performance requirements, significantly enhancing the efficiency of generative AI at the edge.