Amazon
Edge LLM - Reducing LLM Memory Footprint to < 2GB
Llama 7B, Mistral 7B - Large Language Models are good candidate models that can run on edge devices such as Orange Pi 5. However, these models have high demand for general purpose compute and memory resources. If the compute is offloaded to GPU or NPU and the memory utilization is reduced, it makes these models easier to deploy on resource constrained platforms. The reduction in memory can be accomplished by applying downstream task-specific fine tuning, quantization, pruning and in some cases training a smaller model using knowledge distillation from the original large model. The compute offload can be accomplished by using SDKs such as Open CL, RKNN. This student team will work to take the above models and use the identified techniques and reduce the model memory usage as minimal as possible while keeping the impact on accuracy or perplexity < 5 percentage points drop from the original model. The model will run on Orange Pi platform, leveraging the GPU and NPU with a performance of > 1 token per second of output. Design parameters and performance this student team will work to incorporate include: Original model: Llama 7B, Mistral 7B Tasks: Storytelling, summarization, math Q&A Hardware: Orange Pi 5 Model Compression Techniques: fine tuning, quantization, pruning and optional knowledge distillation Final Compressed Model Performance: <5 percentage points drop from the original model on metrics such as accuracy and perplexity The outcomes this student team will work to accomplish are compressed models running on Orange Pi 5 (GPU & NPU), with less than 2 GB RAM usage and > 80% operators offloaded to GPU, NPU Intermediate Milestones this student team will work to meet include: 1. Llama 7B, Mistral 7B - Quantized weights to 4 bits, <5 percentage points drop in accuracy as measured on the dataset 2. Llama 7B, Mistral 7B - Quantized weights to 4 bits & Pruned to 50%, <15 percentage points drop in accuracy as measured on the dataset 3. Llama 7B, Mistral 7B - Quantized weights to 4 bits & Pruned to 50% and fine-tuned to recover accuracy, <5 percentage points drop accuracy drop 4. Llama 7B, Mistral 7B - artifact from step 3 serialized using PyTorch JIT and run on laptop 5. Llama 7B, Mistral 7B - artifact from step 3 serialized and run on Orange Pi 5 on GPU, NPU
Faculty Adviser
Larry Arnstein,
Affiliate Professor,
Electrical & Computer Engineering
Students
Chaitanya Mullapudi
Delilah Yan
James Yao
Lucas Xie
Sai Jayanth Kalisi
Vincent Wang
Yao Zhang
Related News
Fri, 09/20/2024 | UW Civil & Environmental Engineering
Smarter irrigation for a greener UW
A new project combines satellite data with ground sensors to conserve water and create a more sustainable campus environment.
Mon, 09/09/2024 | UW Mechanical Engineering
Testing an in-home mobility system
Through innovative capstone projects, engineering students worked with community members on an adaptable mobility system.
Mon, 08/19/2024 | UW Mechanical Engineering
Students strive to ensure accurate AED shock dosage
ShockSafe, developed by students with the help of mentors from Philips and Engineering Innovation in Health (EIH), can distinguish between children and adults during cardiac arrest emergencies.
Wed, 08/07/2024 | Snohomish County News
Snohomish County, University of Washington partnership boosts efficiency in enterprise scanning center
UW Industrial and Systems Engineering Capstone Project set to save Snohomish County over $40,000 annually.