Skip to main content

Industry & alumni

Amazon

Edge LLM - Reducing LLM Memory Footprint to < 2GB

Llama 7B, Mistral 7B - Large Language Models are good candidate models that can run on edge devices such as Orange Pi 5. However, these models have high demand for general purpose compute and memory resources. If the compute is offloaded to GPU or NPU and the memory utilization is reduced, it makes these models easier to deploy on resource constrained platforms. The reduction in memory can be accomplished by applying downstream task-specific fine tuning, quantization, pruning and in some cases training a smaller model using knowledge distillation from the original large model. The compute offload can be accomplished by using SDKs such as Open CL, RKNN. This student team will work to take the above models and use the identified techniques and reduce the model memory usage as minimal as possible while keeping the impact on accuracy or perplexity < 5 percentage points drop from the original model. The model will run on Orange Pi platform, leveraging the GPU and NPU with a performance of > 1 token per second of output. Design parameters and performance this student team will work to incorporate include: Original model: Llama 7B, Mistral 7B Tasks: Storytelling, summarization, math Q&A Hardware: Orange Pi 5 Model Compression Techniques: fine tuning, quantization, pruning and optional knowledge distillation Final Compressed Model Performance: <5 percentage points drop from the original model on metrics such as accuracy and perplexity The outcomes this student team will work to accomplish are compressed models running on Orange Pi 5 (GPU & NPU), with less than 2 GB RAM usage and > 80% operators offloaded to GPU, NPU Intermediate Milestones this student team will work to meet include: 1. Llama 7B, Mistral 7B - Quantized weights to 4 bits, <5 percentage points drop in accuracy as measured on the dataset 2. Llama 7B, Mistral 7B - Quantized weights to 4 bits & Pruned to 50%, <15 percentage points drop in accuracy as measured on the dataset 3. Llama 7B, Mistral 7B - Quantized weights to 4 bits & Pruned to 50% and fine-tuned to recover accuracy, <5 percentage points drop accuracy drop 4. Llama 7B, Mistral 7B - artifact from step 3 serialized using PyTorch JIT and run on laptop 5. Llama 7B, Mistral 7B - artifact from step 3 serialized and run on Orange Pi 5 on GPU, NPU

Faculty Adviser

Larry Arnstein, Affiliate Professor, Electrical & Computer Engineering

Students

Chaitanya Mullapudi
Delilah Yan
James Yao
Lucas Xie
Sai Jayanth Kalisi
Vincent Wang
Yao Zhang

Related News

Close-up of utility poles with mounted electronic devices and cables in an outdoor setting

Fri, 09/20/2024 | UW Civil & Environmental Engineering

Smarter irrigation for a greener UW

A new project combines satellite data with ground sensors to conserve water and create a more sustainable campus environment.

One person is sitting in a hammock chair, while another person holds part of the frame structure

Mon, 09/09/2024 | UW Mechanical Engineering

Testing an in-home mobility system

Through innovative capstone projects, engineering students worked with community members on an adaptable mobility system.

Five ShockSafe team members stand next to their poster and their prototype of their device

Mon, 08/19/2024 | UW Mechanical Engineering

Students strive to ensure accurate AED shock dosage

ShockSafe, developed by students with the help of mentors from Philips and Engineering Innovation in Health (EIH), can distinguish between children and adults during cardiac arrest emergencies.

ISE Senior Capstone class

Wed, 08/07/2024 | Snohomish County News

Snohomish County, University of Washington partnership boosts efficiency in enterprise scanning center

UW Industrial and Systems Engineering Capstone Project set to save Snohomish County over $40,000 annually.