Amazon

Edge LLM - Reducing LLM Memory Footprint to < 2GB

Llama 7B, Mistral 7B - Large Language Models are good candidate models that can run on edge devices such as Orange Pi 5. However, these models have high demand for general purpose compute and memory resources. If the compute is offloaded to GPU or NPU and the memory utilization is reduced, it makes these models easier to deploy on resource constrained platforms. The reduction in memory can be accomplished by applying downstream task-specific fine tuning, quantization, pruning and in some cases training a smaller model using knowledge distillation from the original large model. The compute offload can be accomplished by using SDKs such as Open CL, RKNN. This student team will work to take the above models and use the identified techniques and reduce the model memory usage as minimal as possible while keeping the impact on accuracy or perplexity < 5 percentage points drop from the original model. The model will run on Orange Pi platform, leveraging the GPU and NPU with a performance of > 1 token per second of output. Design parameters and performance this student team will work to incorporate include: Original model: Llama 7B, Mistral 7B Tasks: Storytelling, summarization, math Q&A Hardware: Orange Pi 5 Model Compression Techniques: fine tuning, quantization, pruning and optional knowledge distillation Final Compressed Model Performance: <5 percentage points drop from the original model on metrics such as accuracy and perplexity The outcomes this student team will work to accomplish are compressed models running on Orange Pi 5 (GPU & NPU), with less than 2 GB RAM usage and > 80% operators offloaded to GPU, NPU Intermediate Milestones this student team will work to meet include: 1. Llama 7B, Mistral 7B - Quantized weights to 4 bits, <5 percentage points drop in accuracy as measured on the dataset 2. Llama 7B, Mistral 7B - Quantized weights to 4 bits & Pruned to 50%, <15 percentage points drop in accuracy as measured on the dataset 3. Llama 7B, Mistral 7B - Quantized weights to 4 bits & Pruned to 50% and fine-tuned to recover accuracy, <5 percentage points drop accuracy drop 4. Llama 7B, Mistral 7B - artifact from step 3 serialized using PyTorch JIT and run on laptop 5. Llama 7B, Mistral 7B - artifact from step 3 serialized and run on Orange Pi 5 on GPU, NPU

Faculty Adviser(s)

Larry Arnstein, Affiliate Professor, Electrical & Computer Engineering