Meta has unveiled its first quantized versions of the Llama 3.2 1B and 3B models, significantly optimized for speed and memory efficiency, with a vision of enabling large language models (LLMs) on mobile and edge devices. These quantized models are designed to meet the growing demand for high-performance, low-footprint AI applications that work directly on devices, including popular smartphones. Built in collaboration with key industry partners like Arm, MediaTek, Qualcomm, and others, these models aim to make AI more accessible, private, and efficient by reducing dependence on extensive cloud resources.
This release is part of Meta’s ongoing commitment to enhance on-device AI by providing developers with high-quality, safe, and efficient tools to deploy AI across various mobile and edge devices. The quantized Llama models offer a 2-4x speedup in inference and a significant reduction in memory footprint, allowing for enhanced AI performance on constrained devices without sacrificing quality.
Quantization is a technique that reduces the precision of the model’s weights and activations, making it more compact and resource-efficient. For the Llama 3.2 1B and 3B models, quantization brings about two key benefits:
1. Speed and Efficiency: The quantized models achieve a 2-4x speedup compared to the standard models and reduce memory usage by approximately 41%.
2. Smaller Model Size: Quantized Llama models are 56% smaller on average, which makes it feasible to deploy them on resource-limited devices like smartphones.
By leveraging quantization, Meta ensures that Llama models maintain quality and safety while becoming accessible for a wider range of on-device applications. For example, the quantized models provide efficient, privacy-centered AI for applications that do not require an internet connection, keeping user interactions entirely on the device.
Meta’s quantized Llama models employ two advanced quantization methods:
1. Quantization-Aware Training (QAT) with LoRA Adaptors (QLoRA): This method optimizes model accuracy in low-precision environments by simulating quantization effects during training. Meta implemented LoRA (Low-Rank Adaptation) adaptors in this approach to enhance model accuracy and adaptability in constrained settings.
2. SpinQuant: A state-of-the-art post-training quantization technique, SpinQuant focuses on portability and efficiency, making it suitable for scenarios where access to training datasets is limited or not possible. SpinQuant operates independently of the training process, which is ideal for developers who need to quantize their fine-tuned models for specific hardware without retraining.
Both quantization methods are incorporated into the Llama Stack reference implementation using PyTorch’s ExecuTorch framework, ensuring compatibility with a variety of devices. With these methods, Meta provides developers the flexibility to choose between QAT + LoRA for accuracy or SpinQuant for ease of use and portability.
Meta’s quantized Llama models are designed with mobile hardware in mind, specifically targeting Qualcomm and MediaTek SoCs with Arm CPUs. The performance has been optimized using the Kleidi AI library, which provides custom AI kernels for mobile CPUs. Meta’s benchmarks on devices like the Android OnePlus 12 show substantial improvements:
- Decode Latency: A 2.5x improvement.
- Prefill Latency: A 4.2x improvement.
- Memory Efficiency: A 41% reduction in memory usage.
- Model Size: A reduction of 56% compared to the original BF16 models.
For developers, these improvements mean faster and more efficient AI applications on mobile devices, with prompt responses and lower memory usage even in short-context scenarios of up to 8,000 tokens. The models have also demonstrated consistent performance across other devices, including the Samsung S24+ for 1B and 3B models and the Samsung S22 for 1B models. While Meta has verified that these models run with accuracy on iOS, detailed performance metrics for Apple devices are still under review.
Meta’s quantization setup for Llama 3.2 models involves a sophisticated framework designed for both model quality and efficient processing on Arm CPUs. The quantization scheme includes three key elements:
1. 4-Bit Groupwise Quantization for Transformer Layers: Each transformer layer is quantized to 4-bit precision with a group size of 32 for weights and 8-bit dynamic quantization for activations.
2. 8-Bit Quantization for the Classification Layer: The final classification layer is quantized at an 8-bit level, ensuring accurate outputs while keeping memory requirements low.
3. 8-Bit Quantization for Embeddings: Embedding layers use 8-bit quantization per channel, enabling efficient use of memory while retaining model fidelity.
This tailored quantization setup enables Meta to balance performance and model size, creating an LLM that is optimized for mobile CPUs without compromising on quality or safety.
Quantization-Aware Training (QAT) with LoRA adaptors, or QLoRA, simulates the effects of quantization during training, helping the model learn to operate efficiently under low-precision conditions. For the Llama 3.2 models, Meta began by fine-tuning BF16 checkpoints with supervised fine-tuning (SFT) before running additional SFT rounds with QAT. LoRA adaptors were applied to all layers within the transformer block, maintaining BF16 precision for activations and weights within the adaptors to preserve quality.
The QLoRA process includes a final fine-tuning stage using Direct Preference Optimization (DPO), which further refines the model’s performance and brings it close to the quality of the original BF16 model, but with a significantly smaller memory footprint.
Using torchao APIs, developers can leverage QAT for their foundational Llama models and incorporate LoRA for customized use cases, allowing them to benefit from quantization while saving computational resources.
For developers who may not have access to the training data required for QAT, Meta offers SpinQuant, an advanced post-training quantization method. While SpinQuant does not achieve the same level of accuracy as QAT + LoRA, it provides flexibility by enabling quantization without retraining. This portability makes it a valuable tool for scenarios where computational resources are constrained or data availability is limited.
SpinQuant uses rotation matrices calibrated on a small WikiText dataset to smooth outliers and enable effective quantization. These matrices are optimized to fit within Meta’s established quantization scheme, ensuring that the quantized model remains efficient and effective even without additional training.
Meta conducted comprehensive evaluations of the quantized Llama 3.2 models, comparing SpinQuant and QLoRA with the original BF16 baseline. The results, using PyTorch’s ExecuTorch framework on ARM CPUs, highlight the advantages of quantization for both speed and memory efficiency:
- Time-to-First-Token (TTFT): On an Android OnePlus 12 device, TTFT was significantly reduced, enhancing the responsiveness of applications.
- Decode Latency: Improved by 2.5x on average.
- Prefill Latency: Improved by 4.2x on average.
- Memory and Model Size: Reduced by 41% and 56%, respectively.
These benchmarks, validated on multiple devices including the OnePlus 12, Samsung S24+, and Samsung S22, provide developers with reproducible performance metrics that confirm the models’ capabilities on popular mobile devices. For future releases, Meta is collaborating with industry partners to incorporate NPUs into the ExecuTorch framework, further enhancing performance for Llama models on compatible devices.
Meta’s quantized Llama models represent a significant leap toward democratizing AI access on mobile and edge devices. This release aligns with Meta’s goal to empower the global developer community by offering models that prioritize efficiency, privacy, and ease of use. As Llama continues to gain traction, Meta sees its quantized models as a pathway for developers to create fast, memory-efficient applications that respect user privacy by processing data entirely on-device. With a 10x growth rate in community adoption, Llama has set a new standard for open, modifiable, and cost-efficient AI solutions. As developers embrace Meta’s lightweight models, they’ll unlock new possibilities for AI-driven experiences that are both fast and private, enabling personalized interactions in real-time without needing extensive cloud resources.
Meta’s release of quantized Llama models on llama.com and Hugging Face also highlights the importance of collaboration, with partners such as Arm, MediaTek, and Qualcomm playing a crucial role in optimizing the models for mobile platforms. Meta’s vision for the future of Llama centers on responsible innovation, community-driven progress, and a commitment to open-source principles.
Download and Experiment with Llama Models Today
To get started, developers can access Llama 3.2 1B and 3B quantized models on llama.com or Hugging Face, making it easy to experiment with high-performance AI models optimized for mobile and edge devices. With quantized Llama models, the future of fast, private, on-device AI has arrived, ready for developers to create the next wave of mobile-powered AI applications.