• MakeMeExpert
  • Posts
  • Compressing the Giants: How Quantization, Pruning, and Distillation Make LLMs Practical

Compressing the Giants: How Quantization, Pruning, and Distillation Make LLMs Practical

A clear and simple guide to how quantization, pruning, and knowledge distillation make large language models smaller, faster, and easier to run on everyday devices.

In partnership with

Large language models are powerful. But they are huge. Bigger models need more storage and memory. That makes them costly to run. It also makes them hard to use on phones, laptops, or other small devices.

Model compression shrinks models so they still work well but use far less memory and compute. That lets models run on ordinary hardware and keeps data private by doing work on-device. The main tools are quantization, pruning, and knowledge distillation.

1. Quantization — lower the number precision

Quantization means storing model numbers with fewer bits. Instead of 32-bit floating point (fp32), we use 8-bit or 4-bit integers. That cuts size a lot.

Why it helps:

  • Less memory. 32→8 bits is a 4x size drop. 4-bit can cut a model to about 20% of original size.

  • Faster math. Integer ops are simpler and often faster than floating point ops.

How it works, simply:

  1. Pick a range for real values (clip from α to β).

  2. Compute a scale to map that range into integers.

  3. Possibly add a zero point offset so zero maps correctly.

Two common schemes:

  • Symmetric (scale only): zero maps to zero. Simple.

  • Asymmetric (scale + zero point): handles uneven ranges more accurately.

When quantization happens:

  • Post-Training Quantization (PTQ): Convert a trained model. Fast and easy. Going below 4 bits often hurts accuracy.

  • Quantization Aware Training (QAT): Train with low precision. Better results at extreme compression.

Bits and granularity:

  • Q8 (8-bit): Close to full precision.

  • Q4 (4-bit): Good size/accuracy trade-off.

  • Q2 (2-bit): Very small, but accuracy can drop.

You can quantize by tensor or by channel. Per-channel is more precise. Advanced schemes use K-means clustering (K-Quants). Some formats blend techniques to balance speed and quality.

The Simplest Way to Create and Launch AI Agents and Apps

You know that AI can help you automate your work, but you just don't know how to get started.

With Lindy, you can build AI agents and apps in minutes simply by describing what you want in plain English.

→ "Create a booking platform for my business."
→ "Automate my sales outreach."
→ "Create a weekly summary about each employee's performance and send it as an email."

From inbound lead qualification to AI-powered customer support and full-blown apps, Lindy has hundreds of agents that are ready to work for you 24/7/365.

Stop doing repetitive tasks manually. Let Lindy automate workflows, save time, and grow your business

2. Pruning — remove what’s not needed

Pruning cuts parts of a model that don’t add much. Think of trimming a bush to keep it healthy.

Two types:

  • Unstructured pruning: Zero out individual weights. This makes the model sparse. It saves parameters but needs special hardware or libraries to get speedups.

  • Structured pruning: Drop whole pieces — attention heads, neurons, or layers. This gives clear size and speed wins without sparse math.

Example: cutting a few attention heads or dropping layers can shrink the model while keeping most of its ability.

3. Knowledge Distillation — teach a smaller model

Distillation trains a smaller “student” model to copy a bigger “teacher” model.

Two common ways:

  • Soft targets: The student learns from the teacher’s output probabilities (logits). These softer signals carry more nuance than hard labels. Training uses a loss like KL divergence.

  • Synthetic data: The teacher generates example inputs and outputs. The student trains on those pairs. Many small models started this way.

A good outcome: a well-compressed student can match or even beat the teacher if the original had redundant parts. Distillation can remove noisy internal structures and improve generalization.

The trade-offs

You can mix quantization, pruning, and distillation. That usually gives the best size cuts. But you must balance three things: size, speed, and accuracy.

Quick comparison:

  • fp32: Best accuracy. Biggest cost.

  • fp16: Half the size of fp32. Good trade-off.

  • int8 (Q8): Small and usually near-original accuracy.

  • int4 (Q4): Much smaller. Some accuracy loss. Useful for many apps.

  • int2 (Q2): Very small and fast. Accuracy can suffer. Good for ultra-low-power uses.

Bottom line

Compression makes large models usable on normal devices. It lowers cost and helps privacy by enabling local inference. Quantization, pruning, and distillation each shrink models in different ways. Used together, they let powerful models run on laptops, phones, and edge devices without losing too much quality.