Quality comparison: KL divergence vs Qwen3.6-27B (bf16)

Measuring Model Quantisation Quality with KL Divergence

KL divergence against a known-good reference answers “how much did this quant change the model’s behaviour?” rather than “how good is this model overall?”. What KLD measures KL divergence measures how much one probability distribution disagrees with another. At each token position, both the reference and the quantised model emit a distribution over the full vocabulary (~248k tokens for Qwen-class models). The reference might say "~80% likely the, ~5% likely a, …"; the quant says something slightly different. KLD compares the two per position and averages. ...

April 28, 2026 · 12 min · 2548 words · Sam McLeod
Illustration of the components of a LLM's memory usage

Bringing K/V Context Quantisation to Ollama

Explaining the concept of K/V context cache quantisation, why it matters and the journey to integrate it into Ollama. Why K/V Context Cache Quantisation Matters The introduction of K/V context cache quantisation in Ollama is significant, offering users a range of benefits: • Run Larger Models: With reduced VRAM demands, users can now run larger, more powerful models on their existing hardware. • Expand Context Sizes: Larger context sizes allow LLMs to consider more information, leading to potentially more comprehensive and nuanced responses. For tasks like coding, where longer context windows are beneficial, K/V quantisation can be a game-changer. • Reduce Hardware Utilisation: Freeing up memory or allowing users to run LLMs closer to the limits of their hardware. Running the K/V context cache at Q8_0 quantisation effectively halves the VRAM required for the context compared to the default F16 with minimal quality impact on the generated outputs, while Q4_0 cuts it down to just one third (at the cost of some noticeable quality reduction). ...

December 4, 2024 · 12 min · Sam McLeod

Code, Chaos, and Copilots (AI/LLM Talk July 2024)

Code, Chaos, and Copilots is a talk I gave in July 2024 as an intro to how I use AI/LLMs to augment my capabilities every day. What I use AI/LLMs for Prompting tips Codegen workflow Picking the right models Model formats Context windows Quantisation Model servers Inference parameters Clients & tools Getting started cheat-sheets Download Slide Deck Disclaimer I’m not a ML Engineer or data scientist, As such, the information presented here is based on my understanding of the subject and may not be 100% accurate or complete. ...

July 18, 2024 · Sam McLeod

Understanding AI/LLM Quantisation Through Interactive Visualisations

AI models (“LLMs” in this case) have inherently large sizes and computational requirements that often pose challenges for deployment and use. ...

July 17, 2024 · Sam McLeod