Bringing K/V Context Quantisation to Ollama

Explaining the concept of K/V context cache quantisation, why it matters and the journey to integrate it into Ollama. Why K/V Context Cache Quantisation Matters The introduction of K/V context cache quantisation in Ollama is significant, offering users a range of benefits: β€’ Run Larger Models: With reduced VRAM demands, users can now run larger, more powerful models on their existing hardware. β€’ Expand Context Sizes: Larger context sizes allow LLMs to consider more information, leading to potentially more comprehensive and nuanced responses. For tasks like coding, where longer context windows are beneficial, K/V quantisation can be a game-changer. β€’ Reduce Hardware Utilisation: Freeing up memory or allowing users to run LLMs closer to the limits of their hardware. Running the K/V context cache at Q8_0 quantisation effectively halves the VRAM required for the context compared to the default F16 with minimal quality impact on the generated outputs, while Q4_0 cuts it down to just one third (at the cost of some noticeable quality reduction). ...

December 4, 2024 Β· 12 min Β· Sam McLeod

LLM FAQ

β€œShould I run a larger parameter model, or a higher quality smaller model of the same family?” TLDR; Larger parameter model [lower quantisation quality] > Smaller parameter model [higher quantisation quality] E.g: Qwen2.5 32B Q3_K_M > Qwen2.5 14B Q8_0 Caveats: Don’t go lower than Q3_K_M, or IQ2_M, especially if the model is under 30B~ parameters. This is in the context of two models of the same family and version (e.g. Qwen2.5 Coder). Longer answer: Check out the Code Chaos and Copilots slide deck. ...

5 min Β· Sam McLeod

LLM vRAM Estimator

0 min Β· 0 words Β· Sam McLeod