Bringing K/V Context Quantisation to Ollama

Explaining the concept of K/V context cache quantisation, why it matters and the journey to integrate it into Ollama. Why K/V Context Cache Quantisation Matters The introduction of K/V context cache quantisation in Ollama is significant, offering users a range of benefits: โ€ข Run Larger Models: With reduced VRAM demands, users can now run larger, more powerful models on their existing hardware. โ€ข Expand Context Sizes: Larger context sizes allow LLMs to consider more information, leading to potentially more comprehensive and nuanced responses. For tasks like coding, where longer context windows are beneficial, K/V quantisation can be a game-changer. โ€ข Reduce Hardware Utilisation: Freeing up memory or allowing users to run LLMs closer to the limits of their hardware. Running the K/V context cache at Q8_0 quantisation effectively halves the VRAM required for the context compared to the default F16 with minimal quality impact on the generated outputs, while Q4_0 cuts it down to just one third (at the cost of some noticeable quality reduction). ...

December 4, 2024 ยท 12 min ยท Sam McLeod

Code, Chaos, and Copilots (AI/LLM Talk July 2024)

Code, Chaos, and Copilots is a talk I gave in July 2024 as an intro to how I use AI/LLMs to augment my capabilities every day. What I use AI/LLMs for Prompting tips Codegen workflow Picking the right models Model formats Context windows Quantisation Model servers Inference parameters Clients & tools Getting started cheat-sheets Download Slide Deck Disclaimer Iโ€™m not a ML Engineer or data scientist, As such, the information presented here is based on my understanding of the subject and may not be 100% accurate or complete. ...

July 18, 2024 ยท Sam McLeod

Understanding AI/LLM Quantisation Through Interactive Visualisations

AI models (โ€œLLMsโ€ in this case) have inherently large sizes and computational requirements that often pose challenges for deployment and use. ...

July 17, 2024 ยท Sam McLeod