Llama

Vibe Coding vs Agentic Coding

Picture this: A business leader overhears their engineering team discussing “vibe coding” and immediately imagines developers throwing prompts at ChatGPT until something works, shipping whatever emerges to production. The term alone—“vibe coding”—conjures images of seat-of-the-pants development that would make any CTO break out in a cold sweat. This misunderstanding is creating a real problem. Whilst vibe coding represents genuine creative exploration that has its place, the unfortunate terminology is causing some business leaders to conflate all AI-assisted / accelerated development with haphazard experimentation. I fear that engineers using sophisticated AI coding agents be it with advanced agentic coding tools like Cline to deliver production-quality solutions are finding their approaches questioned or dismissed entirely. ...

My Plan, Document, Act, Review flow for Agentic Software Development

I follow a simple, yet effective flow for agentic coding that helps me to efficiently develop software using AI coding agents while keeping them on track, focused on the task at hand and ensuring they have access to the right tools and information. The flow is simple: Setup -> Plan -> Act -> Review and Iterate. Workflow Quickstart Outline of Setup -> Plan -> Act -> Review & Iterate workflow: ...

Comprehensive Guide to LLM Sampling Parameters

Large Language Models (LLMs) like those used in Ollama don’t generate text deterministically - they use probabilistic sampling to select the next token based on the model’s prediction probabilities. How these probabilities are filtered and adjusted before sampling significantly impacts the quality of generated text. This guide explains the key sampling parameters and how they affect your model’s outputs, along with recommended settings for different use cases. Ollama Sampling Diagram Sampling Methods Comparison Example Ollama Sampling Settings Table Setting General Coding Coding Alt Factual/Precise Creative Writing Creative Chat min_p 0.05 0.05 0.9 0.1 0.05 0.05 temperature 0.7 0.2 0.2 0.3 1.0 0.85 top_p 0.9 0.9 1.0 0.8 0.95 0.95 mirostat 0 0 0 0 0 0 repeat_penalty 1.1 1.05 1.05 1.05 1.0 1.15 top_k 40 40 0* 0* 0 0 *For factual/precise use cases Some guides recommend Top K = 40, but Min P generally provides better adaptive filtering. Consider using Min P alone with a higher value (0.1) for most factual use cases. ...

Agentic Coding - Live Demo / Brownbag

Apologies for the video quality, Google Meet/Hangouts records in very low resolution and bitrate. Links mentioned in the video: Cline Roo Code (Cline fork with some experimental features) MCP https://modelcontextprotocol.io/introduction The package-version MCP server I created: https://github.com/sammcj/mcp-package-version https://smithery.ai (index of MCP servers) https://mcp.so(index of MCP servers) https://glama.ai/mcp/servers (index of MCP servers)

Illustration of the components of a LLM's memory usage

Bringing K/V Context Quantisation to Ollama

Explaining the concept of K/V context cache quantisation, why it matters and the journey to integrate it into Ollama. Why K/V Context Cache Quantisation Matters The introduction of K/V context cache quantisation in Ollama is significant, offering users a range of benefits: • Run Larger Models: With reduced VRAM demands, users can now run larger, more powerful models on their existing hardware. • Expand Context Sizes: Larger context sizes allow LLMs to consider more information, leading to potentially more comprehensive and nuanced responses. For tasks like coding, where longer context windows are beneficial, K/V quantisation can be a game-changer. • Reduce Hardware Utilisation: Freeing up memory or allowing users to run LLMs closer to the limits of their hardware. Running the K/V context cache at Q8_0 quantisation effectively halves the VRAM required for the context compared to the default F16 with minimal quality impact on the generated outputs, while Q4_0 cuts it down to just one third (at the cost of some noticeable quality reduction). ...

LLM FAQ

“Should I run a larger parameter model, or a higher quality smaller model of the same family?” TLDR; Larger parameter model [lower quantisation quality] > Smaller parameter model [higher quantisation quality] E.g: Qwen2.5 32B Q3_K_M > Qwen2.5 14B Q8_0 Caveats: Don’t go lower than Q3_K_M, or IQ2_M, especially if the model is under 30B~ parameters. This is in the context of two models of the same family and version (e.g. Qwen2.5 Coder). Longer answer: Check out the Code Chaos and Copilots slide deck. ...