Patching NVIDIA's driver and vLLM to enable P2P on consumer GPUs

NVIDIA artificially restricts peer-to-peer (P2P) GPU communication to their enterprise cards. Turns out this is a software limitation, not a hardware one. I patched my drivers to remove it, hacked vLLM to take advantage of it, and got a 10-30% throughput improvement running Qwen 3.5 35b on dual RTX 3090s.

February 25, 2026 · 10 min · 1963 words · Sam McLeod

Code, Chaos, and Copilots (AI/LLM Talk July 2024)

Code, Chaos, and Copilots is a talk I gave in July 2024 as an intro to how I use AI/LLMs to augment my capabilities every day. What I use AI/LLMs for Prompting tips Codegen workflow Picking the right models Model formats Context windows Quantisation Model servers Inference parameters Clients & tools Getting started cheat-sheets Download Slide Deck Disclaimer I’m not a ML Engineer or data scientist, As such, the information presented here is based on my understanding of the subject and may not be 100% accurate or complete. ...

July 18, 2024 · Sam McLeod

Understanding AI/LLM Quantisation Through Interactive Visualisations

AI models (“LLMs” in this case) have inherently large sizes and computational requirements that often pose challenges for deployment and use. ...

July 17, 2024 · Sam McLeod

LLM vRAM Estimator

0 min · 0 words · Sam McLeod