Claude Code has the best agentic coding UX in the business right now — the way it plans, edits, runs, and recovers from errors is genuinely a step above what most open-source coding agents pull off. The harness is the moat.
So here’s a fun question: can you keep the harness and swap the brain?
Turns out, yes. Credit where point Claude Code at a local vLLM server and let your own GPUs do the inference while Anthropic’s CLI does the orchestration. I’ve been digging into it from the angle of someone who already runs a self-hosted LLM stack in production, and it’s worth breaking down what’s actually happening, why the config matters, and what this unlocks if you’re building on top of self-hosted models.
Table of Contents
The trick in one paragraph
Claude Code respects three environment variables that almost nobody talks about: ANTHROPIC_BASE_URL, ANTHROPIC_DEFAULT_OPUS_MODEL, ANTHROPIC_DEFAULT_SONNET_MODEL, and ANTHROPIC_DEFAULT_HAIKU_MODEL. Set the base URL to your own server and remap the three model names to whatever your local server calls its model, and the CLI happily sends every request to your box instead of api.anthropic.com. It does not know it has been redirected.
The launch command ends up looking like this:
ANTHROPIC_BASE_URL=http://localhost:8000 \
ANTHROPIC_DEFAULT_OPUS_MODEL=MiniMax-M2.7 \
ANTHROPIC_DEFAULT_SONNET_MODEL=MiniMax-M2.7 \
ANTHROPIC_DEFAULT_HAIKU_MODEL=MiniMax-M2.7 \
claude
That is the whole client side. The interesting work is on the server.
The vLLM server, and why every flag matters
Here is the Docker Compose that the original post uses, running lukealonso/MiniMax-M2.7-NVFP4 on a dual RTX Pro 6000 box. They’re getting around 70 tokens/sec with a ~196K context window:
services:
llm-server:
image: vllm/vllm-openai:cu130-nightly
container_name: minimax-m2.7-server
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
environment:
- CUDA_VISIBLE_DEVICES=0,1
- HF_HOME=/root/.cache/huggingface
- NCCL_P2P_LEVEL=4
- SAFETENSORS_FAST_GPU=1
- VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
- VLLM_USE_FLASHINFER_MOE_FP4=1
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
- VLLM_FLASHINFER_MOE_BACKEND=latency
ports:
- "8000:8000"
volumes:
- $HF_HOME:/root/.cache/huggingface
ipc: host
command:
- "lukealonso/MiniMax-M2.7-NVFP4"
- "--trust-remote-code"
- "--served-model-name"
- "MiniMax-M2.7"
- "--gpu-memory-utilization"
- "0.95"
- "--max-num-seqs"
- "16"
- "--enable-chunked-prefill"
- "--enable-prefix-caching"
- "--max-num-batched-tokens"
- "16384"
- "--enable-auto-tool-choice"
- "--tool-call-parser"
- "minimax_m2"
- "--reasoning-parser"
- "minimax_m2"
- "--quantization"
- "modelopt_fp4"
- "--kv-cache-dtype"
- "fp8"
- "--dtype"
- "auto"
- "--attention-backend"
- "FLASHINFER"
- "--load-format"
- "fastsafetensors"
- "--tensor-parallel-size"
- "2"
- "--port"
- "8000"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 100
start_period: 300s
networks:
- llm-net
networks:
llm-net:
driver: bridge
Most tutorials just paste this kind of config and move on. I want to call out the parts that are doing real work, because if you tweak this for your own model you need to know which knobs actually matter.
The tool-call and reasoning parsers
This is the non-obvious bit, and it is the reason most “I pointed Claude Code at my local model and it kind of worked” attempts fall apart after a few turns:
--enable-auto-tool-choice
--tool-call-parser minimax_m2
--reasoning-parser minimax_m2
Claude Code is an agent. It expects the model to emit clean tool calls in a parseable format, then it executes those tool calls (file reads, edits, shell commands) and feeds results back. If the model mixes tool calls into prose, or wraps them in the wrong delimiters, the harness can’t extract them and the loop breaks. vLLM ships per-model parsers that know how to extract structured tool calls from the model’s raw output. If your model doesn’t have a parser, this whole approach falls over no matter how good the model is at coding.
So when you swap to a different model, the first thing to check is: does vLLM have a tool-call parser for it? Qwen2.5 and Qwen3 do. Llama 3.1+ does. MiniMax-M2 does (which is why this works). A lot of older or more obscure models don’t, and you’ll find that out the hard way.
The quantization and KV-cache stack
--quantization modelopt_fp4
--kv-cache-dtype fp8
--attention-backend FLASHINFER
This is how a big MoE model fits on two cards and still gives you ~196K context. NVFP4 weights, FP8 KV cache, and FlashInfer attention. The combination is what makes long-context agentic coding actually viable on a workstation — coding agents send huge prompts because they include file contents, repo structure, conversation history, and tool results. If your context is small or your KV cache is fat, you run out of headroom fast.
Chunked prefill + prefix caching
--enable-chunked-prefill
--enable-prefix-caching
These two flags matter specifically for agentic coding workloads more than for regular chat. Coding agents re-send almost-identical prompts on every turn — the system prompt, the tool definitions, the file context, the conversation so far, plus a small delta. Without prefix caching, vLLM reprocesses the entire 100K-token prompt every single turn. With it, the shared prefix is reused and only the new tokens get processed. This is the difference between “usable” and “watching paint dry.”
The honest caveat
The original author flagged this and it’s worth repeating: inference being local does not mean your data is local. Claude Code still phones home to Anthropic for auth, telemetry, and other client-side concerns even when every model call is going to your own server. If you have a hard requirement for fully air-gapped agentic coding — regulated industries, sensitive client codebases — this approach doesn’t get you there. OpenCode, Aider, and a handful of other open-source agents are cleaner on that axis, even if their UX isn’t quite as polished.
For most people though, the bottleneck isn’t telemetry — it’s cost and control over the model. And on those two, this setup is a real win.
What this unlocks if you’re already running self-hosted LLMs
This is where it gets interesting for me, because at QCall we already operate a self-hosted LLM stack in production — Qwen-class models on GCP, an OpenAI-compatible Node.js proxy in front of vLLM, PostgreSQL-backed token billing for tenants. The whole point of that infrastructure is to give clients a usable LLM API at predictable cost and with data residency they control.
Adding an Anthropic-shaped endpoint to a proxy like that is mostly a translation problem, not an inference problem. The two API shapes overlap heavily — the differences are mechanical:
- Anthropic uses a top-level
systemfield; OpenAI puts the system message inside the messages array. - Tool calls are serialized differently: Anthropic uses
tool_useandtool_resultcontent blocks; OpenAI usestool_callswithfunctionobjects and a separatetoolrole. - Stop reasons map differently —
end_turn,tool_use,max_tokenson the Anthropic side;stop,tool_calls,lengthon the OpenAI side. - Streaming events have different names and structures.
Once you handle those, your existing OpenAI-compatible proxy can answer to /v1/messages as well as /v1/chat/completions, and Claude Code can be pointed straight at your billing layer. Tenants get the agentic coding UX, you bill per token against your own GPU capacity, and you don’t have to write a coding agent.
What I’d watch out for if you try this
- Tool-call parser support is the gating factor. Don’t pick a model based on benchmark scores alone. If vLLM doesn’t have a parser for it, the agent loop won’t survive contact with reality. Check before you commit.
- Prefix caching is non-negotiable for coding agents. Without it, every turn feels glacial because the same 80K of context gets reprocessed from scratch.
- Watch your
--max-num-seqs. Sixteen is fine for a single developer. If you’re putting this behind a multi-tenant proxy, you want to think hard about concurrency, queueing, and per-tenant rate limits — otherwise one heavy user starves everyone else. - Tool-calling quality varies wildly across models. Even when the parser exists, some models hallucinate arguments, skip required fields, or refuse to call tools when they should. Burn a few hours on real tasks before declaring victory.
- The CLI is still proprietary. Anthropic could change the env-var contract or harden the client at any point. This is a useful workflow today, not a foundation to bet a product on.
The bigger picture
The pattern here — keep the great client, swap the model — is going to keep showing up. Coding agents, voice agents, document agents: the harness and the model are increasingly separable, and the open ecosystem is catching up fast on the model side while the closed ecosystem still has the edge on UX.
If you’re running self-hosted infrastructure already, this is one of those small wins that costs almost nothing to set up and gives you a genuinely better developer experience on your own hardware. Worth half an afternoon.