Pointing Claude Code at Your Own vLLM Server: A Practical Recipe

Claude Code has the best agentic coding UX in the business right now — the way it plans, edits, runs, and recovers from errors is genuinely a step above what most open-source coding agents pull off. The harness is the moat.

So here’s a fun question: can you keep the harness and swap the brain?

Turns out, yes. Credit where point Claude Code at a local vLLM server and let your own GPUs do the inference while Anthropic’s CLI does the orchestration. I’ve been digging into it from the angle of someone who already runs a self-hosted LLM stack in production, and it’s worth breaking down what’s actually happening, why the config matters, and what this unlocks if you’re building on top of self-hosted models.

The trick in one paragraph

Claude Code respects three environment variables that almost nobody talks about: ANTHROPIC_BASE_URL, ANTHROPIC_DEFAULT_OPUS_MODEL, ANTHROPIC_DEFAULT_SONNET_MODEL, and ANTHROPIC_DEFAULT_HAIKU_MODEL. Set the base URL to your own server and remap the three model names to whatever your local server calls its model, and the CLI happily sends every request to your box instead of api.anthropic.com. It does not know it has been redirected.

The launch command ends up looking like this:

ANTHROPIC_BASE_URL=http://localhost:8000 \
ANTHROPIC_DEFAULT_OPUS_MODEL=MiniMax-M2.7 \
ANTHROPIC_DEFAULT_SONNET_MODEL=MiniMax-M2.7 \
ANTHROPIC_DEFAULT_HAIKU_MODEL=MiniMax-M2.7 \
claude

That is the whole client side. The interesting work is on the server.

The vLLM server, and why every flag matters

Here is the Docker Compose that the original post uses, running lukealonso/MiniMax-M2.7-NVFP4 on a dual RTX Pro 6000 box. They’re getting around 70 tokens/sec with a ~196K context window:

services:
  llm-server:
    image: vllm/vllm-openai:cu130-nightly
    container_name: minimax-m2.7-server
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
      - HF_HOME=/root/.cache/huggingface
      - NCCL_P2P_LEVEL=4
      - SAFETENSORS_FAST_GPU=1
      - VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
      - VLLM_USE_FLASHINFER_MOE_FP4=1
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - VLLM_FLASHINFER_MOE_BACKEND=latency
    ports:
      - "8000:8000"
    volumes:
      - $HF_HOME:/root/.cache/huggingface
    ipc: host
    command:
      - "lukealonso/MiniMax-M2.7-NVFP4"
      - "--trust-remote-code"
      - "--served-model-name"
      - "MiniMax-M2.7"
      - "--gpu-memory-utilization"
      - "0.95"
      - "--max-num-seqs"
      - "16"
      - "--enable-chunked-prefill"
      - "--enable-prefix-caching"
      - "--max-num-batched-tokens"
      - "16384"
      - "--enable-auto-tool-choice"
      - "--tool-call-parser"
      - "minimax_m2"
      - "--reasoning-parser"
      - "minimax_m2"
      - "--quantization"
      - "modelopt_fp4"
      - "--kv-cache-dtype"
      - "fp8"
      - "--dtype"
      - "auto"
      - "--attention-backend"
      - "FLASHINFER"
      - "--load-format"
      - "fastsafetensors"
      - "--tensor-parallel-size"
      - "2"
      - "--port"
      - "8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 100
      start_period: 300s
    networks:
      - llm-net

networks:
  llm-net:
    driver: bridge

Most tutorials just paste this kind of config and move on. I want to call out the parts that are doing real work, because if you tweak this for your own model you need to know which knobs actually matter.

The tool-call and reasoning parsers

This is the non-obvious bit, and it is the reason most “I pointed Claude Code at my local model and it kind of worked” attempts fall apart after a few turns:

--enable-auto-tool-choice
--tool-call-parser minimax_m2
--reasoning-parser minimax_m2

Claude Code is an agent. It expects the model to emit clean tool calls in a parseable format, then it executes those tool calls (file reads, edits, shell commands) and feeds results back. If the model mixes tool calls into prose, or wraps them in the wrong delimiters, the harness can’t extract them and the loop breaks. vLLM ships per-model parsers that know how to extract structured tool calls from the model’s raw output. If your model doesn’t have a parser, this whole approach falls over no matter how good the model is at coding.

So when you swap to a different model, the first thing to check is: does vLLM have a tool-call parser for it? Qwen2.5 and Qwen3 do. Llama 3.1+ does. MiniMax-M2 does (which is why this works). A lot of older or more obscure models don’t, and you’ll find that out the hard way.

The quantization and KV-cache stack

--quantization modelopt_fp4
--kv-cache-dtype fp8
--attention-backend FLASHINFER

This is how a big MoE model fits on two cards and still gives you ~196K context. NVFP4 weights, FP8 KV cache, and FlashInfer attention. The combination is what makes long-context agentic coding actually viable on a workstation — coding agents send huge prompts because they include file contents, repo structure, conversation history, and tool results. If your context is small or your KV cache is fat, you run out of headroom fast.

Chunked prefill + prefix caching

--enable-chunked-prefill
--enable-prefix-caching

These two flags matter specifically for agentic coding workloads more than for regular chat. Coding agents re-send almost-identical prompts on every turn — the system prompt, the tool definitions, the file context, the conversation so far, plus a small delta. Without prefix caching, vLLM reprocesses the entire 100K-token prompt every single turn. With it, the shared prefix is reused and only the new tokens get processed. This is the difference between “usable” and “watching paint dry.”

The honest caveat

The original author flagged this and it’s worth repeating: inference being local does not mean your data is local. Claude Code still phones home to Anthropic for auth, telemetry, and other client-side concerns even when every model call is going to your own server. If you have a hard requirement for fully air-gapped agentic coding — regulated industries, sensitive client codebases — this approach doesn’t get you there. OpenCode, Aider, and a handful of other open-source agents are cleaner on that axis, even if their UX isn’t quite as polished.

For most people though, the bottleneck isn’t telemetry — it’s cost and control over the model. And on those two, this setup is a real win.

What this unlocks if you’re already running self-hosted LLMs

This is where it gets interesting for me, because at QCall we already operate a self-hosted LLM stack in production — Qwen-class models on GCP, an OpenAI-compatible Node.js proxy in front of vLLM, PostgreSQL-backed token billing for tenants. The whole point of that infrastructure is to give clients a usable LLM API at predictable cost and with data residency they control.

Adding an Anthropic-shaped endpoint to a proxy like that is mostly a translation problem, not an inference problem. The two API shapes overlap heavily — the differences are mechanical:

Anthropic uses a top-level system field; OpenAI puts the system message inside the messages array.
Tool calls are serialized differently: Anthropic uses tool_use and tool_result content blocks; OpenAI uses tool_calls with function objects and a separate tool role.
Stop reasons map differently — end_turn, tool_use, max_tokens on the Anthropic side; stop, tool_calls, length on the OpenAI side.
Streaming events have different names and structures.

Once you handle those, your existing OpenAI-compatible proxy can answer to /v1/messages as well as /v1/chat/completions, and Claude Code can be pointed straight at your billing layer. Tenants get the agentic coding UX, you bill per token against your own GPU capacity, and you don’t have to write a coding agent.

What I’d watch out for if you try this

Tool-call parser support is the gating factor. Don’t pick a model based on benchmark scores alone. If vLLM doesn’t have a parser for it, the agent loop won’t survive contact with reality. Check before you commit.
Prefix caching is non-negotiable for coding agents. Without it, every turn feels glacial because the same 80K of context gets reprocessed from scratch.
Watch your --max-num-seqs. Sixteen is fine for a single developer. If you’re putting this behind a multi-tenant proxy, you want to think hard about concurrency, queueing, and per-tenant rate limits — otherwise one heavy user starves everyone else.
Tool-calling quality varies wildly across models. Even when the parser exists, some models hallucinate arguments, skip required fields, or refuse to call tools when they should. Burn a few hours on real tasks before declaring victory.
The CLI is still proprietary. Anthropic could change the env-var contract or harden the client at any point. This is a useful workflow today, not a foundation to bet a product on.

The bigger picture

The pattern here — keep the great client, swap the model — is going to keep showing up. Coding agents, voice agents, document agents: the harness and the model are increasingly separable, and the open ecosystem is catching up fast on the model side while the closed ecosystem still has the edge on UX.

If you’re running self-hosted infrastructure already, this is one of those small wins that costs almost nothing to set up and gives you a genuinely better developer experience on your own hardware. Worth half an afternoon.

Pointing Claude Code at Your Own vLLM Server: A Practical Recipe

Table of Contents

The trick in one paragraph

The vLLM server, and why every flag matters

The tool-call and reasoning parsers

The quantization and KV-cache stack

Chunked prefill + prefix caching

The honest caveat

What this unlocks if you’re already running self-hosted LLMs

What I’d watch out for if you try this

The bigger picture

Staff Augmentation vs Managed Services: The Build vs Run Decision

HubSpot Breeze vs. Salesforce Agentforce: Which AI Agent Actually Works?

Leave a Comment Cancel

Read Next

Top 10 Machine Learning Algorithms Every Engineer Should Know in 2026

Nearshore IT Staff Augmentation: A Buyer’s Guide for 2026

Build an AI Meeting Summarizer & Action Planner with Claude Code + MCP