Qwen3-8B speculative decoding
This example sets up n-gram (prompt-lookup) speculative decoding and measures the performance gain. Speculation proposes several tokens per decode step and verifies them in one forward pass, so when the output repeats the prompt most proposed tokens are accepted at once. On a copy-heavy workload the speculative engine reaches about 2.4 times the output token throughput of the same model without speculation, at about half the time per output token. The benchmark below has the numbers.
This recipe was run end to end on GKE; the InferenceClass, InferenceCluster,
and ModelDeployment are the exact manifests from that run, which served a valid
completion and the benchmark below. The EKS platform shape is the standard
single-L4 recipe. It passes server validation but was not served in this run.
Apply the platform side first, then the ML side. The GKE InferenceCluster
carries a GCP project placeholder to edit before applying.
Setup
Qwen3-8B is an 8.2B dense chat model, served as one Standalone vLLM engine on a
single NVIDIA L4 with no cache and weights pulled straight from Hugging Face. The
deployment shape is incidental here. The speculative config is what matters.
Modelplane supports n-gram (prompt-lookup) speculative decoding, which proposes
tokens by matching the prompt and so needs no draft model or second set of weights.
Platform
# InferenceClass for the L4 shape on EKS, validated serving Qwen3-8B with n-gram
# speculative decoding.
#
# One NVIDIA L4 on a g6.2xlarge. The GPU is declared as a DRA device: the
# scheduler matches a ModelDeployment's nodeSelector against this capacity, then
# DRA binds the physical GPU to the serving pod. n-gram speculation adds no
# weights (no draft model), so the L4's VRAM budget is the same as the plain
# Qwen3-8B recipe and the tier is reused unchanged.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
name: eks-l4-1x-g6
spec:
description: "EKS g6.2xlarge, 1x NVIDIA L4"
provisioning:
provider: EKS
eks:
instanceType: g6.2xlarge
diskSizeGb: 100
accelerator:
type: nvidia-l4
count: 1
devices:
- name: gpu
claim: DRA
driver: gpu.nvidia.com
deviceClassName: gpu.nvidia.com
count: 1
attributes:
architecture: { string: Ada Lovelace }
capacity:
# The L4's real usable VRAM as the NVIDIA DRA driver reports it, not the
# nominal 24GB.
memory: { value: "23034Mi" }
# EKS InferenceCluster with one L4 node pool. No clusterSelector targets it; the
# ModelDeployment matches on device capacity alone, so it lands here or on any
# other compatible cluster in the fleet. Single node, no fabric or capacity
# reservation - a Standalone L4 shape that runs on the local kind control plane.
#
# minNodeCount is 1, not 0: the model claims its L4 through DRA, which binds
# against a per-node ResourceSlice the NVIDIA driver publishes only on a running
# GPU node. With zero GPU nodes there is no slice to match, so scale-up from zero
# stalls. Keeping one L4 warm avoids that cold-start (validated on GKE; the DRA
# claim is cloud-agnostic). Raise it to 2 to run the benchmark baseline too.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
name: eks-l4-single
labels:
modelplane.ai/cloud: eks
modelplane.ai/region: us-west
spec:
cluster:
source: EKS
eks:
region: us-west-2
nodePools:
- name: gpu-l4
className: eks-l4-1x-g6
nodeCount: 1
zones:
- us-west-2a
minNodeCount: 1
maxNodeCount: 4
# InferenceClass for the L4 shape on GKE, validated serving Qwen3-8B with n-gram
# speculative decoding.
#
# One NVIDIA L4 on a g2-standard-8. The GPU is declared as a DRA device: the
# scheduler matches a ModelDeployment's nodeSelector against this capacity, then
# DRA binds the physical GPU to the serving pod. n-gram speculation adds no
# weights (no draft model), so the L4's VRAM budget is the same as the plain
# Qwen3-8B recipe and the tier is reused unchanged.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
name: gke-l4-1x-g2
spec:
description: "GKE g2-standard-8, 1x NVIDIA L4"
provisioning:
provider: GKE
gke:
machineType: g2-standard-8
diskSizeGb: 100
accelerator:
type: nvidia-l4
count: 1
devices:
- name: gpu
claim: DRA
driver: gpu.nvidia.com
deviceClassName: gpu.nvidia.com
count: 1
attributes:
architecture: { string: Ada Lovelace }
capacity:
# The L4's real usable VRAM as the NVIDIA DRA driver reports it, not the
# nominal 24GB.
memory: { value: "23034Mi" }
# GKE InferenceCluster with one L4 node pool. Replace the project ID before
# applying. No clusterSelector targets it; the ModelDeployment matches on device
# capacity alone, so it lands here or on any other compatible cluster. Single
# node, no fabric or capacity reservation - a Standalone L4 shape that runs on
# the local kind control plane.
#
# minNodeCount is 1, not 0: the model claims its L4 through DRA, and a DRA claim
# can only bind against a ResourceSlice that the NVIDIA driver publishes per GPU
# node. With zero GPU nodes there is no slice, so the cluster autoscaler cannot
# tell that a new node would satisfy the claim and refuses to scale up from zero
# (it logs "cannot allocate all claims"). Keeping one L4 warm avoids that
# cold-start. Raise it to 2 to run the benchmark's baseline engine alongside.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
name: gke-l4-single
labels:
modelplane.ai/cloud: gke
modelplane.ai/region: us-central
spec:
cluster:
source: GKE
gke:
project: my-gcp-project # Replace with your GCP project ID.
region: us-central1
nodePools:
- name: gpu-l4
className: gke-l4-1x-g2
nodeCount: 1
zones:
- us-central1-a
minNodeCount: 1
maxNodeCount: 4
curl -fsSL /examples/examples/qwen3-8b-speculative-decoding/inference-cluster-gke.yaml \
| sed 's/my-gcp-project//' \
| kubectl apply -f -Deployment
# Qwen3-8B served on a single NVIDIA L4 by vLLM with n-gram (prompt-lookup)
# speculative decoding, validated end to end (the model layer is cloud-agnostic;
# the same manifest serves on EKS and GKE).
#
# n-gram speculation proposes the next tokens by matching a short suffix of what
# has been generated so far against earlier text in the prompt, then verifies the
# guess in one forward pass. It needs no draft model and no second set of weights,
# so it stays a single Standalone engine with no ModelCache. That is deliberate:
# Modelplane cannot yet stage a separate draft model on cache (modelplaneai/
# modelplane#281), so this is the speculative flavor that works today.
#
# --speculative-config method=ngram with num_speculative_tokens=5
# proposes up to 5 tokens per step;
# prompt_lookup_min/max=2..4 set the n-gram
# suffix lengths matched against the prompt.
# It pays off only when output repeats the
# input - e.g. editing a pasted code block,
# where most output tokens are copied verbatim.
# --default-chat-template-kwargs turns thinking off. Qwen3 thinks by default,
# and a <think> block is novel text absent from
# the prompt, so prompt-lookup cannot accelerate
# it. Off, the output is mostly the copied code,
# which is exactly what n-gram speeds up.
# --max-model-len / --gpu-memory-utilization L4 fit, not correctness. n-gram
# adds only a small proposal buffer, no weights,
# so the budget matches the plain Qwen3-8B recipe.
#
# No --port or --host: Modelplane's routing expects the engine on its default
# :8000 with a /health probe, and passes args through verbatim.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen3-8b-spec
namespace: ml-team
spec:
# One replica, matched to any compatible InferenceCluster by device capacity.
replicas: 1
engines:
- name: qwen3-8b-spec
members:
# A single self-contained vLLM pod. The container named "engine" is the
# inference server; its image and args pass through verbatim.
- role: Standalone
nodeSelector:
devices:
- name: gpu
count: 1
selectors:
# An 8B model needs most of an L4. >=20Gi selects the L4 (which reports
# ~23Gi) without over-constraining. DRA evaluates this CEL against the
# InferenceClass device, then against the GPU's ResourceSlice on bind.
- cel: |
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
args:
- "--model=Qwen/Qwen3-8B"
# The id clients pass as "model" in OpenAI requests.
- "--served-model-name=qwen3-8b-spec"
# Cap the context so the KV cache fits beside the weights on the L4.
- "--max-model-len=16384"
- "--gpu-memory-utilization=0.92"
# Enable n-gram speculative decoding (no draft model, no cache).
- "--speculative-config={\"method\": \"ngram\", \"num_speculative_tokens\": 5, \"prompt_lookup_max\": 4, \"prompt_lookup_min\": 2}"
# Thinking off, so output copies the prompt and prompt-lookup pays off.
- "--default-chat-template-kwargs={\"enable_thinking\": false}"
# Exposes the qwen3-8b-spec deployment's endpoints as a single OpenAI-compatible
# URL. Modelplane labels each composed ModelEndpoint with the deployment name, so
# this selector reaches every replica. Read the public address from
# status.address:
# kubectl get ms qwen3-8b-spec -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
name: qwen3-8b-spec
namespace: ml-team
spec:
endpoints:
- selector:
matchLabels:
modelplane.ai/deployment: qwen3-8b-spec
Speculation is active when the engine logs its SpeculativeConfig at startup
(method='ngram'). n-gram pays off when output copies the input, so the call
below pastes a code block and asks for a small edit, so most output tokens are
matched straight from the prompt:
ADDR=$(kubectl get ms qwen3-8b-spec -n ml-team -o jsonpath='{.status.address}')
curl -s "$ADDR/v1/chat/completions" -H 'Content-Type: application/json' -d '{
"model": "qwen3-8b-spec",
"messages": [{"role":"user","content":"Return this Python function unchanged except rename the variable `total` to `subtotal`. Output only the code.\n\ndef cart(items):\n total = 0\n for item in items:\n total += item.price\n return total"}],
"max_tokens": 200, "temperature": 0 }'Benchmark
Speculation is a speed optimization, so the example measures it rather than
asserting it. The comparison deploys a second engine, qwen3-8b-base, that is
identical to the speculative one except that it drops --speculative-config, then
runs the same workload against both on identical L4 hardware. The
InferenceCluster autoscales to a second node, so each engine gets its own L4.
# Baseline Qwen3-8B for the speculative-decoding benchmark: the same model on the
# same L4, byte-for-byte identical to model-deployment.yaml except that it drops
# --speculative-config. Deploy it alongside the speculative engine to measure the
# decode-speed delta on copy-heavy output, then delete it - it is not part of
# serving, only of the comparison.
#
# Keeping every other flag identical (image, --max-model-len, thinking off) is
# what makes the benchmark fair: the only variable between the two engines is
# n-gram speculation.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen3-8b-base
namespace: ml-team
spec:
replicas: 1
engines:
- name: qwen3-8b-base
members:
- role: Standalone
nodeSelector:
devices:
- name: gpu
count: 1
selectors:
# Same >=20Gi L4 match as the speculative engine, so both land on an
# identical NVIDIA L4. The InferenceCluster autoscales to a second node.
- cel: |
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
args:
- "--model=Qwen/Qwen3-8B"
- "--served-model-name=qwen3-8b-base"
- "--max-model-len=16384"
- "--gpu-memory-utilization=0.92"
# No --speculative-config: this is the unaccelerated baseline.
- "--default-chat-template-kwargs={\"enable_thinking\": false}"
# Exposes the qwen3-8b-base deployment as an OpenAI-compatible URL, so the
# benchmark can hit the baseline engine the same way it hits the speculative one.
# Read its address from status.address:
# kubectl get ms qwen3-8b-base -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
name: qwen3-8b-base
namespace: ml-team
spec:
endpoints:
- selector:
matchLabels:
modelplane.ai/deployment: qwen3-8b-base
The workload is the case n-gram accelerates. Each prompt pastes a code block and
asks for a small rename. The model then reproduces the block almost verbatim, so
most output tokens are matched from the prompt. A generic chat dataset would show
no gain, because nothing in the output repeats the input. Drive both engines with
vllm bench serve at --max-concurrency 1, where decode latency dominates and
speculation has the most to give:
# A throwaway client pod with the vLLM CLI (the benchmark client needs no GPU).
kubectl run bench -n ml-team --image=vllm/vllm-openai:v0.23.0 \
--restart=Never --command -- sleep inf
kubectl wait --for=condition=Ready pod/bench -n ml-team --timeout=10m
# The base image ships vLLM but not its bench extras; add the one the custom
# dataset loader needs.
kubectl exec -n ml-team bench -- pip install -q pandas
# Copy-heavy dataset: code blocks paired with a small edit instruction.
kubectl exec -n ml-team bench -- sh -c 'cat > /tmp/edits.jsonl <<"EOF"
{"prompt": "Return this Python function unchanged except rename `total` to `subtotal`. Output only code.\n\ndef cart(items):\n total = 0\n for item in items:\n total += item.price\n return total"}
{"prompt": "Return this Python function unchanged except rename `n` to `count`. Output only code.\n\ndef fib(n):\n a, b = 0, 1\n for _ in range(n):\n a, b = b, a + b\n return a"}
{"prompt": "Return this Python class unchanged except rename `items` to `entries`. Output only code.\n\nclass Stack:\n def __init__(self):\n self.items = []\n def push(self, x):\n self.items.append(x)\n def pop(self):\n return self.items.pop()"}
EOF'
# Benchmark each engine. --model must match its served-model-name.
for svc in qwen3-8b-base qwen3-8b-spec; do
ADDR=$(kubectl get ms "$svc" -n ml-team -o jsonpath='{.status.address}')
echo "== $svc =="
kubectl exec -n ml-team bench -- vllm bench serve \
--backend openai-chat --endpoint /v1/chat/completions \
--base-url "$ADDR" --model "$svc" --tokenizer Qwen/Qwen3-8B \
--dataset-name custom --dataset-path /tmp/edits.jsonl \
--custom-output-len 256 --num-prompts 30 --max-concurrency 1 --seed 0
done--model is the served-model-name (what the request carries); --tokenizer is the
real Hugging Face repository, which the client needs to count tokens locally - without it
vllm bench serve tries to fetch the served name from the Hub and fails.
Measured on a single L4 per engine (vllm/vllm-openai:v0.23.0, Qwen3-8B, 30
copy-heavy prompts, concurrency 1, identical 1045 generated tokens each run):
| Metric | Baseline | n-gram speculative |
|---|---|---|
| Output token throughput (tok/s) | 16.10 | 39.01 |
| Mean TPOT (ms/token) | 60.20 | 24.21 |
| Wall-clock for 30 requests (seconds) | 64.91 | 26.78 |
Speculation roughly halves the time per output token and lifts output throughput
about 2.4 times, because most output tokens are copied from the prompt and get
verified together in one step. For the speculative run vllm bench serve reports a
65% draft acceptance rate and a mean acceptance length of 4.27, so each
verification step commits about 4 of its 5 proposed tokens. Inter-token latency
(ITL) is not a useful lens here: speculation emits tokens in bursts, so ITL stays
flat even as TPOT and throughput improve.
The engine logs the same accept stats live during a run:
kubectl logs -n ml-team -l modelplane.ai/deployment=qwen3-8b-spec \
| grep "SpecDecoding metrics"Delete the baseline once measured; it exists only for the comparison:
kubectl delete -f docs/manifests/examples/qwen3-8b-speculative-decoding/model-service-baseline.yaml \
-f docs/manifests/examples/qwen3-8b-speculative-decoding/model-deployment-baseline.yaml
kubectl delete pod bench -n ml-team