Qwen3-8B speculative decoding

On this page

This example sets up n-gram (prompt-lookup) speculative decoding and measures the performance gain. Speculation proposes several tokens per decode step and verifies them in one forward pass, so when the output repeats the prompt most proposed tokens are accepted at once. On a copy-heavy workload the speculative engine reaches about 2.4 times the output token throughput of the same model without speculation, at about half the time per output token. The benchmark below has the numbers.

This recipe was run end to end on GKE; the InferenceClass, InferenceCluster, and ModelDeployment are the exact manifests from that run, which served a valid completion and the benchmark below. The EKS platform shape is the standard single-L4 recipe. It passes server validation but was not served in this run. Apply the platform side first, then the ML side. The GKE InferenceCluster carries a GCP project placeholder to edit before applying.

Setup

Qwen3-8B is an 8.2B dense chat model, served as one Standalone vLLM engine on a single NVIDIA L4 with no cache and weights pulled straight from Hugging Face. The deployment shape is incidental here. The speculative config is what matters. Modelplane supports n-gram (prompt-lookup) speculative decoding, which proposes tokens by matching the prompt and so needs no draft model or second set of weights.

Platform

# InferenceClass for the L4 shape on EKS, validated serving Qwen3-8B with n-gram
# speculative decoding.
#
# One NVIDIA L4 on a g6.2xlarge. The GPU is declared as a DRA device: the
# scheduler matches a ModelDeployment's nodeSelector against this capacity, then
# DRA binds the physical GPU to the serving pod. n-gram speculation adds no
# weights (no draft model), so the L4's VRAM budget is the same as the plain
# Qwen3-8B recipe and the tier is reused unchanged.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: eks-l4-1x-g6
spec:
  description: "EKS g6.2xlarge, 1x NVIDIA L4"
  provisioning:
    provider: EKS
    eks:
      instanceType: g6.2xlarge
      diskSizeGb: 100
      accelerator:
        type: nvidia-l4
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ada Lovelace }
    capacity:
      # The L4's real usable VRAM as the NVIDIA DRA driver reports it, not the
      # nominal 24GB.
      memory: { value: "23034Mi" }

# EKS InferenceCluster with one L4 node pool. No clusterSelector targets it; the
# ModelDeployment matches on device capacity alone, so it lands here or on any
# other compatible cluster in the fleet. Single node, no fabric or capacity
# reservation - a Standalone L4 shape that runs on the local kind control plane.
#
# minNodeCount is 1, not 0: the model claims its L4 through DRA, which binds
# against a per-node ResourceSlice the NVIDIA driver publishes only on a running
# GPU node. With zero GPU nodes there is no slice to match, so scale-up from zero
# stalls. Keeping one L4 warm avoids that cold-start (validated on GKE; the DRA
# claim is cloud-agnostic). Raise it to 2 to run the benchmark baseline too.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: eks-l4-single
  labels:
    modelplane.ai/cloud: eks
    modelplane.ai/region: us-west
spec:
  cluster:
    source: EKS
    eks:
      region: us-west-2
  nodePools:
  - name: gpu-l4
    className: eks-l4-1x-g6
    nodeCount: 1
    zones:
    - us-west-2a
    minNodeCount: 1
    maxNodeCount: 4

# InferenceClass for the L4 shape on GKE, validated serving Qwen3-8B with n-gram
# speculative decoding.
#
# One NVIDIA L4 on a g2-standard-8. The GPU is declared as a DRA device: the
# scheduler matches a ModelDeployment's nodeSelector against this capacity, then
# DRA binds the physical GPU to the serving pod. n-gram speculation adds no
# weights (no draft model), so the L4's VRAM budget is the same as the plain
# Qwen3-8B recipe and the tier is reused unchanged.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: gke-l4-1x-g2
spec:
  description: "GKE g2-standard-8, 1x NVIDIA L4"
  provisioning:
    provider: GKE
    gke:
      machineType: g2-standard-8
      diskSizeGb: 100
      accelerator:
        type: nvidia-l4
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ada Lovelace }
    capacity:
      # The L4's real usable VRAM as the NVIDIA DRA driver reports it, not the
      # nominal 24GB.
      memory: { value: "23034Mi" }

# GKE InferenceCluster with one L4 node pool. Replace the project ID before
# applying. No clusterSelector targets it; the ModelDeployment matches on device
# capacity alone, so it lands here or on any other compatible cluster. Single
# node, no fabric or capacity reservation - a Standalone L4 shape that runs on
# the local kind control plane.
#
# minNodeCount is 1, not 0: the model claims its L4 through DRA, and a DRA claim
# can only bind against a ResourceSlice that the NVIDIA driver publishes per GPU
# node. With zero GPU nodes there is no slice, so the cluster autoscaler cannot
# tell that a new node would satisfy the claim and refuses to scale up from zero
# (it logs "cannot allocate all claims"). Keeping one L4 warm avoids that
# cold-start. Raise it to 2 to run the benchmark's baseline engine alongside.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: gke-l4-single
  labels:
    modelplane.ai/cloud: gke
    modelplane.ai/region: us-central
spec:
  cluster:
    source: GKE
    gke:
      project: my-gcp-project  # Replace with your GCP project ID.
      region: us-central1
  nodePools:
  - name: gpu-l4
    className: gke-l4-1x-g2
    nodeCount: 1
    zones:
    - us-central1-a
    minNodeCount: 1
    maxNodeCount: 4

curl -fsSL /examples/examples/qwen3-8b-speculative-decoding/inference-cluster-gke.yaml \
  | sed 's/my-gcp-project//' \
  | kubectl apply -f -

Deployment

# Qwen3-8B served on a single NVIDIA L4 by vLLM with n-gram (prompt-lookup)
# speculative decoding, validated end to end (the model layer is cloud-agnostic;
# the same manifest serves on EKS and GKE).
#
# n-gram speculation proposes the next tokens by matching a short suffix of what
# has been generated so far against earlier text in the prompt, then verifies the
# guess in one forward pass. It needs no draft model and no second set of weights,
# so it stays a single Standalone engine with no ModelCache. That is deliberate:
# Modelplane cannot yet stage a separate draft model on cache (modelplaneai/
# modelplane#281), so this is the speculative flavor that works today.
#
#   --speculative-config            method=ngram with num_speculative_tokens=5
#                                   proposes up to 5 tokens per step;
#                                   prompt_lookup_min/max=2..4 set the n-gram
#                                   suffix lengths matched against the prompt.
#                                   It pays off only when output repeats the
#                                   input - e.g. editing a pasted code block,
#                                   where most output tokens are copied verbatim.
#   --default-chat-template-kwargs  turns thinking off. Qwen3 thinks by default,
#                                   and a <think> block is novel text absent from
#                                   the prompt, so prompt-lookup cannot accelerate
#                                   it. Off, the output is mostly the copied code,
#                                   which is exactly what n-gram speeds up.
#   --max-model-len / --gpu-memory-utilization  L4 fit, not correctness. n-gram
#                                   adds only a small proposal buffer, no weights,
#                                   so the budget matches the plain Qwen3-8B recipe.
#
# No --port or --host: Modelplane's routing expects the engine on its default
# :8000 with a /health probe, and passes args through verbatim.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen3-8b-spec
  namespace: ml-team
spec:
  # One replica, matched to any compatible InferenceCluster by device capacity.
  replicas: 1
  engines:
  - name: qwen3-8b-spec
    members:
    # A single self-contained vLLM pod. The container named "engine" is the
    # inference server; its image and args pass through verbatim.
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # An 8B model needs most of an L4. >=20Gi selects the L4 (which reports
          # ~23Gi) without over-constraining. DRA evaluates this CEL against the
          # InferenceClass device, then against the GPU's ResourceSlice on bind.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args:
            - "--model=Qwen/Qwen3-8B"
            # The id clients pass as "model" in OpenAI requests.
            - "--served-model-name=qwen3-8b-spec"
            # Cap the context so the KV cache fits beside the weights on the L4.
            - "--max-model-len=16384"
            - "--gpu-memory-utilization=0.92"
            # Enable n-gram speculative decoding (no draft model, no cache).
            - "--speculative-config={\"method\": \"ngram\", \"num_speculative_tokens\": 5, \"prompt_lookup_max\": 4, \"prompt_lookup_min\": 2}"
            # Thinking off, so output copies the prompt and prompt-lookup pays off.
            - "--default-chat-template-kwargs={\"enable_thinking\": false}"

# Exposes the qwen3-8b-spec deployment's endpoints as a single OpenAI-compatible
# URL. Modelplane labels each composed ModelEndpoint with the deployment name, so
# this selector reaches every replica. Read the public address from
# status.address:
#   kubectl get ms qwen3-8b-spec -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen3-8b-spec
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-8b-spec

Speculation is active when the engine logs its SpeculativeConfig at startup (method='ngram'). n-gram pays off when output copies the input, so the call below pastes a code block and asks for a small edit, so most output tokens are matched straight from the prompt:

ADDR=$(kubectl get ms qwen3-8b-spec -n ml-team -o jsonpath='{.status.address}')
curl -s "$ADDR/v1/chat/completions" -H 'Content-Type: application/json' -d '{
  "model": "qwen3-8b-spec",
  "messages": [{"role":"user","content":"Return this Python function unchanged except rename the variable `total` to `subtotal`. Output only the code.\n\ndef cart(items):\n    total = 0\n    for item in items:\n        total += item.price\n    return total"}],
  "max_tokens": 200, "temperature": 0 }'

Benchmark

Speculation is a speed optimization, so the example measures it rather than asserting it. The comparison deploys a second engine, qwen3-8b-base, that is identical to the speculative one except that it drops --speculative-config, then runs the same workload against both on identical L4 hardware. The InferenceCluster autoscales to a second node, so each engine gets its own L4.

# Baseline Qwen3-8B for the speculative-decoding benchmark: the same model on the
# same L4, byte-for-byte identical to model-deployment.yaml except that it drops
# --speculative-config. Deploy it alongside the speculative engine to measure the
# decode-speed delta on copy-heavy output, then delete it - it is not part of
# serving, only of the comparison.
#
# Keeping every other flag identical (image, --max-model-len, thinking off) is
# what makes the benchmark fair: the only variable between the two engines is
# n-gram speculation.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen3-8b-base
  namespace: ml-team
spec:
  replicas: 1
  engines:
  - name: qwen3-8b-base
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # Same >=20Gi L4 match as the speculative engine, so both land on an
          # identical NVIDIA L4. The InferenceCluster autoscales to a second node.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args:
            - "--model=Qwen/Qwen3-8B"
            - "--served-model-name=qwen3-8b-base"
            - "--max-model-len=16384"
            - "--gpu-memory-utilization=0.92"
            # No --speculative-config: this is the unaccelerated baseline.
            - "--default-chat-template-kwargs={\"enable_thinking\": false}"

# Exposes the qwen3-8b-base deployment as an OpenAI-compatible URL, so the
# benchmark can hit the baseline engine the same way it hits the speculative one.
# Read its address from status.address:
#   kubectl get ms qwen3-8b-base -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen3-8b-base
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-8b-base

The workload is the case n-gram accelerates. Each prompt pastes a code block and asks for a small rename. The model then reproduces the block almost verbatim, so most output tokens are matched from the prompt. A generic chat dataset would show no gain, because nothing in the output repeats the input. Drive both engines with vllm bench serve at --max-concurrency 1, where decode latency dominates and speculation has the most to give:

# A throwaway client pod with the vLLM CLI (the benchmark client needs no GPU).
kubectl run bench -n ml-team --image=vllm/vllm-openai:v0.23.0 \
  --restart=Never --command -- sleep inf
kubectl wait --for=condition=Ready pod/bench -n ml-team --timeout=10m
# The base image ships vLLM but not its bench extras; add the one the custom
# dataset loader needs.
kubectl exec -n ml-team bench -- pip install -q pandas

# Copy-heavy dataset: code blocks paired with a small edit instruction.
kubectl exec -n ml-team bench -- sh -c 'cat > /tmp/edits.jsonl <<"EOF"
{"prompt": "Return this Python function unchanged except rename `total` to `subtotal`. Output only code.\n\ndef cart(items):\n    total = 0\n    for item in items:\n        total += item.price\n    return total"}
{"prompt": "Return this Python function unchanged except rename `n` to `count`. Output only code.\n\ndef fib(n):\n    a, b = 0, 1\n    for _ in range(n):\n        a, b = b, a + b\n    return a"}
{"prompt": "Return this Python class unchanged except rename `items` to `entries`. Output only code.\n\nclass Stack:\n    def __init__(self):\n        self.items = []\n    def push(self, x):\n        self.items.append(x)\n    def pop(self):\n        return self.items.pop()"}
EOF'

# Benchmark each engine. --model must match its served-model-name.
for svc in qwen3-8b-base qwen3-8b-spec; do
  ADDR=$(kubectl get ms "$svc" -n ml-team -o jsonpath='{.status.address}')
  echo "== $svc =="
  kubectl exec -n ml-team bench -- vllm bench serve \
    --backend openai-chat --endpoint /v1/chat/completions \
    --base-url "$ADDR" --model "$svc" --tokenizer Qwen/Qwen3-8B \
    --dataset-name custom --dataset-path /tmp/edits.jsonl \
    --custom-output-len 256 --num-prompts 30 --max-concurrency 1 --seed 0
done

--model is the served-model-name (what the request carries); --tokenizer is the real Hugging Face repository, which the client needs to count tokens locally - without it vllm bench serve tries to fetch the served name from the Hub and fails.

Measured on a single L4 per engine (vllm/vllm-openai:v0.23.0, Qwen3-8B, 30 copy-heavy prompts, concurrency 1, identical 1045 generated tokens each run):

Metric	Baseline	n-gram speculative
Output token throughput (tok/s)	16.10	39.01
Mean TPOT (ms/token)	60.20	24.21
Wall-clock for 30 requests (seconds)	64.91	26.78

Speculation roughly halves the time per output token and lifts output throughput about 2.4 times, because most output tokens are copied from the prompt and get verified together in one step. For the speculative run vllm bench serve reports a 65% draft acceptance rate and a mean acceptance length of 4.27, so each verification step commits about 4 of its 5 proposed tokens. Inter-token latency (ITL) is not a useful lens here: speculation emits tokens in bursts, so ITL stays flat even as TPOT and throughput improve.

The engine logs the same accept stats live during a run:

kubectl logs -n ml-team -l modelplane.ai/deployment=qwen3-8b-spec \
  | grep "SpecDecoding metrics"

Delete the baseline once measured; it exists only for the comparison:

kubectl delete -f docs/manifests/examples/qwen3-8b-speculative-decoding/model-service-baseline.yaml \
               -f docs/manifests/examples/qwen3-8b-speculative-decoding/model-deployment-baseline.yaml
kubectl delete pod bench -n ml-team