# Qwen3-8B speculative decoding

An 8.2B dense chat model on a single L4 with n-gram speculative decoding.

Source: /examples/qwen3-8b-speculative-decoding/

<!-- vale write-good.Passive = NO -->
This example sets up n-gram (prompt-lookup) speculative decoding and measures the
performance gain. Speculation proposes several tokens per decode step and verifies
them in one forward pass, so when the output repeats the prompt most proposed
tokens are accepted at once. On a copy-heavy workload the speculative engine
reaches about 2.4 times the output token throughput of the same model without
speculation, at about half the time per output token. The benchmark below has the
numbers.

This recipe was run end to end on GKE; the `InferenceClass`, `InferenceCluster`,
and `ModelDeployment` are the exact manifests from that run, which served a valid
completion and the benchmark below. The EKS platform shape is the standard
single-L4 recipe. It passes server validation but was not served in this run.
Apply the platform side first, then the ML side. The GKE `InferenceCluster`
carries a GCP project placeholder to edit before applying.

## Setup

Qwen3-8B is an 8.2B dense chat model, served as one `Standalone` vLLM engine on a
single NVIDIA L4 with no cache and weights pulled straight from Hugging Face. The
deployment shape is incidental here. The speculative config is what matters.
Modelplane supports n-gram (prompt-lookup) speculative decoding, which proposes
tokens by matching the prompt and so needs no draft model or second set of weights.

## Platform

{{< tabs >}}
{{< tab "EKS" >}}
{{< manifests "examples/qwen3-8b-speculative-decoding/inference-class-eks.yaml" >}}

{{< manifests "examples/qwen3-8b-speculative-decoding/inference-cluster-eks.yaml" >}}
{{< /tab >}}
{{< tab "GKE" >}}
{{< manifests "examples/qwen3-8b-speculative-decoding/inference-class-gke.yaml" >}}

{{< manifests path="examples/qwen3-8b-speculative-decoding/inference-cluster-gke.yaml" apply="false" >}}

{{< editCode >}}
```bash
curl -fsSL {{< manifest-url "examples/qwen3-8b-speculative-decoding/inference-cluster-gke.yaml" >}} \
  | sed 's/my-gcp-project/$@<your-gcp-project-id>$@/' \
  | kubectl apply -f -
```
{{< /editCode >}}
{{< /tab >}}
{{< /tabs >}}

## Deployment

{{< manifests "examples/qwen3-8b-speculative-decoding/model-deployment.yaml" >}}

{{< manifests "examples/qwen3-8b-speculative-decoding/model-service.yaml" >}}

Speculation is active when the engine logs its `SpeculativeConfig` at startup
(`method='ngram'`). n-gram pays off when output copies the input, so the call
below pastes a code block and asks for a small edit, so most output tokens are
matched straight from the prompt:

```bash
ADDR=$(kubectl get ms qwen3-8b-spec -n ml-team -o jsonpath='{.status.address}')
curl -s "$ADDR/v1/chat/completions" -H 'Content-Type: application/json' -d '{
  "model": "qwen3-8b-spec",
  "messages": [{"role":"user","content":"Return this Python function unchanged except rename the variable `total` to `subtotal`. Output only the code.\n\ndef cart(items):\n    total = 0\n    for item in items:\n        total += item.price\n    return total"}],
  "max_tokens": 200, "temperature": 0 }'
```

## Benchmark

Speculation is a speed optimization, so the example measures it rather than
asserting it. The comparison deploys a second engine, `qwen3-8b-base`, that is
identical to the speculative one except that it drops `--speculative-config`, then
runs the same workload against both on identical L4 hardware. The
`InferenceCluster` autoscales to a second node, so each engine gets its own L4.

{{< manifests "examples/qwen3-8b-speculative-decoding/model-deployment-baseline.yaml" >}}

{{< manifests "examples/qwen3-8b-speculative-decoding/model-service-baseline.yaml" >}}

The workload is the case n-gram accelerates. Each prompt pastes a code block and
asks for a small rename. The model then reproduces the block almost verbatim, so
most output tokens are matched from the prompt. A generic chat dataset would show
no gain, because nothing in the output repeats the input. Drive both engines with
`vllm bench serve` at `--max-concurrency 1`, where decode latency dominates and
speculation has the most to give:

```bash
# A throwaway client pod with the vLLM CLI (the benchmark client needs no GPU).
kubectl run bench -n ml-team --image=vllm/vllm-openai:v0.23.0 \
  --restart=Never --command -- sleep inf
kubectl wait --for=condition=Ready pod/bench -n ml-team --timeout=10m
# The base image ships vLLM but not its bench extras; add the one the custom
# dataset loader needs.
kubectl exec -n ml-team bench -- pip install -q pandas

# Copy-heavy dataset: code blocks paired with a small edit instruction.
kubectl exec -n ml-team bench -- sh -c 'cat > /tmp/edits.jsonl <<"EOF"
{"prompt": "Return this Python function unchanged except rename `total` to `subtotal`. Output only code.\n\ndef cart(items):\n    total = 0\n    for item in items:\n        total += item.price\n    return total"}
{"prompt": "Return this Python function unchanged except rename `n` to `count`. Output only code.\n\ndef fib(n):\n    a, b = 0, 1\n    for _ in range(n):\n        a, b = b, a + b\n    return a"}
{"prompt": "Return this Python class unchanged except rename `items` to `entries`. Output only code.\n\nclass Stack:\n    def __init__(self):\n        self.items = []\n    def push(self, x):\n        self.items.append(x)\n    def pop(self):\n        return self.items.pop()"}
EOF'

# Benchmark each engine. --model must match its served-model-name.
for svc in qwen3-8b-base qwen3-8b-spec; do
  ADDR=$(kubectl get ms "$svc" -n ml-team -o jsonpath='{.status.address}')
  echo "== $svc =="
  kubectl exec -n ml-team bench -- vllm bench serve \
    --backend openai-chat --endpoint /v1/chat/completions \
    --base-url "$ADDR" --model "$svc" --tokenizer Qwen/Qwen3-8B \
    --dataset-name custom --dataset-path /tmp/edits.jsonl \
    --custom-output-len 256 --num-prompts 30 --max-concurrency 1 --seed 0
done
```

`--model` is the `served-model-name` (what the request carries); `--tokenizer` is the
real Hugging Face repository, which the client needs to count tokens locally - without it
`vllm bench serve` tries to fetch the served name from the Hub and fails.

Measured on a single L4 per engine (`vllm/vllm-openai:v0.23.0`, Qwen3-8B, 30
copy-heavy prompts, concurrency 1, identical 1045 generated tokens each run):

| Metric | Baseline | n-gram speculative |
|---|---|---|
| Output token throughput (tok/s) | 16.10 | 39.01 |
| Mean TPOT (ms/token) | 60.20 | 24.21 |
| Wall-clock for 30 requests (seconds) | 64.91 | 26.78 |

Speculation roughly halves the time per output token and lifts output throughput
about 2.4 times, because most output tokens are copied from the prompt and get
verified together in one step. For the speculative run `vllm bench serve` reports a
65% draft acceptance rate and a mean acceptance length of 4.27, so each
verification step commits about 4 of its 5 proposed tokens. Inter-token latency
(ITL) is not a useful lens here: speculation emits tokens in bursts, so ITL stays
flat even as TPOT and throughput improve.

The engine logs the same accept stats live during a run:

```bash
kubectl logs -n ml-team -l modelplane.ai/deployment=qwen3-8b-spec \
  | grep "SpecDecoding metrics"
```

Delete the baseline once measured; it exists only for the comparison:

```bash
kubectl delete -f docs/manifests/examples/qwen3-8b-speculative-decoding/model-service-baseline.yaml \
               -f docs/manifests/examples/qwen3-8b-speculative-decoding/model-deployment-baseline.yaml
kubectl delete pod bench -n ml-team
```
<!-- vale write-good.Passive = YES -->