<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Examples on Modelplane Docs</title><link>/examples/</link><description>Recent content in Examples on Modelplane Docs</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Mon, 01 Jan 0001 00:00:00 +0000</lastBuildDate><atom:link href="/examples/index.xml" rel="self" type="application/rss+xml"/><item><title>Qwen3-8B</title><link>/examples/qwen3-8b/</link><pubDate/><guid>/examples/qwen3-8b/</guid><description>&lt;!-- vale write-good.Passive = NO --&gt;
&lt;p&gt;An 8.2B dense chat model on a single NVIDIA L4. The smallest recipe: one
&lt;code&gt;Standalone&lt;/code&gt; engine, no cache, weights pulled straight from Hugging Face.&lt;/p&gt;
&lt;p&gt;This recipe was run end to end; the &lt;code&gt;InferenceClass&lt;/code&gt; and &lt;code&gt;ModelDeployment&lt;/code&gt; are
the exact manifests from that run. Apply the platform side first, then the ML
side.&lt;/p&gt;
&lt;h2 id="platform"&gt;Platform &lt;a class="anchor-link" id="platform" href="#platform" aria-label="Link to this section: Platform"&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="gdoc-manifest"&gt;
&lt;div class="code-card"&gt;
&lt;div class="code-card__header"&gt;
&lt;span class="code-card__name"&gt;inference-class.yaml&lt;/span&gt;
&lt;div class="code-card__actions"&gt;&lt;button class="code-card__btn code-card__apply" type="button" data-copy="kubectl apply -f /examples/examples/qwen3-8b/inference-class.yaml" aria-label="Copy kubectl apply command" title="Copy kubectl apply command"&gt;
&lt;svg class="icon-main" width="19" height="19" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.6" stroke-linecap="round" aria-hidden="true" focusable="false"&gt;
&lt;circle cx="12" cy="12" r="7.4"/&gt;
&lt;circle cx="12" cy="12" r="2.4"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(51.43 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(102.86 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(154.29 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(205.71 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(257.14 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(308.57 12 12)"/&gt;
&lt;/svg&gt;
&lt;svg class="icon-check" width="16" height="16" viewBox="0 0 16 16" fill="currentColor" aria-hidden="true" focusable="false"&gt;&lt;path d="M13.485 1.929a.75.75 0 0 1 .086 1.057l-7 8.5a.75.75 0 0 1-1.117.063l-3.5-3.5a.75.75 0 0 1 1.06-1.06l2.92 2.92 6.494-7.894a.75.75 0 0 1 1.057-.086z"/&gt;&lt;/svg&gt;
&lt;/button&gt;&lt;button class="code-card__btn code-card__copy" type="button" aria-label="Copy contents" title="Copy contents"&gt;
&lt;svg class="icon-main" width="15" height="15" viewBox="0 0 16 16" fill="currentColor" aria-hidden="true" focusable="false"&gt;&lt;path d="M4 1.5H3a2 2 0 0 0-2 2V14a2 2 0 0 0 2 2h10a2 2 0 0 0 2-2V3.5a2 2 0 0 0-2-2h-1v1h1a1 1 0 0 1 1 1V14a1 1 0 0 1-1 1H3a1 1 0 0 1-1-1V3.5a1 1 0 0 1 1-1h1z"/&gt;&lt;path d="M9.5 1a.5.5 0 0 1 .5.5v1a.5.5 0 0 1-.5.5h-3a.5.5 0 0 1-.5-.5v-1a.5.5 0 0 1 .5-.5zm-3-1A1.5 1.5 0 0 0 5 1.5v1A1.5 1.5 0 0 0 6.5 4h3A1.5 1.5 0 0 0 11 2.5v-1A1.5 1.5 0 0 0 9.5 0z"/&gt;&lt;/svg&gt;
&lt;svg class="icon-check" width="16" height="16" viewBox="0 0 16 16" fill="currentColor" aria-hidden="true" focusable="false"&gt;&lt;path d="M13.485 1.929a.75.75 0 0 1 .086 1.057l-7 8.5a.75.75 0 0 1-1.117.063l-3.5-3.5a.75.75 0 0 1 1.06-1.06l2.92 2.92 6.494-7.894a.75.75 0 0 1 1.057-.086z"/&gt;&lt;/svg&gt;
&lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="code-card__body"&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c"&gt;# InferenceClass for the L4 shape, validated serving Qwen3-8B on EKS.&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;#&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# One NVIDIA L4 on an EKS g6.xlarge. The single GPU is a claim: DRA device;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# the scheduler matches a ModelDeployment&amp;#39;s nodeSelector against its declared&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# capacity and DRA binds it to the serving pod.&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;modelplane.ai/v1alpha1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;InferenceClass&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;eks-l4-1x-g6&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;EKS g6.xlarge, 1x NVIDIA L4&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;provisioning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;EKS&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;eks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;instanceType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;g6.xlarge&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;diskSizeGb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;accelerator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;nvidia-l4&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;devices&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;gpu&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;claim&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;DRA&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;gpu.nvidia.com&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;deviceClassName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;gpu.nvidia.com&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;attributes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;architecture&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;{&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;Ada Lovelace }&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;capacity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c"&gt;# The L4&amp;#39;s real usable VRAM as the NVIDIA DRA driver reports it, not the&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c"&gt;# nominal 24GB.&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;{&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;23034Mi&amp;#34;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;}&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="gdoc-manifest"&gt;
&lt;div class="code-card"&gt;
&lt;div class="code-card__header"&gt;
&lt;span class="code-card__name"&gt;inference-cluster.yaml&lt;/span&gt;
&lt;div class="code-card__actions"&gt;&lt;button class="code-card__btn code-card__apply" type="button" data-copy="kubectl apply -f /examples/examples/qwen3-8b/inference-cluster.yaml" aria-label="Copy kubectl apply command" title="Copy kubectl apply command"&gt;
&lt;svg class="icon-main" width="19" height="19" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.6" stroke-linecap="round" aria-hidden="true" focusable="false"&gt;
&lt;circle cx="12" cy="12" r="7.4"/&gt;
&lt;circle cx="12" cy="12" r="2.4"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(51.43 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(102.86 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(154.29 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(205.71 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(257.14 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(308.57 12 12)"/&gt;
&lt;/svg&gt;
&lt;svg class="icon-check" width="16" height="16" viewBox="0 0 16 16" fill="currentColor" aria-hidden="true" focusable="false"&gt;&lt;path d="M13.485 1.929a.75.75 0 0 1 .086 1.057l-7 8.5a.75.75 0 0 1-1.117.063l-3.5-3.5a.75.75 0 0 1 1.06-1.06l2.92 2.92 6.494-7.894a.75.75 0 0 1 1.057-.086z"/&gt;&lt;/svg&gt;
&lt;/button&gt;&lt;button class="code-card__btn code-card__copy" type="button" aria-label="Copy contents" title="Copy contents"&gt;
&lt;svg class="icon-main" width="15" height="15" viewBox="0 0 16 16" fill="currentColor" aria-hidden="true" focusable="false"&gt;&lt;path d="M4 1.5H3a2 2 0 0 0-2 2V14a2 2 0 0 0 2 2h10a2 2 0 0 0 2-2V3.5a2 2 0 0 0-2-2h-1v1h1a1 1 0 0 1 1 1V14a1 1 0 0 1-1 1H3a1 1 0 0 1-1-1V3.5a1 1 0 0 1 1-1h1z"/&gt;&lt;path d="M9.5 1a.5.5 0 0 1 .5.5v1a.5.5 0 0 1-.5.5h-3a.5.5 0 0 1-.5-.5v-1a.5.5 0 0 1 .5-.5zm-3-1A1.5 1.5 0 0 0 5 1.5v1A1.5 1.5 0 0 0 6.5 4h3A1.5 1.5 0 0 0 11 2.5v-1A1.5 1.5 0 0 0 9.5 0z"/&gt;&lt;/svg&gt;
&lt;svg class="icon-check" width="16" height="16" viewBox="0 0 16 16" fill="currentColor" aria-hidden="true" focusable="false"&gt;&lt;path d="M13.485 1.929a.75.75 0 0 1 .086 1.057l-7 8.5a.75.75 0 0 1-1.117.063l-3.5-3.5a.75.75 0 0 1 1.06-1.06l2.92 2.92 6.494-7.894a.75.75 0 0 1 1.057-.086z"/&gt;&lt;/svg&gt;
&lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="code-card__body"&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c"&gt;# An EKS InferenceCluster with one L4 node pool, labeled for the&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# ModelDeployment&amp;#39;s clusterSelector to target.&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;modelplane.ai/v1alpha1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;InferenceCluster&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;eks-l4&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;modelplane.ai/region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;us&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;EKS&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;eks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;us-west-2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;nodePools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;gpu-l4&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;className&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;eks-l4-1x-g6&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;nodeCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;minNodeCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;maxNodeCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;zones&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;us-west-2a&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h2 id="deployment"&gt;Deployment &lt;a class="anchor-link" id="deployment" href="#deployment" aria-label="Link to this section: Deployment"&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="gdoc-manifest"&gt;
&lt;div class="code-card"&gt;
&lt;div class="code-card__header"&gt;
&lt;span class="code-card__name"&gt;model-deployment.yaml&lt;/span&gt;
&lt;div class="code-card__actions"&gt;&lt;button class="code-card__btn code-card__apply" type="button" data-copy="kubectl apply -f /examples/examples/qwen3-8b/model-deployment.yaml" aria-label="Copy kubectl apply command" title="Copy kubectl apply command"&gt;
&lt;svg class="icon-main" width="19" height="19" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.6" stroke-linecap="round" aria-hidden="true" focusable="false"&gt;
&lt;circle cx="12" cy="12" r="7.4"/&gt;
&lt;circle cx="12" cy="12" r="2.4"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(51.43 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(102.86 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(154.29 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(205.71 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(257.14 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(308.57 12 12)"/&gt;
&lt;/svg&gt;
&lt;svg class="icon-check" width="16" height="16" viewBox="0 0 16 16" fill="currentColor" aria-hidden="true" focusable="false"&gt;&lt;path d="M13.485 1.929a.75.75 0 0 1 .086 1.057l-7 8.5a.75.75 0 0 1-1.117.063l-3.5-3.5a.75.75 0 0 1 1.06-1.06l2.92 2.92 6.494-7.894a.75.75 0 0 1 1.057-.086z"/&gt;&lt;/svg&gt;
&lt;/button&gt;&lt;button class="code-card__btn code-card__copy" type="button" aria-label="Copy contents" title="Copy contents"&gt;
&lt;svg class="icon-main" width="15" height="15" viewBox="0 0 16 16" fill="currentColor" aria-hidden="true" focusable="false"&gt;&lt;path d="M4 1.5H3a2 2 0 0 0-2 2V14a2 2 0 0 0 2 2h10a2 2 0 0 0 2-2V3.5a2 2 0 0 0-2-2h-1v1h1a1 1 0 0 1 1 1V14a1 1 0 0 1-1 1H3a1 1 0 0 1-1-1V3.5a1 1 0 0 1 1-1h1z"/&gt;&lt;path d="M9.5 1a.5.5 0 0 1 .5.5v1a.5.5 0 0 1-.5.5h-3a.5.5 0 0 1-.5-.5v-1a.5.5 0 0 1 .5-.5zm-3-1A1.5 1.5 0 0 0 5 1.5v1A1.5 1.5 0 0 0 6.5 4h3A1.5 1.5 0 0 0 11 2.5v-1A1.5 1.5 0 0 0 9.5 0z"/&gt;&lt;/svg&gt;
&lt;svg class="icon-check" width="16" height="16" viewBox="0 0 16 16" fill="currentColor" aria-hidden="true" focusable="false"&gt;&lt;path d="M13.485 1.929a.75.75 0 0 1 .086 1.057l-7 8.5a.75.75 0 0 1-1.117.063l-3.5-3.5a.75.75 0 0 1 1.06-1.06l2.92 2.92 6.494-7.894a.75.75 0 0 1 1.057-.086z"/&gt;&lt;/svg&gt;
&lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="code-card__body"&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c"&gt;# Qwen3-8B served on a single NVIDIA L4, validated end to end on EKS.&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;#&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# An 8.2B dense model is a single Standalone engine: one self-contained vLLM&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# pod, no ModelCache, weights pulled straight from Hugging Face. The flags carry&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# real meaning beyond fit:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;#&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# --tool-call-parser=hermes the parser for Qwen3 dense (qwen3_xml is&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# for Qwen3-Coder, not this model). Qwen3&amp;#39;s&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# tool-use template ships in the tokenizer,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# so no --chat-template is needed.&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# --reasoning-parser=qwen3 with&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# --default-chat-template-kwargs turns thinking off. Qwen3 thinks by&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# default, burying a one-line answer under a&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# &amp;lt;think&amp;gt; block and forbidding greedy decode.&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# --max-model-len / --gpu-memory-utilization L4 fit, not correctness.&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;#&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# No --port or --host: Modelplane&amp;#39;s routing expects the engine on its default&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# :8000 with a /health probe, and passes args through verbatim.&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;modelplane.ai/v1alpha1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;ModelDeployment&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;qwen3-8b&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;ml-team&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;replicas&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;clusterSelector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;matchLabels&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;modelplane.ai/region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;us&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;engines&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;qwen3-8b&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;members&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;Standalone&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;nodeSelector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;devices&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;gpu&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;selectors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;cel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="sd"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; device.capacity[&amp;#34;gpu.nvidia.com&amp;#34;].memory.compareTo(quantity(&amp;#34;20Gi&amp;#34;)) &amp;gt;= 0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;template&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;containers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;engine&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;vllm/vllm-openai:v0.23.0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;--model=Qwen/Qwen3-8B&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;--served-model-name=qwen&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;--max-model-len=16384&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;--gpu-memory-utilization=0.92&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;--reasoning-parser=qwen3&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;&amp;#34;--default-chat-template-kwargs={\&amp;#34;enable_thinking\&amp;#34;: &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;}&lt;span class="s2"&gt;&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; - &amp;#34;&lt;/span&gt;--&lt;span class="l"&gt;enable-auto-tool-choice&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;--tool-call-parser=hermes&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="gdoc-manifest"&gt;
&lt;div class="code-card"&gt;
&lt;div class="code-card__header"&gt;
&lt;span class="code-card__name"&gt;model-service.yaml&lt;/span&gt;
&lt;div class="code-card__actions"&gt;&lt;button class="code-card__btn code-card__apply" type="button" data-copy="kubectl apply -f /examples/examples/qwen3-8b/model-service.yaml" aria-label="Copy kubectl apply command" title="Copy kubectl apply command"&gt;
&lt;svg class="icon-main" width="19" height="19" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.6" stroke-linecap="round" aria-hidden="true" focusable="false"&gt;
&lt;circle cx="12" cy="12" r="7.4"/&gt;
&lt;circle cx="12" cy="12" r="2.4"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(51.43 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(102.86 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(154.29 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(205.71 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(257.14 12 12)"/&gt;
&lt;line x1="12" y1="9.6" x2="12" y2="2.6" transform="rotate(308.57 12 12)"/&gt;
&lt;/svg&gt;
&lt;svg class="icon-check" width="16" height="16" viewBox="0 0 16 16" fill="currentColor" aria-hidden="true" focusable="false"&gt;&lt;path d="M13.485 1.929a.75.75 0 0 1 .086 1.057l-7 8.5a.75.75 0 0 1-1.117.063l-3.5-3.5a.75.75 0 0 1 1.06-1.06l2.92 2.92 6.494-7.894a.75.75 0 0 1 1.057-.086z"/&gt;&lt;/svg&gt;
&lt;/button&gt;&lt;button class="code-card__btn code-card__copy" type="button" aria-label="Copy contents" title="Copy contents"&gt;
&lt;svg class="icon-main" width="15" height="15" viewBox="0 0 16 16" fill="currentColor" aria-hidden="true" focusable="false"&gt;&lt;path d="M4 1.5H3a2 2 0 0 0-2 2V14a2 2 0 0 0 2 2h10a2 2 0 0 0 2-2V3.5a2 2 0 0 0-2-2h-1v1h1a1 1 0 0 1 1 1V14a1 1 0 0 1-1 1H3a1 1 0 0 1-1-1V3.5a1 1 0 0 1 1-1h1z"/&gt;&lt;path d="M9.5 1a.5.5 0 0 1 .5.5v1a.5.5 0 0 1-.5.5h-3a.5.5 0 0 1-.5-.5v-1a.5.5 0 0 1 .5-.5zm-3-1A1.5 1.5 0 0 0 5 1.5v1A1.5 1.5 0 0 0 6.5 4h3A1.5 1.5 0 0 0 11 2.5v-1A1.5 1.5 0 0 0 9.5 0z"/&gt;&lt;/svg&gt;
&lt;svg class="icon-check" width="16" height="16" viewBox="0 0 16 16" fill="currentColor" aria-hidden="true" focusable="false"&gt;&lt;path d="M13.485 1.929a.75.75 0 0 1 .086 1.057l-7 8.5a.75.75 0 0 1-1.117.063l-3.5-3.5a.75.75 0 0 1 1.06-1.06l2.92 2.92 6.494-7.894a.75.75 0 0 1 1.057-.086z"/&gt;&lt;/svg&gt;
&lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="code-card__body"&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c"&gt;# Exposes the qwen3-8b deployment&amp;#39;s endpoints as a single OpenAI-compatible URL.&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# Modelplane labels each composed ModelEndpoint with the deployment name, so this&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# selector reaches every replica. Read the public address from status.address:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# kubectl get ms qwen3-8b -n ml-team -o jsonpath=&amp;#39;{.status.address}&amp;#39;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;modelplane.ai/v1alpha1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;ModelService&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;qwen3-8b&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;ml-team&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="nt"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;endpoints&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;matchLabels&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;modelplane.ai/deployment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;qwen3-8b&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;!-- vale write-good.Passive = YES --&gt;</description></item><item><title>Qwen3-Coder-480B</title><link>/examples/qwen3-coder/</link><pubDate/><guid>/examples/qwen3-coder/</guid><description>&lt;!-- vale write-good.Passive = NO --&gt;
&lt;p&gt;A 480B code MoE (35B active). Two validated shapes: the BF16 weights span two
H200 nodes as a gang over EFA, served from a &lt;code&gt;ModelCache&lt;/code&gt;; the FP8 checkpoint
fits one node, so it runs as a single &lt;code&gt;Standalone&lt;/code&gt; engine on SGLang with no
cache.&lt;/p&gt;
&lt;p&gt;Both shapes were run end to end; the &lt;code&gt;InferenceClass&lt;/code&gt; and &lt;code&gt;ModelDeployment&lt;/code&gt; are
the exact manifests from those runs. Apply the platform side first, then the ML
side. The &lt;code&gt;InferenceCluster&lt;/code&gt; carries an EC2 capacity reservation placeholder to
edit before applying.&lt;/p&gt;</description></item><item><title>Kimi-K2</title><link>/examples/kimi-k2/</link><pubDate/><guid>/examples/kimi-k2/</guid><description>&lt;!-- vale write-good.Passive = NO --&gt;
&lt;p&gt;A 1T MoE (1 trillion parameters) served prefill/decode disaggregated across two
H200 nodes: two engines, one per phase, with Modelplane composing the llm-d
routing layer between them. This recipe serves an INT4 quantization of the
model; the native FP8 weights need four such nodes.&lt;/p&gt;
&lt;p&gt;This recipe was run end to end; the &lt;code&gt;InferenceClass&lt;/code&gt; and &lt;code&gt;ModelDeployment&lt;/code&gt; are
the exact manifests from that run. Apply the platform side first, then the ML
side. The &lt;code&gt;InferenceCluster&lt;/code&gt; carries an EC2 capacity reservation placeholder to
edit before applying.&lt;/p&gt;</description></item><item><title>Llama-3.1-8B</title><link>/examples/llama-3.1-8b/</link><pubDate/><guid>/examples/llama-3.1-8b/</guid><description>&lt;!-- vale write-good.Passive = NO --&gt;
&lt;p&gt;An 8B dense chat model on a single NVIDIA L4. The entry recipe: one &lt;code&gt;Standalone&lt;/code&gt;
engine, no cache, public weights from a Hugging Face mirror. It carries no
&lt;code&gt;clusterSelector&lt;/code&gt;, so device capacity alone matches it to any compatible L4 in
the fleet.&lt;/p&gt;
&lt;p&gt;This recipe was run end to end on GKE; the &lt;code&gt;InferenceClass&lt;/code&gt;, &lt;code&gt;InferenceCluster&lt;/code&gt;,
and &lt;code&gt;ModelDeployment&lt;/code&gt; are the exact manifests from that run. The EKS platform
shape is the standard single-L4 recipe. It passes server validation but was not
served in this run. Apply the platform side first, then the ML side. The GKE
&lt;code&gt;InferenceCluster&lt;/code&gt; carries a GCP project placeholder to edit before applying.&lt;/p&gt;</description></item><item><title>Collecting engine metrics</title><link>/examples/collecting-engine-metrics/</link><pubDate/><guid>/examples/collecting-engine-metrics/</guid><description>&lt;!-- vale write-good.Passive = NO --&gt;
&lt;p&gt;Scraping an inference engine&amp;rsquo;s Prometheus metrics, shown on the smallest serving
shape: a 0.5B Qwen chat model on one NVIDIA L4. vLLM publishes metrics at
&lt;code&gt;/metrics&lt;/code&gt; on its serving port with no extra flag, and Modelplane runs a
Prometheus on every workload cluster with &lt;code&gt;PodMonitor&lt;/code&gt; discovery open across
namespaces, so scraping the engine is a &lt;code&gt;PodMonitor&lt;/code&gt; plus a &lt;code&gt;port-forward&lt;/code&gt;. The
model is only the subject; the same wiring fits any engine, with the SGLang,
leader/worker, and prefill/decode differences noted at the end.&lt;/p&gt;</description></item><item><title>Qwen3-8B speculative decoding</title><link>/examples/qwen3-8b-speculative-decoding/</link><pubDate/><guid>/examples/qwen3-8b-speculative-decoding/</guid><description>&lt;!-- vale write-good.Passive = NO --&gt;
&lt;p&gt;This example sets up n-gram (prompt-lookup) speculative decoding and measures the
performance gain. Speculation proposes several tokens per decode step and verifies
them in one forward pass, so when the output repeats the prompt most proposed
tokens are accepted at once. On a copy-heavy workload the speculative engine
reaches about 2.4 times the output token throughput of the same model without
speculation, at about half the time per output token. The benchmark below has the
numbers.&lt;/p&gt;</description></item></channel></rss>