A 12-turn multi-topic conversation run against ARK Qwen 30B on 1× Blackwell GPU. Same model, same GPUs, same runtime — only the stateful flag flips. What the KV-cache actually buys you, measured in latency, token volume, and cumulative GPU prefill.
extra_body.ark_stateful = <session-id> so the KV-cache for that session stays resident on the GPU. Stateless re-sends the full conversation history on every call — how every OpenAI-compatible third-party API works.| Model | Provider | Params | TTFT @ turn 12 | Tokens @ turn 12 | Cumulative tokens (12 turns) | Stateful speed-up |
|---|---|---|---|---|---|---|
| Qwen 30B featured | ARK runtime | 30B | 0.228s / 4.046s | 37 / 10,168 | 438 / 57,414 | 18× |
| Llama 3 8B | ARK runtime | 8B | 0.090s / 0.817s | 60 / 6,565 | 713 / 41,306 | 9× |
| Bielik | ARK runtime | — | 0.146s / 1.519s | 43 / 8,579 | 489 / 50,661 | 10× |
| Llama 3.3 70B | Groq (cloud) | 70B | 0.883s | 7,863 | 47,976 | — |
| Llama 3.3 70B | Together (cloud) | 70B | 0.680s | 7,805 | 47,676 | — |
| GPT-4o | OpenAI (cloud) | — | 0.811s | 6,893 | 41,801 | — |
Reading the numbers. For ARK runtime rows, cells show stateful / stateless. Cloud providers were tested stateless only — an apples-to-apples Llama 3.3 70B run on ARK is in scope for the April 2026 suite. All values are averages across 5 topics. Cumulative tokens are per single 12-turn conversation.
scripts/stateful_benchmark.pyopenai SDK. Takes one arg: the provider name (ark, openai, groq, together).