Research / Stateful vs Stateless · Nov 2025

Stateful vs stateless, on the same ARK runtime.

A 12-turn multi-topic conversation run against ARK Qwen 30B on 1× Blackwell GPU. Same model, same GPUs, same runtime — only the stateful flag flips. What the KV-cache actually buys you, measured in latency, token volume, and cumulative GPU prefill.

What KV-cache reuse actually saves.
Four views of the same run. Summary at turn 12 → latency over time → tokens per turn → cumulative GPU work.
Model
Qwen 30B
Hardware
1× Blackwell GPU
Topics × turns
5 × 12
Run date
November 2025
Three-tile summary at turn 12: ARK Stateful delivers 18x faster time to first token, 275x fewer prompt tokens per turn, and 131x less cumulative GPU prefill across the 12-turn conversation.
At turn 12: 18× faster time to first token, 275× fewer tokens per turn, 131× less cumulative GPU prefill across the run.
Line chart: Time to first token across 12 conversation turns. ARK Stateful stays flat at about 0.22 seconds. Stateless climbs from 0.26 seconds at turn 1 to 4.05 seconds at turn 12. The widening gap is the prefill tax.
Stateful latency stays flat. Stateless re-prefills the full history on every turn — the widening gap is the prefill tax.
Line chart: Prompt tokens sent per turn across 12 turns. ARK Stateful sends about 37 new tokens each turn. Stateless re-sends the full growing conversation history, reaching 10,168 tokens by turn 12.
Stateless re-sends the full growing history on every turn. Stateful only sends new user tokens; the KV-cache on the GPU holds the rest.
Line chart: Cumulative prompt tokens prefilled across 12 turns. ARK Stateful totals 438 tokens by turn 12 — hugs the x-axis. ARK Stateless compounds to 57,414 tokens — 131x more GPU prefill for the same conversation.
Totals across the full conversation. Stateless GPU work compounds on every turn; stateful stops paying the tax after turn 1.
Same runtime, four hardware classes.
A second test in the same November 2025 suite: ARK aggregate throughput on Meta Llama 3.3 70B at a 4-GPU configuration, across four hardware tiers. Same model, same ARK runtime, same measurement approach — only the GPUs change.
Horizontal bar chart: Meta Llama 3.3 70B aggregate throughput at 4-GPU setup. H100 (GH100 datacenter) at 49.3 tokens/sec, RTX 5080-class (GB203 Blackwell) at 40.4, RTX 3090-class (GA102 Ampere) at 24.0, RTX 3060-class (GA106 entry Ampere) at 17.3. A 2.9x range between entry consumer and datacenter - the same 70B model runs on all four.
A 2.9× range between an entry consumer GPU and a datacenter H100 — the same 70B model runs on all four tiers, no code-path change.
How the run was set up.
Conversation shape
Five topics (Barcelona travel, car purchase, AI band launch, fitness plan, Python learning). Each topic runs a 12-turn dialogue where every follow-up question depends on prior turns — mirroring agentic workloads, not one-shot chat.
What is measured
Time to first token (TTFT), prompt tokens per call, cumulative tokens across the 12 turns. Measured on the client side using the OpenAI-compatible streaming API surface.
Stateful vs stateless
Stateful sets extra_body.ark_stateful = <session-id> so the KV-cache for that session stays resident on the GPU. Stateless re-sends the full conversation history on every call — how every OpenAI-compatible third-party API works.
Numbers reported
Values shown are averages across the 5 topics for each turn index. Raw per-topic traces are available via the Reproduce section below.
The full test matrix.
This report focuses on Qwen 30B on ARK. The same benchmark suite also ran against two other models on ARK and three cloud-hosted reference providers — identical prompts, identical measurement script.
Model Provider Params TTFT @ turn 12 Tokens @ turn 12 Cumulative tokens (12 turns) Stateful
speed-up
Llama 3 8B ARK runtime 8B 0.090s / 0.817s 60 / 6,565 713 / 41,306
Bielik ARK runtime 0.146s / 1.519s 43 / 8,579 489 / 50,661 10×
Llama 3.3 70B Groq (cloud) 70B 0.883s 7,863 47,976
Llama 3.3 70B Together (cloud) 70B 0.680s 7,805 47,676
GPT-4o OpenAI (cloud) 0.811s 6,893 41,801

Reading the numbers. For ARK runtime rows, cells show stateful / stateless. Cloud providers were tested stateless only — an apples-to-apples Llama 3.3 70B run on ARK is in scope for the April 2026 suite. All values are averages across 5 topics. Cumulative tokens are per single 12-turn conversation.

Run this on your own hardware.
We're preparing a public reproducer kit — script, prompt set, and output schema — so every number on this page can be independently verified on your own GPUs. Available through a POC engagement today; the self-serve package is in progress.
Script
scripts/stateful_benchmark.py
Python 3.10+, openai SDK. Takes one arg: the provider name (ark, openai, groq, together).
Prompt set
5 topics × up to 12 turns, embedded in the script
Identical prompts across all providers — comparison is apples to apples.
Output
Per-turn CSV with TTFT, token counts, cumulative time
Same CSV schema for every provider so results drop into the same charts.