BENCHMARK REPORT · NOVEMBER 2025

Stateful vs stateless, on the same ARK runtime.

A 12-turn multi-topic conversation run against ARK Qwen 30B on 1× Blackwell GPU. Same model, same GPUs, same runtime — only the stateful flag flips. What the KV-cache actually buys you, measured in latency, token volume, and cumulative GPU prefill.

RESULTS

What KV-cache reuse actually saves.

Four views of the same run. Summary at turn 12 → latency over time → tokens per turn → cumulative GPU work.

Model

Qwen 30B

Hardware

1× Blackwell GPU

Topics × turns

5 × 12

Run date

November 2025

Three-tile summary at turn 12: ARK Stateful delivers 18x faster time to first token, 275x fewer prompt tokens per turn, and 131x less cumulative GPU prefill across the 12-turn conversation. — At turn 12: 18× faster time to first token, 275× fewer tokens per turn, 131× less cumulative GPU prefill across the run.

Line chart: Time to first token across 12 conversation turns. ARK Stateful stays flat at about 0.22 seconds. Stateless climbs from 0.26 seconds at turn 1 to 4.05 seconds at turn 12. The widening gap is the prefill tax. — Stateful latency stays flat. Stateless re-prefills the full history on every turn — the widening gap is the prefill tax.

Line chart: Prompt tokens sent per turn across 12 turns. ARK Stateful sends about 37 new tokens each turn. Stateless re-sends the full growing conversation history, reaching 10,168 tokens by turn 12. — Stateless re-sends the full growing history on every turn. Stateful only sends new user tokens; the KV-cache on the GPU holds the rest.

Line chart: Cumulative prompt tokens prefilled across 12 turns. ARK Stateful totals 438 tokens by turn 12 — hugs the x-axis. ARK Stateless compounds to 57,414 tokens — 131x more GPU prefill for the same conversation. — Totals across the full conversation. Stateless GPU work compounds on every turn; stateful stops paying the tax after turn 1.

HARDWARE RANGE

Same runtime, four hardware classes.

A second test in the same November 2025 suite: ARK aggregate throughput on Meta Llama 3.3 70B at a 4-GPU configuration, across four hardware tiers. Same model, same ARK runtime, same measurement approach — only the GPUs change.

Horizontal bar chart: Meta Llama 3.3 70B aggregate throughput at 4-GPU setup. H100 (GH100 datacenter) at 49.3 tokens/sec, RTX 5080-class (GB203 Blackwell) at 40.4, RTX 3090-class (GA102 Ampere) at 24.0, RTX 3060-class (GA106 entry Ampere) at 17.3. A 2.9x range between entry consumer and datacenter - the same 70B model runs on all four. — A 2.9× range between an entry consumer GPU and a datacenter H100 — the same 70B model runs on all four tiers, no code-path change.

METHODOLOGY

How the run was set up.

Conversation shape

Five topics (Barcelona travel, car purchase, AI band launch, fitness plan, Python learning). Each topic runs a 12-turn dialogue where every follow-up question depends on prior turns — mirroring agentic workloads, not one-shot chat.

What is measured

Time to first token (TTFT), prompt tokens per call, cumulative tokens across the 12 turns. Measured on the client side using the OpenAI-compatible streaming API surface.

Stateful vs stateless

Stateful sets extra_body.ark_stateful = <session-id> so the KV-cache for that session stays resident on the GPU. Stateless re-sends the full conversation history on every call — how every OpenAI-compatible third-party API works.

Numbers reported

Values shown are averages across the 5 topics for each turn index. Raw per-topic traces are available via the Reproduce section below.

MODELS TESTED

The full test matrix.

This report focuses on Qwen 30B on ARK. The same benchmark suite also ran against two other models on ARK and three cloud-hosted reference providers — identical prompts, identical measurement script.

Model	Provider	Params	TTFT @ turn 12	Tokens @ turn 12	Cumulative tokens (12 turns)	Stateful speed-up
Qwen 30B featured	ARK runtime	30B	0.228s / 4.046s	37 / 10,168	438 / 57,414	18×
Llama 3 8B	ARK runtime	8B	0.090s / 0.817s	60 / 6,565	713 / 41,306	9×
Bielik	ARK runtime	—	0.146s / 1.519s	43 / 8,579	489 / 50,661	10×
Llama 3.3 70B	Groq (cloud)	70B	0.883s	7,863	47,976	—
Llama 3.3 70B	Together (cloud)	70B	0.680s	7,805	47,676	—
GPT-4o	OpenAI (cloud)	—	0.811s	6,893	41,801	—

Reading the numbers. For ARK runtime rows, cells show stateful / stateless. Cloud providers were tested stateless only — an apples-to-apples Llama 3.3 70B run on ARK is in scope for the April 2026 suite. All values are averages across 5 topics. Cumulative tokens are per single 12-turn conversation.

REPRODUCE Coming soon

Run this on your own hardware.

We're preparing a public reproducer kit — script, prompt set, and output schema — so every number on this page can be independently verified on your own GPUs. Available through a POC engagement today; the self-serve package is in progress.

Script

scripts/stateful_benchmark.py

Python 3.10+, openai SDK. Takes one arg: the provider name (ark, openai, groq, together).

Prompt set

5 topics × up to 12 turns, embedded in the script

Identical prompts across all providers — comparison is apples to apples.

Output

Per-turn CSV with TTFT, token counts, cumulative time

Same CSV schema for every provider so results drop into the same charts.

Request reproducer via POC → Back to research