Built for agents: the inference substrate

Introduction

Agentic AI is no longer a lab curiosity. According to Deloitte’s State of AI in the Enterprise 2026 (N=3,235 director-to-C-suite respondents), 74% of enterprises plan to deploy agentic AI within two years, and 85% expect to customise those agents to their own workflows. But only 21% report having mature governance for autonomous agents. The gap between ambition and operational readiness is widening — and the bottleneck is, increasingly, the inference layer underneath.

Most of today’s inference stacks — vLLM, TensorRT-LLM, Ollama, and the public hyperscaler APIs they power — were engineered for a world that no longer exists: a human typing into a chat box, one question at a time. Agents break almost every assumption that architecture was built on. This post is about what the new assumptions should be.

The chat-era mismatch

The canonical inference API is stateless by design. Each call is self-contained: the client sends the entire conversation history, the server runs prefill across all of it, decodes a response, and throws the working state away. That model worked beautifully for chat, where a user takes seconds or minutes between turns, context rarely exceeds a few thousand tokens, and one call corresponds to one billable interaction.

An agent does not behave like a human. An agent runs a planning loop, calls tools, reads their output, reasons over the combined context, decides the next action, and does it again — often dozens or hundreds of times inside a single task. The conversation history grows every step. And at every step, a stateless runtime re-prefills the entire accumulated context from scratch.

Stateless inference charges the agent a prefill tax on every turn. The longer the plan, the heavier the tax. By turn fifty, you are paying for the same tokens fifty times.

The practical consequences stack up fast. Time-to-first-token balloons as the context grows. GPU compute is consumed re-processing tokens that have not changed. Token budgets explode on metered APIs. Users perceive agents that “feel slow” because the loop is paying for the same work over and over. And on the security side, the batching tricks that inference servers use to claw back some of that efficiency introduce cross-session risk — a problem no enterprise CISO is willing to inherit.

The numbers nobody can ignore

Enterprise state of play

Agents are coming. The infrastructure is not ready.

Four datapoints define the inflection: adoption is accelerating, customisation is the norm, governance is immature, and the fix has to live at the runtime layer.

74%

Plan agentic AI deployment within 2 years

85%

Will need customised agents for their workflows

21%

Have mature governance for autonomous agents

15s

ARK KV persistence window — tuned for machine-paced loops

ARK is architected around a different default: keep state where it belongs — inside the runtime. The next four sections describe the pillars that make that practical.

Pillar 01 — Stateful by design

ARK supports session-level KV-cache persistence at the compute node layer. In plain terms: the attention key-value tensors computed during prefill are retained between calls within a session. When the next call arrives, only the new tokens get prefilled — the previously computed context is already on the GPU.

The default session idle timeout is fifteen seconds. If the next call arrives within that window, the agent loop keeps benefiting from the cached state. If it does not, the KV-cache is released and the session falls back cleanly to stateless. There is no dangling state, no unbounded GPU memory growth, and no cache that outlives its relevance.

The measured impact, verified against our November 2025 test suite:

Token reduction: ~4,150 tokens per call (stateless) collapses to ~46 tokens per call (stateful) — 98.9% fewer prefill tokens.
Time-to-first-token: ~1.07 s (stateless) drops to ~0.14 s (stateful) — 87% lower TTFT between steps.
GPU compute freed: the eliminated prefill is real compute recovered — the same cluster now hosts more concurrent sessions without adding hardware.

For customers running their own hardware, the primary value is not a token bill — they are not billed per token. It is infrastructure density: more concurrent sessions on the same cluster, more headroom for burst traffic, a deferred hardware replacement cycle.

Pillar 02 — Session-isolated, not session-leaky

Statefulness is only useful if it is safe. ARK enforces strict isolation at the session boundary: KV-caches, past-key-values, attention states, logits, sampling RNGs, and in-flight tokens never cross sessions. The architecture structurally prevents cross-session data leakage — even under aggressive batching or multi-tenant operation.

That matters doubly for agents. Agentic workloads are long-lived, often operate over sensitive context (customer data, financial positions, patient records, internal planning), and are increasingly multi-tenant within a single deployment. A runtime that cuts corners on isolation to claw back batch efficiency is a runtime that cannot be defended in a CISO review. ARK’s session-level isolation is a first-class architectural property, not a configurable mitigation.

Pillar 03 — Built for autonomous work

Agents do not pause politely while a compute node reboots. An agent mid-plan whose inference runtime crashes is an agent that has lost its working memory, its partial reasoning, and often its downstream tool effects. The operational cost of restart-to-recover is, in most enterprise settings, unacceptable.

ARK’s architecture keeps running as long as a complete shard set exists across surviving hardware. The runtime can survive up to 99% hardware failure: throughput degrades, but the platform does not crash. This is achieved through shard redundancy, dynamic session routing, stateless orchestration, and non-blocking failure isolation — a combination competing runtimes do not provide. A compute node dropping mid-generation does not kill the agent. Rolling upgrades happen without pausing the workload. Burst capacity is added by attaching GPUs at runtime, no model reload required.

Pillar 04 — Deployed where agents act

The Deloitte survey also reports that 77% of companies now factor an AI solution’s country of origin into vendor selection, and 83% treat data residency as at least moderately important to strategic planning. For agents that touch regulated data, sovereignty is no longer a nice-to-have — it is the default procurement filter.

ARK is designed to be deployed where the agent needs to act: on-prem, in sovereign clouds, inside regulated perimeters. Agents that read customer records, move money, or interact with classified systems stay inside the organisation’s trust boundary — not inside a foreign hyperscaler’s tenancy model. The same runtime, the same OpenAI-compatible API, the same performance profile — wherever your agents legally need to live.

Best-fit workloads

Stateful inference pays off most where calls are rapid and machine-paced within the persistence window. In practice, that means:

Agentic multi-step reasoning chains — planners, tool-using agents, retrieval loops.
IDE-integrated code assistance — active coding sessions where each edit prompts the model again.
Document processing pipelines — chunked sequential analysis over long documents or corpora.
Live meeting and call intelligence — real-time transcription, summarisation, and extraction over a rolling window.
High-frequency support and live chat — back-and-forth that takes seconds per turn, not minutes.

For workloads where calls are minutes or hours apart, the KV-cache is released cleanly and the runtime falls back to stateless — no harm done. For workloads inside the window, the gap between ARK and a conventional stack widens with every turn.

Conclusion

Agentic AI is not a new application on top of the same infrastructure. It is a new class of workload that exposes architectural assumptions baked into chat-era inference stacks a decade ago. Re-prefilling the same tokens on every turn, crashing a whole workload when one node fails, leaking state across sessions in exchange for batch efficiency, and routing regulated data through foreign tenancies — those are not edge cases. For an agent workload, they are the normal operating conditions.

The fix is not an optimisation layer on top. It is a runtime built around a different set of assumptions: state persists, sessions isolate, nodes fail safely, and sovereignty is the default. That is what ARK is. And that is why we think the inference substrate for agentic AI looks less like what we have had, and more like what comes next.

Planning an agent deployment? Measure the loop latency on your workload.

We’ll run ARK and a stateless baseline side-by-side on your hardware in a two-week POC.

Talk to our team →