Q2 2026: Speech, Embeddings & Coding Plans

Three months of updates: speech generation and Audio Studio, OpenAI-compatible embeddings, OCR, DevPass coding plans, chat subscription plans, enterprise IAM and master keys, SOC 2 Type II, 40+ new models, and much more.

June 30, 2026

Most teams want one API for everything they build with — text, speech, embeddings, images, video — without juggling a separate vendor and SDK for each. Q2 was about closing that gap. LLM Gateway added speech generation and embeddings as first-class OpenAI-compatible endpoints, shipped fixed-price coding plans and chat subscriptions, tightened enterprise access controls, earned SOC 2 Type II, and brought 40+ new models online. Here's everything that shipped from April through June.

By the Numbers

The quarter in traffic, across every project on the platform:

27,038,440 requests routed
207.9B tokens processed — 195.8B input, 120B cached, 11.8B output

DeepSeek V4, Grok 4.1 Fast, and Gemini 3 Flash drove the most volume. But the bigger story is how much of it is no longer plain text: Gemini Embedding 2 ranks second by request count, and GPT Image 2 lands in the top ten — proof that embeddings, images, and audio now move serious traffic through the gateway.

Top 10 models by tokens

#	Model	Tokens
1	deepseek-v4-pro	48.6B
2	grok-4-1-fast-non-reasoning	29.2B
3	gemini-3-flash-preview	27.0B
4	deepseek-v4-flash	23.0B
5	claude-sonnet-4-6	7.7B
6	claude-opus-4-8	6.4B
7	gemini-3-pro-image-preview	6.2B
8	gemini-embedding-2	5.8B
9	glm-5.2	5.1B
10	claude-opus-4-6	4.6B

Top 10 models by requests

#	Model	Requests
1	grok-4-1-fast-non-reasoning	10,605,050
2	gemini-embedding-2	5,887,455
3	gemini-3-flash-preview	2,406,179
4	gemini-3-pro-image-preview	2,122,872
5	deepseek-v4-flash	1,158,043
6	gemini-3.1-flash-image-preview	668,547
7	gpt-image-2	543,958
8	deepseek-v4-pro	540,795
9	grok-4-1-fast	300,313
10	deepseek-v3.2	289,820

Speech & Audio

Audio Studio in the LLM Gateway Playground with multi-track speech generation

Text-to-speech now runs through the same gateway as the rest of your stack:

/v1/audio/speech — An OpenAI-compatible speech endpoint backed by ElevenLabs, Google Gemini, and more, so you switch voices and providers without changing your code.
ElevenLabs provider — Native text-to-speech with per-character pricing tracked on every request.
Google audio — Audio support for Google models, wired into the Playground.
Audio Studio — A dedicated workspace in the Playground for generating and previewing speech.

Read the speech docs

Embeddings

OpenAI-compatible embeddings turning text into vectors for semantic search and RAG

/v1/embeddings is now OpenAI-compatible across providers, so retrieval and semantic search work without provider-specific glue:

Google embeddings via gemini-embedding-001, plus Google Vertex embeddings
Same-provider key fallback — Embedding requests fail over to your other keys on the same provider
Routing metadata and key health included in embedding responses, just like chat completions

Read the embeddings docs

OCR

/v1/ocr — Extract structured text from documents and images with mistral-ocr-latest.
Chat OCR — The Playground reads text out of uploaded images directly in a conversation.

Read the OCR docs

Video Generation

ByteDance Seedance video generation models in the LLM Gateway model selector

We expanded video beyond Q1's launch with new models and input modes:

ByteDance Seedance 2.0, 2.0 Fast, and 1.5 Pro — including reference-video input and first/last-frame control on Seedance 2.0
Alibaba Wan 2.6 — text-to-video
MiniMax Hailuo 2.3
AtlasCloud Kling v3
Grok Imagine Video 1.5 — promoted out of preview

Image Generation

Image Studio in the LLM Gateway Playground generating images with GPT Image 2

gpt-image-2 — Added from OpenAI and via Azure OpenAI, with quality and size pass-through for accurate, resolution-based pricing.
Reve — A new image-generation provider.

Responses API

/v1/responses — Full support for OpenAI's Responses API.
/v1/responses/compact — A compact variant for smaller payloads.
item_reference resolution — Input items referenced by ID are resolved server-side.

DevPass: Coding Plans

DevPass gives you a fixed monthly price for coding agents like Claude Code, Codex, Cline, and Cursor — frontier models without metered per-token billing. Q2 turned it into a complete product:

Restricted to coding agents and root-model routing — Plans cover inference for supported agents, keeping pricing predictable
Annual billing alongside monthly
Invoices and shared billing details across the dashboard
Public DevPass profiles to show off what you've built
Social and passkey sign-in
Cancellation feedback flow and lifecycle notifications
New integration guides for Pi, Continue, Hermes, and Cursor plan mode

Get your DevPass

Chat Subscription Plans & Playground

Chat subscription plans, service tiers, and the SDK sandbox

The chat Playground gained Starter, Plus, and Pro subscription plans plus a wave of workflow features:

Forking, message editing, and chat reset — Branch a conversation or rewind it
Temporary chats that leave no history, and pinned chats in the sidebar
Public share links with a redesigned share dialog and OpenGraph images
Chat history search across every conversation
Comparison mode persistence — Your multi-model setup sticks between sessions
AI chat support replacing Crisp, with suggested answers and one-click human escalation
Image, Video, and Audio Studios plus a Canvas page for longer-form work

Open the Playground

Routing & Reliability

Routing got smarter about cost, latency, and stickiness:

Per-request and per-project routing strategy — Choose how the gateway picks providers at either level
Sticky session routing via the x-session-id header, so a conversation stays on one provider
Stable preferred-provider routing for predictable provider selection
Image-aware token estimates feed auto-routing for more accurate cost weighting
Provider service tiers — Flex and priority tiers (including Vertex), gated to your own provider keys
Faster provider-downtime reaction and AWS Bedrock region routing with a global default

Read the routing docs

Enterprise & Security

Integrated guardrails and custom rules for enterprise safety policies

SOC 2 Type II — LLM Gateway completed its SOC 2 Type II audit. Read the announcement
IAM rules — Restrict API keys by IP CIDR range (Enterprise)
Master keys — Provision and manage keys programmatically, with dedicated IAM rule routes
Per-key custom model catalog — Enterprise organizations expose a curated model list per key
Per-project routing overrides — Pin providers and policies at the project level
Provider compliance policies and legal metadata surfaced per provider
Guardrails redact action — Mask sensitive content instead of blocking it outright
Enterprise trial and lifted seat, project, and key limits for enterprise plans

Explore Enterprise

API Key Lifecycle

TTL expiration — Set an expiry on any API key
Roll secret — Rotate a key's secret without changing its ID or breaking integrations
See our API key rotation guide for the full pattern

Embeddable Payments SDK

For platforms that resell or meter LLM usage to their own users:

Embeddable end-user wallets — Give your users their own credit balances
SDK settings and sandbox test keys for safe local development
Opt-in preview behind a feature flag

Read the payments SDK docs

New Models

Q2 added more than 40 models across providers:

Anthropic

Claude Opus 4.8 (Anthropic and AWS Bedrock)
Claude Opus 4.7 with adaptive thinking
Adaptive thinking for Opus 4.6
Claude Sonnet 4.6 with a 1M-token context window
Claude Fable 5 (Anthropic and AWS Bedrock)

OpenAI

GPT-5.5 family
gpt-image-2 (OpenAI and Azure OpenAI)

xAI

Grok 4.3 and Grok Build 0.1, plus grok-4.20 via Vertex AI
Grok Imagine Video 1.5

DeepSeek

DeepSeek V4 Pro and V4 Flash across Alibaba, Novita, and CanopyWave
DeepSeek V4 in Alibaba's Singapore region
Reasoning enabled for DeepSeek V3.2 on Novita

Open & Frontier Models

GLM-5.1 and GLM-5.2 across Z.ai, EmberCloud, Together AI, and Novita
Kimi K2.6, K2.7 Highspeed, and K2.7 Code across Moonshot, CanopyWave, Novita, and Together AI
MiniMax M3 and tool calling on MiniMax M2.7
Qwen3.6 (Max Preview, Plus, 35B-A3B) and Qwen3.7 (Max, Plus) across Alibaba and Novita
Gemma 4, Gemini 3.5 Flash, and Gemini 3.1 Flash Lite
Nemotron 3 Ultra 550B, Xiaomi MiMo, and Sakana fugu-ultra

Explore all models

New Providers

ElevenLabs — Text-to-speech
Reve — Image generation
DeepInfra — Inference provider
Bluestone and extended Together AI coverage
vertex-anthropic and a discounted anthropic-discount provider
Azure AI Foundry — Grok models, gpt-oss-120b, and custom Foundry deployment names
Vertex AI partner models — 13 new OpenAI-compatible mappings

Browse all providers

Analytics & Admin

Usage analytics and cost breakdown in the LLM Gateway dashboard

Organization analytics — Member, API-key, and per-source usage breakdowns
Hourly history rollups — Faster long-range charts, with hourly buckets beyond 24 hours in the admin dashboard
Model categorization and weekly fair-use caps for premium-tier models
A steady stream of admin dashboard improvements: cost-share views, sortable provider tables, error breakdowns by source, and custom date ranges

Billing & Payments

Cache-write billing for Anthropic, AWS Bedrock, and Alibaba
International payment fee handling
Credit top-up minimum raised to $10

Deployment

Helm chart — Self-host LLM Gateway on Kubernetes with a maintained chart. See the self-hosting guide.

Docs, SEO & Comparisons

Fumadocs upgrade with Ask AI — Ask questions against the docs in natural language
Enriched llms.txt and a sitemap page for AI crawlers
Refreshed comparison pages with provider logos and OpenGraph images
Community model ratings — Rate any model after 100 requests
New enterprise SEO pages and a growing library of guides

Explore all models | Try the Playground | Get started now