All use cases

Use case

RAG & document Q&A

Mix the cheapest embeddings with the best generation models — behind one API, with caching and fallback.

Embeddings and chat, one API

Generate embeddings and answers through the same OpenAI-compatible endpoint and key. No separate integrations for each stage of the pipeline.

Long-context generation on demand

When a query pulls in a lot of retrieved context, route to a long-context model like Gemini 3.1 Pro without changing your code.

Cache the context you resend

Prompt caching cuts the cost of the system prompt and retrieved passages that repeat across a conversation or a batch of questions.

Cost analytics per pipeline

See exactly what embedding and generation cost per query, so you can tune chunking, retrieval and model choice against real numbers.

RAG is a pipeline, and each stage wants a different model

A retrieval-augmented app does at least two model-bound jobs: embed content and queries, then generate an answer from the retrieved context. The economics of those jobs are opposite — embeddings should be cheap and run at scale, while generation should be smart and sometimes long-context. Wiring up a separate provider for each stage is how RAG codebases get messy.

LLM Gateway collapses that into one OpenAI-compatible API: embeddings and chat, every provider, one key.

One integration for embed and generate

1import OpenAI from "openai";2
3const client = new OpenAI({4  baseURL: "https://api.llmgateway.io/v1",5  apiKey: process.env.LLM_GATEWAY_API_KEY,6});7
8// 1. Embed the query with an affordable embeddings model9const embedding = await client.embeddings.create({10  model: "openai/text-embedding-3-large",11  input: userQuery,12});13
14// 2. Generate the answer with a strong model, given retrieved context15const answer = await client.chat.completions.create({16  model: "google-ai-studio/gemini-3.1-pro-preview", // long-context when retrieval is large17  messages: [18    { role: "system", content: systemPrompt },19    { role: "user", content: `${retrievedContext}\n\nQuestion: ${userQuery}` },20  ],21});

Both stages, one endpoint, one key — and you can change either model independently as better or cheaper options appear.

Match the model to the query

Short, simple questions don't need a flagship model or a huge context window. Long, citation-heavy questions do. Routing per request lets you keep the cheap path cheap and reserve long-context generation for the queries that actually need it.

Stop paying twice for the same context

RAG prompts are repetitive by design: the same system instructions, and often the same top passages, across a conversation or a batch run. Prompt caching means those repeated tokens don't cost full price every time — a real saving at scale.

Tune against real numbers

Because every embedding and generation call is logged with tokens and dollar cost, you can measure the actual price of a change to your chunk size, retrieval depth or model choice — instead of guessing.

Frequently asked questions

Can I generate embeddings and answers through one API?

Yes. LLM Gateway exposes embeddings and chat completions through the same OpenAI-compatible endpoint and key, so your retrieval and generation stages share one integration instead of two separate provider SDKs.

How do I handle queries with a lot of retrieved context?

Route those requests to a long-context model. Because switching models is a one-line change to the model string, you can send short queries to a fast, cheap model and long-context queries to a model like Gemini 3.1 Pro without restructuring your app.

Does caching help RAG cost?

Often, yes. RAG prompts resend the same system instructions and frequently the same retrieved passages. Prompt caching avoids paying full price for those repeated tokens, which adds up across high query volumes.

Can I switch embedding or generation providers later?

Yes. The gateway abstracts the provider behind a stable API, so you can move between embedding or generation models — to chase quality or price — without rewriting your pipeline.

One API for every model

Route across 200+ models with fallback, caching and per-request cost analytics. Start free in minutes.