LLM Gateway for RAG & Document Q&A

RAG is a pipeline, and each stage wants a different model

A retrieval-augmented app does at least two model-bound jobs: embed content and queries, then generate an answer from the retrieved context. The economics of those jobs are opposite — embeddings should be cheap and run at scale, while generation should be smart and sometimes long-context. Wiring up a separate provider for each stage is how RAG codebases get messy.

LLM Gateway collapses that into one OpenAI-compatible API: embeddings and chat, every provider, one key.

One integration for embed and generate

1import OpenAI from "openai";2
3const client = new OpenAI({4  baseURL: "https://api.llmgateway.io/v1",5  apiKey: process.env.LLM_GATEWAY_API_KEY,6});7
8// 1. Embed the query with an affordable embeddings model9const embedding = await client.embeddings.create({10  model: "openai/text-embedding-3-large",11  input: userQuery,12});13
14// 2. Generate the answer with a strong model, given retrieved context15const answer = await client.chat.completions.create({16  model: "google-ai-studio/gemini-3.1-pro-preview", // long-context when retrieval is large17  messages: [18    { role: "system", content: systemPrompt },19    { role: "user", content: `${retrievedContext}\n\nQuestion: ${userQuery}` },20  ],21});

1import OpenAI from "openai";2
3const client = new OpenAI({4  baseURL: "https://api.llmgateway.io/v1",5  apiKey: process.env.LLM_GATEWAY_API_KEY,6});7
8// 1. Embed the query with an affordable embeddings model9const embedding = await client.embeddings.create({10  model: "openai/text-embedding-3-large",11  input: userQuery,12});13
14// 2. Generate the answer with a strong model, given retrieved context15const answer = await client.chat.completions.create({16  model: "google-ai-studio/gemini-3.1-pro-preview", // long-context when retrieval is large17  messages: [18    { role: "system", content: systemPrompt },19    { role: "user", content: `${retrievedContext}\n\nQuestion: ${userQuery}` },20  ],21});

Both stages, one endpoint, one key — and you can change either model independently as better or cheaper options appear.

Match the model to the query

Short, simple questions don't need a flagship model or a huge context window. Long, citation-heavy questions do. Routing per request lets you keep the cheap path cheap and reserve long-context generation for the queries that actually need it.

Stop paying twice for the same context

RAG prompts are repetitive by design: the same system instructions, and often the same top passages, across a conversation or a batch run. Prompt caching means those repeated tokens don't cost full price every time — a real saving at scale.

Tune against real numbers

Because every embedding and generation call is logged with tokens and dollar cost, you can measure the actual price of a change to your chunk size, retrieval depth or model choice — instead of guessing.

Frequently asked questions

Can I generate embeddings and answers through one API?

Yes. LLM Gateway exposes embeddings and chat completions through the same OpenAI-compatible endpoint and key, so your retrieval and generation stages share one integration instead of two separate provider SDKs.

How do I handle queries with a lot of retrieved context?

Route those requests to a long-context model. Because switching models is a one-line change to the model string, you can send short queries to a fast, cheap model and long-context queries to a model like Gemini 3.1 Pro without restructuring your app.

Does caching help RAG cost?

Often, yes. RAG prompts resend the same system instructions and frequently the same retrieved passages. Prompt caching avoids paying full price for those repeated tokens, which adds up across high query volumes.

Can I switch embedding or generation providers later?

Yes. The gateway abstracts the provider behind a stable API, so you can move between embedding or generation models — to chase quality or price — without rewriting your pipeline.

RAG & document Q&A

Embeddings and chat, one API

Long-context generation on demand

Cache the context you resend

Cost analytics per pipeline