RAG is a pipeline, and each stage wants a different model
A retrieval-augmented app does at least two model-bound jobs: embed content and queries, then generate an answer from the retrieved context. The economics of those jobs are opposite — embeddings should be cheap and run at scale, while generation should be smart and sometimes long-context. Wiring up a separate provider for each stage is how RAG codebases get messy.
LLM Gateway collapses that into one OpenAI-compatible API: embeddings and chat, every provider, one key.
One integration for embed and generate
1import OpenAI from "openai";2
3const client = new OpenAI({4 baseURL: "https://api.llmgateway.io/v1",5 apiKey: process.env.LLM_GATEWAY_API_KEY,6});7
8// 1. Embed the query with an affordable embeddings model9const embedding = await client.embeddings.create({10 model: "openai/text-embedding-3-large",11 input: userQuery,12});13
14// 2. Generate the answer with a strong model, given retrieved context15const answer = await client.chat.completions.create({16 model: "google-ai-studio/gemini-3.1-pro-preview", // long-context when retrieval is large17 messages: [18 { role: "system", content: systemPrompt },19 { role: "user", content: `${retrievedContext}\n\nQuestion: ${userQuery}` },20 ],21});1import OpenAI from "openai";2
3const client = new OpenAI({4 baseURL: "https://api.llmgateway.io/v1",5 apiKey: process.env.LLM_GATEWAY_API_KEY,6});7
8// 1. Embed the query with an affordable embeddings model9const embedding = await client.embeddings.create({10 model: "openai/text-embedding-3-large",11 input: userQuery,12});13
14// 2. Generate the answer with a strong model, given retrieved context15const answer = await client.chat.completions.create({16 model: "google-ai-studio/gemini-3.1-pro-preview", // long-context when retrieval is large17 messages: [18 { role: "system", content: systemPrompt },19 { role: "user", content: `${retrievedContext}\n\nQuestion: ${userQuery}` },20 ],21});Both stages, one endpoint, one key — and you can change either model independently as better or cheaper options appear.
Match the model to the query
Short, simple questions don't need a flagship model or a huge context window. Long, citation-heavy questions do. Routing per request lets you keep the cheap path cheap and reserve long-context generation for the queries that actually need it.
Stop paying twice for the same context
RAG prompts are repetitive by design: the same system instructions, and often the same top passages, across a conversation or a batch run. Prompt caching means those repeated tokens don't cost full price every time — a real saving at scale.
Tune against real numbers
Because every embedding and generation call is logged with tokens and dollar cost, you can measure the actual price of a change to your chunk size, retrieval depth or model choice — instead of guessing.