RAG as operational memory: context engineering for software

RAG is usually explained as a technique to “connect an LLM to your documents.” That definition is fine to get going, but it breaks the moment you try to take it to a real system.

What I’m sharing here isn’t the theory — it’s the process: how I ran into the problems one by one while building RAG systems for software (source code, technical documentation, glossaries, change history, configuration, tickets, and operational knowledge), and which decision solved each one. I can’t name projects for confidentiality reasons, but I can share the technical journey and the mistakes I’d avoid if I started over.

The conclusion I landed on fits in one line:

The brief: give context about software that never stops changing

The first problem showed up before I wrote a single line of code. The knowledge that mattered changed all the time:

new commits every day;
documentation that aged badly;
conventions that only live in the code;
different configuration per environment;
modules only part of the team understands;
execution state that depends on the exact moment of the query.

Training or fine-tuning a model for every change made no sense. The idea that organized the design comes from the original RAG paper: combine the model’s parametric memory —its linguistic and reasoning capability— with an external, non-parametric memory queried through retrieval, which brings the up-to-date, specific, traceable knowledge.

That separation was the first architectural decision: I don’t ask the model to remember what changes, I hand it over fresh on every query. But retrieving well turned out to be much harder than I expected.

The first bottleneck: the model is only as good as what it retrieves

Retrieval is the stage that decides which fragments of external knowledge reach the model before it generates an answer. The basic flow is straightforward:

The user asks a question.
The system turns it into a searchable representation.
It searches an index for the most relevant documents or chunks.
It selects a small set of results, the top K.
It inserts those results into the prompt.
The model answers using the question and the retrieved context.

In vector search, both the question and the chunks become embeddings: vectors that represent approximate meaning. The retriever compares the similarity between the question vector and the index vectors; if two are close, they’re assumed to be talking about the same thing.

2D projection of the embedding space (the real one has hundreds of dimensions): each chunk is a vector, and the retriever returns the K closest to the question vector.

The full flow, from question to prompt, looks like this:

question: "why is this integration failing"
        -> embedding of the question
        -> search in the vector index
        -> top 5 chunks: API contracts, config, known errors
        -> those chunks enter the prompt
        -> the LLM drafts an explanation

The retriever doesn’t understand or write: its only job is to shrink the search space, from a corpus of tens of thousands of chunks to a small top K —between 5 and 12 in my systems— that probably contains the evidence. That’s why it’s the most delicate piece of the whole system.

The architecture I ended up with

As I solved problems, the system settled into a pattern with separate responsibilities: loaders that normalize heterogeneous sources, chunking that splits them into retrievable units, metadata that adds structure, a hybrid retriever, tools for what’s live, and a prompt builder that assembles the context with rules and citations.

The center of the architecture isn't the LLM — it's the context built for it.

The underlying lesson is that the LLM isn’t at the center. The center is the context. If it arrives well built, the prompt can be simple; if it arrives badly, you end up trying to fix with instructions what was really a data problem.

Cutting by tokens broke the meaning: structure-based chunking

Naive chunking cuts every N tokens with a fixed overlap. For narrative documentation that’s enough; for software, it isn’t. The first bad answers came from chunks that split a function in half or merged two unrelated ideas.

I attacked it by letting structure define the unit. In code and technical systems, a useful chunk is usually:

a class or a public function;
a module;
a configuration section;
a glossary entry;
a documentation section;
a changelog block;
a small file worth keeping whole.

The rule that stuck with me:

A chunk should resemble the mental unit a person would use to answer the question.

If someone asks about a failing integration, they don’t want five arbitrary pieces of text. They want to know which component is involved, from which layer, over which contract, and with which configuration. That’s why I prefer to parse when the corpus has structure; and when there’s no parser, I at least use path, extension, folder conventions, and simple syntactic boundaries. The embedding shouldn’t have to guess that services/, docs/decisions/, or config/ mean different things: the system already knows that.

The index returned noise: metadata before the vector

One of the biggest quality jumps came from no longer treating each chunk as plain text. A useful chunk almost always needs metadata:

{
  "source": "code",
  "file_path": "services/billing/client.ts",
  "kind": "service_client",
  "module": "billing",
  "symbol": "createInvoice",
  "source_layer": "application",
  "chunk_index": 0
}

That metadata let me filter before retrieving: “only our own code”, “only integrations”, “only citable sources”, “only this module”, “only this corpus version”, “only files allowed for the user’s role”. It cuts noise, improves precision, and as a bonus improves security, because permissions are applied before the content reaches the prompt.

The combined impact of structure-based chunking, metadata, and hybrid retrieval was the single biggest jump of the whole project:

~4/10

correct evidence in top K

token-based chunking, plain text

~9/10

correct evidence in top K

structure + metadata + hybrid

Approximate numbers, measured on my internal test set of real questions. The order of magnitude is what matters, not the decimal.

The corpus changes every day: keeping the index fresh

The problem that opened the project —software that never stops changing— came back as an uncomfortable question: what good is excellent retrieval over last week’s index? Re-indexing the whole corpus on every change stopped being viable as soon as it grew past a few thousand chunks.

What worked was incremental ingestion:

each chunk stores a hash of its content and its source;
a change —a commit, a documentation edit, a new configuration— triggers re-ingestion only for the affected files;
only what changed gets re-chunked and re-embedded; the rest of the index is left untouched;
orphaned chunks, whose file or section no longer exists, are removed in the same pass.

Every update bumps the corpus version —the same one that later becomes part of the cache key. And there’s a hidden cost worth knowing before you pick an embedding model: switching it invalidates the entire index, because vectors from different models aren’t comparable. Re-embedding the whole corpus is the most expensive migration in the system, and index versioning also exists to make it possible without downtime: the old index stays live while the new one is built.

Exact symbols got lost: hybrid retrieval

Vector search is good for semantic intent: “where is the final price computed?” can find code that doesn’t contain those words. But it fails with identifiers, method names, paths, classes, exact errors, hashes, and strings — and in software those details matter enormously.

The fix was to combine signals instead of picking one:

Hybrid retrieval: intent through vectors, exactness through lexical search, control through metadata — then rerank.

vector search for intent;
exact match or BM25 for symbols;
structural filters by metadata;
reranking when the initial top K brings too much noise;
manual injection of critical context when I know it must not be lost.

That last point matters: if the user asks from a specific screen, ticket, or file, sometimes it’s worth guaranteeing that the chunk from that document enters the prompt even when its score isn’t the highest. It isn’t gaming the result; it’s acknowledging that the system has signals the embedding doesn’t see.

There’s one more signal I’d install from day one today: rewriting the query before searching. Real questions arrive vague —“why is this failing”, “checkout is broken”— and betting everything on a single embedding of that phrase is fragile. Generating two or three variants (the question rephrased in domain terms, the likely symbols, the suspected module) and retrieving with all of them improves recall without touching the index. I got to this late: I tried it at the end of the journey, and it’s among the first things I’d set up in the next system.

The index doesn’t know what’s happening now: tools

Some questions the index should never answer: “what changed in the last deploy?”, “what error shows up in the sanitized logs?”, “who touched this line?”, “what configuration is active right now?”. That isn’t indexable knowledge, it’s live state.

I solved it with very concrete tools — and the smaller and more auditable, the better:

get_recent_deploys     git_log
get_sanitized_logs     git_blame
get_related_tickets    git_show
search_commits         recent_activity

The split became clear: the index answers “what does this mean” or “where is it documented”; the tools answer “what’s happening now.” That separation cut hallucination a lot, because the model stopped imagining state and history: now it queries them.

An optimization I liked was running some tools before calling the model. If a support question almost always needs service state, recent deploys, and related tickets, the backend runs them in parallel and puts them in the prompt as base context. That removes two or three tool-calling rounds —each one several seconds in my case—, cuts perceived latency to less than half on typical support questions, and keeps the model from forgetting to look at the obvious.

Answers you couldn’t verify: prompt-as-contract and citations

With good context in place, the next problem was trust. The system prompt couldn’t be a list of vague wishes; it had to work as an operational contract: answer in the expected language, don’t make things up without evidence, prefer local sources over generic theory, cite only what’s present in the context, use tools when the question depends on current state.

What took me longest to learn were the negative rules, and every one of them came from a real mistake:

Citations were the other front. Asking for “include sources” isn’t enough: the model invents formats, cites chunks it didn’t use, or mixes references. I solved it by making them structured, as a contract between retriever, prompt, and UI:

The retriever hands over each chunk with a controlled, citable tag.
The prompt only allows those formats.
The final answer is cleaned before it’s shown.
The backend reconstructs the citation list from valid tags.
If there’s no source, there’s no citation.

In practice it looks like this:

# each chunk arrives with a controlled, citable tag
[[src:billing/client.ts#createInvoice]]
  "The billing client validates the amount..."

# the model can only cite using those tags
"...validated before issuing
 [[src:billing/client.ts#createInvoice]]."

# the backend ignores anything that isn't a known tag
# and rebuilds the citation as a link to the file or commit

For internal knowledge this is key: if someone makes a decision based on an answer, they need to be able to open the file, commit, or document that backs it up.

What breaks in production: caching, fallback, and security

The decisions above made the system good on paper. Taking it to production surfaced three more problems.

Caching. Caching by the question text seemed obvious, until the same question started meaning different things depending on the corpus version, active context, operational state, or the user’s permissions.

Fallback. Providers fail, models rate-limit, context grows long. The pattern that saved me was a degraded mode: if the LLM doesn’t respond, return the most relevant retrieved passages with a clear message. It isn’t elegant, but the user isn’t blocked, the sources are still available, and observability records the real error. That’s what separates a demo from a usable tool.

Security. Indexing repos, internal documentation, and read tools creates another way to access the organization’s knowledge. The measures I consider minimal: path whitelists; a denylist for .env, secrets, and raw logs; separation by project; access control before retrieval; read-only tools by default; auditing of questions, tools, and errors; and a review of which sources are citable.

How I evaluated it

Asking the chat things and checking whether it “sounds good” works for a demo, not for production. I ended up testing in three layers:

Retrieval tests: given a question, the top K must contain the expected chunk.
Answer tests: the answer must mention mandatory points and exclude forbidden claims.
Tool tests: each tool must return stable, bounded JSON with no out-of-scope data.

A retrieval test needs no framework: it’s a table of cases and one assertion.

# retrieval-tests.yaml
- question: "where is the amount validated before invoicing?"
  must_retrieve: "services/billing/client.ts#createInvoice"
  top_k: 8

- question: "what timeout does the payments integration use in staging?"
  must_retrieve: "config/staging.yml#payments"
  must_not_retrieve: "config/production.yml"
  top_k: 8

The runner executes each question against the retriever and fails if the expected chunk doesn’t show up in the top K —or if a forbidden one does. It runs in CI in seconds, because it never calls the LLM: it only tests the search. This set is the one behind the numbers in this post: it started with about twenty real questions and grew with every bad answer.

And I kept every bad answer as a regression case. In RAG, many improvements are invisible until something breaks; tests keep you from re-breaking what you’d already learned.

What helped most, and what I’d do differently

Ranked by real return, the order is clear:

Domain-specific metadata — improves retrieval, security, and explanation at once.
Structure-based chunking — keeps the index from being shapeless text.
Small tools — resolve live state and history without bloating the corpus.
Strict citations — make the answer verifiable.
Cache with corpus version — cuts cost without serving stale information.
Degraded fallback — turns LLM failures into tolerable failures.

What returned the least was trying to solve with prompting what were really data problems. If I started another system tomorrow, I’d begin by defining the real questions it must answer, design the metadata before choosing a model, write retrieval tests from week one, and keep the corpus small until I understood what hurts. The expensive mistake isn’t indexing too little, it’s indexing a lot without knowing whether it helps.

Checklist: RAG to production

The actionable version of everything above, in the order I’d review it:

Define the real questions the system must answer before picking a stack.
Design the chunk metadata before choosing a model.
Chunk by structure, not by tokens.
Combine vector, lexical, and filters; don’t bet on a single signal.
Write retrieval tests from week one and run them in CI.
Solve live state with small, read-only tools — not by indexing it.
Make citations structured and verifiable, not decorative.
Version the corpus, include it in the cache key, and plan for incremental ingestion.
Apply access control before retrieval and treat the corpus as untrusted input.
Keep the corpus small until the tests tell you what’s missing.

Closing

RAG truly helped me once I stopped treating it as an AI integration and started treating it as a layer of context engineering. The important work isn’t in calling the model, but in deciding what enters the context, what doesn’t, what gets queried live, what gets cited, what gets cached, what gets blocked by permissions, and what gets answered honestly when there’s no evidence.

That’s the idea I keep: a good RAG system doesn’t replace technical judgment. It encodes it.

Two fronts were only sketched here and deserve posts of their own: how to evaluate a RAG beyond “it sounds good”, and the security of an internal RAG when the corpus is untrusted input. They’re next on the list.

If you’re building something like this —or fighting a RAG that sounds good but isn’t trustworthy— it’s exactly the kind of problem I work on. Feel free to reach out.

References

Patrick Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2020.
Shunyu Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models, 2022.
Akari Asai et al., Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection, 2023.
Penghao Zhao et al., Retrieval-Augmented Generation for AI-Generated Content: A Survey, 2024.