
Building a Chatbot That Answers Your Infrastructure Questions
Building a chatbot for infrastructure questions sounds straightforward until you decide what counts as an answer. If it can name a service, quote a runbook, and point to the place where a change gets verified, it is useful. If it starts guessing about prod topology, it becomes a liability quickly.
What this chatbot should actually answer
I would keep the first version narrow. Good questions are the ones that already have answers in internal material:
- “Which cluster runs the billing API?”
- “What is the rollback step for the queue worker?”
- “Where is the Redis connection string documented?”
- “What changed in the last deploy window?”
Bad questions are open-ended or need live operational judgment:
- “Is the system healthy?”
- “Should I restart this service?”
- “What is wrong with customer 1432's request?”
That split matters because the bot is not an operator. It is a retrieval layer over your system knowledge.
Pick the data sources before you pick the model
Start with docs, runbooks, and config snapshots
I usually start with sources that are stable, textual, and already reviewed by humans:
- architecture docs
- incident runbooks
- service ownership pages
- sanitized config snapshots
- deploy notes
- Terraform or Kubernetes manifests, if you treat them as read-only evidence
The model is the easy part. The quality comes from source selection and update cadence. If the docs are stale, the bot will be confidently stale too.
Define what stays out of scope
You should explicitly exclude:
- secrets and credentials
- private keys
- ephemeral access tokens
- raw customer data
- interactive shell output from production hosts
- anything that can trigger action without approval
Do not let the chatbot retrieve secrets “for convenience.” If it can see them, assume they will eventually leak into logs, prompts, or user-visible answers.
Architecture that keeps answers grounded
Retrieval layer and indexing strategy
The simplest reliable pattern is retrieval-augmented generation:
- chunk your documents
- embed the chunks
- retrieve the top matches for a question
- ask the model to answer only from that context
For infrastructure docs, chunk by semantic section instead of fixed size alone. A runbook step, a service definition, or a config block should usually stay intact. Breaking those apart ruins the context.
Metadata helps more than people expect:
| Metadata | Why it matters |
|---|---|
| service | filters answers to the right system |
| environment | avoids mixing dev and prod |
| lastUpdated | helps with freshness checks |
| sourceType | docs, runbook, config, incident note |
Prompt contract and answer format
The bot should return structured answers, not freeform essays. I like a format like:
- short answer
- supporting evidence
- source citations
- confidence or uncertainty
- follow-up action if the answer is incomplete
That contract gives you room to say “I do not know” without sounding broken.
Caching, freshness, and fallback behavior
Infrastructure changes often enough that freshness matters. Cache retrieved context briefly, not indefinitely. If a source is older than your trust window, flag it in the answer.
Fallback behavior should be boring:
- if retrieval fails, say so
- if sources conflict, show both
- if confidence is low, point to the owning team or runbook
Implementation sketch in JavaScript
Ingesting infrastructure knowledge
A basic ingest flow in JavaScript can scan Markdown docs, normalize them, and store embeddings plus metadata.
const files = await loadDocs("./infra-docs");
for (const file of files) {
const chunks = splitByHeading(file.content);
for (const chunk of chunks) {
await vectorStore.upsert({
id: `${file.path}:${chunk.id}`,
text: chunk.text,
metadata: {
path: file.path,
service: file.service,
environment: file.environment,
updatedAt: file.updatedAt
}
});
}
}
The important part is not the SDK. It is the metadata. Without it, you cannot separate “the staging API” from “the production API” when the wording is similar.
Serving a question with retrieved context
At query time, fetch the top matches and pass only those into the prompt.
async function answerInfraQuestion(question) {
const matches = await vectorStore.search(question, { topK: 5 });
const context = matches.map((m, i) => ({
id: i + 1,
text: m.text,
source: m.metadata.path,
updatedAt: m.metadata.updatedAt
}));
const prompt = buildPrompt({ question, context });
return llm.generate(prompt);
}
I would keep the prompt strict: answer only from the supplied context, cite each claim, and refuse to invent details.
Returning citations and uncertainty
Citations should be visible in the UI and machine-readable in the response. A practical shape is:
{
answer: "The billing API runs in cluster-2.",
citations: [
{ source: "docs/billing.md", excerpt: "Billing API is deployed to cluster-2" }
],
confidence: "medium",
note: "Runbook last updated 18 days ago"
}
If you cannot point to evidence, the bot should say that plainly. That is better than a polished hallucination.
Testing for accuracy, drift, and bad assumptions
Golden questions and regression checks
Write a small set of goldens before launch. Include questions that should be easy, ambiguous, and impossible. Then re-run them whenever prompts, embeddings, or source data change.
Good test cases include:
- exact runbook lookup
- service ownership questions
- questions with outdated wording
- questions that intentionally cross environments
Track whether the bot cites the right source and whether it answers the question you asked, not the adjacent one.
Conflicting sources and stale data
This is where infra bots usually fail. One doc says the service is in AWS. Another says it moved to GCP last quarter. The bot needs a policy:
- prefer newest source
- surface the conflict
- ask for manual verification when the conflict affects production
If you do not encode this, the model will quietly pick one.
Operational guardrails
Access control and secrets hygiene
Use the same access model the docs already have. If a user cannot open a runbook in your wiki, the chatbot should not reveal it either. Also sanitize source material before indexing. Remove secrets, tokens, and any value that should never appear in a prompt or log.
Logging, rate limits, and human escalation
Log the question, retrieved document IDs, and answer metadata. Do not log raw secrets or unnecessary context text unless you have a clear retention policy.
Rate limits stop abuse, but escalation matters more. If the bot sees a production incident question, route it toward the human on call instead of pretending to diagnose. A good chatbot supports the ops team. It does not replace them.
What to improve after the first version
After launch, I would tune in this order:
- retrieval quality
- chunking strategy
- source freshness checks
- answer formatting
- conflict handling
People often start by tuning prompts. In practice, the biggest gains usually come from better source curation and stricter retrieval rules.
Conclusion
A useful infrastructure chatbot is less about “AI” and more about evidence. If you can keep it grounded in current docs, restrict its scope, and force it to show where each answer came from, it becomes a real internal tool instead of a demo.
The rule I keep coming back to is simple: if the bot cannot cite it, it should not claim it.


