As teams move to agentic dev, one question we’re hearing a lot is “Where does system/organizational truth live?” One interesting solution to this problem is to use the same git repository that engineers and agents already rely on.

Wikis and similar systems — Confluence, Notion, and the rest — are orthogonal to the system itself. Pages fall behind quickly, because updating them is secondary labor. In a traditional SDLC, this isn’t the end of the world. Humans can just ignore out-of-date documentation, and rely on the bits they know to be reliable, filling in the gaps by asking their manager or emailing the product team. An agent won’t do that! A copy and paste from outdated docs will be treated as 100% true by an agent.

Ultimately, this is less a model problem than a documentation and memory problem.

The repository as the system of record

One solution to this problem is to make the repository the system of record. Requirements and rationales live alongside the modules they describe, and git history and PR comments explain what changed along the way. Jira and Confluence shift from being the source of truth to useful reports generated from what lives in the repo.

Several forces stack in favor of this locality, especially with agents in the loop. The simplest is colocation: docs that live next to src/billing/ change when billing changes. Local docs aren’t orthogonal to the work the way wiki articles are. And because they ride the same branches, the honesty machinery you already run (pull requests, human and agent review, CI checks, linters) can govern documentation too.

Locality also changes what a diagram is worth. A box-and-arrow PNG in Confluence is opaque to an agent, but a Mermaid flowchart beside the service entrypoint shows up in the diff and can be read by the tools. Underneath all of this is a question of what the artifact is optimized for. When agents and engineers reason about a change, the question is usually what a module does and what must not break, not where the payments wiki lives. A repository is organized around that question; a topical wiki fights it.

Karpathy’s markdown library: compile, don’t chunk

We’re not the only ones thinking this way. Andrej Karpathy recently described an approach to building an LLM knowledge base that bypasses classic RAG for many real workloads. You dump raw material into a folder, let an agent compile it into interlinked Markdown, and run periodic lint passes so the library stays consistent. At team scale he argued structured text plus indices beats opaque embedding search that returns plausible fragments with weak provenance.

That pattern rhymes with system documentation too. You already have the raw material of a knowledge base scattered everywhere: git history, tickets, Slack threads, and code. You’re just not compiling it into agent-readable truth that lives beside the work. Karpathy’s point is that the model can be your librarian. Once the pattern is established, summaries, cross-links, and cleanup become ongoing maintenance, done entirely agentically, rather than a one-time migration project. This doesn’t mean vector databases are dead; it means the interesting bet for many product engineering teams is versioned Markdown in git, maintained with the same habits as code.

What you should do Monday

We are intentionally not prescribing a full-on “move your tickets into git” or “replace Jira with YAML tasks” project. Ticket workflow is a separate fight, and many teams are not ready for it. This thesis calls for different experiments around documentation and shared memory.

Try this: pick one bounded subsystem and require that its functional and architectural documentation live in Markdown/Mermaid stored alongside the code. Treat wiki pages as deprecated for that slice. Then add a lightweight gate that makes drift visible: a CI check that fails when certain paths change but their docs don’t, or an agent lint step that compares README claims to test names and public API surfaces.

Run a Karpathy-style compile pass to flesh out this new library: export related decisions from Slack into a raw/ folder, then run a bounded agent job to produce interlinked docs/ with backlinks. Review the agent’s work and tweak based on what it missed or under/overemphasized. Measure whether onboarding and agent sessions get faster, more accurate, or just plain easier.

Whatever experiment you run, teams will have to adjust. This change lands differently on each side of the house. Product leaders will need to build fluency with repo-resident artifacts (reading a spec in git, commenting on a PR). For engineering leaders, the change means owning freshness: if agents read the repo, stale local docs are now a production risk, not a technical-writing nuisance.

None of this requires replacing the wiki overnight. Plenty of orgs will keep it as the official face for auditors and executives, and that’s not wrong. Different consumers need different views. The bet worth running in 2026 is smaller: for the teams shipping with agents, stop asking them to treat a parallel documentation universe as true. Put truth beside the implementation, let git govern change, and let agents help maintain the library the same way we finally learned to let CI maintain tests.

Further Reading

Red Teaming Gen AI

February 12, 2026

Tony Karrer

If you’re pushing LLM or RAG features into production, you already know the stakes: the models aren’t just code, they’re evolving systems that interact with unpredictable users and highly variable data. Traditional QA isn’t enough. To ship resilient AI and win confidence from customers and stakeholders, adversarial testing needs to move to the top of your playbook.

Adversarial testing: why it matters for LLM and RAG systems

Adversarial testing or “red teaming” is about trying to make your AI fail on purpose, before malicious actors or edge-case users do. For LLMs and RAG, that means probing for prompt injections, jailbreaks, hallucinations, data leakage, and subverted retrieval strategies.

LLM systems are vulnerable to cleverly crafted prompts that skirt safety limits and encourage harmful, biased, or unauthorized outputs.

RAG and hybrid architectures have unique takeover risks: manipulating the retrieval pipeline, poisoning source documents, or confusing context windows so the model behaves unpredictably.

Adversarial testing uncovers real issues that aren’t obvious until your model is live: privacy leaks, bias amplification, data extraction attacks, and unreliable inferences; all the stuff that keeps CTOs and CISOs up at night.​

How do tech leaders integrate adversarial testing for LLM/RAG?

  • Simulate attacks with both manual red teaming and automated tools and test vectors like prompt injections, data poisoning, and retrieval manipulation.
  • Chain attacks across model and retrieval layers; don’t assume vulnerabilities stop at the model boundary.
  • Use playbooks like MITRE ATLAS, OWASP ML Security Top 10, and keep logs for every test; they’re useful for team learning, postmortems, and compliance.
  • Layer in robust monitoring so adversarial scenarios are caught in real time, not just during scheduled security reviews. Real-time monitoring is essential for both security and reliability.
  • Involve domain experts and skeptics. Adversarial ideation is creative work, not just automation. It takes deep product knowledge and a healthy dose of adversarial thinking to imagine how your outputs could be abused.​
  • Simulate attacks with both manual red teaming and automated tools and test vectors like prompt injections, data poisoning, and retrieval manipulation.
  • Chain attacks across model and retrieval layers; don’t assume vulnerabilities stop at the model boundary.
  • Use playbooks like MITRE ATLAS, OWASP ML Security Top 10, and keep logs for every test; they’re useful for team learning, postmortems, and compliance.
  • Layer in robust monitoring so adversarial scenarios are caught in real time, not just during scheduled security reviews. Real-time monitoring is essential for both security and reliability.
  • Involve domain experts and skeptics. Adversarial ideation is creative work, not just automation. It takes deep product knowledge and a healthy dose of adversarial thinking to imagine how your outputs could be abused.​

Reading List

Building Reliable Autonomous Agentic AI

January 12, 2026

Tony Karrer

Over the past few years, CTOs have been building LLM-based systems using a DAG workflow approach. Autonomous agentic systems are a different sport. We’ve had reliability as a key question and it’s even more critical when a model can take actions (call tools, write to systems, trigger workflows). There’s incredible power here, but also big challenges.

A few definitions to start

Autonomous agentic system: an LLM wrapped in a loop that can plan, take actions via tools, observe results, and continue until it reaches a stop condition (or it’s forced to stop).

Tool calling: the agent selecting from a constrained action space (tool names + schemas) and emitting structured calls; your runtime executes them, validates outputs, and feeds results back into the loop.

Orchestration (the “real software” around the model): state management, retries, idempotency, timeouts, tool gating, context assembly/pruning, audit logging, and escalation paths.

Closed-loop evaluation (Plan -> Act -> Judge -> Revise): a repeatable harness where you run realistic tasks, score outcomes (ideally against ground truth and human-calibrated judges), learn what broke, and iterate.

Guardrails + safe stopping: runtime-enforced constraints (policies, budgets, circuit breakers, permissions) that limit what the agent can do and force it to stop or escalate when risk rises or progress stalls.

A small set of practices that pay off fast

Treat your tools like a product surface, not a pile of functions.
The failure mode is “death by a thousand tools”: overlapping capabilities, ambiguous names, and huge schemas that make selection brittle. Keep tools narrow, make them obviously distinct, and hide tools by default unless they’re relevant to the current step. “Just-in-time” instructions and tool visibility is a pragmatic way to scale without drowning the model in choices. 

Move reliability into deterministic infrastructure (not prompt magic).
If an agent can trigger side effects (create a ticket, refund an order, email a customer), you need transactional thinking: idempotent tools, checkpointing, “undo stacks,” and clear commit points. Prompts don’t roll back production systems; your runtime does. 

Put hard budgets and explicit stop reasons into the main loop.
Most “runaway agents” are simply missing guardrails that set limits on: iterations, tool calls, dollars, and wall-clock time; and “no progress” detectors (same tool call repeating, same plan restated, same error class recurring). When the agent hits a threshold, it should stop with a structured summary: what it tried, learned, and needs from a human.

Design for long-running work with durable state and resumability.
If the agent’s job can outlast a single context window (or a single process), assume it will crash, time out, or be interrupted. Store state externally, make steps replayable, and separate “planning notes” from the minimal context required to proceed. The goal is to resume cleanly without redoing expensive work or compounding earlier mistakes.

Make evaluation real: production-like tasks, ground truth, and judges you can trust.
Vibe checks don’t catch regressions. You want a small-but-representative set of real tasks sampled from production distributions, with ground truth where possible, and automated judges that are calibrated against human agreement (so you know what “good” means). Also assume reward hacking and metric gaming will happen. Build detection for it the same way you do for any other adversarial input.

Security guardrails: constrain action space, validate everything, and sandbox execution.
Tool calling expands your attack surface (prompt injection is just one angle). Practical defaults: strict schema validation, allow-lists for tool targets, content sanitization, least-privilege credentials, and sandboxed execution for anything that can run code or touch sensitive systems.

Want to learn how TechEmpower can help you or your team with Agentic AI?

More reading

Building production-ready agentic systems: Lessons from Shopify Sidekick (Shopify, Aug 26, 2025)

The most “copyable” part is how they hit tool sprawl in the real world and moved to just-in-time instructions, plus a very concrete evaluation approach (ground-truth sets, human agreement, judge calibration, and the reality of reward hacking).

AI grew up and got a job: Lessons from 2025 on agents and trust (Dec 18, 2025, Google Cloud)

A CTO-level framing of why “agents” change the trust model: autonomy, integration into workflows, atomicity/rollback thinking, and why governance has to be part of the architecture.

Effective harnesses for long-running agents (Nov 26, 2025, Anthropic)

Focuses on the annoying reality: agents that run for hours/days need a harness that’s built for resumability, recoverability, and controlled progress—not just bigger context windows.

What 1,200 Production Deployments Reveal About LLMOps in 2025 (Dec 19, 2025, ZenML)

A dense, case-study-heavy sweep of what shows up across production systems: context engineering, infrastructure guardrails, circuit breakers, and why “software fundamentals” keep winning over clever prompting.

Ground Truth Curation Process for AI Systems (Aug 20, 2025, Microsoft).

If you’re serious about closed-loop improvement, this is the unglamorous foundation: how to build and maintain ground truth sets that support regression testing and meaningful “judge” signals.

Function calling using LLMs (May 6, 2025, Martin Fowler).

A solid mental model for “tools as a constrained action space,” plus practical guardrails (unit tests around tool selection, injection defenses, and how to reduce boilerplate as your toolset grows).

How to build your first agentic AI system (Oct 2, 2025, TechTarget).

A pragmatic implementation-oriented checklist, including explicit loop limits, retry patterns, and when to escalate—useful for teams moving from prototypes to something operational.