Product meets Engineering in the AI Era

March 13, 2026

Tony Karrer

Join us on April 10: Product and Engineering Working Together in the Agentic Coding Era

We’ve assembled four product and engineering leaders to share exactly how they’ve retooled their processes. This virtual mini-conference is designed for CPOs, VPs of Product, CTOs, and Heads of Engineering who want practical, ready-to-apply examples — not theory. Register here

CPOs, VPs of Product, and CTOs are experiencing a common challenge: while agentic coding tools accelerate product development, they also introduce new friction between product and engineering. A product manager (PM) creates a spec that tells engineering what they want built, and then one of two things happens:

  • The engineer appropriately asks the agentic coding tool what questions it has. The agent immediately surfaces 15 questions, 12 of which need input from product. You have a cycle time hit and more context switching.
  • The engineer doesn’t surface the questions and builds it anyway. After PR reviews and QA, they realize the implementation does the wrong thing.

One theme for the first half of 2026: product and engineering leaders need to reduce this new friction.

What changed

A PM’s spec has two audiences.

First, people:

  • Reviewers (customers, leadership, other PMs) who need to confirm the product intent.
  • Engineers who need to reason about tradeoffs, durability, and how it fits the architecture.

Second, agents:

  • The agentic coding tool that will try to execute what you wrote, literally, at speed.

So what do we do?

PMs should use codebase-aware tools before handoff

I would highly recommend that product leaders and product managers try out the new Claude Desktop app, which bundles Claude, Claude Cowork, and Claude Code into a more PM-friendly interface. You can use it for a LOT more product needs than creating specs – see the additional reading below.

To get your PMs onboard, consider using the tool to ask:

“What does the product do today in scenario X?”

If you have Claude Desktop connected to your code, it often can answer those types of questions. It also will provide you the answer to:

“Given this draft spec, what questions do we need to answer before someone starts work?”

This helps PMs clarify ambiguity so you avoid the new friction points.

It’s time to change the default from “PMs don’t have visibility into the repo.” That policy actively works against speed and alignment. By giving the AI tooling access to the code base, PMs are empowered with insight while maintaining the separation of responsibilities with engineering.

Side note: Markdown is quickly becoming the shared format for specs because it’s easy to diff, easy to reuse, and plays nicely with repos and agent workflows. Pick a Markdown editor you like (Obsidian is a good choice) and make it part of the standard toolkit.

PRDs and Tickets => Specs

You may want to start calling PRDs / Tickets or other definitions of what’s to be built “specs” internally, not because PRD is wrong, but because it communicates a shift: the output is meant to be fed into an agentic coding tool w/ more specifics.

The upcoming virtual mini-conference and the additional reading has lots of help on this front, for example – acceptance criteria and edge cases are critical.

AI supports PMs but does not replace their judgment; it should enhance decision-making efficiency. Use AI to accelerate drafting, decomposition, and edge case discovery. But the final tradeoffs, priorities, and product decisions still belong to the PM. And us engineers still get to rely on PM judgment to know what to build.

Engineering still has to engineer

A clear spec does not eliminate engineering responsibilities. Strong teams do two things consistently:

  1. Architecture and technical planning: fit the spec into the system in a durable way (constraints, data flows, integration points, performance, security).
  2. Task shaping: break the spec into finer-grained development tasks that are independently testable, so agentic execution stays controlled and reviewable.

A good spec allows the engineers to focus on the work that actually requires engineering judgment.

Reading list

 

Building Reliable Autonomous Agentic AI

January 12, 2026

Tony Karrer

Over the past few years, CTOs have been building LLM-based systems using a DAG workflow approach. Autonomous agentic systems are a different sport. We’ve had reliability as a key question and it’s even more critical when a model can take actions (call tools, write to systems, trigger workflows). There’s incredible power here, but also big challenges.

A few definitions to start

Autonomous agentic system: an LLM wrapped in a loop that can plan, take actions via tools, observe results, and continue until it reaches a stop condition (or it’s forced to stop).

Tool calling: the agent selecting from a constrained action space (tool names + schemas) and emitting structured calls; your runtime executes them, validates outputs, and feeds results back into the loop.

Orchestration (the “real software” around the model): state management, retries, idempotency, timeouts, tool gating, context assembly/pruning, audit logging, and escalation paths.

Closed-loop evaluation (Plan -> Act -> Judge -> Revise): a repeatable harness where you run realistic tasks, score outcomes (ideally against ground truth and human-calibrated judges), learn what broke, and iterate.

Guardrails + safe stopping: runtime-enforced constraints (policies, budgets, circuit breakers, permissions) that limit what the agent can do and force it to stop or escalate when risk rises or progress stalls.

A small set of practices that pay off fast

Treat your tools like a product surface, not a pile of functions.
The failure mode is “death by a thousand tools”: overlapping capabilities, ambiguous names, and huge schemas that make selection brittle. Keep tools narrow, make them obviously distinct, and hide tools by default unless they’re relevant to the current step. “Just-in-time” instructions and tool visibility is a pragmatic way to scale without drowning the model in choices. 

Move reliability into deterministic infrastructure (not prompt magic).
If an agent can trigger side effects (create a ticket, refund an order, email a customer), you need transactional thinking: idempotent tools, checkpointing, “undo stacks,” and clear commit points. Prompts don’t roll back production systems; your runtime does. 

Put hard budgets and explicit stop reasons into the main loop.
Most “runaway agents” are simply missing guardrails that set limits on: iterations, tool calls, dollars, and wall-clock time; and “no progress” detectors (same tool call repeating, same plan restated, same error class recurring). When the agent hits a threshold, it should stop with a structured summary: what it tried, learned, and needs from a human.

Design for long-running work with durable state and resumability.
If the agent’s job can outlast a single context window (or a single process), assume it will crash, time out, or be interrupted. Store state externally, make steps replayable, and separate “planning notes” from the minimal context required to proceed. The goal is to resume cleanly without redoing expensive work or compounding earlier mistakes.

Make evaluation real: production-like tasks, ground truth, and judges you can trust.
Vibe checks don’t catch regressions. You want a small-but-representative set of real tasks sampled from production distributions, with ground truth where possible, and automated judges that are calibrated against human agreement (so you know what “good” means). Also assume reward hacking and metric gaming will happen. Build detection for it the same way you do for any other adversarial input.

Security guardrails: constrain action space, validate everything, and sandbox execution.
Tool calling expands your attack surface (prompt injection is just one angle). Practical defaults: strict schema validation, allow-lists for tool targets, content sanitization, least-privilege credentials, and sandboxed execution for anything that can run code or touch sensitive systems.

Want to learn how TechEmpower can help you or your team with Agentic AI?

More reading

Building production-ready agentic systems: Lessons from Shopify Sidekick (Shopify, Aug 26, 2025)

The most “copyable” part is how they hit tool sprawl in the real world and moved to just-in-time instructions, plus a very concrete evaluation approach (ground-truth sets, human agreement, judge calibration, and the reality of reward hacking).

AI grew up and got a job: Lessons from 2025 on agents and trust (Dec 18, 2025, Google Cloud)

A CTO-level framing of why “agents” change the trust model: autonomy, integration into workflows, atomicity/rollback thinking, and why governance has to be part of the architecture.

Effective harnesses for long-running agents (Nov 26, 2025, Anthropic)

Focuses on the annoying reality: agents that run for hours/days need a harness that’s built for resumability, recoverability, and controlled progress—not just bigger context windows.

What 1,200 Production Deployments Reveal About LLMOps in 2025 (Dec 19, 2025, ZenML)

A dense, case-study-heavy sweep of what shows up across production systems: context engineering, infrastructure guardrails, circuit breakers, and why “software fundamentals” keep winning over clever prompting.

Ground Truth Curation Process for AI Systems (Aug 20, 2025, Microsoft).

If you’re serious about closed-loop improvement, this is the unglamorous foundation: how to build and maintain ground truth sets that support regression testing and meaningful “judge” signals.

Function calling using LLMs (May 6, 2025, Martin Fowler).

A solid mental model for “tools as a constrained action space,” plus practical guardrails (unit tests around tool selection, injection defenses, and how to reduce boilerplate as your toolset grows).

How to build your first agentic AI system (Oct 2, 2025, TechTarget).

A pragmatic implementation-oriented checklist, including explicit loop limits, retry patterns, and when to escalate—useful for teams moving from prototypes to something operational.

Announcing the AI Developer Bootcamp

I’m excited to share something we’ve been working on: the TechEmpower AI Developer Bootcamp. This is a hands-on program for developers who want to build real LLM-powered applications and graduate with a project they can show to employers.

The idea is simple: you learn by building. Over 6–12 weeks, participants ship projects to GitHub, get reviews from senior engineers, and collaborate with peers through Slack and office hours. By the end, you’ll have a working AI agent repo, a story to tell in interviews, and practical experience with the same tools we use in production every day.

Now, some context on why we’re launching this. Over the past year, we’ve noticed that both recent grads and experienced engineers are struggling to break into new roles. The job market is challenging right now, but one area of real growth is software that uses LLMs and retrieval-augmented generation (RAG) as part of production-grade systems. That’s the work we’re doing every day at TechEmpower, and it’s exactly the skill set this Bootcamp is designed to teach.

We’ve already run smaller cohorts, and the results have been encouraging. For some participants, it’s been a bridge from graduation to their first job. For others, it’s been a way to retool mid-career and stay current. In a few cases, it’s even become a pipeline into our own engineering team.

Our next cohort starts October 20. Tuition is $4,000, with discounts and scholarships available. If you know a developer who’s looking to level up with AI, please pass this along.

Learn more and apply here

We’re starting to see a pattern with LLM apps in production: things are humming along… until suddenly they’re not. You start hearing:

  • “Why did our OpenAI bill spike this week?”
  • “Why is this flow taking 4x longer than last week?”
  • “Why didn’t anyone notice this earlier?”

It’s not always obvious what to track when you’re dealing with probabilistic systems like LLMs. But if you don’t set up real-time monitoring and alerting early, especially for cost and latency, you might miss a small issue that quietly escalates into a big cost overrun.

The good news: you don’t need a fancy toolset to get started. You can use OpenTelemetry for basic metrics, or keep it simple with custom request logging. The key is being intentional and catching the high-leverage signals.

Here are some top reads that will help you get your arms around it.

Top Articles

AI Coding Assistants Update

September 16, 2025

Tony Karrer

The conversation around AI coding assistants keeps speeding up, and we are hearing the following questions from technology leaders:

  • Which flavor do we bet on—fully-agentic tools (Claude Code, Devin) or IDE plug-ins (Cursor, JetBrains AI Assistant, Copilot)?
  • How do we evaluate these tools?
  • How do we effectively roll out these tools?

At the top level, I think about:

  • Agentic engines are happy running end-to-end loops: edit files, run tests, open pull requests. They’re great for plumbing work, bulk migrations, and onboarding new engineers to a massive repo.
  • IDE assistants excel at tight feedback loops: completions, inline explanations, commit-message suggestions. They feel safer because they rarely touch the filesystem.

Here’s a pretty good roundup:

The Best AI Coding Tools, Workflows & LLMs for June 2025.

Most teams I work with end up running a hybrid—agents for the heavy lifting, IDE helpers for day-to-day quick work items.

Whichever path you take, the practices you use matter the most.

Some examples to get you started:

Reading list