Every once in a while, the TechEmpower Framework Benchmarks surface something that makes everyone involved squint at the numbers and say, “Wait… what?”

Round 23 had one of those.

On the face of it, nothing particularly dramatic had changed about the idea of the benchmarks. Same philosophy as always:

Choosing a web application framework involves evaluation of many factors. While comparatively easy to measure, performance is frequently given little consideration. We hope to help change that.

Application performance can be directly mapped to hosting dollars, and for companies both large and small, hosting costs can be a pain point. Weak performance can also cause premature and costly scale pain by requiring earlier optimization efforts and increased architectural complexity. Finally, slow applications yield poor user experience and may suffer penalties levied by search engines.

What if building an application on one framework meant that at the very best your hardware is suitable for one tenth as much load as it would be had you chosen a different framework?

That’s the north star for the project, and it hasn’t really changed.

What did change in R23 was that one set of tests, the spring-mongo implementations, looked much worse than anyone expected. Not “eh, that’s a little low” worse, but “this is probably pointing at a real problem” worse.

And because the internet is sometimes better than we deserve, the lead performance engineer at MongoDB noticed, reached out, and we ended up on a little adventure together.

MongoDB’s Java story in the benchmarks is represented by spring-mongo (a Spring-based implementation) as well as a few Node.js variants. In Round 22, the spring-mongo tests were not the fastest things on the planet, but they were in a reasonable ballpark.

By round 23, some of the spring-mongo results in our continuous benchmark environment (Citrine) had fallen off a cliff.

For one of the core tests (query) the numbers looked roughly like this (requests per second, spring-mongo only):

  • Round 22: ~5.9k
  • Round 23 baseline: ~583

That’s not noise. That’s a smoking gun.

At the same time, other frameworks and other MongoDB-based implementations weren’t seeing that same kind of collapse. That’s usually a hint that something environmental or configuration-related is going wrong, rather than “MongoDB suddenly became slow” or “Spring suddenly can’t talk to the database.”

This is where Ger Hartnett from MongoDB enters the story. Ger is the lead performance engineer at MongoDB, and he did exactly what I wish more vendors would do: he treated the weird benchmark as a bug report.

He emailed me and, very politely, started asking all the questions you’d expect a performance engineer to ask:

  • “What’s the host hardware?” We’re running on a single Xeon 6330 box with 56 hyper-threaded cores.
  • “What OD and kernel versions were used for R22 and R23?” Ubuntu 22.04 for R22 and 24.04 for R23; the host OS had also moved forward over time, and we hadn’t been explicitly encoding that in the results yet.
  • “Can we get logs for spring-mongo?” We recommend against enabling logging for “round runs” (I/O eats CPU), but the benchmark harness and public status pages make it possible to turn them on temporarily (via pull request) and then correlate them with specific continuous runs.

That kicked off a back-and-forth where we coordinated on where and how their team could safely tweak the spring-mongo implementations.

Meanwhile, MongoDB pulled in Intel to help analyze what was happening on their own lab hardware, including runs with and without “performance mode” enabled at the platform level.

So we now had three axes of investigation:

  • The application and its configuration (the spring-mongo code and connection settings)
  • The database process and allocator behavior (MongoDB itself under Docker)
  • The platform-level configuration (power/performance modes, etc., analyzed with tools like PerfSpect)

The team ended up sending two pull requests:

  • #10514 – a workaround for a MongoDB server issue involving TCMalloc per-CPU memory pools when running inside Docker containers.
  • #10565 – changes to the way the spring-mongo client connects to the database, including adjustments to maxPoolSize and related options so the driver can actually make use of the available hardware.

Once both PRs were in and our continuous benchmarking pipeline had cranked through them, Ger sent a summary email with the numbers for spring-mongo:

Test R22 R23 Baseline TCMalloc Fix Only ConnString \+ TCMalloc
query 5.9k 583 837 14.2k
db 104k 68k 192k 205k
fortune 22k 59k 167k 188k
update 2.4k 528 795 10.2k

The query test went from 583 to 14.2k in our environment. That’s roughly a 25x relative to the Round 23 baseline, and significantly better than the Round 22 result. The other tests also improved substantially, especially once the allocator and pool configuration were both in a sane place.

This was not just about our benchmark setup. MongoDB and Intel ran the same workload in their own environments, with and without CPU “performance mode,” and saw similar patterns: the combination fixes make a huge difference.

I like stories where everyone comes out looking better in the end. MongoDB has a much stronger showing in the benchmarks, the benchmark implementations themselves improve as reference examples, and if you’re out there picking a stack for your next project, you get numbers that are closer to what’s actually possible when people who care about performance have had a chance to sharpen the edges.

Product meets Engineering in the AI Era

March 13, 2026

Tony Karrer

Join us on April 10: Product and Engineering Working Together in the Agentic Coding Era

We’ve assembled four product and engineering leaders to share exactly how they’ve retooled their processes. This virtual mini-conference is designed for CPOs, VPs of Product, CTOs, and Heads of Engineering who want practical, ready-to-apply examples — not theory. Register here

CPOs, VPs of Product, and CTOs are experiencing a common challenge: while agentic coding tools accelerate product development, they also introduce new friction between product and engineering. A product manager (PM) creates a spec that tells engineering what they want built, and then one of two things happens:

  • The engineer appropriately asks the agentic coding tool what questions it has. The agent immediately surfaces 15 questions, 12 of which need input from product. You have a cycle time hit and more context switching.
  • The engineer doesn’t surface the questions and builds it anyway. After PR reviews and QA, they realize the implementation does the wrong thing.

One theme for the first half of 2026: product and engineering leaders need to reduce this new friction.

What changed

A PM’s spec has two audiences.

First, people:

  • Reviewers (customers, leadership, other PMs) who need to confirm the product intent.
  • Engineers who need to reason about tradeoffs, durability, and how it fits the architecture.

Second, agents:

  • The agentic coding tool that will try to execute what you wrote, literally, at speed.

So what do we do?

PMs should use codebase-aware tools before handoff

I would highly recommend that product leaders and product managers try out the new Claude Desktop app, which bundles Claude, Claude Cowork, and Claude Code into a more PM-friendly interface. You can use it for a LOT more product needs than creating specs – see the additional reading below.

To get your PMs onboard, consider using the tool to ask:

“What does the product do today in scenario X?”

If you have Claude Desktop connected to your code, it often can answer those types of questions. It also will provide you the answer to:

“Given this draft spec, what questions do we need to answer before someone starts work?”

This helps PMs clarify ambiguity so you avoid the new friction points.

It’s time to change the default from “PMs don’t have visibility into the repo.” That policy actively works against speed and alignment. By giving the AI tooling access to the code base, PMs are empowered with insight while maintaining the separation of responsibilities with engineering.

Side note: Markdown is quickly becoming the shared format for specs because it’s easy to diff, easy to reuse, and plays nicely with repos and agent workflows. Pick a Markdown editor you like (Obsidian is a good choice) and make it part of the standard toolkit.

PRDs and Tickets => Specs

You may want to start calling PRDs / Tickets or other definitions of what’s to be built “specs” internally, not because PRD is wrong, but because it communicates a shift: the output is meant to be fed into an agentic coding tool w/ more specifics.

The upcoming virtual mini-conference and the additional reading has lots of help on this front, for example – acceptance criteria and edge cases are critical.

AI supports PMs but does not replace their judgment; it should enhance decision-making efficiency. Use AI to accelerate drafting, decomposition, and edge case discovery. But the final tradeoffs, priorities, and product decisions still belong to the PM. And us engineers still get to rely on PM judgment to know what to build.

Engineering still has to engineer

A clear spec does not eliminate engineering responsibilities. Strong teams do two things consistently:

  1. Architecture and technical planning: fit the spec into the system in a durable way (constraints, data flows, integration points, performance, security).
  2. Task shaping: break the spec into finer-grained development tasks that are independently testable, so agentic execution stays controlled and reviewable.

A good spec allows the engineers to focus on the work that actually requires engineering judgment.

Reading list

 

Red Teaming Gen AI

February 12, 2026

Tony Karrer

If you’re pushing LLM or RAG features into production, you already know the stakes: the models aren’t just code, they’re evolving systems that interact with unpredictable users and highly variable data. Traditional QA isn’t enough. To ship resilient AI and win confidence from customers and stakeholders, adversarial testing needs to move to the top of your playbook.

Adversarial testing: why it matters for LLM and RAG systems

Adversarial testing or “red teaming” is about trying to make your AI fail on purpose, before malicious actors or edge-case users do. For LLMs and RAG, that means probing for prompt injections, jailbreaks, hallucinations, data leakage, and subverted retrieval strategies.

LLM systems are vulnerable to cleverly crafted prompts that skirt safety limits and encourage harmful, biased, or unauthorized outputs.

RAG and hybrid architectures have unique takeover risks: manipulating the retrieval pipeline, poisoning source documents, or confusing context windows so the model behaves unpredictably.

Adversarial testing uncovers real issues that aren’t obvious until your model is live: privacy leaks, bias amplification, data extraction attacks, and unreliable inferences; all the stuff that keeps CTOs and CISOs up at night.​

How do tech leaders integrate adversarial testing for LLM/RAG?

  • Simulate attacks with both manual red teaming and automated tools and test vectors like prompt injections, data poisoning, and retrieval manipulation.
  • Chain attacks across model and retrieval layers; don’t assume vulnerabilities stop at the model boundary.
  • Use playbooks like MITRE ATLAS, OWASP ML Security Top 10, and keep logs for every test; they’re useful for team learning, postmortems, and compliance.
  • Layer in robust monitoring so adversarial scenarios are caught in real time, not just during scheduled security reviews. Real-time monitoring is essential for both security and reliability.
  • Involve domain experts and skeptics. Adversarial ideation is creative work, not just automation. It takes deep product knowledge and a healthy dose of adversarial thinking to imagine how your outputs could be abused.​
  • Simulate attacks with both manual red teaming and automated tools and test vectors like prompt injections, data poisoning, and retrieval manipulation.
  • Chain attacks across model and retrieval layers; don’t assume vulnerabilities stop at the model boundary.
  • Use playbooks like MITRE ATLAS, OWASP ML Security Top 10, and keep logs for every test; they’re useful for team learning, postmortems, and compliance.
  • Layer in robust monitoring so adversarial scenarios are caught in real time, not just during scheduled security reviews. Real-time monitoring is essential for both security and reliability.
  • Involve domain experts and skeptics. Adversarial ideation is creative work, not just automation. It takes deep product knowledge and a healthy dose of adversarial thinking to imagine how your outputs could be abused.​

Reading List

Building Reliable Autonomous Agentic AI

January 12, 2026

Tony Karrer

Over the past few years, CTOs have been building LLM-based systems using a DAG workflow approach. Autonomous agentic systems are a different sport. We’ve had reliability as a key question and it’s even more critical when a model can take actions (call tools, write to systems, trigger workflows). There’s incredible power here, but also big challenges.

A few definitions to start

Autonomous agentic system: an LLM wrapped in a loop that can plan, take actions via tools, observe results, and continue until it reaches a stop condition (or it’s forced to stop).

Tool calling: the agent selecting from a constrained action space (tool names + schemas) and emitting structured calls; your runtime executes them, validates outputs, and feeds results back into the loop.

Orchestration (the “real software” around the model): state management, retries, idempotency, timeouts, tool gating, context assembly/pruning, audit logging, and escalation paths.

Closed-loop evaluation (Plan -> Act -> Judge -> Revise): a repeatable harness where you run realistic tasks, score outcomes (ideally against ground truth and human-calibrated judges), learn what broke, and iterate.

Guardrails + safe stopping: runtime-enforced constraints (policies, budgets, circuit breakers, permissions) that limit what the agent can do and force it to stop or escalate when risk rises or progress stalls.

A small set of practices that pay off fast

Treat your tools like a product surface, not a pile of functions.
The failure mode is “death by a thousand tools”: overlapping capabilities, ambiguous names, and huge schemas that make selection brittle. Keep tools narrow, make them obviously distinct, and hide tools by default unless they’re relevant to the current step. “Just-in-time” instructions and tool visibility is a pragmatic way to scale without drowning the model in choices. 

Move reliability into deterministic infrastructure (not prompt magic).
If an agent can trigger side effects (create a ticket, refund an order, email a customer), you need transactional thinking: idempotent tools, checkpointing, “undo stacks,” and clear commit points. Prompts don’t roll back production systems; your runtime does. 

Put hard budgets and explicit stop reasons into the main loop.
Most “runaway agents” are simply missing guardrails that set limits on: iterations, tool calls, dollars, and wall-clock time; and “no progress” detectors (same tool call repeating, same plan restated, same error class recurring). When the agent hits a threshold, it should stop with a structured summary: what it tried, learned, and needs from a human.

Design for long-running work with durable state and resumability.
If the agent’s job can outlast a single context window (or a single process), assume it will crash, time out, or be interrupted. Store state externally, make steps replayable, and separate “planning notes” from the minimal context required to proceed. The goal is to resume cleanly without redoing expensive work or compounding earlier mistakes.

Make evaluation real: production-like tasks, ground truth, and judges you can trust.
Vibe checks don’t catch regressions. You want a small-but-representative set of real tasks sampled from production distributions, with ground truth where possible, and automated judges that are calibrated against human agreement (so you know what “good” means). Also assume reward hacking and metric gaming will happen. Build detection for it the same way you do for any other adversarial input.

Security guardrails: constrain action space, validate everything, and sandbox execution.
Tool calling expands your attack surface (prompt injection is just one angle). Practical defaults: strict schema validation, allow-lists for tool targets, content sanitization, least-privilege credentials, and sandboxed execution for anything that can run code or touch sensitive systems.

Want to learn how TechEmpower can help you or your team with Agentic AI?

More reading

Building production-ready agentic systems: Lessons from Shopify Sidekick (Shopify, Aug 26, 2025)

The most “copyable” part is how they hit tool sprawl in the real world and moved to just-in-time instructions, plus a very concrete evaluation approach (ground-truth sets, human agreement, judge calibration, and the reality of reward hacking).

AI grew up and got a job: Lessons from 2025 on agents and trust (Dec 18, 2025, Google Cloud)

A CTO-level framing of why “agents” change the trust model: autonomy, integration into workflows, atomicity/rollback thinking, and why governance has to be part of the architecture.

Effective harnesses for long-running agents (Nov 26, 2025, Anthropic)

Focuses on the annoying reality: agents that run for hours/days need a harness that’s built for resumability, recoverability, and controlled progress—not just bigger context windows.

What 1,200 Production Deployments Reveal About LLMOps in 2025 (Dec 19, 2025, ZenML)

A dense, case-study-heavy sweep of what shows up across production systems: context engineering, infrastructure guardrails, circuit breakers, and why “software fundamentals” keep winning over clever prompting.

Ground Truth Curation Process for AI Systems (Aug 20, 2025, Microsoft).

If you’re serious about closed-loop improvement, this is the unglamorous foundation: how to build and maintain ground truth sets that support regression testing and meaningful “judge” signals.

Function calling using LLMs (May 6, 2025, Martin Fowler).

A solid mental model for “tools as a constrained action space,” plus practical guardrails (unit tests around tool selection, injection defenses, and how to reduce boilerplate as your toolset grows).

How to build your first agentic AI system (Oct 2, 2025, TechTarget).

A pragmatic implementation-oriented checklist, including explicit loop limits, retry patterns, and when to escalate—useful for teams moving from prototypes to something operational.

AI coding tools are transforming how we make software. But measuring the impact of these tools is harder than it looks!

To address this pressing issue, we are excited to announce our upcoming webinar: AI Coding Tool Metrics: DORA and CTOs Deep Dive. This expert-led session aims to provide engineering leaders with the clarity and tools needed to navigate the complexities of measuring the impact of AI coding tools effectively.

For the first time, the LA CTO Forum is opening this session to a broader audience. Join us, along with fellow CTOs, VPEs, heads of engineering, senior product leaders, and IT leaders, to gain a practical and reality-based view of measuring AI coding tools in the real world.

Event Details

  • Date:
  • Time: /

Reserve your spot

During this two-hour mini-conference, attendees can expect:

  • Insights from a DORA researcher on how high-performing teams are adopting AI-assisted development and the key metrics that correlate with better outcomes.
  • Real-world experiences shared by two CTOs on measuring AI tools in their organizations, including utilization, quality, satisfaction metrics, and handling non-code work.
  • A moderated discussion among CTOs and attendees to address key questions and concerns.

Key Takeaways

  • Discover the metrics used by leading organizations to measure the impact of AI coding tools and the tools that can help capture them.
  • Learn how to assess where your team stands on the AI adoption curve and strategies to catch up if needed.
  • Understand the hidden value AI tools provide beyond just increasing code output.

Don’t miss this opportunity to gain valuable insights and strategies to effectively measure the impact of AI coding tools in your organization.

All registrants will receive the slides and a full session recording.

AI Coding Tools Metrics

December 1, 2025

Tony Karrer

If you’re an engineering or product leader, you’re probably already getting the question: “Are AI tools getting us the 30% productivity boost that is happening in other organizations?”

You likely don’t have a good, honest answer to that question. And to get there you need a bit of patience and to face an age old problem for software engineering – how do we measure it?

One caution at the start – let adoption mature. In almost every rollout I’ve seen, the first 3-6 months are a time of rapid improvement:

  • Engineers are learning how best to use the tools, including where they help, how to prompt, and how to sanity-check outputs.
  • Teams are still evolving rules and example prompts, and figuring out what approach to use in different scenarios.
  • Tooling, tests, and repo structures are still tuned for human-only workflows.

AI Tool adoption is the biggest knowledge and skills change for engineers and engineering teams ever in any of our careers. Competence takes time. Early on, your measure should focus on adoption and use to enable coaching, not trying to push too hard on other measures. But that doesn’t get you off the hook from figuring out how to answer the measurement question. Side note: if you haven’t yet incorporated AI coding tools into your SDLC, check out our recent blog post 2-week spike to ramp up on AI Coding Tools.

Want to learn more? We’re hosting a special two-hour deep dive for engineering and product leaders about how to measure the real impact of AI coding tools, what metrics actually matter, and how high-performing teams are handling the transition.

AI Coding Tool Metrics: DORA and CTOs Deep Dive
Friday, January 9, 2026 • 8–10 AM PST / 11 AM–1 PM EST

Reserve Your Spot

Can’t attend live? Register anyway and we’ll send you the full session recording.
All registrants receive the full recording.
This two-hour, high-impact mini-conference includes:
  • A DORA researcher sharing new findings on how high-performing teams are adopting AI-assisted development — what’s changing in their workflows and which metrics actually correlate with better outcomes.
  • Two CTOs breaking down how they measure AI tools inside their organizations: the utilization, quality, and satisfaction metrics they track, what surprised them, and how they manage the non-code work.
  • A moderated discussion among CTOs and attendees to surface real questions and compare approaches.
You’ll learn:
  • What metrics leading organizations are using — and which tools help you capture them.
  • How to find where your team sits on the AI adoption curve, and what to do if you’re behind.
  • Where AI tools create hidden value that doesn’t show up as “more code.”
This is the first time the LA CTO Forum has opened one of its online sessions to a broader audience. Don’t miss this opportunity!

What most teams actually track

Once you’re past the initial rollout, most orgs end up tracking some subset of these:

  • Utilization: AI tool usage (DAU/WAU, sessions or prompts per dev), percentage of committed code that’s AI-generated, and percentage of PRs or tickets that are AI-assisted.
  • Throughput: rates of PRs, Tickets, Story Points, Cycle Time with and without use of AI tools and use of AI tools with Productivity improvement often based on qualitative estimates.
  • Quality: commit acceptance rates, rework rates, and incident/defect trends over time for AI-touched work versus non-AI.
  • Developer satisfaction

That said, you quickly run into the same problem we’ve always had with developer measurement and the AI coding tools just layer complexity on top.

I will also point out that the widely varying studies that you read plays directly into this and the fact that you are likely measuring immature adoption.

High-value AI work that doesn’t result in “more lines of code”

The other trap is that a lot of the best AI use cases don’t include code generation and may not affect “throughput” numbers:

  1. Errors, stack traces, and debugging

    Using an assistant to explain logs, propose hypotheses, and narrow in on fixes is incredibly valuable. The final fix might be three lines of code, but the time saved in root cause analysis is where the win lives.

  2. Understanding existing codebases

    Having an agent walk an engineer through modules, data flows, and edge cases is gold for onboarding and cross-team work, and really day-to-day work as well. The output might be a short design note, a diagram, or just a better mental model, but often not code itself.

  3. Requirements analysis and development strategy

    Turning fuzzy business goals into crisp acceptance criteria, edge cases, migration plans, and trade-off analyses is real engineering work. Good use of AI here usually means more iterating and more thinking up front. This work itself is not yet code.

  4. Code review assistance

    AI can act as a second set of eyes: flagging missing tests, odd edge cases, or inconsistencies with past patterns. It may not change the size of the diff, but it can quietly improve quality and shorten the path from PR to deployment.

If you rely too heavily on Lines of Code produced, you will fall into all the old traps and you will especially undervalue these use cases.

The new friction AI introduces

Even when AI tools are helping, they create some early friction that can make metrics look worse before they look better:

  1. Requirements friction

    Once engineers get good with AI, they tend to ask more – and better – questions about requirements and acceptance criteria. Tickets that used to be “good enough” start getting challenged. That’s healthy, but in the short term it can make cycle times look longer and frustrate product managers who weren’t expecting that level of scrutiny.

  2. Code review overload

    If you think of AI as multiplying your number of junior developers, your ratio just shifted dramatically. You now have far more “entry-level” code being submitted for review review. Without changes to review practices and guardrails, senior and mid-level engineers get swamped in AI-generated diffs and everything slows down.

This is why you can’t just stare at velocity charts and “% AI-generated code” and call it a day. You have to look at the whole system: how long work takes end-to-end, how quality and incidents move, how much time seniors spend reviewing, and whether the non-code work (requirements, debugging, comprehension) is getting easier.

Pragmatic measurement stance for 2026

If you’re getting pressure to “show me the numbers,” a reasonable stance looks like:

  • Acknowledge that you need at least 3–6 months of adoption maturity before any hard conclusions.
  • Track a small set of utilization and quality signals, and compare AI and non-AI work within the same teams over time.
  • Explicitly call out the non-code use cases you care about—debugging, codebase understanding, requirements, code review—and capture their impact with a mix of targeted metrics and narrative examples.
  • Use external studies as framing, not as your baseline; your systems, codebase, and people will be different.

Reading list




AI is transforming how software gets built. Teams that integrate AI into their SDLC the right way are seeing faster delivery cycles, lower costs, and higher ROI.

To help teams make that transition effectively, TechEmpower is hosting a webinar:
Leveraging AI Tooling Across Your Software Development Lifecycle.

The session will be moderated by Tony Karrer, CEO of TechEmpower, with featured guest Brent Laster,
author of The AI-Enabled SDLC (O’Reilly). They’ll share practical strategies for integrating AI tools
across every stage of software development—from planning and coding to testing, documentation, and deployment.

This webinar will help attendees connect the dots and move from ad-hoc AI experiments to real-world, AI-driven workflows that scale.

Event Details

  • Date:
  • Time:
    /
  • Reserve your spot


What You’ll Learn

  • AI use cases across key SDLC phases: where to start and how to scale
  • Real-world examples that work: AI-assisted coding, reviews, testing, documentation, and more
  • Team enablement strategies: roles, prompting approaches, and workflows for adopting AI

All registrants will receive the slides and a full session recording.

2-week spike to ramp up on AI Coding Tools

October 23, 2025

Tony Karrer

We’ve seen many companies stumble when rolling out AI coding assistants. Success depends on building knowledge, skills, and practical habits. We’re helping across all aspects of rolling out AI tools, but we have found one practice that accelerates proficiency:

2-week (10 work-day) AI Coding Tool Ramp-up Spike

Here’s how it works:

  • 2 days of focused training
    • Day 1 (Fundamentals): Core patterns of AI-assisted development – How to write precise prompts, how to review AI results, and how to refine code without creating technical debt. Engineers leave with a systematic workflow rather than just ad-hoc examples.
    • Day 2 (Advanced): Context management, multi-file refactors, breaking down features into AI-manageable chunks, debugging AI outputs, rules, MCP servers/services. Exercises surface common failure modes, ensuring teams build the reflexes to reset context, enforce consistency, and debug AI outputs.
  • 8 days of supported, hands-on ticket work
    • Developers pick up a variety of tickets and use the AI tool as part of getting the work done.
    • Task journaling — Each developer keeps a lightweight daily log of what worked and what didn’t, building a shared playbook.
    • Feedback loops: with AI champions — Daily check-ins with champions and facilitators and asynchronous support to help overcome early friction quickly and build skills quickly.

By the end of the two-week spike, engineers have built a foundation of habits, shared practices, and a clearer sense of where the tools genuinely improve code quality and developer experience. Leaders need to provide support for continued learning beyond this two-week period, but we’ve found this to be a critical first step.

Additional Reading:

Announcing the AI Developer Bootcamp

I’m excited to share something we’ve been working on: the TechEmpower AI Developer Bootcamp. This is a hands-on program for developers who want to build real LLM-powered applications and graduate with a project they can show to employers.

The idea is simple: you learn by building. Over 6–12 weeks, participants ship projects to GitHub, get reviews from senior engineers, and collaborate with peers through Slack and office hours. By the end, you’ll have a working AI agent repo, a story to tell in interviews, and practical experience with the same tools we use in production every day.

Now, some context on why we’re launching this. Over the past year, we’ve noticed that both recent grads and experienced engineers are struggling to break into new roles. The job market is challenging right now, but one area of real growth is software that uses LLMs and retrieval-augmented generation (RAG) as part of production-grade systems. That’s the work we’re doing every day at TechEmpower, and it’s exactly the skill set this Bootcamp is designed to teach.

We’ve already run smaller cohorts, and the results have been encouraging. For some participants, it’s been a bridge from graduation to their first job. For others, it’s been a way to retool mid-career and stay current. In a few cases, it’s even become a pipeline into our own engineering team.

Our next cohort starts October 20. Tuition is $4,000, with discounts and scholarships available. If you know a developer who’s looking to level up with AI, please pass this along.

Learn more and apply here

We’re starting to see a pattern with LLM apps in production: things are humming along… until suddenly they’re not. You start hearing:

  • “Why did our OpenAI bill spike this week?”
  • “Why is this flow taking 4x longer than last week?”
  • “Why didn’t anyone notice this earlier?”

It’s not always obvious what to track when you’re dealing with probabilistic systems like LLMs. But if you don’t set up real-time monitoring and alerting early, especially for cost and latency, you might miss a small issue that quietly escalates into a big cost overrun.

The good news: you don’t need a fancy toolset to get started. You can use OpenTelemetry for basic metrics, or keep it simple with custom request logging. The key is being intentional and catching the high-leverage signals.

Here are some top reads that will help you get your arms around it.

Top Articles