My Experiences with AI Agents in Software Development

Over the last couple of years I’ve built several projects with AI coding agents — production ML systems, serving pipelines, internal tools. Enough builds to get past the demo-day excitement and form a real view of what changes when agents write the code and a human directs them.

This is that view — written for other engineers and engineering leads making the same shift. Not a forecast, not a tutorial. Just an honest report on what works, what breaks, and what I’m still not sure about.

The core shift is this: software work is moving from writing code to specifying and verifying it. That’s real, it already works in a specific zone, and it raises hard questions about skill, trust, and long-term quality that nobody has fully answered yet.

The shift, stated plainly

For most of my career, building software meant a person writing the code by hand. You understood the problem, designed an approach, then spent most of your hours producing the code yourself. That hand-writing of code was the real bottleneck — the slow, careful work of turning a clear idea into correct, complete code.

Agents change this. The thinking still has to happen, and happen well — but it happens up front, as a specification, a prompt, a set of constraints. The code itself is generated by the agent, which writes it faster and often better than a person would. What’s left for the human is the two ends: specifying clearly before, and verifying carefully after. The slow middle — a human writing the code by hand — is largely gone.

There’s a second effect worth naming, because many teams feel it but few say it: not every team has enough strong engineers. Agents raise the floor — they give a team a baseline of competent code regardless of who’s directing them. But you still need strong people to decide what to build and to judge whether the result is right. So the shape changes: fewer strong people, doing higher-value work, instead of a large team of uneven skill.

This isn’t a small change in workflow. It reshapes how time is spent, what skills matter, how teams are built, and what it means to own a piece of software. The rest of this article is the consequences.

What agents are genuinely good at today

Credit where it’s due — the capability is real, and underselling it misleads as much as overselling it.

Agents are exceptional at the happy path. Give one a clear, well-scoped task and it produces working, idiomatic code faster than any human, often with sensible structure and error handling. In one project, an agent scaffolded a complete ML serving pipeline — API layer, model loading, prediction endpoint, logging — in a single session. Days of boilerplate, done in an afternoon, and done well.

They also go deep on one thing far better than broad across many. Hand an agent a single component with clear boundaries and it excels — not just autocompleting, but making reasonable choices within the boundary you drew. The skill is in drawing the boundary.

This is why agents shine on greenfield work and prototypes: no legacy context to absorb, no decade of old decisions to respect — just a clean problem and a clear target. There, an agent is a genuine force multiplier.

Where they break — the integration points

Real systems are not a single happy path. They’re a collection of components, and the hard part has always been where those components connect — the points where one part’s assumptions meet another part’s reality.

Agents build each component correctly for the inputs they were given. The gaps show up where the pieces connect — especially where real production data meets code tested only against clean, simple examples. I learned this the hard way: an agent built a serving component that passed every test, because the tests used minimal, well-formed inputs. Only when a real production record — wide, messy, full of the irregularities real data carries — flowed through the whole system did the gap appear. The component was right. The connection between components was wrong.

This is the architect’s job, and it doesn’t delegate. The agent builds for the inputs it’s given; the human has to specify what the real inputs look like and make testing against them non-negotiable. An agent won’t tell you to “test with a realistic production record” — it doesn’t know what production looks like. You do. That knowledge is exactly what the human still owns.

The lesson generalizes: agents are excellent at building to a specification and poor at knowing what the specification should have said. The gap between those two is where experience lives.

When the flow looks perfect and isn’t

My most expensive lesson came from a project where everything appeared to go right. I ran a multi-level agent setup — a strategist agent decomposing the work, a working agent executing it — across ten phases. Each phase looked clean: the strategist defined it, the working agent executed and summarized, the strategist reviewed the summary and approved, and we moved on. Ten green phases.

The missing piece was me. There was agent-side validation at every phase, but no real human check — I was trusting the summaries. It looked perfect until I sat down to validate Phase 1 at the end and hit a cascade of problems. Things that would have taken minutes to fix right after Phase 1 were now tangled through nine phases built on top of them. The flow looked perfect precisely because the only thing checking it was the same machinery that produced it.

That rewired how I work. An agent reviewing another agent’s summary is not the same as a human checking the actual output against reality — and the gap is invisible until you look. Which is why you can’t defer the looking.

Where agents quietly drift

There’s a subtler failure too: agents don’t reliably hold the guardrails you set. They agree to a principle, honor it for a while, then drift — quietly enough that you only catch it if you’re watching.

Two examples. One of my firm rules is never patch a symptom with a manual one-off command — find the root cause and fix it properly. An agent respects that early in a session, but several exchanges later, when something breaks, it will happily suggest a manual workaround. Remind it of the rule and it corrects course. The rule didn’t change; its hold on it decayed over the conversation.

The second: I keep my strategist agent (the what and why) separate from my working agent (the how). Over time the strategist starts overreaching — writing sample code that belongs to the working level. A reminder fixes it; again, the boundary eroded on its own.

The common thread: the human has to be the persistent memory of how we agreed to work. Agents are excellent within a turn and leaky across many. Holding the principles steady is a standing job for the person in the loop.

The new working model — from pairs to pods

Most engineers already know one model for working together: pair programming. Two developers working through a problem side by side. It’s a human-to-human model, built for a time when people wrote the code by hand — so its output is bounded by how fast humans can produce code, a few hundred lines on a good day.

The structure I’ve found works in the agent era is different enough to need its own name. Call it pod programming. A small, cross-functional team sits together and builds with agents — but the team is organized around a different kind of work. A representative pod:

Role	Count	Owns
Business Analyst / Product Owner	1	Whether it’s the right thing to build
Technical Architect	1	How the pieces connect, system integrity
Full-stack Engineer	1	Implementation direction across the stack
Quality Engineers	2	Verification and the acceptance gates

This is not “a bigger team doing the same work more slowly.” A pair produces code at human pace; a pod directing agents handles a far larger flow, because code generation is no longer the slow step. Because the agent writes the code, the people are freed for what matters more — each brings their own expertise to specifying and checking the work, all working together through the same build rather than one person driving it alone.

Notice the team has more quality engineers than full-stack engineers — a flip of the traditional ratio, where developers usually outnumber QE. It’s deliberate: when agents generate code faster than any human can, the hard work moves downstream. Checking the work becomes as big a job as producing it — so you staff for it. When code generation was the bottleneck, you staffed for that; when verification is the bottleneck, you staff for verification. The exact numbers are just an example — the point is the direction.

There’s a further shift this only hints at. As neither role primarily writes code, the line between “full-stack engineer” and “quality engineer” blurs — the shared skill becomes specifying and verifying agent output, the same whether you call yourself an implementer or a verifier. I expect these to merge into one role over time: engineers who specify, direct, and verify. Today you might staff them separately; the trajectory is toward one discipline.

A note on size: a pod of five or six people today is driven mostly by ownership — you want a person clearly accountable for each domain (the product, the architecture, the build, the verification). It is not that the work needs that many hands. As agents improve and each person can confidently own more, I expect the pod to get smaller over time. The roles matter more than the headcount, and the headcount should fall.

Whatever the exact composition, the principle holds: agents don’t remove the people, they change what the people spend their time on.

Roughly, I’ve found the time splits in half: about 50% working with the agent — prompting, directing, course-correcting — and about 50% understanding and verifying what it produced. That second half is not optional during this transition, and underbudgeting it is the most common mistake I see. The speed of generation seduces you into skipping the comprehension, and the gap compounds. (This is also where the case for staffing verification heavily comes from — if half the work is checking, you resource it like half the work.)

As an illustration of the rhythm — and I offer this as a proposed model, not a measured law — a productive session looks something like:

Phase	Duration	Activity
Specify	~60 min	Prompt design — covering the requirements precisely, folding in lessons from past sessions about where the agent tends to go wrong
Build	~90 min	Building with the agent and the pod, each member directing within their domain
Verify	~60 min	Each pod member reviews the agent’s work in their domain, traces the flow, and identifies gaps for the next session

The exact minutes don’t matter; the shape does. Significant time up front specifying, a concentrated build, and significant time after verifying. The build — the part we used to think of as “the work” — is the shortest phase.

In practice this runs to about two such cycles a day — roughly seven hours of focused work with breaks — though it can also be one longer session. That’s a high number and I hold it loosely; the right pace depends on the team and will settle with experience. The point isn’t the schedule, it’s the proportion: spend real time specifying up front, keep the build tight, and never let verification be the thing that gets squeezed when time runs short. This is early thinking, meant to evolve.

One consequence worth naming: prompting becomes a first-class engineering skill. Not casual chat, but deliberate craft — precise, complete, refined against past failures, and eventually standardized across the team so the pod isn’t reinventing prompts and rediscovering the same agent failure modes individually. Prompt definition, refinement, and standardization start to look like a real part of the engineering practice, the way coding standards and code review did.

The machinery — a multi-level agent model

If the pod is the people, this is the machinery they direct. The setup I’ve converged on uses three levels of agents, each with a distinct job, and — critically — human validation gates between phases.

Three-level agent model: human notes feed Level 1 (prompt agent), which feeds Level 2 (strategist agent) that decomposes into phases, which feeds Level 3 (working agent); a human validation gate sits between each phase and the next.

Level 1, the prompt agent. Its job is to turn my rough notes into a complete, context-rich specification. This matters more than it sounds: humans forget things, drop a constraint they mentioned last week, leave a preference implicit. An agent carrying the conversation history doesn’t — it can hold all the accumulated context and preferences and fold them into a specification far more complete than what I’d hand-write in the moment. Context is the whole game here, and the agent is better at retaining it than I am.

Level 2, the strategist and architect agent. It takes that specification and decomposes the work into logical phases. For each phase it writes a detailed prompt for the level below, monitors what comes back, and surfaces to me what I need to look at. This is the agent doing the architect’s decomposition work — turning a big requirement into a sequence of buildable, verifiable pieces.

Level 3, the working agent. It follows the level-2 instructions, executes each phase, and produces a summary of what it did for validation — by both the strategist agent and the human.

The model is elegant, and it’s also exactly where my hardest lesson lives. Notice that the strategist validating the working agent’s summary is agent validating agent. That’s necessary but not sufficient. The gate that actually protects you is the human one — a person checking the real output against reality before the next phase builds on it. That gate is the easiest one to skip, because everything looks fine without it, right up until it very much doesn’t. (This is the failure I described earlier — ten clean phases, no human gate, and a Phase 1 reckoning at the end.) The machinery is only as good as the gates you actually honor.

The human role, redefined

If the agent writes the code, what is the human for? Two durable things.

Domain expertise. Knowing what the system needs to do, what production reality looks like, where the edge cases hide, what “correct” means here. The agent has none of this — it has to be told, and what it builds is capped by the quality of what it’s told.

System ownership. This distinction is load-bearing: reading the code and understanding the system are not the same thing. Today you need both — you read the output because you don’t yet fully trust it, and you understand the system because you’re accountable for it. But they can come apart. You can stop reading every line while still owning the system’s behavior, its contracts, its failure modes, its long-term quality. The human moves up — from author of implementation to owner of the system — not off the ladder entirely.

That’s the role that survives: the person who holds the system in their head, sets direction, specifies reality, and stays accountable for whether the thing is correct, maintainable, and observable — regardless of who or what wrote it.

The questions I don’t have clean answers for

An honest account has to include what’s unresolved — and these genuinely are. I raise them not to answer them, but because anyone adopting agents seriously will run into them.

Skill evaporation. If agents write the code and humans verify, what happens to the skills that only come from writing it? Engineers may lose the hands-on fluency that lets them verify well — and the next generation may never build it. The verification we rely on today is backed by years of authorship; what backs it when authorship fades? I don’t know.

Trust, and how it’s earned. I verify heavily today because I don’t fully trust agent output. Trust should come from demonstrated reliability — but how a team calibrates “how much can I safely not check” is still immature. Trust too little and you lose the gain; trust too much and the gaps between components bite you.

Technical debt changes shape — it doesn’t disappear. My first instinct was optimistic: agents write cleaner, more consistent code than rushed humans, so the classic forms of debt — copy-paste, tangled structure, the shortcut you swore you’d revisit — genuinely shrink. And they do. But debt doesn’t vanish; it migrates. The new form is what I’d call unvalidated-flow debt — work that looks finished and passes the agent’s own checks but was never verified against reality, quietly accumulating until someone finally looks. The ten-phase project I described earlier was exactly this: no messy code anywhere, and yet a large, invisible debt that came due all at once at Phase 1. This new debt is arguably more dangerous than the old kind, precisely because it’s invisible. Sloppy code announces itself; unvalidated flow doesn’t. The defense is the obvious one, applied with discipline: once the agent has done the work, you spend real time verifying it. Done thoroughly, that verification pays the debt down as you go — and if the work then holds up under genuine checking, your remaining risk is low. The danger isn’t the debt itself; it’s deferring the verification that clears it.

The long-term quality attributes. Maintainability, scalability, observability, security, performance — the properties that decide whether a system survives for years, and exactly the ones a working demo never tests. Security deserves a special flag: agents optimize for working, not secure, and those are different bars. An agent will happily ship code that runs perfectly while carrying an insecure default, a leaked secret, or a missing authorization check — nothing in “make it work” forces it to think about the adversary. Whether agent-built systems hold up on these over a multi-year life isn’t known yet; the systems haven’t lived that long. The human owns these attributes — security most of all.

None of these is a reason not to use agents. They’re the reasons to use them with discipline — and to keep watching.

Where this works now, and where it’s headed

My read on the current boundary: agents are ready today for new projects of small-to-medium complexity — greenfield work, prototypes, and internal tools with some tolerance on long-term quality while the practice matures. There the gain is large and the risk is contained.

The frontier — coming, but not here yet — is the hard case: large existing codebases and legacy systems, where the agent must absorb enormous implicit context, respect a decade of decisions, and not break things no one fully remembers. That’s where the value is greatest and the capability still weakest. It’s where the whole field is pushing.

The discipline that makes it work — and where it’s going

If there’s one practice that makes agent-built software trustworthy, it’s this: decompose the work into small, verifiable phases, each with a manual acceptance gate before the next begins. No phase proceeds until a human has validated the last. This is what keeps the speed of generation from outrunning the comprehension — it forces the verification to happen, phase by phase, instead of being deferred until it’s too large to do.

One testing practice fits this moment almost too well. I’ll describe it by what it does, not what it’s called, because the label invites arguments that miss the point: tests are the acceptance gates. Define the tests first — the real ones, against production-shaped data — and you’ve defined both what the agent should build and how you’ll know it worked. The mechanism is what matters: it turns “verify the agent’s work” from a vague obligation into a concrete gate that either passes or doesn’t. Agents and executable gates are a strong pairing, because a gate is exactly the kind of unambiguous target an agent builds toward well.

And the longer arc? I’ll end on a forward view, held with deliberate caution. Given the trajectory of the last couple of years, I expect agents to produce genuinely production-grade code and artifacts within the next couple of years. When they do, not reading the code line-by-line stops being negligence and becomes a rational allocation of attention — you verify behavior and contracts, and let implementation be a detail. Software starts to resemble a black box you specify the inputs and outputs for, trusting the middle the way we already trust the software products we build on.

I hold that as a direction of travel, not a description of today — right now you verify everything, because the reliability isn’t there yet. But the gap between “verify every line” and “trust the box” is exactly what the discipline above — small gated phases, real tests, owning the system rather than reading every line — is built to bridge. And trusting code you haven’t read isn’t new: we already trust compilers and the software products we build on, relying on the contract and verifying at the boundary. Trusting agent output is the same move, one level up. The practices that make the transition safe are the ones that make the destination workable.

What stays constant, in every version of this future I can see, is the human who understands the system — what it must do, what reality it faces, and whether it’s any good. Writing the code was never the point. That part, I’m confident about.

This article reflects my own experience building with AI agents across several projects, written with the help of AI from my notes and direction. I’ll update it as the practice — and my view of it — continues to evolve.