Agent Engineering 101 at GDG London: How to Build Reliable AI Systems

Last week I had the chance to speak at GDG London Build with AI 2026. My session, Agent Engineering 101: How to Build Reliable AI Systems, is now live as a presentation at superagenticai.github.io/Agent-Engineering-101. The official agenda for the event was published on the event schedule pages at buildwithai.gdg.london and iwd.gdg.london/schedule.

The core idea

My central argument was simple: most AI failures in production are not model failures. They are engineering failures. A prompt can get you surprisingly far, and a clever demo can look magical, but once you move into production, reality shows up fast. Retrieval gets noisy, tool calls fail, memory becomes messy, models change, costs and latency start mattering, policy boundaries tighten, and users ask unexpected questions. That is where prompt tinkering starts to break down, and that is where Agent Engineering becomes the right frame. Agent Engineering is the broader systems discipline required to build non-deterministic AI systems that are reliable, testable, and useful in the real world.

An agent is not a prompt

One of the first points I made at GDG London was that an agent is not just a fancy system prompt. An agent is a system. In practice, that system includes a goal, a model, runtime context, tools, state or memory, policies, and a control loop that decides what to do next. Once you add retrieval, tools, memory, workflows, and feedback loops, you are no longer just prompting. You are designing an architecture.

How teams are building agents today

This is the pattern many of us have seen: you start with a prompt, add RAG, add tools or MCP, swap models, add a framework, and keep patching until it kind of works. This is a useful way to learn, but it often creates accidental complexity. The prompt becomes hidden business logic, retrieval becomes a magic box, tool behavior becomes probabilistic, and memory becomes either too weak to help or too sticky to trust. This is why so many AI systems feel impressive in demos and unreliable in practice.

Why prompt tinkering breaks

A phrase I use often is this: models do not read your mind. They only see the context you actually provide, they only use the tools you expose, and they only operate within the boundaries you define. When the system around the model is vague, brittle, or poorly evaluated, the overall behavior becomes unreliable. The most common failure modes are familiar: context changes and assumptions are lost, retrieval quality is uneven, tool outputs are unpredictable, memory can introduce stale or irrelevant state, model upgrades can shift behavior, and prompts silently become undocumented product logic. At that point, the answer is not another round of prompt tweaking. The answer is better engineering.

From prompt engineering to Agent Engineering

One of the reasons I wanted to give this talk is that the AI field is creating many new labels: Prompt Engineering, Context Engineering, Harness Engineering, Eval Engineering, Memory Engineering, Skills Engineering, Guardrail Engineering, Inference Engineering, and Agentic Engineering. It can look fragmented at first, but the opposite is true. These are not separate end states. They are subsystems of one larger production discipline, and that discipline is Agent Engineering.

The main types of Agent Engineering

In the talk I covered the major disciplines now converging in real production systems. Each one addresses a different layer of the agent stack, but none of them alone is sufficient for building a reliable system:

Prompt Engineering

The instruction layer. Prompting patterns, structured outputs, expressing intent clearly.

Context Engineering

The runtime context layer. Retrieval, grounding, context construction, and MCP.

Harness Engineering

The execution layer. Tool wiring, runtime constraints, policies, permissions, sandboxes.

Eval Engineering

The measurement layer. Behavioral tests, eval suites, benchmarks.

Memory Engineering

The state layer. Short-term and long-term memory, retrieval strategies.

Skills Engineering

The capability layer. Tool use, reusable skills, composability.

Guardrail Engineering

The safety layer. Input/output controls, policy boundaries, risk mitigation.

Inference Engineering

The performance layer. Latency, cost, model serving, batching, throughput.

Agentic Engineering

The orchestration layer. Multi-agent workflows, delegation, human-in-the-loop.

All of these categories are part of the broader conversation emerging around Agent Engineering. Individually, each one solves a real problem. Together, they form the production discipline that the field has been converging toward.

Reliability comes from loops, not vibes

A major theme of the session was that reliable AI systems are built through feedback loops, not vibes, not intuition alone, and not one giant prompt. The loop looks like this:

Build → Evaluate → Optimize → Repeat

Once you start thinking this way, you stop asking how to craft the perfect prompt once and start asking how to define expected behavior, measure it, improve it, and keep doing that systematically.

Why DSPy and GEPA matter

This is why I wanted to include DSPy and GEPA in the talk. DSPy is important because it pushes AI development toward a more modular, declarative, software-like model, helping move the work from brittle strings toward structured AI programs. GEPA matters because optimization should not be random. It offers a more systematic route for improving prompts and textual system components using evaluation feedback and reflection. The bigger point is that AI systems need optimization not only at the prompt layer but across context, retrieval, tools, memory, and orchestration.

Google’s ecosystem for production agents

Because this was a GDG London session, I also wanted to connect the topic to Google’s agent ecosystem. On the model side, the Gemini model family continues to improve across reasoning, multimodality, long context, and cost-performance tradeoffs, and for builders who need structured outputs and tool use, it is worth looking at Gemini structured output and Gemini function calling.

On the framework side, Google’s Agent Development Kit (ADK) gives developers a modular way to build and orchestrate agents, and the ADK agent documentation is a useful starting point for understanding how Google frames agents as self-contained execution units. For multi-agent communication, the A2A documentation is especially interesting because it shows how agents can communicate with one another in a structured way. This ecosystem is useful not because it proves one vendor wins, but because it shows that the stack is maturing. Models, runtimes, tools, protocols, and orchestration are becoming distinct layers.

Practical rules for resilient agents

I wanted the talk to be practical, so I ended with a few rules I believe matter more than hype:

  • Keep the task narrow: Define explicit tool contracts
  • Bound memory carefully: Write evals early
  • Add fallbacks and retries: Expect models to change
  • Observe behavior in production
  • Optimize systematically

Reliable agents are rarely the ones with the flashiest demos. They are usually the ones built with the clearest constraints, the best feedback loops, and the most thoughtful engineering.

The bigger point

What I wanted people to leave with at GDG London was not just that agents are exciting. I wanted them to leave with a more serious idea:

Agent Engineering is becoming the reliability discipline for AI systems.

As foundation models improve and become more widely available, the competitive edge shifts higher up the stack, to context design, memory strategy, evals, orchestration, guardrails, inference tradeoffs, and optimization. In other words, the future belongs less to people who can write clever prompts once and more to people who can design systems that keep working.

Closing

It was great to share these ideas at GDG London Build with AI 2026 and to see how many developers are now moving past the wow phase of AI into the much harder and more interesting phase of building dependable systems. If you want to browse the talk, the full presentation is here: Agent Engineering 101.

Thanks again to the GDG London team for organizing the event and creating space for practical conversations about what it really takes to build reliable AI systems.