Bringing HALO and Agentic Harness Engineering Concepts Into Our Open Harness Stack

Harness engineering is moving from intuition to observability. Recent work around HALO and Agentic Harness Engineering (AHE) points to an important shift in agent development. The next frontier is not only building stronger coding agents. It is building better systems around those agents. The harness decides what the model can see, which tools it can use, how errors are surfaced, how traces are stored, and how improvements are evaluated. When the harness is weak, even strong models can fail in avoidable ways.

At Superagentic AI, we have now implemented the first practical version of these ideas across our open harness stack:

RLM Code v0.1.8
MetaHarness v0.2.2

This is not a clone of HALO or the AHE reference implementation. It is our own implementation of the core ideas inside our existing tools. HALO gives us the trace-analysis loop. AHE gives us the observability structure. RLM Code and MetaHarness now connect both into a working open-source harness optimization workflow.

The Problem: Harnesses Fail Quietly

Coding agents do not fail only because the model is weak. They often fail because the harness around the model is brittle.

Common harness-level failures include:

Hallucinated tool calls
Confusing or redundant tool arguments
Missing feedback after failed commands
Refusal loops
Weak retry behavior
Poor task decomposition
Bad context compaction
Unclear system prompts
Middleware that hides useful state
Evaluation signals that do not explain why a candidate improved or regressed

These are not only model-training problems. They are engineering problems. The challenge is that they are hard to debug manually. A serious benchmark run can produce thousands or millions of trace tokens. Reading raw logs by hand does not scale. Making random prompt edits is not enough either, because it does not reliably explain which change caused which outcome. This is where HALO and Agentic Harness Engineering become useful.

What HALO Showed

HALO, or Hierarchical Agent Loop Optimizer, showed that execution traces can be used as direct evidence for
harness improvement.

The basic loop is simple:

collect traces
    -> analyze repeated failure modes
    -> produce a report
    -> feed that report to a coding agent
    -> update the harness
    -> evaluate again

The important part is that the optimizer is not guessing. It is grounded in what the agent actually did. If traces show hallucinated tool calls, the harness can add stricter tool-selection guidance. If traces show redundant tool arguments, the tool schema can be simplified. If traces show refusal loops, the system prompt or middleware can be changed. HALO made the trace-to-harness-improvement loop explicit.

What Agentic Harness Engineering Added

The Agentic Harness Engineering paper adds a stronger structure around the same idea. AHE frames harness optimization around three observability layers:

Component observability: The harness should be decomposed into editable pieces such as prompts, tools,
tool descriptions, middleware, skills, memory, sub-agents, evaluators, and orchestration.
Experience observability: Raw traces should be distilled into layered evidence, including overview
reports, per-task or per-trace detail files, and raw traces for drill-down.
Decision observability: Every harness edit should come with a change manifest that explains what changed,
which evidence motivated it, what it is expected to fix, and what might regress.

This is the key shift. Prompt optimization says:

try a better prompt and see if the score improves

AHE-style harness optimization says:

make an evidence-backed change
  record the prediction
  evaluate the result
  attribute the outcome
  keep, discard, or redesign based on evidence

That is a more serious engineering loop. It makes harness improvement observable, auditable, and falsifiable.

What Superagentic AI Built

Superagentic AI has implemented these concepts across two open-source repositories:

Together, they now support a practical workflow for trace-grounded, evidence-backed harness optimization.

RLM Code v0.1.8: Experience Observability

RLM Code now handles the trace-analysis side of the loop. We added a HALO and AHE-inspired trace_analysis environment that works over OpenTelemetry-shaped JSONL traces.

It supports:

Sidecar trace indexing
Dataset overview
Trace querying
Trace counting
Trace search
Full trace viewing for small traces
Selected span viewing for large traces
Bounded payload handling
Layered evidence corpus export

The new core action is:

export_evidence_corpus

This writes a structured evidence bundle:

trace-evidence/
    overview.md
    index.json
    detail/
      trace-error-1.md
      trace-error-2.md
    raw/
      trace-error-1.jsonl
      trace-error-2.jsonl

The output is designed for another coding agent or MetaHarness to consume.

overview.md gives the high-level diagnosis. detail/*.md gives per-trace evidence. raw/ *.jsonl keeps processed span data available for drill-down. index.json makes the corpus machine-readable. This implements the AHE idea of experience observability. Traces are not dumped into a prompt blindly. They become a layered evidence corpus.

Example workflow:

/rlm run "Find systemic harness failures trace=./traces.jsonl" env=trace_analysis steps=6

Then export the evidence corpus:

{
    "action": "export_evidence_corpus",
    "output_dir": "./trace-evidence",
    "filters": {"has_errors": true},
    "limit": 100,
    "include_raw": true
  }

The resulting trace-evidence/overview.md can then be passed into MetaHarness.

MetaHarness v0.2.2: Decision Observability

MetaHarness now handles the harness evolution side. We added support for trace-grounded candidate generation and AHE-style change manifests. MetaHarness already creates candidate workspaces, runs proposal backends such as Codex or Gemini, validates changes,
evaluates candidates, and stores run artifacts.

Now it also supports:

--trace-evidence
Candidate evidence injection
.metaharness/evidence/trace_evidence.md
.metaharness/change_manifest.json
Archived proposal manifests
Optional task-level change attribution
Ledger fields for changed components and verdicts

A MetaHarness run can now receive trace evidence like this:

uv run metaharness run ./my-harness \
    --backend codex \
    --hosted \
    --trace-evidence ./trace-evidence/overview.md \
    --budget 1

Each candidate workspace receives:.metaharness/evidence/trace_evidence.md The proposer prompt explicitly tells the coding agent to use that evidence when making harness changes. Before finishing, the proposer is now instructed to write:

.metaharness/change_manifest.json

A manifest looks like this:

{
    "schema_version": "metaharness.change_manifest.v1",
    "candidate_id": "c0001",
    "parent_candidate_ids": ["c0000"],
    "changes": [
      {
        "id": "change-1",
        "component": "tool_description",
        "description": "Clarified the tool argument contract.",
        "files": ["tools/search.py"],
        "failure_pattern": "The agent repeatedly passed redundant arguments.",
        "evidence_refs": ["trace_evidence.md#trace-error-1"],
        "root_cause": "The tool schema allowed ambiguous argument combinations.",
        "targeted_fix": "Make the required argument explicit and remove redundant alternatives.",
        "predicted_fixes": ["task-search-001"],
        "risk_tasks": ["task-search-legacy"],
        "notes": "Should reduce malformed tool calls."
      }
    ]
  }

MetaHarness archives this under:candidates/<id>/proposal/change_manifest.jsonIf the evaluator returns task-level results, MetaHarness can also generate:

candidates/<id>/proposal/change_attribution.json

That attribution report compares predicted fixes, risk tasks, actual improvements, and regressions. Verdicts include:

EFFECTIVE
PARTIALLY_EFFECTIVE
MIXED
INEFFECTIVE
HARMFUL

This gives MetaHarness a real decision ledger. The point is not just to find the winning candidate. The point is to
understand why a change worked or failed.

The New Superagentic Harness Loop

Together, RLM Code and MetaHarness now support this loop:

agent execution traces
    -> RLM Code trace_analysis
    -> layered evidence corpus
    -> MetaHarness --trace-evidence
    -> candidate harness edits
    -> change_manifest.json
    -> validation and evaluation
    -> change_attribution.json
    -> candidate ledger
    -> keep, discard, or redesign

This is the practical implementation of HALO and AHE concepts in the Superagentic AI stack.HALO-style trace analysis gives us the evidence. AHE-style manifests make the decisions auditable. MetaHarness turns that into an optimization workflow.

Why?

This moves harness optimization away from guesswork. Instead of saying:

The new prompt seems better.

We can say:

This candidate changed the tool description because traces showed malformed tool calls.
  It predicted task A would improve and task B might regress.
  After evaluation, task A improved and task B stayed stable.
  The change is EFFECTIVE.

That is a stronger engineering primitive. It makes harness improvement:

Observable
Reproducible
Auditable
Benchmarkable
Falsifiable

This is especially important for coding agents because the harness is now a large part of the product.

The model is only one part of the system. The real agent experience comes from the model plus tools, memory, middleware,
prompts, execution environment, tracing, evaluation, and rollback behavior.

What We Did Not Do

We did not copy the HALO repository. We did not copy the AHE reference implementation. We did not tie our stack to NexAU,
Harbor, E2B, or any one benchmark runtime.

Instead, we implemented the transferable ideas:

Trace analysis
Layered evidence
Evidence injection
Change manifests
Task-level attribution
Candidate ledgers

This keeps the Superagentic AI stack modular.

RLM Code remains the recursive trace-analysis and experimentation layer. MetaHarness remains the optimization and candidate-
evaluation layer. The two now work together.

What Comes Next

This is the first practical version of the loop.

Next, we plan to improve the system in several directions:

Richer task-level grouping in RLM Code
Automatic MetaHarness ingestion of the full evidence corpus, not only overview.md
Better component-level rollback suggestions
Stronger OpenTelemetry compatibility
Multi-run attribution across candidates
Benchmark reports showing before and after gains
Tighter integration between trace spans and candidate manifests

The long-term direction is clear:

self-improving harnesses need observability first

Before an agent can improve its own harness, it needs to see what happened, understand what changed, and verify whether the
change worked. That is what we have started building.

Summary

Superagentic AI has now implemented HALO- and AHE-inspired observability loops across its open harness stack. In RLM Code v0.1.8, we added layered trace evidence export. In MetaHarness v0.2.2, we added trace-grounded candidate proposals, change manifests, task-level attribution, and candidate ledger support.

Together, they create a practical open-source workflow for observable harness optimization:

traces -> evidence -> harness edits -> manifest predictions -> eval -> attribution

This is the beginning of falsifiable harness engineering.

Introducing SuperOpt: Research on Agentic Environment Optimization for Autonomous AI Agents

Superagentic AI Blog

Full Stack Agentic AI, Agent Optimization, Agent Engineering and Agent Experience.