CodexOpt: Optimize AGENTS.md and SKILL.md for Codex with GEPA-Inspired Feedback

Modern coding agents are getting better fast. But for most teams, one problem remains stubbornly manual: the instructions that shape agent behavior. A repo might have an AGENTS.md. It might have a set of SKILL.md files. Over time, those files grow. They collect repeated rules, contradictory guidance, vague workflows, missing verification steps, and formatting drift. Teams tweak them constantly, but usually without a reliable way to answer basic questions:

  • Did this actually improve agent behaviour
  • Did we reduce prompt bloat or make it worse?
  • Did we accidentally weaken safety, testing, or output quality?
  • Which version of our instructions should we keep?

At first, everything feels manageable. The agent follows instructions reasonably well, the repo stays organized, and small tweaks seem easy enough to make by hand. Then the instruction files start to grow. A rule gets repeated. A workflow gets added in one place and contradicted in another. A skill becomes too generic. The agent starts skipping tests, getting too verbose, or formatting responses inconsistently. You fix it by editing the prompt files again, but now you are guessing the same questions above.

That is the problem CodexOpt is built to solve.

CodexOpt is a CLI for benchmarking and optimizing the instruction assets developers already use with Codex style workflows. Instead of treating AGENTS.md and SKILL.md files like static documentation, it treats them like measurable parts of the system.

Why this project exists

Most teams still edit agent instruction files manually. That works for a while, but it breaks down once those files become part of a real engineering workflow. The challenge is not just writing instructions. The challenge is maintaining them over time.

You need a way to answer practical questions:

  • Is this version actually better than the last one?
  • Did we make the instructions clearer or just longer?
  • Did we improve safety, testing guidance, and output quality?
  • Which candidate should we keep?

CodexOpt is designed around those questions. It gives developers a repeatable loop for discovering instruction assets, scoring them, generating improved candidates, reviewing diffs, and applying only the changes that are actually worth keeping.

Why focus only on Codex, AGENTS.md, and SKILL.md?

The scope is intentionally narrow. CodexOpt does not try to optimize every prompt format or every agent framework. It focuses on the files that are closest to how developers actually work in a repository:

  • AGENTS.md as the top level behavioral contract
  • SKILL.md files as reusable task specific workflows

That narrow focus is a strength. These files are version controlled, reviewable, repo local, and already part of the development workflow. They are the right place to start if you want a practical way to improve agent behavior without building an entire prompt platform from scratch.

Codex itself being open source also matters here. It makes repo local instruction assets more important, not less. Teams can shape behavior in a way that stays transparent, inspectable, and close to the codebase.

What inspired CodexOpt

Two ideas influenced the design of CodexOpt. The first is GEPA, which shows how text artifacts can be optimized using reflection and search. The second is the prompt learning idea described by Arize in Prompt Learning: Using English Feedback to Optimize LLM Systems, which argues that natural language feedback can be a stronger optimization signal than a single scalar score. CodexOpt borrows from both ideas, but in a very repo native way.

From GEPA, it takes the idea that prompts and instructions are not fixed text; they are optimizable system components. From prompt learning, it takes the idea that critique matters. Instead of relying only on length checks or numeric scores, CodexOpt tries to capture feedback such as contradiction, redundancy, missing verification guidance, weak trigger clarity, and poor alignment with the actual repo workflow.

That combination is what makes the project interesting. It is not just prompt cleanup. It is an attempt to turn instruction maintenance into an engineering workflow.

How CodexOpt works

CodexOpt gives developers a simple command line flow. You point it at a repository with an AGENTS.md and one or more skills. It can also take optional evidence, such as a tasks.md file or a short list of recurring issues and review themes. Then it runs through a series of steps:

  • scan discovers instruction assets
  • benchmark scores them and generates feedback
  • optimize generates better candidates
  • apply previews or writes changes safely
  • report summarizes the latest runs

This makes the workflow measurable. Instead of endlessly editing a prompt and hoping for the best, you can benchmark the current state, inspect the findings, compare candidates, and keep a record of what changed.

What it evaluates

CodexOpt is not just checking whether a file exists or whether a skill has frontmatter. It looks for the kinds of problems developers actually run into when instruction files drift.

For AGENTS.md, that includes contradictory guidance, duplicated rules, missing workflow structure, weak verification guidance, missing output expectations, and weak repo grounding.

For SKILL.md, it includes missing frontmatter, vague trigger conditions, weak workflow structure, insufficient verification steps, and prompt bloat.

It can also use optional evidence files to make the evaluation more grounded. A task list can help CodexOpt understand what the repo actually expects from the agent. A short issue or review log can tell it which mistakes keep happening.

That does not mean CodexOpt is executing full agent simulations yet. Today, those evidence files shape scoring and feedback rather than running end to end task execution. But even that shift is valuable because it makes instruction quality more repo specific and less generic.

Heuristic mode and GEPA mode

CodexOpt currently supports two optimization paths.

The default is a fast local heuristic engine. This is the safest option for most teams and the best place to start. It handles things like whitespace cleanup, duplicate line removal, and skill frontmatter repair. It is deterministic, cheap, and easy to understand.

The second path is optional GEPA backed optimization. This is the more advanced mode. When configured correctly, it can use reflection and search to explore stronger candidates. That makes it promising for deeper instruction optimization over time.

One thing that matters for trust is that CodexOpt reports when a GEPA requested run falls back to heuristic behavior. If a team asks for GEPA, they should know whether they actually got a GEPA backed result or a safe fallback. That visibility is important in production workflows.

Watch Demo

Who this is for

CodexOpt is built for developers who are already maintaining repo local instruction assets and want a better way to improve them.

  • individual developers experimenting with Codex in their own repos
  • teams maintaining shared AGENTS.md guidance
  • platform teams standardizing internal skills
  • open source maintainers who want more predictable coding agent behavior

If you already find yourself editing prompts because the same mistakes keep happening, CodexOpt is aimed at you.

The demo repo

To make the project easy to understand, there is a companion demo repo at codexopt-demo.

  • a noisy and contradictory AGENTS.md
  • skills with missing frontmatter, duplicated lines, and unnecessary verbosity
  • a tasks.md file that describes repo tasks
  • an issues.md file that simulates recurring feedback themes

It also includes a tiny Python package with bugs aligned to those tasks, so the repo feels like something a developer might actually work on. This matters because it makes the value of CodexOpt concrete. You are not looking at abstract prompts in a vacuum. You are looking at a repo, its instruction files, its recurring problems, and a tool that tries to improve them in a measured way.

What is different about CodexOpt

There are plenty of prompt tools and many agent frameworks. CodexOpt is different because it stays local to the repository and close to how developers already work.

  • It is not trying to become a hosted prompt management platform
  • It is not trying to become a full agent execution framework
  • It is not trying to replace all prompt engineering workflows

It is trying to do one thing well: help teams improve the instruction assets that shape coding agent behavior in source control.

That focus makes it easier to adopt, easier to reason about, and easier to integrate into real development workflows.

Why this matters for open source

As coding agents become part of day to day development, instruction files become part of the real software surface area. Open source teams need better tooling around them.

CodexOpt is useful precisely because it treats those files seriously. It makes them inspectable, benchmarkable, reviewable, and safe to apply. That is a much better foundation than endless manual prompt edits with no measurement.

For open source maintainers, this is a practical way to keep instruction quality from becoming invisible technical debt.

Where the project goes next

The current release is a solid foundation, but there is a clear path forward.

Over time, CodexOpt can grow into richer scenario based evaluation, deeper repo specific scoring, stronger evidence handling, and more capable GEPA backed optimization. But the important part is that the core workflow already exists: benchmark, optimize, review, apply.

That alone is a meaningful step forward for teams maintaining AGENTS.md and SKILL.md by hand.

Try it

If you are already maintaining instruction files for Codex in a real repository, CodexOpt gives you a better way to do it, not as prompt guesswork, but as an engineering workflow.