CodexOpt Brings Microsoft SkillOpt to Codex: Optimizing Agent Skills with Execution Feedback

Microsoft Research released the SkillOpt paper. The work has generated considerable discussion across the AI community. Researchers and engineers highlight its disciplined approach to improving agent capabilities without modifying model weights.

Codex users already depend on AGENTS.md and SKILL.md files as active components of runtime behavior in OpenAI’s Codex harness. SkillOpt offers a structured method to optimize these artifacts through execution feedback, bounded edits, and validation gating. This approach transforms intuitive prompt adjustments into measurable, reproducible gains.

With CodexOpt 0.2.0, we have integrated these concepts into a practical CLI workflow designed specifically for Codex users.

Community Reactions on X

Discussions on social media reflect strong interest in SkillOpt. Many posts emphasize its shift from hand-crafted prompts to systematic, evidence-based skill evolution. Key points frequently mentioned include:

SkillOpt treats skill documents as trainable external state for frozen models, applying optimizer-like discipline (bounded add/delete/replace edits, textual learning rates, and strict validation gates).
Strong empirical results: best or tied for best across all 52 evaluated settings (models × benchmarks × harnesses). Notable gains on GPT-5.5 include +23.5 points in direct chat, +24.8 points in Codex harness, and +19.1 points in Claude Code.
Practical advantages: zero additional inference cost at runtime, strong transferability across models and harnesses, and skills that remain compact and human-readable.
Comparisons frequently note that SkillOpt outperforms baselines such as human-written skills, TextGrad, GEPA, and EvoSkill.

Some voices raise longer-term questions about reliance on static optimized skills as dynamic reasoning improves, but the prevailing sentiment views this as a valuable step toward more reliable agent engineering today.

What SkillOpt Delivers

SkillOpt frames natural-language skill documents as optimizable external state. An optimizer analyzes rollout trajectories, proposes controlled edits, and accepts changes only when they improve performance on held-out validation tasks.

Key strengths:

Rigorous validation that prevents unproven or bloating changes.
Exported skills are plain text files with no runtime overhead.
Demonstrated improvements on benchmarks such as Spreadsheet solving (41.8 percent to 80.7 percent) and Office QA.

Why Codex Is an Ideal Target

OpenAI’s Codex harness incorporates instruction files directly into the agent loop. This produces observable trajectories that provide rich feedback for optimization.

CodexOpt treats Codex runs as rollouts:

Deploy a candidate skill.
Execute tasks through codex exec.
Capture JSON event streams and outcomes.
Score behavior using verifiers, LLM judges, or static analysis.
Generate bounded rewrites.
Validate on held-out tasks before acceptance.

What’s New in CodexOpt 0.2.0

uv run codexopt improve # Safe preview mode
uv run codexopt improve --live # Full optimization with Codex
uv run codexopt improve --live --apply # Apply validated changes

Core capabilities:

Automatic train/validation task splits mined from git history, issues, and skill descriptions.
Bounded edits with textual controls.
Validation-gated acceptance.
Multiple reward signals including verifier outcomes and LLM judge feedback.
Full Codex JSONL trajectory support.
Detailed reports showing accepted diffs and performance changes.

SkillOpt Mapping to CodexOpt

Skill artifact: SKILL.md or AGENTS.md
Rollout: codex exec or command verifier
Feedback: Trajectory analysis + multi-signal scoring
Bounded edit: Edit budget + controlled modifications
Validation gate: Held-out task performance
Exported skill: Validated file diff with backups

GEPA Influence and CodexOpt Approach

The SkillOpt paper evaluates against GEPA and similar methods. CodexOpt incorporates useful elements of textual reflection while delivering a streamlined, Codex-native implementation. The previous GEPA engine path has been deprecated in favor of the maintained reflective engine.

Practical Workflow

Preview changes safely with uv run codexopt improve.
Run live optimization with uv run codexopt improve --live.
Review results using uv run codexopt report.
Apply validated edits with --apply.

Enhanced task evidence in tasks.md or JSON format strengthens optimization signals, from simple descriptions to full Codex rollout tests.

What Community saying?

Community conversations highlight a shared challenge: manual skill editing often results in inconsistency or prompt bloat. SkillOpt and tools like CodexOpt establish a higher standard where skills must demonstrate value through measurable task improvements.

Optimized skills become reliable, transferable artifacts that reduce agent errors and improve workflow consistency.

Install and Get Started

pip install codexopt==0.2.0

PyPI release: https://pypi.org/project/codexopt/0.2.0/
Github: https://github.com/SuperagenticAI/CodexOpt
SkillOpt Paper: https://arxiv.org/abs/2605.23904
SkillOpt Project Page: https://microsoft.github.io/SkillOpt/

Closing

Agent skills have evolved from static notes into operational, optimizable components. CodexOpt 0.2.0 makes SkillOpt-style optimization practical for Codex users by combining rigorous validation with direct harness integration.

Evidence-driven improvement provides a clearer path forward than intuition alone. Start optimizing your skills today.

SuperQode 0.2.0: A Harness Engineering Framework for Coding Agents

Superagentic AI Blog

Full Stack Agentic AI, Agent Optimization, Agent Engineering and Agent Experience.