Karpathy skills on OpenClaw: agents don't write better code. But they do it more efficiently.Apr 30, 2026Karpathy skills on OpenClaw: agents don't write better code. But they do it more efficiently.Apr 30, 2026

We added Karpathy-inspired coding rules to root AGENTS.md and ran three coding agents through 40 OpenClaw PRs. The judge scored code quality as basically unchanged, but the agents got to the same answers with less work: fewer tool calls and file reads, and consistently lower time and cost.

We tested Auggie, Claude Code, and Codex against 40 PRs from OpenClaw. For each PR, we ran each agent twice: once with existing AGENTS.md, once with AGENTS.md plus about 2.5k characters of Karpathy-style coding rules prepended to the file.

The Karpathy rules are the kind of thing you'd write at the top of a team style guide.

Here is the summary:

In short: solve the ticket, stay local, don't wander.

The setup:

  • 40 hand-selected PRs from OpenClaw, mid-complexity (100–300 LOC excluding tests)
  • Three runners: Auggie on Opus 4.7, Claude Code on Opus 4.7, Codex on GPT-5.4
  • Two variants per PR: baseline AGENTS.md (~18K chars) vs. AGENTS-karpathy.md (~20.5K chars)
  • 6 runs per config, total 18 repeats per individual PR
  • Scored by an LLM judge on completeness, correctness, best practices, code reuse, and unsolicited documentation

What we did not test:

  • We didn’t test the effect of Karpathy-style guidelines on easy or hard PRs, so the effect might be different. It also doesn't apply to all AGENTS.md.
  • We ran a test on another repo where the instructions heavily overlap with the Karpathy skill, and unsurprisingly, there was no statistically significant difference.

Cost, time, and tokens dropped

Every runner used fewer tool calls and finished faster. Each dot represents a separate, full 40 PR run. Relative change is calculated as percentage from average per baseline runs (without Karpathy).

Post image

Baseline vs Karpathy — Per-Run Results

Auggie Claude Code Codex
Duration −3% −7% −7%
Tool Calls −3% −8% −6%
Cost −3% −10% −8%

The reduction came mostly from fewer searches and fewer file reads. The agents found what they needed in fewer lookups. Not always without impact on performance. Tool call failure rates stayed flat at 5–8%. Same directional change for every runner.

Output tokens fell by similar margins across runners. Per PR, Karpathy was faster and cheaper on about 30 of 40 PRs. The pattern held across all three agents.

A 3–10% efficiency gain from a small prompt change isn't a model breakthrough. If you're running a coding agent at scale, it's still real money, real latency, and real capacity.

Less sidequesting with the same quality (mostly)

Auggie Claude Code Codex
Quality Score +0.00 -0.07 +0.00

By the judge's score, the code didn't get better.

Open circles are baseline, solid green markers are Karpathy. Each score is in the [-1.0, 1.0] range, delta is showcased on chart.

Post image

Results on per-run quality score

Auggie, Codex did the same, but Claude Code has -0.07 (similar to Opus → Sonnet drop). It degraded on multiple sub-metrics. Correctness fell by 0.07, completeness by 0.06. A closer look at the data reveals why: Claude Code with Karpathy guidelines has more conservative trajectories, touching ~5% fewer files per task.

Karpathy-style guidelines don’t transfer uniformly across agent harnesses and repositories. While objectives there are meaningful for software engineering tasks, the baseline system prompt and orchestration of each system are different. In Codex, the guidelines likely add useful structure (improving efficiency). In Augment, the baseline prompt already encodes similar constraints, so the marginal impact is smaller. In Claude Code, the system prompt may already be highly constrained, so layering additional constraints could reduce exploration and degrade performance.

Auggie Claude Code Codex
Overall score +0.00 -0.07 +0.00
Completeness -0.02 -0.06 +0.01
Correctness -0.01 -0.07 -0.02
Best Practices -0.01 -0.03 −0.00
Code Reuse -0.00 -0.05 +0.02
Unsolicited Docs +0.06 +0.05 +0.06

One sub-metric moved in the same direction for every runner: Unsolicited Docs scored better.

That tracks. The Karpathy rules told the agents not to add features beyond the ask and not to "improve" nearby code, comments, or formatting. The agents listened and did less unrequested work.

Per PR

Averaged across all runners per PR:

  • Cost, Duration: Karpathy is faster and cheaper on 30 of 40 PRs, consistent across all runners
  • Quality: Karpathy is stable with 20/20 wins, apart from Claude Code

The code itself didn't reliably get better, but the path got shorter.

A good AGENTS.md is a cheap control layer, and in a sense is an extension (tailored for specific repos) of your system prompt. It provides behavioral and domain constraints tailored to your repository. This experiment shows that such guidelines consistently result in faster, cheaper runs with fewer tool calls, however the quality stays the same or drops, depending on the harness.

If you're running coding agents at scale, adding Karpathy-style guidelines could give you a free lunch, but it should be calibrated to your existing setup.