This is the story of Paper #4 in my AutoResearch project: Self-Play in the Age of Foundation Models — A Comprehensive Survey from Game-Theoretic Foundations to Open-Ended Learning. Unlike the first three surveys, this one was not just written by the agent framework — it was argued with, stress-tested, and experimentally validated by it over 16 review rounds, ending at a median peer-review score of 8.6/10 (strong accept).
Disclaimer: This is a personal hobby project. All opinions expressed are my own and do not represent any organization.
The Score Trajectory: Not a Straight Line
What makes this paper interesting is that its score did not climb monotonically. The agent ran a five-persona peer review after each revision, and the median moved up and down honestly:
| Version | Score | What happened |
|---|---|---|
| V4–V10 | 7.0 → 8.4 | Drafting: taxonomy, sections, and three original theorems. Citations grew from 0 to 207. |
| V11 | 8.5 | The 285B GRPO verifier-noise experiment landed in §8 — the paper turned from a pure survey into one with an original large-scale experiment. |
| V12 | 8.2 ↓ | The only decline. An external citation check found three defective references; the score was lowered to the paper's true state, not preserved. |
| V13 | 8.4 | A 2000-step long-horizon run (8.3× the original length) tested the KL-buffering hypothesis; data provenance was corrected. |
| V14 | 8.4 | Seed replication + KL ablation turned "buffering" from a hypothesis into evidence. Bottleneck shifted to theoretical rigor. |
| V15 | 8.5 | Held-out KL-endpoint study; parallel work started on four theoretical sub-problems. |
| V16 | 8.6 ✓ | Theory hardening landed — series best. |
The 285B RL Experiment: The Core Differentiator
The key decision was to validate the survey's central claim on a real 285B-parameter model (DeepSeek-V4) rather than a small-model demonstration. The setup: GRPO training, batch 512, N=16, 32K context, on 18,953 math-reasoning problems. The verifier flips its reward with probability ε — and the claim is that ε sets the ceiling on self-play improvement.
Two findings carried the paper:
- Improvement falls monotonically with verifier noise. Across ε ∈ {0, 0.10, 0.30, 0.45}, the relative change in training-distribution accuracy was
+4.8 / +0.1 / −4.1 / −6.6percent — strictly monotone, and matching the ordering predicted by the paper's own theorem. - A KL anchor buffers against noise. Holding ε=0.30 and sweeping the KL coefficient gave
−9.9 / −4.1 / +0.8percent for KL = 0 / 0.001 / 0.01. The held-out KL-endpoint study then showed the buffer is a genuine dial with a sweet spot, not a one-way knob — there is a real tradeoff between training-distribution gains and held-out performance.
Twelve GRPO runs in total, about 3,570 GPU-hours of compute. Five separate submission failures along the way were all diagnosed and fixed autonomously — including a recurring type error where a learning rate in scientific notation was parsed as a string. After the fourth recurrence, the framework wrote a pre-submission check script and baked the lesson into its own operating constraints.
Five Personas, One Honest Median
Every review round is scored in parallel by five independent reviewer personas, each with a different bias:
| Persona | Role | What they push on |
|---|---|---|
| R1 | Experimentalist | Do the numbers in the paper match the raw experiment logs, line by line? |
| R2 | Theorist | Are the proofs rigorous? This was the binding constraint for most of the paper's life. |
| R3 | Perfectionist | Table consistency, abstract accuracy, citation hygiene. |
| R4 | Synthesizer | Does each addition answer a question the paper itself raised? |
| R5 | Newcomer | Can a non-expert navigate it? (A reading guide and abbreviations table gave the largest single jump.) |
The median — not the mean — is the reported score, so a single enthusiastic reviewer can't inflate it. For most of the paper's life, the theorist (R2) sat at 8.0 and held the median down. That is exactly the signal that mattered.
V16: Theory Hardening Was the Last Mile
By V14 the experiments were solid, but R2 kept the median pinned: the noise-floor argument and the KL-ablation sign weren't reconciled with the theory. The fix was not more experiments — it was new mathematics. Three sub-problems were solved and integrated:
- Noise-floor recurrence. An exact closed-form re-derivation showing the floor is Θ(ηT) under harmonic coverage (which also caught a dropped-factor error in the paper's own draft), with a genuine constant floor η/ρ under an explicit mixing assumption.
- Coupled-floor lemma. A rigorous Gibbs → Jensen → Pinsker chain bounding the KL-anchor displacement.
- Matching lower bound. A persistence dichotomy proving the floor's order is optimal.
That carried the median from 8.5 to 8.6. Notably, this round was done overnight by a serial single-agent loop — two earlier attempts at parallel multi-agent workflows stalled and burned most of the round's tokens before the actual proofs came from the simpler serial approach. That lesson is now part of the framework's memory.
Read the full paper, the production statistics, and the four-paper comparison.
View AutoResearch Papers →What I Take Away
The first three surveys showed an agent framework can write a credible paper. This one showed something harder: it can run a real experiment to test the paper's own claim, score itself honestly (including downward), and close a theory gap with new math — over six days, largely unattended. The remaining gap to a 9.0 is a single open problem (formalizing the held-out ↔ exploitability map), left as post-publication work.
Stay tuned.