Happy to share the Deli AutoResearch V2 update — three survey papers produced with LLM agent assistance, covering the full pipeline from literature search to PDF generation. It's still very much an exploratory experiment with plenty of room for improvement, but hopefully offers some reference points for AI-assisted academic writing.
Disclaimer: This is a personal hobby project. All opinions expressed are my own and do not represent any organization.
What's New in This Release
This update brings three major changes:
- New Paper #3: Long-Horizon Sequential Decision-Making. A comprehensive 57-page survey covering 384 papers on hierarchical planning, reactive agents, MCTS/PRM search methods, and RL for agents. Features a rigorous horizon scaling experiment across 5 frontier models × 5 horizon lengths, revealing exponential performance decay (R² > 0.93) that CoT and hierarchical planning can significantly mitigate.
- Papers #1 & #2 republished with extensive new literature. Both papers received a major citation upgrade: 141 new references added, 73 papers deeply woven into the narrative, and 36 venue upgrades (arXiv → accepted conference/journal via DBLP verification). This round of work consumed ~850K tokens across 10+ subagents in ~200 minutes.
- Paper-writing skill open-sourced. The
paper_writingskill — covering LaTeX generation, section structure, and figure/table standards — is now open source and available for reuse. Other skills in the pipeline (search_agent, call_api, peer_review_simulation, etc.) remain internal.
The Pipeline at a Glance
387 results
131 must-cite
16 deep-discuss
upgrades
190 pages
pilot studies
6.0→8.5
Paper-by-Paper Comparison
| Metric | Paper #1 Auto-Research Agents | Paper #2 Continual Learning | Paper #3 Long-Horizon |
|---|---|---|---|
| Citations | 228 | 329 | 384 |
| Pages | 63 | 70 | 57 |
| New refs (Jun) | 34 | 41 | 66 |
| Venue upgrades | 16 | 14 | 6 |
| Review score | 8.5/10 | 8.5/10 | 8.5/10 |
| Version | V5 | V5 | V4 |
The New Paper: Long-Horizon Decision-Making
Paper #3 tackles one of the hardest open problems in AI agents: how to maintain coherence over long action sequences. The survey covers hierarchical planning (HRL, options framework), reactive architectures, search-based methods (MCTS, PRM), and RL for agent training.
The highlight is a horizon scaling experiment: 5 frontier models tested across 5 horizon lengths (10, 50, 100, 200, 500 steps) × 3 conditions (vanilla, CoT, hierarchical) × 3 task types. Key finding: performance degrades exponentially with horizon length, but Chain-of-Thought and hierarchical decomposition reduce the decay constant by 40-60%.
Iteration History
Each paper went through multiple autonomous iterations, driven by agent-simulated peer review that surfaces weaknesses and triggers targeted fixes:
| Version | Paper #1 (Auto-Research) | Paper #2 (Continual Learning) | Paper #3 (Long-Horizon) |
|---|---|---|---|
| V1 | Draft + 80 refs (6.0) | Draft + basic taxonomy (6.0) | Draft + 134 refs (7.0) |
| V2 | Literature + logic chains (6.5) | Dual-axis taxonomy redesign (6.5) | Adversarial review exposed missing experiments (3.0*) |
| V3 | Pilot experiment + figures (7.5) | CL×SI interaction study (7.0) | Horizon scaling experiment (8.0) |
| V4 | Deep analysis + formal conjectures (8.0) | Self-verification Pareto study (8.0) | Cross-validation + meta-analysis (8.5 ✓) |
| V5 | Literature refresh + polish (8.5 ✓) | Literature refresh + polish (8.5 ✓) | — |
* Paper #3 V2 was scored by an adversarial reviewer with strict experimental standards; V3 addressed all concerns with a redesigned horizon scaling experiment.
Literature Upgrade: What V5 Did
The final round isn't just a re-upload. The autonomous literature agent:
- Re-ran 50+ search queries against updated arXiv indices
- Added 141 new references across the three papers
- Deeply wove 73 papers into existing narrative (not just citation dumps)
- Upgraded 36 entries from arXiv preprints to their accepted venue via DBLP cross-referencing
- Maintained freshness: 35-60% of all citations are from the past year
Open Source Skill
The paper_writing skill is now open source:
- paper_writing OPEN SOURCE — LaTeX generation, section structure, figure/table standards
Other skills used in the pipeline remain internal:
- search_agent — arXiv search, citation verification, DBLP cross-check
- call_api — Multi-model peer review (3-5 reviewers × 5 rounds)
- skill-router — Dynamic skill matching for sub-tasks
- peer_review_simulation — Multi-persona scoring with iterative fix cycles
- experiment_design — Study design and execution planning
Total: 49+ skill invocations across 7 distinct skills, coordinated by the Deli_AutoResearch master framework.
Read the papers, download the PDFs, and explore the production statistics.
View AutoResearch Papers →What's Next
The pipeline is approaching a level where the primary bottleneck is no longer writing quality — it's research taste: which questions to ask, which angles to pursue, and when to stop. The next iteration will focus on hypothesis generation and novelty detection, moving from survey synthesis toward original contributions.
Stay tuned.