AutoResearch V2: Three Papers, 941 Citations

AutoResearch V2：三篇论文，941 篇引用

An experiment in LLM-assisted academic survey writing

一次 LLM 辅助学术综述写作的实验记录

Deli Chen June 4, 2026 5 min read

Happy to share the Deli AutoResearch V2 update — three survey papers produced with LLM agent assistance, covering the full pipeline from literature search to PDF generation. It's still very much an exploratory experiment with plenty of room for improvement, but hopefully offers some reference points for AI-assisted academic writing.

Disclaimer: This is a personal hobby project. All opinions expressed are my own and do not represent any organization.

What's New in This Release

This update brings three major changes:

New Paper #3: Long-Horizon Sequential Decision-Making. A comprehensive 57-page survey covering 384 papers on hierarchical planning, reactive agents, MCTS/PRM search methods, and RL for agents. Features a rigorous horizon scaling experiment across 5 frontier models × 5 horizon lengths, revealing exponential performance decay (R² > 0.93) that CoT and hierarchical planning can significantly mitigate.
Papers #1 & #2 republished with extensive new literature. Both papers received a major citation upgrade: 141 new references added, 73 papers deeply woven into the narrative, and 36 venue upgrades (arXiv → accepted conference/journal via DBLP verification). This round of work consumed ~850K tokens across 10+ subagents in ~200 minutes.
Paper-writing skill open-sourced. The paper_writing skill — covering LaTeX generation, section structure, and figure/table standards — is now open source and available for reuse. Other skills in the pipeline (search_agent, call_api, peer_review_simulation, etc.) remain internal.

The Pipeline at a Glance

🔍

Recall

50+ queries
387 results

→

⭐

Score (LQS)

228 scored
131 must-cite

→

🎯

Classify

A/B/C/D depth
16 deep-discuss

→

⬆️

Upgrade

36 venue
upgrades

→

✍️

Write

~1.95M tokens
190 pages

→

🧪

Experiment

3300 API calls
pilot studies

→

📝

Review

14 rounds
6.0→8.5

Paper-by-Paper Comparison

Metric	Paper #1 Auto-Research Agents	Paper #2 Continual Learning	Paper #3 Long-Horizon
Citations	228	329	384
Pages	63	70	57
New refs (Jun)	34	41	66
Venue upgrades	16	14	6
Review score	8.5/10	8.5/10	8.5/10
Version	V5	V5	V4

The New Paper: Long-Horizon Decision-Making

Paper #3 tackles one of the hardest open problems in AI agents: how to maintain coherence over long action sequences. The survey covers hierarchical planning (HRL, options framework), reactive architectures, search-based methods (MCTS, PRM), and RL for agent training.

The highlight is a horizon scaling experiment: 5 frontier models tested across 5 horizon lengths (10, 50, 100, 200, 500 steps) × 3 conditions (vanilla, CoT, hierarchical) × 3 task types. Key finding: performance degrades exponentially with horizon length, but Chain-of-Thought and hierarchical decomposition reduce the decay constant by 40-60%.

Iteration History

Each paper went through multiple autonomous iterations, driven by agent-simulated peer review that surfaces weaknesses and triggers targeted fixes:

Version	Paper #1 (Auto-Research)	Paper #2 (Continual Learning)	Paper #3 (Long-Horizon)
V1	Draft + 80 refs (6.0)	Draft + basic taxonomy (6.0)	Draft + 134 refs (7.0)
V2	Literature + logic chains (6.5)	Dual-axis taxonomy redesign (6.5)	Adversarial review exposed missing experiments (3.0*)
V3	Pilot experiment + figures (7.5)	CL×SI interaction study (7.0)	Horizon scaling experiment (8.0)
V4	Deep analysis + formal conjectures (8.0)	Self-verification Pareto study (8.0)	Cross-validation + meta-analysis (8.5 ✓)
V5	Literature refresh + polish (8.5 ✓)	Literature refresh + polish (8.5 ✓)	—

* Paper #3 V2 was scored by an adversarial reviewer with strict experimental standards; V3 addressed all concerns with a redesigned horizon scaling experiment.

Literature Upgrade: What V5 Did

The final round isn't just a re-upload. The autonomous literature agent:

Re-ran 50+ search queries against updated arXiv indices
Added 141 new references across the three papers
Deeply wove 73 papers into existing narrative (not just citation dumps)
Upgraded 36 entries from arXiv preprints to their accepted venue via DBLP cross-referencing
Maintained freshness: 35-60% of all citations are from the past year

Open Source Skill

The paper_writing skill is now open source:

paper_writing OPEN SOURCE — LaTeX generation, section structure, figure/table standards

Other skills used in the pipeline remain internal:

search_agent — arXiv search, citation verification, DBLP cross-check
call_api — Multi-model peer review (3-5 reviewers × 5 rounds)
skill-router — Dynamic skill matching for sub-tasks
peer_review_simulation — Multi-persona scoring with iterative fix cycles
experiment_design — Study design and execution planning

Total: 49+ skill invocations across 7 distinct skills, coordinated by the Deli_AutoResearch master framework.

Read the papers, download the PDFs, and explore the production statistics.

View AutoResearch Papers →

What's Next

The pipeline is approaching a level where the primary bottleneck is no longer writing quality — it's research taste: which questions to ask, which angles to pursue, and when to stop. The next iteration will focus on hypothesis generation and novelty detection, moving from survey synthesis toward original contributions.

Stay tuned.

今天很高兴发布 Deli AutoResearch V2 版本的正式更新——三篇由 LLM Agent 辅助完成的综述论文，覆盖从文献检索到 PDF 生成的完整流程。作为一次探索性实验，仍有诸多不足，但希望能为 AI 辅助学术写作提供一些参考。

声明：纯个人兴趣项目，不代表任何组织的立场，所有观点仅代表我个人。

本次更新的三大核心

新增第三篇论文：长时序决策（Long-Horizon Sequential Decision-Making）
一篇57页的全面综述，覆盖384篇论文，包括层次化规划、反应式架构、基于搜索的方法（MCTS/PRM）和 Agent RL。亮点是一项严格的"地平线缩放实验"：5个前沿模型 × 5种步长 × 3种条件 × 3类任务，发现性能随步长呈指数衰减（R² > 0.93），而 CoT 和层次分解可将衰减常数降低 40-60%。
前两篇论文全面改版，大幅扩充文献
两篇论文均完成重大引文升级：新增141篇参考文献、73篇深度织入叙述、36个期刊/会议升级（从 arXiv 预印本升级为正式发表版本，通过 DBLP 交叉验证）。本轮工作消耗约 85 万 tokens，10+ 个子 Agent，耗时约 200 分钟。
论文写作技能已开源
paper_writing 技能——覆盖 LaTeX 生成、章节结构、图表规范——现已开源并可供复用。流水线中的其他技能（search_agent、call_api、peer_review_simulation 等）为内部使用。

流水线一览

🔍

召回

50+ 查询
387 结果

→

⭐

评分 (LQS)

228 篇评分
131 必引

→

🎯

分类

A/B/C/D 深度
16 篇深入讨论

→

⬆️

升级

36 个期刊
升级

→

✍️

写作

~195万 tokens
190 页

→

🧪

实验

3300 API 调用
先导研究

→

📝

评审

14 轮
6.0→8.5

三篇论文对比

指标	论文 #1 自主研究 Agent	论文 #2 持续学习	论文 #3 长时序决策
引用数	228	329	384
页数	63	70	57
新增引文 (6月)	34	41	66
期刊升级	16	14	6
评审分数	8.5/10	8.5/10	8.5/10
版本	V5	V5	V4

新论文：长时序决策

第三篇论文攻克 AI Agent 领域最难的开放问题之一：如何在长动作序列中保持行为一致性。综述覆盖层次化规划（HRL、options 框架）、反应式架构、基于搜索的方法（MCTS、PRM）以及 Agent RL 训练。

核心实验发现：性能随步长呈指数衰减，但 Chain-of-Thought 和层次分解可将衰减常数降低 40-60%。这为设计长时序 Agent 提供了明确的工程指导。

迭代历程

每篇论文都经历了多轮自主迭代，由 Agent 驱动的同行审议不断暴露弱点、触发修复：

版本	论文 #1 (自主研究)	论文 #2 (持续学习)	论文 #3 (长时序)
V1	初稿 + 80 引用 (6.0)	初稿 + 基础分类法 (6.0)	初稿 + 134 篇文献 (7.0)
V2	文献补充 + 逻辑链 (6.5)	双轴分类法重构 (6.5)	对抗审稿暴露实验缺失 (3.0*)
V3	Pilot 实验 + 图表 (7.5)	CL×SI 交互实验 (7.0)	Horizon Scaling 实验 (8.0)
V4	深度分析 + 形式猜想 (8.0)	自验证 Pareto 实验 (8.0)	交叉验证 + meta分析 (8.5 ✓)
V5	文献更新 + 精修 (8.5 ✓)	文献更新 + 精修 (8.5 ✓)	—

* Paper #3 V2 由对抗性审稿人评分，严格要求实验验证；V3 据此重新设计了 horizon scaling 实验。

文献升级：V5 做了什么

最终一轮不是简单的重新上传。自主文献 Agent 完成了以下工作：

对更新后的 arXiv 索引重新执行 50+ 检索查询
三篇论文新增 141 篇参考文献
将 73 篇论文 深度织入已有叙述（非简单堆砌引用）
通过 DBLP 交叉引用 升级 36 个条目（从 arXiv 预印本到正式发表）
保持新鲜度：35-60% 引用来自过去一年

开源技能

论文写作技能已开源：

paper_writing OPEN SOURCE — LaTeX 生成、章节结构、图表规范

其他技能为内部使用：

search_agent — arXiv 检索、引用验证、DBLP 交叉核实
call_api — 多模型同行评审（3-5 评审人 × 5 轮）
skill-router — 子任务动态技能匹配
peer_review_simulation — 多角色评分 + 迭代修复
experiment_design — 实验设计与执行规划

共计 49+ 次技能调用，涉及 7 个独立技能，由 Deli_AutoResearch 主控框架统一协调。

阅读论文、下载 PDF、探索生产统计数据。

查看 AutoResearch 论文 →

下一步

流水线正在趋近一个临界点：主要瓶颈不再是写作质量，而是研究品味——问什么问题、探索什么角度、何时收手。下一个迭代将聚焦假说生成和新颖性检测，从综述合成走向原创贡献。

Stay tuned.