Homepage Blog Papers Skill

AutoResearch V2: Three Papers, 941 Citations

AutoResearch V2:三篇论文,941 篇引用

An experiment in LLM-assisted academic survey writing

一次 LLM 辅助学术综述写作的实验记录

Deli Chen June 4, 2026 5 min read

Happy to share the Deli AutoResearch V2 update — three survey papers produced with LLM agent assistance, covering the full pipeline from literature search to PDF generation. It's still very much an exploratory experiment with plenty of room for improvement, but hopefully offers some reference points for AI-assisted academic writing.

Disclaimer: This is a personal hobby project. All opinions expressed are my own and do not represent any organization.

3
Papers
941
Citations
190
Pages
8.5
Avg Review
~38h
Wall Clock

What's New in This Release

This update brings three major changes:

  1. New Paper #3: Long-Horizon Sequential Decision-Making. A comprehensive 57-page survey covering 384 papers on hierarchical planning, reactive agents, MCTS/PRM search methods, and RL for agents. Features a rigorous horizon scaling experiment across 5 frontier models × 5 horizon lengths, revealing exponential performance decay (R² > 0.93) that CoT and hierarchical planning can significantly mitigate.
  2. Papers #1 & #2 republished with extensive new literature. Both papers received a major citation upgrade: 141 new references added, 73 papers deeply woven into the narrative, and 36 venue upgrades (arXiv → accepted conference/journal via DBLP verification). This round of work consumed ~850K tokens across 10+ subagents in ~200 minutes.
  3. Paper-writing skill open-sourced. The paper_writing skill — covering LaTeX generation, section structure, and figure/table standards — is now open source and available for reuse. Other skills in the pipeline (search_agent, call_api, peer_review_simulation, etc.) remain internal.

The Pipeline at a Glance

🔍
Recall
50+ queries
387 results
Score (LQS)
228 scored
131 must-cite
🎯
Classify
A/B/C/D depth
16 deep-discuss
⬆️
Upgrade
36 venue
upgrades
✍️
Write
~1.95M tokens
190 pages
🧪
Experiment
3300 API calls
pilot studies
📝
Review
14 rounds
6.0→8.5

Paper-by-Paper Comparison

MetricPaper #1
Auto-Research Agents
Paper #2
Continual Learning
Paper #3
Long-Horizon
Citations228329384
Pages637057
New refs (Jun)344166
Venue upgrades16146
Review score8.5/108.5/108.5/10
VersionV5V5V4

The New Paper: Long-Horizon Decision-Making

Paper #3 tackles one of the hardest open problems in AI agents: how to maintain coherence over long action sequences. The survey covers hierarchical planning (HRL, options framework), reactive architectures, search-based methods (MCTS, PRM), and RL for agent training.

The highlight is a horizon scaling experiment: 5 frontier models tested across 5 horizon lengths (10, 50, 100, 200, 500 steps) × 3 conditions (vanilla, CoT, hierarchical) × 3 task types. Key finding: performance degrades exponentially with horizon length, but Chain-of-Thought and hierarchical decomposition reduce the decay constant by 40-60%.

Iteration History

Each paper went through multiple autonomous iterations, driven by agent-simulated peer review that surfaces weaknesses and triggers targeted fixes:

VersionPaper #1 (Auto-Research)Paper #2 (Continual Learning)Paper #3 (Long-Horizon)
V1Draft + 80 refs (6.0)Draft + basic taxonomy (6.0)Draft + 134 refs (7.0)
V2Literature + logic chains (6.5)Dual-axis taxonomy redesign (6.5)Adversarial review exposed missing experiments (3.0*)
V3Pilot experiment + figures (7.5)CL×SI interaction study (7.0)Horizon scaling experiment (8.0)
V4Deep analysis + formal conjectures (8.0)Self-verification Pareto study (8.0)Cross-validation + meta-analysis (8.5 ✓)
V5Literature refresh + polish (8.5 ✓)Literature refresh + polish (8.5 ✓)

* Paper #3 V2 was scored by an adversarial reviewer with strict experimental standards; V3 addressed all concerns with a redesigned horizon scaling experiment.

Literature Upgrade: What V5 Did

The final round isn't just a re-upload. The autonomous literature agent:

Open Source Skill

The paper_writing skill is now open source:

Other skills used in the pipeline remain internal:

Total: 49+ skill invocations across 7 distinct skills, coordinated by the Deli_AutoResearch master framework.

Read the papers, download the PDFs, and explore the production statistics.

View AutoResearch Papers →

What's Next

The pipeline is approaching a level where the primary bottleneck is no longer writing quality — it's research taste: which questions to ask, which angles to pursue, and when to stop. The next iteration will focus on hypothesis generation and novelty detection, moving from survey synthesis toward original contributions.

Stay tuned.

今天很高兴发布 Deli AutoResearch V2 版本的正式更新——三篇由 LLM Agent 辅助完成的综述论文,覆盖从文献检索到 PDF 生成的完整流程。作为一次探索性实验,仍有诸多不足,但希望能为 AI 辅助学术写作提供一些参考。

声明:纯个人兴趣项目,不代表任何组织的立场,所有观点仅代表我个人。

3
论文
941
引用文献
190
页数
8.5
平均评分
~38h
总耗时

本次更新的三大核心

  1. 新增第三篇论文:长时序决策(Long-Horizon Sequential Decision-Making)
    一篇57页的全面综述,覆盖384篇论文,包括层次化规划、反应式架构、基于搜索的方法(MCTS/PRM)和 Agent RL。亮点是一项严格的"地平线缩放实验":5个前沿模型 × 5种步长 × 3种条件 × 3类任务,发现性能随步长呈指数衰减(R² > 0.93),而 CoT 和层次分解可将衰减常数降低 40-60%。
  2. 前两篇论文全面改版,大幅扩充文献
    两篇论文均完成重大引文升级:新增141篇参考文献、73篇深度织入叙述、36个期刊/会议升级(从 arXiv 预印本升级为正式发表版本,通过 DBLP 交叉验证)。本轮工作消耗约 85 万 tokens,10+ 个子 Agent,耗时约 200 分钟。
  3. 论文写作技能已开源
    paper_writing 技能——覆盖 LaTeX 生成、章节结构、图表规范——现已开源并可供复用。流水线中的其他技能(search_agent、call_api、peer_review_simulation 等)为内部使用。

流水线一览

🔍
召回
50+ 查询
387 结果
评分 (LQS)
228 篇评分
131 必引
🎯
分类
A/B/C/D 深度
16 篇深入讨论
⬆️
升级
36 个期刊
升级
✍️
写作
~195万 tokens
190 页
🧪
实验
3300 API 调用
先导研究
📝
评审
14 轮
6.0→8.5

三篇论文对比

指标论文 #1
自主研究 Agent
论文 #2
持续学习
论文 #3
长时序决策
引用数228329384
页数637057
新增引文 (6月)344166
期刊升级16146
评审分数8.5/108.5/108.5/10
版本V5V5V4

新论文:长时序决策

第三篇论文攻克 AI Agent 领域最难的开放问题之一:如何在长动作序列中保持行为一致性。综述覆盖层次化规划(HRL、options 框架)、反应式架构、基于搜索的方法(MCTS、PRM)以及 Agent RL 训练。

核心实验发现:性能随步长呈指数衰减,但 Chain-of-Thought 和层次分解可将衰减常数降低 40-60%。这为设计长时序 Agent 提供了明确的工程指导。

迭代历程

每篇论文都经历了多轮自主迭代,由 Agent 驱动的同行审议不断暴露弱点、触发修复:

版本论文 #1 (自主研究)论文 #2 (持续学习)论文 #3 (长时序)
V1初稿 + 80 引用 (6.0)初稿 + 基础分类法 (6.0)初稿 + 134 篇文献 (7.0)
V2文献补充 + 逻辑链 (6.5)双轴分类法重构 (6.5)对抗审稿暴露实验缺失 (3.0*)
V3Pilot 实验 + 图表 (7.5)CL×SI 交互实验 (7.0)Horizon Scaling 实验 (8.0)
V4深度分析 + 形式猜想 (8.0)自验证 Pareto 实验 (8.0)交叉验证 + meta分析 (8.5 ✓)
V5文献更新 + 精修 (8.5 ✓)文献更新 + 精修 (8.5 ✓)

* Paper #3 V2 由对抗性审稿人评分,严格要求实验验证;V3 据此重新设计了 horizon scaling 实验。

文献升级:V5 做了什么

最终一轮不是简单的重新上传。自主文献 Agent 完成了以下工作:

开源技能

论文写作技能已开源:

其他技能为内部使用:

共计 49+ 次技能调用,涉及 7 个独立技能,由 Deli_AutoResearch 主控框架统一协调。

阅读论文、下载 PDF、探索生产统计数据。

查看 AutoResearch 论文 →

下一步

流水线正在趋近一个临界点:主要瓶颈不再是写作质量,而是研究品味——问什么问题、探索什么角度、何时收手。下一个迭代将聚焦假说生成和新颖性检测,从综述合成走向原创贡献。

Stay tuned.