Deli AutoResearch Papers

Paper #1

From Copilots to Colleagues: A Survey of Autonomous Research Agents in the Age of Foundation Models

This survey proposes a five-level autonomy taxonomy (L1–L5), identifies four dominant architectural patterns, and systematically compares 17 major systems across a six-dimensional feature matrix. Includes an illustrative pilot study comparing 5 frontier models across 3 research tasks and 3 agent architectures, with a formal Architecture-Capability Trade-off Conjecture.

228 citations

63 pages

60.5% within 1yr

29.8% accepted

8.5/10 review

Download (628 KB) V5 · 2026-06-04

@article{chendeli_202606_auto_research_survey, title={From Copilots to Colleagues: A Survey of Autonomous Research Agents in the Age of Foundation Models}, author={Chen, Deli}, journal={arXiv preprint}, year={2026}, note={Generated by Deli AutoResearch framework. 228 citations, 63 pages, peer review 8.5/10} }

Paper #2

Never Stop Learning: A Unified Survey of Continual Learning and Self-Improvement in Large Language Models

Unifies continual learning and self-improvement under a three-axis taxonomy (Strategy × Scope × Objective). Formalizes CL×SI interaction via bilevel optimization with impossibility conjectures. Includes two pilot experiments: a CL×SI interaction study revealing GPT-5.2’s deterministic SI collapse, and a knowledge retention-acquisition trade-off study identifying Self-Verification as Pareto-optimal across 5 domains.

329 citations

70 pages

54.3% within 1yr

30.1% accepted

8.5/10 review

Download (777 KB) V5 · 2026-06-04

@article{chendeli_2026_continue_learning_survey, title={Never Stop Learning: A Unified Survey of Continual Learning and Self-Improvement in Large Language Models}, author={Chen, Deli}, journal={arXiv preprint}, year={2026}, note={Generated by Deli AutoResearch framework. 326 citations, 70 pages, peer review 8.5/10} }

Paper #3

Navigating the Long Horizon: A Comprehensive Survey of Agent Architectures and Reinforcement Learning for Extended Sequential Decision-Making

Surveys 384 papers on long-horizon sequential decision-making, covering hierarchical planning, reactive agents, search-based methods (MCTS, PRM), and RL for agents. Features a rigorous horizon scaling experiment across 5 frontier models × 5 horizon lengths × 3 conditions × 3 task types, with exponential decay fitting (R² > 0.93). Chain-of-thought and hierarchical planning significantly reduce horizon degradation.

384 citations

57 pages

35.4% within 1yr

49.2% accepted

8.5/10 review

Download (762 KB) V4 · 2026-06-04

@article{chendeli_202606_long_horizon_survey, title={Navigating the Long Horizon: A Comprehensive Survey of Agent Architectures and Reinforcement Learning for Extended Sequential Decision-Making}, author={Chen, Deli}, journal={arXiv preprint}, year={2026}, note={Generated by Deli AutoResearch framework. 384 citations, 57 pages, peer review 8.5/10} }

Paper #4

Self-Play in the Age of Foundation Models: A Comprehensive Survey from Game-Theoretic Foundations to Open-Ended Learning

Unifies self-play from TD-Gammon and AlphaZero through PSRO and league training to LLM self-improvement (SPIN, self-rewarding, DeepSeek-R1) under a three-axis taxonomy. Validated by an original 285B-parameter GRPO experiment: four calibrated verifier-noise conditions with seed replication show training-distribution improvement falling monotonically as verifier noise rises, plus a KL-endpoint held-out study revealing a train-dist vs held-out tradeoff. The V16 theory hardening adds an exact closed-form noise-floor recurrence, a constant floor under uniform mixing, a KL-anchor displacement lemma, and a matching lower bound.

217 citations

75 pages

285B RL experiment

6 theorems & lemmas

8.6/10 review

Download (962 KB) V16 · 2026-06-17

@article{chendeli_202606_self_play_survey, title={Self-Play in the Age of Foundation Models: A Comprehensive Survey from Game-Theoretic Foundations to Open-Ended Learning}, author={Chen, Deli}, journal={arXiv preprint}, year={2026}, note={Generated by Deli AutoResearch framework. 217 citations, 75 pages, peer review 8.6/10, includes a 285B-parameter RL experiment} }

Production Statistics

Metric	Paper #1	Paper #2	Paper #3	Paper #4	Total
BibTeX entries	228	326	384	217	1155
PDF pages	63	70	57	75	265
Figures	5+	8+	13	6+	32+
Tables	14+	15+	30+	12+	71+
Peer review (final)	8.5/10	8.5/10	8.5/10	8.6/10	8.5+ avg
Review iterations	V1→V5	V1→V5	V1→V4	V1→V16	30 rounds
Compute Consumption
Total iterations (agent turns)	~60	~80	~70	~80	~290
Output tokens	~550K	~720K	~680K	~600K	~2.55M
Tool invocations	~380	~470	~520	~600	~1,970
Subagents spawned	12+	18+	18+	15+	63+
Wall clock (total)	~10h	~12h	~16h	~6h	~44h
Citation Quality
Venue upgrades	16	14	6	11	47
New refs added (June)	34	41	66	47	188
Papers woven in	15	25	33	18	91
1yr citation ratio	60.5%	54.3%	35.4%	—	—
Accepted ratio	29.8%	30.1%	49.2%	—	—

Subagent Consumption (Literature + Experiment + Review Cycle)

Phase	Subagents	Tokens	Tool Uses	Wall Clock
Literature collection (3 papers)	3	386,359	332	58 min
Text weaving (3 papers)	3	203,204	117	44 min
Experiment design + execution	2	111,115	100	46 min
Experiment integration + Review V3	1	64,460	45	27 min
Weakness fix + Review V4	1	87,498	58	26 min
Total	10+	~852,636	652	~201 min

Review Score Trajectory

Paper	V1	V2	V3	V4	V5 (Final)
Paper #1 (Auto-Research)	6.0	6.5	7.5	8.0	8.5 ✓
Paper #2 (Continual Learning)	6.0	6.5	7.0	8.0	8.5 ✓
Paper #3 (Long-Horizon)	7.0	3.0*	8.0	8.5 ✓	—
Paper #4 (Self-Play)	7.0	7.5	8.0	8.5	8.6 ✓ (V16)

* Paper #3 V2 scored by adversarial reviewer with strict experimental standards; V3 addressed all concerns with redesigned horizon scaling experiment. V5 improvements focus on analytical depth, structural cohesion, and cross-benchmark validation. Paper #4 ran the longest cycle (V1→V16): a 285B-parameter GRPO experiment, a KL-endpoint held-out study, and a round of theory hardening (closed-form noise-floor recurrence + matching lower bound) carried it from 8.5 to 8.6.

Literature Funnel (4-Stage Pipeline)

Each paper goes through a systematic 4-stage literature review pipeline: Recall (broad keyword search via site:arxiv.org) → Score (LQS multi-dimensional quality scoring) → Classify (A/B/C/D citation depth assignment) → Upgrade (arXiv preprint → accepted venue via DBLP).

Stage	Paper #1	Paper #2	Paper #3	Total
Stage 1: Recall Keyword queries × site:arxiv.org	20 queries 170 results	10 queries 83 results	20+ queries 134 results	50+ queries 387 results
Stage 2: Score (LQS) Recency 30% + Citation 25% + Venue 20% + Institution 10% + Acceptance 15%	50 scored 14 must-cite 36 conditional 0 dropped	45 scored 45 must-cite 0 conditional 0 dropped	133 scored 72 must-cite 51 conditional 10 dropped	228 scored 131 must-cite 87 conditional 10 dropped
Stage 3: Classify A = deep discussion, B = detailed cite, C = brief cite, D = drop	A: 5 • B: 10 C: 35 • D: 0	A: 4 • B: 12 C: 29 • D: 0	A: 7 • B: 13 C: 103 • D: 10	A: 16 • B: 35 C: 167 • D: 10
Stage 4: Upgrade arXiv → @inproceedings via DBLP/OpenReview	16 upgraded	14 upgraded	6 upgraded	36 upgraded
Final BibTeX	228 entries	329 entries	384 entries	941 entries

LQS thresholds: ≥7.0 = must-cite (high quality + high relevance), 5.0–7.0 = conditional (fills taxonomy gap), <5.0 = dropped.
Citation depth: A-level papers get 1–3 paragraphs of discussion; B-level get 2–5 sentences; C-level get a single citation in context; D-level are excluded from the paper.

Skill Hub Usage

Skills invoked from an internal skill registry during the research pipeline.

Skill	ID	Invocations	Phase	Purpose
search_agent	#5	12+	Literature & Verify	arXiv search, citation verification, DBLP cross-check, acceptance status lookup
call_api	#2	8+	Review & Experiment	Multi-model peer review (3–5 reviewers × 5 rounds), horizon scaling experiment (3300 API calls)
static_file_service	#6	4	Deploy	PDF hosting, index.html generation, service restart
skill-router	#57	3	Orchestration	Dynamic skill matching for sub-tasks (literature, experiment, deployment)
Deli_AutoResearch	—	3	Orchestration	Master framework: anti-loop, heartbeat, state management, multi-track coordination
paper_writing OPEN SOURCE	—	3	Writing	LaTeX generation, section structure, figure/table standards, compilation
peer_review_simulation	—	14	Review	Multi-persona scoring (5 reviewer types), iterative fix cycles (V1→V5)
experiment_design	—	2	Experiment	Horizon scaling study design, CL×SI interaction experiment design

Total skill invocations	—	49+	—	Across 3 papers × 5+ review rounds × multi-stage pipeline

Skill registry — an internal catalog of reusable agent skills; 7 of them were used in this project.