Personal research project by Deli Chen. All opinions are my own and do not represent any organization.
| Metric | Paper #1 | Paper #2 | Paper #3 | Total |
|---|---|---|---|---|
| BibTeX entries | 228 | 326 | 384 | 938 |
| PDF pages | 63 | 70 | 57 | 190 |
| Figures | 5+ | 8+ | 13 | 26+ |
| Tables | 14+ | 15+ | 30+ | 59+ |
| Peer review (final) | 8.5/10 | 8.5/10 | 8.5/10 | 8.5 avg |
| Review iterations | V1→V5 | V1→V5 | V1→V4 | 14 rounds |
| Compute Consumption | ||||
| Total iterations (agent turns) | ~60 | ~80 | ~70 | ~210 |
| Output tokens | ~550K | ~720K | ~680K | ~1.95M |
| Tool invocations | ~380 | ~470 | ~520 | ~1,370 |
| Subagents spawned | 12+ | 18+ | 18+ | 48+ |
| Wall clock (total) | ~10h | ~12h | ~16h | ~38h |
| Citation Quality | ||||
| Venue upgrades | 16 | 14 | 6 | 36 |
| New refs added (June) | 34 | 41 | 66 | 141 |
| Papers woven in | 15 | 25 | 33 | 73 |
| 1yr citation ratio | 60.5% | 54.3% | 35.4% | — |
| Accepted ratio | 29.8% | 30.1% | 49.2% | — |
| Phase | Subagents | Tokens | Tool Uses | Wall Clock |
|---|---|---|---|---|
| Literature collection (3 papers) | 3 | 386,359 | 332 | 58 min |
| Text weaving (3 papers) | 3 | 203,204 | 117 | 44 min |
| Experiment design + execution | 2 | 111,115 | 100 | 46 min |
| Experiment integration + Review V3 | 1 | 64,460 | 45 | 27 min |
| Weakness fix + Review V4 | 1 | 87,498 | 58 | 26 min |
| Total | 10+ | ~852,636 | 652 | ~201 min |
| Paper | V1 | V2 | V3 | V4 | V5 (Final) |
|---|---|---|---|---|---|
| Paper #1 (Auto-Research) | 6.0 | 6.5 | 7.5 | 8.0 | 8.5 ✓ |
| Paper #2 (Continual Learning) | 6.0 | 6.5 | 7.0 | 8.0 | 8.5 ✓ |
| Paper #3 (Long-Horizon) | 7.0 | 3.0* | 8.0 | 8.5 ✓ | — |
* Paper #3 V2 scored by adversarial reviewer with strict experimental standards; V3 addressed all concerns with redesigned horizon scaling experiment. V5 improvements focus on analytical depth, structural cohesion, and cross-benchmark validation.
Each paper goes through a systematic 4-stage literature review pipeline: Recall (broad keyword search via site:arxiv.org) → Score (LQS multi-dimensional quality scoring) → Classify (A/B/C/D citation depth assignment) → Upgrade (arXiv preprint → accepted venue via DBLP).
| Stage | Paper #1 | Paper #2 | Paper #3 | Total |
|---|---|---|---|---|
| Stage 1: Recall Keyword queries × site:arxiv.org |
20 queries 170 results | 10 queries 83 results | 20+ queries 134 results | 50+ queries 387 results |
| Stage 2: Score (LQS) Recency 30% + Citation 25% + Venue 20% + Institution 10% + Acceptance 15% |
50 scored 14 must-cite 36 conditional 0 dropped |
45 scored 45 must-cite 0 conditional 0 dropped |
133 scored 72 must-cite 51 conditional 10 dropped |
228 scored 131 must-cite 87 conditional 10 dropped |
| Stage 3: Classify A = deep discussion, B = detailed cite, C = brief cite, D = drop |
A: 5 • B: 10 C: 35 • D: 0 |
A: 4 • B: 12 C: 29 • D: 0 |
A: 7 • B: 13 C: 103 • D: 10 |
A: 16 • B: 35 C: 167 • D: 10 |
| Stage 4: Upgrade arXiv → @inproceedings via DBLP/OpenReview |
16 upgraded | 14 upgraded | 6 upgraded | 36 upgraded |
| Final BibTeX | 228 entries | 329 entries | 384 entries | 941 entries |
LQS thresholds: ≥7.0 = must-cite (high quality + high relevance), 5.0–7.0 = conditional (fills taxonomy gap), <5.0 = dropped.
Citation depth: A-level papers get 1–3 paragraphs of discussion; B-level get 2–5 sentences; C-level get a single citation in context; D-level are excluded from the paper.
Skills invoked during the research pipeline.
| Skill | ID | Invocations | Phase | Purpose |
|---|---|---|---|---|
| paper_writing OPEN SOURCE | — | 3 | Writing | Parent skill group: LaTeX generation, section structure, figure/table standards, compilation |
| — literature_survey OPEN SOURCE | — | 12+ | Literature | Keyword generation, LQS scoring, citation depth classification, venue upgrade |
| — paper_structure OPEN SOURCE | — | 6+ | Writing | Section outline, paragraph flow, cross-reference consistency, taxonomy design |
| — experiment_design OPEN SOURCE | — | 2 | Experiment | Horizon scaling study design, CL×SI interaction experiment design |
| — figures_tables OPEN SOURCE | — | 8+ | Writing | Figure layout, table formatting, caption generation, visualization standards |
| — peer_review_simulation OPEN SOURCE | — | 14 | Review | Multi-persona scoring (5 reviewer types), iterative fix cycles (V1→V5) |
| Internal Skills (not publicly available) | ||||
| search_agent | #5 | 12+ | Literature & Verify | arXiv search, citation verification, DBLP cross-check, acceptance status lookup |
| call_api | #2 | 8+ | Review & Experiment | Multi-model peer review (3–5 reviewers × 5 rounds), horizon scaling experiment (3300 API calls) |
| static_file_service | #6 | 4 | Deploy | PDF hosting, index.html generation, service restart |
| skill-router | #57 | 3 | Orchestration | Dynamic skill matching for sub-tasks (literature, experiment, deployment) |
| Deli_AutoResearch* | — | 3 | Orchestration | Master framework: anti-loop, heartbeat, state management, multi-track coordination |
| Total skill invocations | — | 68+ | — | Across 3 papers × 5+ review rounds × multi-stage pipeline |
paper_writing is the open-source skill group containing 5 sub-skills. Skills with IDs (#2, #5, #6, #57) depend on internal infrastructure and are not publicly available.
* Deli_AutoResearch is still actively iterating and does not have a stable public release yet.
The paper_writing skill is open source. Other skills in the pipeline are internal.
View Open Source Skill: paper_writing →