01
文献调研
4-stage pipeline: Recall → Score (LQS) → Classify (A/B/C/D) → Upgrade (arXiv→accepted).
IN: topic + taxonomy keywords
OUT: references.bib + citation_plan.jsonl
Stage 1: High-Recall Retrieval
- 20-30 keyword queries via
search.py -o "site:arxiv.org ..."
- Each taxonomy cell: 3+ query variants (core terms, synonyms, method names)
- Snowball: seed paper citation networks
- Target: 200-500 raw candidates
Stage 2: LQS Multi-Dimensional Scoring
| Dimension | Weight | Scoring |
| Recency | 30% | 6mo=10, 1yr=8, 2yr=5, 3yr=3 |
| Citation Impact | 25% | cites/mo: ≥50=10, ≥10=8, ≥3=6 |
| Venue | 20% | Top-tier=10, Strong=7, Workshop=4 |
| Institution | 10% | Top lab=10, Top uni=9 |
| Acceptance | 15% | Accepted=10, Under review=5, None=3 |
Thresholds: LQS≥7.0 must-cite, 5.0-7.0 conditional, <5.0 drop
Stage 3: Citation Depth Classification
- A-level (1-3 paragraphs): section protagonist, 3-5 per chapter
- B-level (2-5 sentences): important insight, 5-10 per chapter
- C-level (1 sentence): supporting evidence
- D-level: dropped, not cited
Stage 4: Venue Upgrade
- Cross-check DBLP + OpenReview for acceptance status
- arXiv with "Accepted at X" →
@inproceedings
- Target: arXiv-only ratio ≤ 60%
Verification
- Every 20 citations: title match, author, year, venue check
- Target: verification rate ≥80%, hallucinated = 0
- Year distribution: within-1yr ≥40%, accepted ≥30%
02
论文结构与逻辑
Chapter architecture, paragraph logic chains, taxonomy design, formal claims, hedge language, abstract-conclusion alignment.
IN: bib + experiment findings
OUT: sections/*.tex (full manuscript)
Chapter Architecture (Survey Standard)
- §1 Introduction: Hook → Gap → Contributions → Roadmap
- §2 Background: formal definitions, taxonomy overview
- §3-6 Core: one method family per chapter, with critical assessment
- §7 Benchmarks + Experiments
- §8 Future: specific open problems (Barrier + Attack vector)
- §9 Conclusion: numbered key findings (not repeat of abstract)
Paragraph Logic Patterns
| Pattern | Structure | Use Case |
| Claim-Evidence-Implication | Assert → Data → So what | Main body |
| Compare-Contrast | A → B → Difference → Trade-off | Method comparison |
| Concession-Rebuttal | Admit strength → But limitation | Critical analysis |
| Funnel | Broad → Narrow → This paper | Introduction |
Taxonomy Design
- Multi-axis matrix (not flat list)
- MECE: mutually exclusive, collectively exhaustive
- Must have empty cells → gap analysis material
- Spanning methods show taxonomy tension (good)
Formal Claims
- Default:
Conjecture + Remark (not Theorem)
- Hedge ladder: demonstrates > suggests > may > hypothesize
- Rule: claim strength ≤ evidence strength
Related Work Differentiation
- Mandatory comparison table with existing surveys
- "We're more recent" is NOT sufficient differentiation
- Need structural novelty: new taxonomy, new angle, new experiment
03
实验设计
4-stage loop: Design (hypothesis) → Execute (API/GPU) → Iterate (adjust) → Report (structured JSON).
IN: conjecture or gap
OUT: results.json + experiment_summary.md
Stage 1: Design (Most Important)
- Must answer: "which paper claim does this support?"
- Experiment spec: hypothesis, independent/dependent vars, control vars, expected results
- Statistical plan decided BEFORE running (no HARKing)
- Principles: falsifiable, minimal first, pre-registered, has control
Stage 2: Execute
| Path | Scale | Use Case |
| Path A: API | Hours, lightweight | Multi-model comparison, prompt ablation |
| Path B: GPU RL | Days, heavyweight | Agent training, reward shaping |
- API: 3-5 frontier models × 2-3 conditions × 15-25 tasks × 3 trials
- GPU: via cluster job submission + auto-monitoring loop
Stage 3: Iterate
- Ceiling effect → increase difficulty
- Floor effect → decrease difficulty or check for bugs
- Not significant → increase trials or change hypothesis
- Surprise finding → design follow-up
- Max 5 iterations, then accept best result
Stage 4: Report (Data Only)
- Output:
results.json (schema: config + results + statistics + findings)
- Output:
experiment_summary.md (purpose, results, limitations)
- Does NOT produce LaTeX tables or figures — that's the Figures skill's job
04
学术图表设计
High information-density tables and vector figures. Presentation layer for all data in the paper.
IN: results.json + section placeholders
OUT: figures/*.pdf + tables/*.tex
Table Types
| Type | Use | Info Density |
| Comparison Matrix | Methods × features | Very high |
| Benchmark Table | Models × metrics | High |
| Ablation Table | Conditions × results | High |
| Taxonomy Table | Classification visualization | Medium |
| Meta-analysis | Aggregated cross-paper data | Very high |
Table Rules
- No vertical lines — booktabs three-line style only
- Alternating row color:
\rowcolor{gray!6}
- Bold best results in each column
- All experimental data: mean ± std
- Caption must contain key finding, not just description
Figure Types & Tools
- Data-driven (curves, bars, heatmaps):
matplotlib → PDF
- Architecture/flow diagrams: TikZ or SVG→PDF
- Simple schematics: PIL → PNG (acceptable per reviewer feedback)
- Priority: TikZ > matplotlib PDF > SVG→PDF > PIL PNG
Quality Checklist
- Vector format (PDF) preferred, PNG ≥ 300 DPI
- Font size ≥ 10pt after scaling
- Academic palette: blue #2196F3, red #F44336, green #4CAF50, orange #FF9800
- All axes labeled, all lines have legend
- Light grid (alpha=0.3) for readability
- Self-contained: understandable without reading main text
Quantity Targets
- Full survey (50+ pages): ≥10 tables, ≥6 figures
- Short survey (30 pages): ≥5 tables, ≥3 figures
05
同行审议模拟
Multi-persona scoring that drives the iteration loop by routing weaknesses back to sub-skills #1-4.
IN: compiled PDF
OUT: score + weakness list → routed to corresponding sub-skill
Reviewer Personas (3-5 per round)
| Persona | Focus | Scoring Weight |
| R1 Experimentalist | Statistical rigor, baselines, replication | Experimental 30% |
| R2 Theorist | Formal definitions, proofs, MECE taxonomy | Technical depth 35% |
| R3 Perfectionist | Writing quality, figures, formatting | Clarity 30% |
| R4 Synthesizer | Cross-cutting analysis, gap identification | Novelty 25% |
| R5 Newcomer | Accessibility, definitions, examples | Clarity 35% |
Scoring Protocol
- Each reviewer scores independently (no anchoring)
- Final score = median of all reviewers
- Dimensions: Novelty, Comprehensiveness, Clarity, Technical Depth, Experimental Validation
- Calibration: 6.0=workshop, 7.0=main conference, 8.0=Strong Accept (top 20%), 9.0=Oral
Anti-Inflation Rules
- First round score capped at 7.0 (every paper has room to improve)
- Max +1.5 per round
- At least 1 "unresolved" weakness must remain
- Different LLM model for at least 1 reviewer per round (diversity)
Output Format
- Overall score + per-dimension scores
- 3-5 Strengths, 3-5 Weaknesses (prioritized Major/Minor)
- Concrete suggestions (actionable)
- Recommendation: Accept / Weak Accept / Borderline / Reject
- Regression check: are previously-fixed weaknesses still fixed?