# Selective Internalization Model **Repository Path**: knifecms/sim ## Basic Information - **Project Name**: Selective Internalization Model - **Description**: “选择性内化(Selective Internalization)”的自进化智能体 - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-05-02 - **Last Updated**: 2026-05-05 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Selective Internalization Overlay A model-agnostic framework for capability evolution in closed-model LLM agents. The system externalizes adaptation into structured capability carriers (knowledge, skill, tool) and reuses them through runtime routing, without requiring access to model weights. Promotion governance uses overlap-based impact analysis and layered policy resolution to prevent harmful duplication. **Key result**: On a 15-case benchmark, the governed overlay achieves 0.80 after-success rate with 0.00 harmful-promotion rate, compared to 1.00 / 0.20 under ungated auto-promotion. ## Architecture ``` Task → Router → Registry → Context Compiler → Executor (model backend) ↑ Experience Buffer ← Evaluation Suite Gap Detector → Synthesizer → Candidate ↓ Promotion Governance (overlap scoring → impact verdict → policy resolution) ``` The framework has six stages: 1. **Capability Registry** — stores active knowledge/skill/tool carriers with typed metadata (domain, type, triggers, content) 2. **Capability Router** — selects relevant carriers for a given task using trigger matching and context constraints 3. **Context Compiler** — assembles carrier content into the model prompt 4. **Executor** — calls the backend (OpenAI-compatible API or echo for benchmarking) 5. **Learning Loop** — evaluates outcomes, detects capability gaps, synthesizes candidate carriers 6. **Promotion Governance** — evaluates candidates against active carriers via overlap scoring and layered policy ## Installation ```bash pip install -e . ``` Or install dependencies directly: ```bash pip install openai pydantic ``` ## Quick Start ```bash # Run the agent with overlay enabled python -m agent run --task "reconcile monthly sales figures" --provider openai # Run a single benchmark case python -m agent run --case docs/benchmark-spec.json --mode governed # Run experiment suite python -m agent experiment run --benchmark docs/benchmark-spec.json --modes prompt_only naive_auto_promotion overlay_without_governance overlay_with_governance ``` ## Project Structure ``` sim/ ├── agent/ │ ├── capability/ # Carrier definitions, registry │ │ ├── manifest.py # PackManifest dataclass │ │ ├── registry.py # PackRegistry (active carriers) │ │ └── packs/ # Default shipped carriers (JSON) │ ├── config/ │ │ └── approval_policy.json # Governance policy config │ ├── experiments/ # Benchmark runner, aggregator, analysis │ │ ├── benchmark.py # 15-case benchmark definition │ │ ├── runner.py # Experiment execution │ │ ├── aggregator.py # Mode-level aggregation │ │ └── analysis.py # Per-case and family-level analysis │ ├── learning/ # Gap detection, synthesis, governance │ │ ├── gap_detector.py # Weighted signal gap scoring │ │ ├── synthesizer.py # Candidate carrier creation │ │ ├── approval_policy.py # Layered policy resolver │ │ └── promoter.py # Promotion/block decision │ ├── runtime/ # Routing, context, execution │ │ ├── router.py # CapabilityRouter │ │ ├── context_compiler.py │ │ ├── executor.py # OverlayExecutor + backends │ │ └── trace.py # ExecutionTrace for audit │ ├── cli.py # CLI entry point │ └── service.py # AgentService orchestration ├── docs/ │ ├── adr/ # Architecture Decision Records │ ├── paper/ # PeerJ manuscript, figures, supplementary │ ├── plans/ # Planning documents │ └── results/ # Timestamped experiment outputs ├── tests/ # Unit and integration tests └── .gitignore ``` ## Benchmark The benchmark (`docs/benchmark-spec.json`) contains 15 cases across 5 families: | Family | Description | Cases | |--------|-------------|-------| | `knowledge_internalization` | Repeated explanation failures → reusable knowledge | 3 | | `tool_internalization` | Repeated manual workflows → deterministic tool | 3 | | `skill_internalization` | Repeated output-structure failures → reusable skill | 3 | | `governance_stress` | Duplicate/superseding candidates that pressure policy | 3 | | `provider_transfer` | Cross-provider cases testing model-agnostic reuse | 3 | Benchmark backend is deterministic — no API calls required. ## Key Design Decisions **Gap detection** uses a weighted combination of five signals (repeated_failures 0.35, evaluation_failed 0.25, novelty 0.20, uncertainty 0.10, user_corrections 0.10) with a threshold of 0.65 to prevent single-failure noise from triggering synthesis. **Overlap scoring** combines domain match (0.4), type match (0.2), trigger Jaccard (0.2), and content-token Jaccard (0.2) into a single score that feeds into four verdict levels: duplicate_risk, likely_supersedes, moderate_overlap, low_risk. **Layered policy resolution** applies specificity ordering: impact verdict rules override carrier-type rules, which override provider rules, which override domain rules, which override defaults. The `operations` domain and `tool` carrier type receive stricter default thresholds. ## Reproducibility Experiment runs are archived with full result logs: ```bash python -m agent experiment run --benchmark docs/benchmark-spec.json # Output: docs/results/YYYY-MM-DD/HHMMSSZ/report.json + results.csv ``` ## License CC BY 4.0 — same license as the PeerJ publication.