# Selective Internalization Model

**Repository Path**: knifecms/sim

## Basic Information

- **Project Name**: Selective Internalization Model
- **Description**: “选择性内化（Selective Internalization）”的自进化智能体
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-05-02
- **Last Updated**: 2026-05-05

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Selective Internalization Overlay

A model-agnostic framework for capability evolution in closed-model LLM agents. The system externalizes adaptation into structured capability carriers (knowledge, skill, tool) and reuses them through runtime routing, without requiring access to model weights. Promotion governance uses overlap-based impact analysis and layered policy resolution to prevent harmful duplication.

**Key result**: On a 15-case benchmark, the governed overlay achieves 0.80 after-success rate with 0.00 harmful-promotion rate, compared to 1.00 / 0.20 under ungated auto-promotion.

## Architecture

```
Task → Router → Registry → Context Compiler → Executor (model backend)
                                              ↑
                          Experience Buffer ← Evaluation Suite
                          Gap Detector → Synthesizer → Candidate
                          ↓
                      Promotion Governance
                      (overlap scoring → impact verdict → policy resolution)
```

The framework has six stages:

1. **Capability Registry** — stores active knowledge/skill/tool carriers with typed metadata (domain, type, triggers, content)
2. **Capability Router** — selects relevant carriers for a given task using trigger matching and context constraints
3. **Context Compiler** — assembles carrier content into the model prompt
4. **Executor** — calls the backend (OpenAI-compatible API or echo for benchmarking)
5. **Learning Loop** — evaluates outcomes, detects capability gaps, synthesizes candidate carriers
6. **Promotion Governance** — evaluates candidates against active carriers via overlap scoring and layered policy

## Installation

```bash
pip install -e .
```

Or install dependencies directly:

```bash
pip install openai pydantic
```

## Quick Start

```bash
# Run the agent with overlay enabled
python -m agent run --task "reconcile monthly sales figures" --provider openai

# Run a single benchmark case
python -m agent run --case docs/benchmark-spec.json --mode governed

# Run experiment suite
python -m agent experiment run --benchmark docs/benchmark-spec.json --modes prompt_only naive_auto_promotion overlay_without_governance overlay_with_governance
```

## Project Structure

```
sim/
├── agent/
│   ├── capability/         # Carrier definitions, registry
│   │   ├── manifest.py     # PackManifest dataclass
│   │   ├── registry.py     # PackRegistry (active carriers)
│   │   └── packs/          # Default shipped carriers (JSON)
│   ├── config/
│   │   └── approval_policy.json  # Governance policy config
│   ├── experiments/        # Benchmark runner, aggregator, analysis
│   │   ├── benchmark.py    # 15-case benchmark definition
│   │   ├── runner.py       # Experiment execution
│   │   ├── aggregator.py    # Mode-level aggregation
│   │   └── analysis.py     # Per-case and family-level analysis
│   ├── learning/           # Gap detection, synthesis, governance
│   │   ├── gap_detector.py # Weighted signal gap scoring
│   │   ├── synthesizer.py  # Candidate carrier creation
│   │   ├── approval_policy.py  # Layered policy resolver
│   │   └── promoter.py     # Promotion/block decision
│   ├── runtime/            # Routing, context, execution
│   │   ├── router.py       # CapabilityRouter
│   │   ├── context_compiler.py
│   │   ├── executor.py     # OverlayExecutor + backends
│   │   └── trace.py        # ExecutionTrace for audit
│   ├── cli.py              # CLI entry point
│   └── service.py          # AgentService orchestration
├── docs/
│   ├── adr/                # Architecture Decision Records
│   ├── paper/              # PeerJ manuscript, figures, supplementary
│   ├── plans/              # Planning documents
│   └── results/            # Timestamped experiment outputs
├── tests/                  # Unit and integration tests
└── .gitignore
```

## Benchmark

The benchmark (`docs/benchmark-spec.json`) contains 15 cases across 5 families:

| Family | Description | Cases |
|--------|-------------|-------|
| `knowledge_internalization` | Repeated explanation failures → reusable knowledge | 3 |
| `tool_internalization` | Repeated manual workflows → deterministic tool | 3 |
| `skill_internalization` | Repeated output-structure failures → reusable skill | 3 |
| `governance_stress` | Duplicate/superseding candidates that pressure policy | 3 |
| `provider_transfer` | Cross-provider cases testing model-agnostic reuse | 3 |

Benchmark backend is deterministic — no API calls required.

## Key Design Decisions

**Gap detection** uses a weighted combination of five signals (repeated_failures 0.35, evaluation_failed 0.25, novelty 0.20, uncertainty 0.10, user_corrections 0.10) with a threshold of 0.65 to prevent single-failure noise from triggering synthesis.

**Overlap scoring** combines domain match (0.4), type match (0.2), trigger Jaccard (0.2), and content-token Jaccard (0.2) into a single score that feeds into four verdict levels: duplicate_risk, likely_supersedes, moderate_overlap, low_risk.

**Layered policy resolution** applies specificity ordering: impact verdict rules override carrier-type rules, which override provider rules, which override domain rules, which override defaults. The `operations` domain and `tool` carrier type receive stricter default thresholds.

## Reproducibility

Experiment runs are archived with full result logs:

```bash
python -m agent experiment run --benchmark docs/benchmark-spec.json
# Output: docs/results/YYYY-MM-DD/HHMMSSZ/report.json + results.csv
```

## License

CC BY 4.0 — same license as the PeerJ publication.