# memorybench **Repository Path**: joo2019/memorybench ## Basic Information - **Project Name**: memorybench - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-05-06 - **Last Updated**: 2026-05-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # MemoryBench A pluggable benchmarking framework for evaluating memory and context systems. original ## Features - πŸ”Œ Interoperable: mix and match any provider with any benchmark - 🧩 Bring your own benchmarks: plug in custom datasets and tasks - ♻️ Checkpointed runs: resume from any pipeline stage (ingest β†’ index β†’ search β†’ answer β†’ evaluate) - πŸ†š Multi‑provider comparison: run the same benchmark across providers side‑by‑side - πŸ§ͺ Judge‑agnostic: swap GPT‑4o, Claude, Gemini, etc. without code changes - πŸ“Š Structured reports: export run status, failures, and metrics for analysis - πŸ–₯️ Web UI: inspect runs, questions, and failures interactively, in real-time! ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Benchmarks β”‚ β”‚ Providers β”‚ β”‚ Judges β”‚ β”‚ (LoCoMo, β”‚ β”‚ (Supermem, β”‚ β”‚ (GPT-4o, β”‚ β”‚ LongMem..) β”‚ β”‚ Mem0, Zep) β”‚ β”‚ Claude..) β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ MemoryBench β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Ingest β”‚ Indexingβ”‚ Search β”‚ Answer β”‚Evaluateβ”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## Quick Start ```bash bun install cp .env.example .env.local # Add your API keys bun run src/index.ts run -p supermemory -b locomo ``` ## Configuration ```bash # Providers (at least one) SUPERMEMORY_API_KEY= MEM0_API_KEY= ZEP_API_KEY= # Judges (at least one) OPENAI_API_KEY= ANTHROPIC_API_KEY= GOOGLE_API_KEY= ``` ## Commands | Command | Description | |---------|-------------| | `run` | Full pipeline: ingest β†’ index β†’ search β†’ answer β†’ evaluate β†’ report | | `compare` | Run benchmark across multiple providers simultaneously | | `ingest` | Ingest benchmark data into provider | | `search` | Run search phase only | | `test` | Test single question | | `status` | Check run progress | | `list-questions` | Browse benchmark questions | | `show-failures` | Debug failed questions | | `serve` | Start web UI | | `help` | Show help (`help providers`, `help models`, `help benchmarks`) | ## Options ``` -p, --provider Memory provider (supermemory, mem0, zep) -b, --benchmark Benchmark (locomo, longmemeval, convomem) -j, --judge Judge model (gpt-4o, sonnet-4, gemini-2.5-flash, etc.) -r, --run-id Run identifier (auto-generated if omitted) -m, --answering-model Model for answer generation (default: gpt-4o) -l, --limit Limit number of questions -q, --question-id Specific question (for test command) --force Clear checkpoint and restart ``` ## Examples ```bash # Full run bun run src/index.ts run -p mem0 -b locomo # With custom run ID bun run src/index.ts run -p mem0 -b locomo -r my-test # Resume existing run bun run src/index.ts run -r my-test # Limited questions bun run src/index.ts run -p supermemory -b locomo -l 10 # Different models bun run src/index.ts run -p zep -b longmemeval -j sonnet-4 -m gemini-2.5-flash # Compare multiple providers bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -s 5 # Test single question bun run src/index.ts test -r my-test -q question_42 # Debug bun run src/index.ts status -r my-test bun run src/index.ts show-failures -r my-test ``` ## Pipeline ``` 1. INGEST Load benchmark sessions β†’ Push to provider 2. INDEX Wait for provider indexing 3. SEARCH Query provider β†’ Retrieve context 4. ANSWER Build prompt β†’ Generate answer via LLM 5. EVALUATE Compare to ground truth β†’ Score via judge 6. REPORT Aggregate scores β†’ Output accuracy + latency ``` Each phase checkpoints independently. Failed runs resume from last successful point. ## MemScore MemScore is a composite metric that captures three dimensions of provider performance in a single line: ``` accuracy% / latencyMs / contextTokens ``` | Component | What it measures | |-----------|-----------------| | **Quality** | Answer accuracy β€” `(correct / total) * 100` from judge evaluations | | **Latency** | Average search response time in milliseconds | | **Tokens** | Average context tokens sent to the answering model (counted client-side) | After a run completes, MemScore appears in the CLI summary: ``` Summary: Total Questions: 50 Correct: 43 Accuracy: 86.00% MemScore: 86% / 145ms / 1823tok ``` MemScore is intentionally a triple, not a single number β€” collapsing quality, latency, and cost into one score hides important tradeoffs. Use it to compare providers side-by-side on the same benchmark: ```bash bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -j gpt-4o ``` The `report.json` includes both a display string and structured `memscoreComponents` for programmatic use. > **[Full MemScore documentation β†’](https://supermemory.ai/docs/memorybench/memscore)** ## Checkpointing Runs persist to `data/runs/{runId}/`: - `checkpoint.json` - Run state and progress - `results/` - Search results per question - `report.json` - Final report Re-running same ID resumes. Use `--force` to restart. ## Extending | Component | Guide | |-----------|-------| | Add Provider | [src/providers/README.md](src/providers/README.md) | | Add Benchmark | [src/benchmarks/README.md](src/benchmarks/README.md) | | Add Judge | [src/judges/README.md](src/judges/README.md) | | Project Structure | [src/README.md](src/README.md) | ## License MIT