MemPABench Harness: 3-Phase Implementation Plan

Context

We’re building a benchmark harness to evaluate memory systems for personal assistants on interaction preference learning. We keep Nanobot’s full PA stack intact (AgentLoop, MessageBus, Channel architecture, tools, hooks) because the benchmark must test real PA systems, not stripped-down LLM wrappers. We modify Nanobot in targeted ways: make memory pluggable, add checkpointing, and build a harness layer on top that drives experiments via process_direct() (with a BenchmarkChannel upgrade path for full channel fidelity).

Key principle: The PA under test should be as close to a real deployed PA as possible. Channel architecture, tool usage, session management, memory consolidation — all preserved.

Target Architecture

Three independent roles (PA, Simulator, Judge) + shared infrastructure (memory backends, harness orchestration, data).

MemPA/                                  # Project root
│
├── pa/                                 # 🤖 Personal Assistant (Nanobot fork)
│   ├── agent/
│   │   ├── loop.py                     #   FROM nanobot (KEPT — core PA engine, ~584 lines)
│   │   ├── runner.py                   #   FROM nanobot (as-is — LLM + tool loop, ~232 lines)
│   │   ├── hook.py                     #   FROM nanobot (as-is — lifecycle hooks, ~50 lines)
│   │   ├── context.py                  #   FROM nanobot (MODIFIED — add benchmark prompt loading)
│   │   ├── memory.py                   #   FROM nanobot (MODIFIED — delegate to pluggable MemoryBackend)
│   │   ├── skills.py                   #   FROM nanobot (as-is)
│   │   └── subagent.py                 #   FROM nanobot (as-is)
│   ├── providers/
│   │   ├── base.py                     #   FROM nanobot (as-is — LLMProvider ABC, ~369 lines)
│   │   ├── registry.py                 #   FROM nanobot (trimmed — ~5 providers)
│   │   ├── anthropic_provider.py       #   FROM nanobot (as-is)
│   │   └── openai_compat_provider.py   #   FROM nanobot (as-is)
│   ├── session/
│   │   └── manager.py                  #   FROM nanobot (MODIFIED — add checkpoint/restore)
│   ├── tools/
│   │   ├── base.py                     #   FROM nanobot (as-is — Tool ABC)
│   │   ├── registry.py                 #   FROM nanobot (as-is — ToolRegistry)
│   │   ├── filesystem.py               #   FROM nanobot (as-is — ReadFile, WriteFile, EditFile, ListDir)
│   │   ├── web.py                      #   FROM nanobot (MODIFIED — MockWebSearchTool replaces WebSearchTool)
│   │   ├── shell.py                    #   FROM nanobot (as-is — ExecTool, only for coding scenarios)
│   │   ├── email.py                    #   NEW (~60 lines — EmailTool mock)
│   │   ├── calendar.py                 #   NEW (~50 lines — CalendarTool mock)
│   │   └── contacts.py                 #   NEW (~30 lines — ContactsTool mock)
│   ├── bus/
│   │   ├── events.py                   #   FROM nanobot (as-is — InboundMessage, OutboundMessage)
│   │   └── queue.py                    #   FROM nanobot (as-is — MessageBus)
│   ├── channels/
│   │   ├── base.py                     #   FROM nanobot (as-is — BaseChannel ABC)
│   │   └── benchmark.py               #   NEW (~60 lines — BenchmarkChannel)
│   ├── command/                        #   FROM nanobot (as-is — slash command routing)
│   ├── config/
│   │   └── schema.py                   #   FROM nanobot (MODIFIED — add benchmark fields)
│   └── utils/
│       └── helpers.py                  #   FROM nanobot (trimmed)
│
├── simulator/                          # 🎭 User Simulator (independent)
│   ├── __init__.py
│   ├── core.py                         #   UserSimulator class
│   ├── persona.py                      #   PersonaConfig, preference evolution logic
│   └── prompts/                        #   Simulator-specific prompt templates
│       └── persona_instructions.md
│
├── judge/                              # ⚖️ Evaluator (independent)
│   ├── __init__.py
│   ├── evaluator.py                    #   BenchmarkEvaluator (Judge + Simulator self-eval)
│   ├── metrics.py                      #   CSS, ETS, IQΔ computation
│   ├── schemas.py                      #   EvaluationResult, score dataclasses
│   └── prompts/                        #   Judge-specific prompt templates
│       ├── scoring_rubric.md
│       └── output_format.md
│
├── memory/                             # 🧠 Memory backends (shared by PA + harness)
│   ├── __init__.py
│   ├── base.py                         #   MemoryBackend ABC
│   ├── file/                           #   File-based baseline
│   │   ├── __init__.py
│   │   ├── backend.py                  #   FileMemoryBackend (from Nanobot MemoryStore)
│   │   └── README.md                   #   Setup notes
│   ├── mem0/                           #   Mem0 adapter
│   │   ├── __init__.py
│   │   ├── backend.py                  #   Mem0Backend
│   │   ├── config.yaml                 #   Mem0-specific config (vector DB, graph settings)
│   │   ├── docker-compose.yaml         #   Qdrant / other deps if needed
│   │   └── README.md                   #   Setup & teardown instructions
│   ├── memos/                          #   MemOS adapter
│   │   ├── __init__.py
│   │   ├── backend.py                  #   MemOSBackend
│   │   ├── config.yaml                 #   MemOS-specific config
│   │   ├── docker-compose.yaml         #   MemOS deps
│   │   └── README.md
│   └── zep/                            #   Zep/Graphiti adapter
│       ├── __init__.py
│       ├── backend.py                  #   ZepBackend
│       ├── config.yaml                 #   Zep config (Neo4j connection etc.)
│       ├── docker-compose.yaml         #   Neo4j + Zep services
│       └── README.md
│
├── harness/                            # 🔧 Experiment orchestration (wires PA + Simulator + Judge)
│   ├── __init__.py
│   ├── orchestrator.py                 #   Accumulation phase: simulator ↔ PA conversation loop
│   ├── checkpoint.py                   #   Memory/session snapshot save & restore
│   ├── experiment_runner.py            #   Experiment matrix runner (personas × backends × levels)
│   ├── results.py                      #   Result collection + analysis
│   └── config.py                       #   BenchmarkConfig dataclass
│
├── data/                               # 📦 All static data (personas, prompts, test interactions, sandbox)
│   ├── personas/                       #   Persona definitions (YAML)
│   │   ├── sheldon_cooper.yaml
│   │   ├── leonard_hofstadter.yaml
│   │   ├── michael_scott.yaml
│   │   └── ...                         #   5-6 personas total
│   ├── sandbox_templates/              #   Per-persona workspace templates
│   │   ├── user_a/                     #   Anonymized persona ID
│   │   │   ├── documents/              #   Pre-seeded persona files
│   │   │   ├── projects/               #   Code/work files (if applicable)
│   │   │   ├── email/                  #   EmailTool sandbox
│   │   │   │   ├── inbox/              #   Pre-seeded incoming emails (JSON)
│   │   │   │   ├── sent/               #   Empty at init, populated by PA
│   │   │   │   └── drafts/             #   Optional pre-seeded drafts
│   │   │   ├── calendar.json           #   CalendarTool sandbox (pre-seeded events)
│   │   │   ├── contacts.json           #   ContactsTool sandbox (persona-appropriate contacts)
│   │   │   └── web_cache/              #   MockWebSearch predetermined results (per scenario)
│   │   ├── user_b/
│   │   └── ...                         #   One template per persona (10 total)
│   ├── test_interactions/              #   Evaluation probes per persona
│   │   ├── sheldon_cooper/
│   │   │   ├── final_state.yaml        #   30 final-state probes (all preference cells)
│   │   │   ├── pre_event_work_1.yaml   #   Pre-event snapshot for evolving pref 1 (Work)
│   │   │   ├── pre_event_work_2.yaml
│   │   │   ├── pre_event_work_3.yaml
│   │   │   ├── pre_event_personal_1.yaml
│   │   │   ├── pre_event_personal_2.yaml
│   │   │   └── pre_event_personal_3.yaml
│   │   └── michael_scott/
│   │       └── ...
│   └── prompts/                        #   PA prompt templates (loaded by pa/agent/context.py)
│       ├── assistant/
│       │   ├── base_identity.md        #   Base PA identity
│       │   ├── ipas_L1.md              #   L1: context-blind preference
│       │   ├── ipas_L2.md              #   L2: context-aware static
│       │   ├── ipas_L3.md              #   L3: context-aware + evolution
│       │   └── ipas_L4.md              #   L4: full IPaS
│       └── memory/
│           ├── consolidation_L1.md     #   Memory consolidation prompt per level
│           ├── consolidation_L2.md
│           ├── consolidation_L3.md
│           └── consolidation_L4.md
│
├── workspace/                          # 🗂️ Runtime output (gitignored)
│   ├── experiments/                    #   Per-experiment isolated workspaces
│   │   └── {persona}__{backend}__{condition}/
│   │       ├── sessions/              #   Session JSONL files
│   │       ├── memory/                #   MEMORY.md, HISTORY.md (file backend)
│   │       ├── email/                 #   EmailTool sandbox (copied from template)
│   │       ├── calendar.json          #   CalendarTool sandbox
│   │       ├── contacts.json          #   ContactsTool sandbox
│   │       ├── web_cache/             #   MockWebSearch predetermined results
│   │       ├── documents/             #   Pre-seeded files (copied from template)
│   │       ├── projects/              #   Code/work files (if applicable)
│   │       └── checkpoints/
│   │           └── s10/
│   │               ├── memory_snapshot.json
│   │               ├── session.jsonl
│   │               ├── workspace_snapshot/ #   Full workspace state at checkpoint
│   │               └── metadata.json
│
├── results/                            # 📊 Experiment results (can be git-tracked)
│   ├── raw/                            #   Per-experiment JSON
│   ├── summary/                        #   Aggregated tables
│   └── plots/                          #   Visualizations
│
├── tests/                              # 🧪 Tests (mirrors top-level structure)
│   ├── test_pa/
│   ├── test_simulator/
│   ├── test_judge/
│   ├── test_memory/
│   └── test_integration/
│
├── configs/                            # ⚙️ Experiment configurations
│   ├── mini_test.yaml                  #   Minimal (1 persona, file backend, 10 sessions)
│   ├── pilot.yaml                      #   First run: 2 personas × 2 backends × 1 IPaS level
│   ├── full_benchmark.yaml             #   Full matrix (4 personas × conditions)
│   └── dev.yaml                        #   Debug config
│
├── IMPLEMENTATION_PLAN.md
├── pyproject.toml
├── README.md
└── .gitignore

Design principles

Three independent roles:

pa/ — the PA under test. Can run standalone (it’s a Nanobot fork). Depends on memory/ for pluggable backends.
simulator/ — user simulator. Only depends on pa/providers/base.py for LLM calls. No PA infrastructure.
judge/ — evaluator. Only depends on pa/providers/base.py for LLM calls. No PA infrastructure.

Shared components:

memory/ — backend implementations shared between PA (runtime use) and harness (checkpoint/restore). Each backend gets its own folder with config, docker-compose, setup docs.
harness/ — the only module that knows about all three roles. Wires them together for experiments.

Data separation:

data/ — static inputs: persona definitions, test interactions, prompt templates. Version-controlled.
workspace/ — runtime outputs: sessions, memory files, checkpoints. Gitignored.
results/ — experiment outputs: scores, summaries, plots. Can be version-controlled.
configs/ — experiment configurations. Version-controlled.

Deleted from Nanobot: specific channel implementations (telegram.py, discord.py, whatsapp.py, slack.py, etc.), cron/, security/, heartbeat/, cli/. These are platform-specific, not PA-structural.

Kept from Nanobot: AgentLoop, MessageBus, BaseChannel, tools (filesystem, web, shell), hooks, session manager, command router, skills, subagent manager. These define what a PA is.

Two-stage agent integration approach

Stage 1 (get it working): Harness calls agent_loop.process_direct() — simplest path, skips channel/bus, but still goes through the full _process_message() pipeline (session, memory, tools, consolidation).
Stage 2 (full fidelity): Add BenchmarkChannel so messages flow through the complete MessageBus → Channel → AgentLoop → Channel path. Migration is trivial since simulator/evaluator only care about (input, output) pairs.

This is a low-risk approach: process_direct() and BenchmarkChannel both call _process_message(), so all downstream behavior is identical.

Phase 0: Preference Extraction Pipeline (Data Preparation)

Current role of this section: high-level integration summary only. Detailed extractor design, pilot procedure, and synthesis logic live in MemPABench_Extractor_MVP_Plan, Transcript_to_Preference_Workflow, Pattern_to_Preference_Rubrics, and MemPABench_MBTI_Projection_Table. Do not duplicate those details here.

Goal: Produce canonical persona preference YAMLs as 2 × 15 matrices: two contexts from PA_Interaction_Preference × fifteen IPaS attributes. These matrices are benchmark input, not runtime memory.

2026-05-12 note: the old one-off 4-context → 2-context migration script has been archived at data/personas/archive/merge_contexts.py. It should not be run in the current pipeline; active persona files are already 2-context.

Current status:

TBBT transcripts are parsed.
Two-pass extraction has been run for Sheldon S01-S03.
data/extraction/sheldon_s01-s03_pass2.json feeds the HITL queue.
data/personas/user_a_s01-s03_preferences.yaml is the anonymized canonical runtime matrix for the simulator MVP.
Multi-character and broader multi-season extraction remain future work.

Pipeline summary:

Pass 1 filters transcript scenes, classifies each scene into one of the 2 contexts, tags relevant IPaS attributes, and extracts direct evidence.
Pass 2 groups evidence by (context, attribute), applies the rubric, and outputs setting, confidence, evidence summary, and asymmetry notes.
High-confidence cells may flow directly into persona YAML. Non-High and empty cells go through HITL / MBTI projection / seed authoring according to the confidence gate.

Confidence gate: only High-confidence settings are auto-accepted. Anything below High must be reviewed before committing to canonical persona files.

Phase 1: Fork Nanobot -> Benchmark-ready PA

Current status: mostly implemented, with two important architectural changes from the original plan:

The PA package is pa/, not mempabench/.
Runtime tools are exposed through a run-local MCP state server instead of native mock tool classes.

Implemented:

pa/ contains the benchmark-ready Nanobot fork: agent loop, runner, session manager, context builder, providers, command router, channels/bus scaffolding, and MCP tool integration.
AgentLoop.process_direct() is the active harness entry point. It still runs through _process_message(), context assembly, session persistence, tool loop, memory retrieval/write hooks, and response metadata.
ContextBuilder.set_benchmark_identity() supports benchmark-specific identity prompts while preserving bootstrap files, skill loading, and file-memory injection.
Adapter-retrieved memory can be injected into the system prompt under # Retrieved Memory.
SessionManager supports checkpoint/restore through checkpoint(), restore_checkpoint(), and _load_from_path().
harness.config.BenchmarkConfig exists with workspace paths, assistant/simulator/judge models, personas, memory providers, IPaS levels, backbone models, session settings, checkpoint sessions, and execution controls.
harness.checkpoint.CheckpointManager implements the original three-layer checkpoint shape: session JSONL, memory snapshot JSON, and sandbox/state directory copy.
harness.pa_factory.build_openrouter_pa() builds a PA instance using OpenRouter, MCP state tools, optional memory adapter, and workspace restriction.

Implemented as a replacement for native mock tools:

harness.state_runtime.initialize_state_runtime() copies data/state_fixtures/{user_id} into a run-local mutable state directory and returns MCP config for the PA.
state_server.server exposes deterministic benchmark tools through MCP: document read, email draft/send/search/read, route options, store item checks, cinema showtimes, nearby places, contacts lookup, and planning-note append.
Tool availability is session-scoped through StateRuntime.enabled_tools.
This replaces the original plan for native EmailTool, ContactsTool, MockWebSearchTool, and similar PA-native mock tools. Calendar is not currently represented as a separate tool surface.

Memory implementation status:

The original memory.base.MemoryBackend + memory.file.backend.FileMemoryBackend still exist and support file-memory snapshot/restore.
The active PA integration uses the newer async memory.adapter.MemoryAdapter protocol instead: setup_scope, record_event, retrieve, format_context, reset_scope, and health.
AgentLoop retrieves adapter memory before each turn when run/persona metadata is present, injects it into the prompt, then records the visible user/assistant turn back to the adapter.
Existing Nanobot file memory consolidation remains in pa.agent.memory.MemoryConsolidator; it has not been fully replaced by MemoryBackend delegation.

Not implemented from the original Phase 1 plan:

No BenchmarkChannel yet.
No PA-native mock tool classes for email/calendar/contacts/web; MCP state tools are the current replacement.
No full config-driven assembly of benchmark identity/IPaS-level prompts into experiment conditions yet.
CheckpointManager still assumes the older synchronous MemoryBackend.snapshot()/restore() interface and is not fully unified with the newer async MemoryAdapter contract.

Phase 1 current deliverable

pa = build_openrouter_pa(
    workspace=workspace,
    openrouter_key=openrouter_key,
    model=pa_model,
    mcp_servers=state_runtime.mcp_server_config(),
    memory_adapter=memory_adapter,
)
 
response = await pa.process_direct(
    content=user_message,
    session_key="sim:user_a:session_01",
    metadata={"run_id": run_id, "persona_id": "user_a", "session_id": "session_01"},
)

Phase 2: Simulator + Memory Backends + Session Harness

Current status: partially implemented. The single-session simulator-to-PA harness works, but the original multi-session AccumulationOrchestrator has not been implemented yet.

Implemented: scripted simulator loop

The current largest working loop is simulator.loop.SimulatorLoop.run_session():

Emit session_start.
For each scripted beat:
- resolve active interaction-preference skills,
- generate the simulator turn,
- log simulator-visible and eval-only fields,
- forward only the PA-visible <message> to pa_loop.process_direct(),
- record PA response, tool events, IX selections, and missing IX tool calls.
Emit session_end.

This is a replacement for the original free-form UserSimulator.generate_message()/should_end_session() loop. The current simulator is script/beat driven rather than natural end-condition driven.

Implemented simulator components:

simulator.schemas defines persona identity, preference matrix, session scripts, beats, life context, and related structures.
simulator.loop assembles simulator prompts, loads actor skill specs, calls the simulator LLM, parses structured simulator output, and drives a multi-beat session.
simulator.output parses PA-visible <message> plus eval-only blocks.
harness.transcript writes append-only transcript.jsonl, renders transcript.md, writes meta.yaml, and extracts PA tool calls for audit/judge use.
Access boundary is enforced by construction and tested: eval-only blocks are logged but not sent to PA.

Implemented: PA/session wiring

SimulatorLoop is PA-agnostic and only requires an object with process_direct(content, session_key, metadata=...).
harness.pa_factory.RecordingPA wraps PA instances for boundary tests and records exactly what the PA received.
tests/test_simulator/test_session_with_pa.py is the current end-to-end driver for one scripted session with simulator + real PA + MCP state tools + transcript output.

Implemented: memory adapter backends and conditions

The memory work has moved from the original MemoryBackend ABC design to an async adapter registry:

memory.adapter.MemoryAdapter defines the active runtime contract.
memory.registry.default_registry() registers:
- no_memory
- nanobot_file_memory
- mem0
- simple_rag
- memos
- graphiti
- honcho
Backend adapters have focused tests for record/retrieve/format/reset/health behavior.
nanobot_file_memory wraps the existing file-memory baseline as a controlled condition.

Important limitation: backend-native full snapshot/restore is not implemented for these adapters in the original Phase 2 sense. Reset/retrieve/write contracts exist; full checkpoint export/import remains future work.

Implemented: run-local state instead of sandbox templates

The original plan used data/sandbox_templates/{persona} copied into workspace/experiments/.... Current implementation uses:

static fixtures in data/state_fixtures/{user_id};
run-local mutable copies under workspace/runs or the test output directory;
MCP tools operating only on the run-local state namespace;
transcript-side state/tool audit output.

This serves the same benchmark need as the original sandbox plan for current scripted sessions, but it is not yet wrapped in a full experiment workspace layout.

Implemented: current real PA orchestrator wiring

harness/orchestrator.py is the benchmark-level outer loop for frozen RunPlan steps: memory mode, staged state, step execution, transcript artifacts, ledger, and resume behavior.
scripts/run_orchestrator.py is the real-run CLI/wiring layer. It loads or builds the run plan, creates the PA workspace and memory adapter, builds the OpenRouter PA, copies simulator/profiles/{persona_id} into the current run’s simulator_workspace/{persona_id}, loads that run-local profile for the simulator actor, and passes it into the nanobot simulator backend.
Simulator runtime memory is separate from PA memory: simulator profile memory writes to run-local simulator_workspace/{persona_id}/memory/MEMORY.md and HISTORY.md; PA file memory writes to the run-local pa_workspace/memory/.
The real-run loop drains simulator background memory-consolidation tasks before closing each session, so profile memory writes are completed before the step is marked done.
StateServer is now wired in manifest mode for real orchestrator steps. StageRuntime builds a StateManifest from data/scripts/outlines/{persona_id}/outline.yaml and each step’s acc_num, materializes required fixtures into stage_work/{step_id}, writes state_manifest.yaml, passes MEMPA_STATE_MANIFEST to the MCP subprocess, and enables both static StateServer tools and manifest dynamic tools. This makes scenario-specific tools such as mcp_state_read_anyons_review_paper available alongside generic tools like mcp_state_documents_read and mcp_state_email_search.

Not implemented from the original Phase 2 plan

The original AccumulationOrchestrator.run_accumulation(persona, backend) API was replaced by frozen RunPlan execution through harness/orchestrator.py.
Full experiment-matrix orchestration is not implemented yet.
No automatic scheduled checkpointing during accumulation.
No restore-and-evaluate flow from checkpoints.
No BenchmarkChannel; the active path remains process_direct().
No complete implementation of the 114 accumulation sessions + 6 pre-event probes + 30 final probes as one runnable benchmark flow.
No simulator advance_to_session(), should_end_session(), or self_evaluate() API matching the original plan.

Phase 2 current deliverable

with TranscriptWriter(run_dir) as tw:
    sim_loop = SimulatorLoop(
        provider=sim_provider,
        identity=identity,
        matrix=matrix,
        script=script,
        pa_loop=pa,
        transcript_writer=tw,
        session_key="sim:user_a:session_01",
        run_id=run_id,
        persona_id="user_a",
        agent_id="nanobot",
    )
    events = await sim_loop.run_session()

This supports one scripted session end-to-end. The next design task is the Phase 2 orchestrator: how to compose many scripted sessions, memory conditions, checkpoints, state resets/restores, and later evaluation probes into a benchmark-level run.

Phase 3: Evaluation Pipeline + Experiment Runner

Goal: Full end-to-end benchmark — accumulate, evaluate at checkpoints, iterate experiment matrix, collect results.

Step 3.1: Evaluator (dual evaluation)

harness/evaluator.py (~300 lines)
BenchmarkEvaluator.evaluate_checkpoint(checkpoint_id, agent, test_interactions):
1. Run test interactions against checkpointed agent (read-only memory)
2. Judge evaluation: LLM scores responses against ground-truth preferences (structured JSON output, temperature=0)
3. Simulator self-evaluation: persona reviews interactions from user-experience perspective
4. Returns EvaluationResult(judge_scores, simulator_scores, test_results)
Judge dimensions: Context Sensitivity, Preference Evolution Tracking, Interaction Quality
Verify: run on known checkpoint, both judge + simulator produce valid structured scores

Step 3.2: Test interaction design

personas/<name>/test_interactions.yaml per persona
Probe specific preferences at each checkpoint level
Include expected behavior and preference key for scoring
Pre-designed (not generated) for reproducibility

Step 3.3: Experiment runner

harness/experiment_runner.py (~200 lines)
Iterates: personas × backends × IPaS levels
For each: accumulate → evaluate at checkpoints
Support: resumability (skip completed), parallelism (independent personas), cost tracking (token usage)
Verify: 1 persona × 1 backend × 1 IPaS level end-to-end

Step 3.4: Results collection + analysis

harness/results.py (~150 lines)
Save per-experiment JSON, generate summary DataFrame, evolution plots
Aggregate across experiment matrix

Step 3.5: CLI entry point

__main__.py (~60 lines)
Commands: run (full benchmark), run-single (one experiment), analyze (generate reports)

Phase 3 deliverable

python -m mempabench run --config benchmark_config.yaml
# Full experiment matrix, results in results/

Data Design Checklist

This section covers all static data that must be authored before running the benchmark. Code (Phases 1-3) does not depend on final data — you can develop with the mini test dataset and swap in real data later.

1. Persona definitions (`data/personas/{name}.yaml`)

4–6 personas drawn from Big Bang Theory, Silicon Valley, and The Office US (4 in the core suite, optionally 5–6 in the extended suite). Anonymized as User A, User B, … during benchmark execution. Each implemented persona file needs:

# Required fields
id: str                            # Anonymized ID (e.g. "user_a")
name: str                          # Character name (used only during annotation, never seen by PA)
show: str                          # Source show
background: str                    # Role, expertise, personality summary (2-3 sentences)
communication_style: str           # How they talk — tone, verbosity, formality (1-2 sentences)
difficulty: str                    # Easy / Medium / Hard / Trap
core_challenge: str                # What makes this persona hard to serve well (1 sentence)
 
# Contexts: 2 total (work / personal)
# Primary (all personas): email, scheduling, document editing, info lookup, learning
# Exploratory (by persona subset): coding, lab, sales, management, solo hobby projects
contexts:
  - name: str                      # Short label (e.g. "work", "personal")
    domain: "work" | "personal"
    description: str               # What the user is doing in this context
    typical_tasks: list[str]       # Example tasks from the 30 task categories
 
# Initial interaction preferences (maps to PA Interaction Preference Taxonomy)
# 4 dimensions, 15 PA attributes. Can be context-dependent.
# Values should be concrete settings, not ranges.
initial_preferences:
  # Per-context overrides: if absent, use default. If present, overrides for that context.
  default:
    # Dim 1: Expression Style (4 attributes)
    tone_formality: "casual" | "consultative" | "formal"
    verbosity: "terse" | "moderate" | "detailed"
    emotional_engagement: "task-focused" | "balanced" | "relationship-focused"
    guidance_level: "assumed" | "calibrated" | "guided"
 
    # Dim 2: Disclosure (4 attributes)
    reasoning_visibility: "show" | "summarize" | "hide"
    uncertainty_expression: "express" | "moderate" | "hide"
    process_visibility: "silent" | "bookend" | "full_narration"
    memory_privacy: "minimal_transparent" | "domain_scoped" | "full"
 
    # Dim 3: Initiative & Autonomy (5 attributes)
    autonomy_level: "reactive" | "suggest" | "self_directed" | "autonomous"
    proactive_outreach: "low" | "medium" | "high"
    task_expansion: "low" | "medium" | "high"
    solution_breadth: "low" | "medium" | "high"
    capability_boundary: "suggest_alternatives" | "find_and_hand_off"
 
    # Dim 4: Information Flow (2 attributes)
    information_elicitation: "infer" | "structured" | "iterative"
    topic_management: "follow_user" | "organize" | "one_at_a_time"
 
  # Context-specific overrides (only list attributes that differ from default)
  work:
    tone_formality: "formal"
    autonomy_level: "suggest"     # More cautious for professional actions
  personal:
    solution_breadth: "high"      # Personal contexts welcome option exploration
    emotional_engagement: "balanced"
 
# Preference evolution events
# Primary event: functionally equivalent across all personas (Stimulus Template A: Irreversible Operation Error)
# Exploratory events: persona-specific (1-2 per persona)
preference_events:
  - type: "primary" | "exploratory"
    session: int                   # When the shift occurs (or range for gradual)
    trigger: str                   # Narrative event (e.g. "Accidentally deleted experiment scripts")
    functional_structure: str      # Abstract structure (e.g. "Irreversible loss × work × external")
    changes:                       # Which preferences change
      attribute_name: new_value
    context: str | null            # If null, applies globally; if set, context-specific
    expected_detection_turn: int   # First turn where behavioral evidence is observable
    rationale: str                 # Why this shift is natural for this character

Persona roster (from proposal):

ID	Source	Core Challenge	Difficulty
User A	BBT	Extreme context-dependency: opposite styles per context. Strong correction signals.	Easy
User B	BBT	Hidden preferences, rarely corrects, silent satisfaction decay. Weak signals.	Hard
User C	BBT	Ultra-terse default. Career change triggers dramatic multi-attribute shift.	Medium
User D	BBT	Emotion-driven preferences (not task-driven). Flowery when comfortable, silent when anxious.	Hard
User E	SV	Confident in tech, hedging in management. Career trauma inverts preferences.	Medium
User F	SV	Control group: nearly invariant. Over-learning gets penalized (false positive trap).	Trap
User G	SV	Overly accommodating surface. Gradual boundary assertion = weak signal.	Hard
User H	Office	Most unstable: erratic in authority role, over-sharing in personal context. Matures through life event.	Medium
User I	Office	Power-hierarchy-driven preferences. Deferential up, authoritarian down.	Easy
User J	Office	Normally terse + disengaged. Suddenly proactive about things they care about.	Hard

Design considerations:

Taxonomy coverage: Collectively, the selected personas (4–6) should exercise as many of the 15 PA attributes as feasible. Not every persona needs to touch every attribute; aim for each attribute to be tested by at least 2 personas where possible.
Shift diversity: Include different shift types: trust-building (relaxing), trust-breaking (tightening), context-specific vs global, gradual drift vs sudden switch, explicit correction vs implicit behavioral drift, temporary vs permanent.
Character fidelity: Preferences should be derivable from show canon and documented behavioral evidence.
Cross-context contrast: Each persona should have at least one attribute that differs across contexts. This is what makes Context Sensitivity measurable.
Personalization necessity check: For each scenario × attribute, GT values must diverge across personas. If all personas share the same GT, that item is excluded from CS scoring.
Core suite (4 personas): 1 easy + 1 medium + 1 hard + 1 trap/control for main analysis. Extended suite (optional) adds 1–2 more, total 5–6.

2. Preference evolution timeline

~~Each persona has 3-4 shift events spread across ~40-50 sessions.~~ Each persona has exactly 6 evolving preferences out of 30 total (15 attributes × 2 contexts): 3 in Work context, 3 in Personal context, each a distinct attribute. Events are staggered — no two events in adjacent sessions.

Session structure per persona (150 total):

Accumulation (114 sessions):
  24 stable preferences × 3 reps            = 72 sessions
  6 evolving × (3 pre + 1 event + 3 post)   = 42 sessions
  Total accumulation                         = 114 sessions

Test (36 sessions, run on-policy):
  30 preference cells × 1 final-state probe  = 30 sessions  (after session 114)
  6 evolving prefs × 1 pre-event snapshot    =  6 sessions  (each before its event)
  Total test                                 = 36 sessions

Event placement per evolving preference:

Each event is placed after that preference’s 3 pre-event accumulation sessions. The pre-event test session runs immediately before the event. Events are staggered to avoid clustering. Post-event accumulation (3 sessions) follows each event.

Shift diversity guidelines (applied across 6 events, 2 contexts):

At least 2 shifts context-specific (change in Work or Personal only, not both)
At least 1 shift is a regression (preference tightens or reverts after relaxing)
At least 1 shift reflects trust-building (preference relaxes over time)
Include both explicit correction and implicit behavioral drift across the 6 events

~~Checkpoint placement~~ → replaced by fixed test structure: 6 pre-event snapshots + 30 final-state probes. No arbitrary checkpoint schedule needed.

3. Test sessions (`data/scripts/sessions/{name}/tests/`)

~~Pre-designed probes that run at each checkpoint.~~ Pre-designed test sessions in two categories: 6 pre-event snapshot files (one per evolving preference, run immediately before each event) and 30 final-state probe files (covering all preference cells, run after all 114 accumulation sessions). These are the actual evaluation inputs.

Test sessions are read-only preference-alignment probes. Their purpose is to test whether the PA can select the active interaction preference from prior memory / learned preference state. They must not teach, correct, punish, or reveal the target preference during the test.

Naming:

test_pre_01.yaml … test_pre_06.yaml
test_final_001.yaml … test_final_030.yaml

id: test_final_001
session_type: test_final
timeline:
  placement: after_all_active_sessions
test_target:
  context: "work_internal"
  primary_attribute: "confirmation"
  expected_ix_setting: "confirm_before_action"
  background_preferences: []
task:
  user_request: "Schedule a meeting with Leonard for Tuesday at 3pm."
  neutral_wording_checks:
    no_direct_preference_request: true
    no_target_setting_synonyms: true
    no_user_correction_or_teaching: true
  task_success_independent_of_setting: true
judge_observables:
  expected_behavior:
    - "PA asks for confirmation before modifying the schedule."
  wrong_but_task_successful_settings:
    - setting: "act_without_confirmation"
      observable_behavior: "PA schedules directly and reports completion."
      why_task_still_succeeds: "The meeting is still scheduled; the error is preference alignment, not task success."
  scoring_basis:
    - pa_response
    - ix_tool_selection
memory_policy:
  write_preference_evidence: false
  ordinary_task_memory_allowed: false

Design considerations:

Single-target probes: Each test session probes exactly one primary interaction attribute. If a task naturally touches other preferences, mark them as background only and exclude them from scoring.
Before/after shift pairs: Each evolving preference has a pre-event snapshot probe (same query as its final-state probe) so PA behavior before and after the event can be directly compared. The 6 pre-event snapshot files each contain exactly 1 probe targeting the evolving attribute.
Preference-neutral wording: User requests state task needs only. They cannot contain direct preference instructions or target-setting synonyms such as “brief”, “detailed”, “ask first”, “don’t ask”, “take initiative”, or “wait for me” when those are the tested attribute.
No teaching feedback: If the PA picks the wrong interaction style, the user does not correct, explain, scold, or reveal the preference. The user may continue the task or accept a usable but suboptimal result.
Task success decoupling: Wrong settings must still allow the task to complete. The judge scores how the PA completed the task, not whether the task was possible.
Judge observables: Scoring must rely on PA behavior only: response shape, IX tool selection, tool path, question pattern, initiative, disclosure style, or similar visible actions. Do not use user reaction as evidence.
Memory pollution guard: Test sessions default to write_preference_evidence: false. If task outcome memory is required by the harness, label it ordinary task memory, not preference evidence.
Context discrimination: Include the same query in different contexts to test whether the PA correctly applies context-specific preferences.
Sandbox-requiring probes: At least 30-40% of probes should require sandbox interaction (file reads, script runs, file writes) — otherwise we’re only testing conversational adaptation.
Rubric design: Each probe needs a 1-5 rubric that the Judge LLM will use. Rubrics must be specific to the attribute and scenario, not generic. The rubric is what makes evaluation reproducible.

4. Sandbox templates (`data/sandbox_templates/{name}/`)

Already detailed in the Sandbox appendix. Key additional considerations:

Evolving content: Some sandbox files should be written to expect modification over time. A schedule.md with dates, a todo.md with items to check off, a draft.tex to revise.
Cross-context files: Some files are relevant to multiple contexts (e.g., contacts.json used in both scheduling and email contexts).
Test-interaction alignment: Every requires_sandbox: true test interaction must have corresponding files in the sandbox. Audit this mapping.
Character-authentic content: File content should use character-appropriate language and domain knowledge. Sheldon’s simulation code should look like physics code, not generic Python.

5. Prompt templates (`data/prompts/`)

PA identity prompts (`data/prompts/assistant/`)

base_identity.md — Core PA identity shared across all IPaS levels. Defines the PA’s role, capabilities, and general behavioral guidelines. Does NOT include preference information.
ipas_L1.md — Context-blind: injects user preferences as a flat list, no context awareness. The PA should apply preferences uniformly.
ipas_L2.md — Context-aware static: injects preferences keyed by context. The PA should apply different preferences in different contexts.
ipas_L3.md — Context-aware + evolution: same as L2, plus instructs the PA to update its understanding of preferences based on observed conversation patterns.
ipas_L4.md — Full IPaS: same as L3, plus composability — the PA should reason about preference interactions (e.g., high initiative + high confirmation = proactively suggest but always confirm).

Key design question: How does each IPaS level interact with memory? At L1, the memory backend stores preferences but the prompt doesn’t differentiate by context. At L4, the prompt actively instructs the PA to use memory for preference evolution tracking. This interaction between prompt template and memory backend is what the benchmark measures.

Memory consolidation prompts (`data/prompts/memory/`)

consolidation_L1.md — Consolidation focuses on factual recall (what happened), not preference patterns.
consolidation_L2.md — Adds context-tagged preference observations to consolidation output.
consolidation_L3.md — Adds temporal preference change detection.
consolidation_L4.md — Adds cross-preference interaction notes and confidence levels.

These prompts determine what the MemoryConsolidator extracts. The same conversation may produce different memory content depending on the consolidation prompt — this is the IPaS ablation mechanism.

6. Persona selection criteria

Persona archetypes are catalogued in section 1 above. Core suite uses 4 personas (1 easy + 1 medium + 1 hard + 1 trap); extended suite (optional) adds 1–2 more, total 5–6.

Criterion	Why it matters	How to verify
Show diversity	Avoid over-fitting to one show’s dynamics	3 shows: BBT (4), SV (3), Office (3)
Role diversity	Different professional contexts	Mix of technical, managerial, creative roles
Communication diversity	Different baseline interaction styles	Terse (User C, F, J) to verbose (User D, H)
Preference spread	Taxonomy coverage	Each of the 4 dimensions / 15 attributes has ≥2 personas exercising it
Shift diversity	Different evolution patterns	Gradual drift (G), sudden switch (C, E), explicit correction (A), implicit behavioral (B, J), temporary (A exploratory)
Difficulty spread	Calibrate benchmark	2 easy, 3 medium, 4 hard, 1 trap
Control group	Detect false positives	User F: nearly invariant, over-learning penalized

7. Data authoring order

Recommended sequence:

Select 5-6 personas — finalize the roster based on criteria above
Write initial preferences for each persona — map to taxonomy attributes
Design preference evolution — 3-4 shifts per persona with triggers and rationale
Determine checkpoint schedule — based on shift timing
Write test interactions — per persona per checkpoint, with rubrics
Create sandbox templates — per persona, aligned with test interactions
Write prompt templates — IPaS L1-L4 + consolidation prompts (shared across personas)
Audit coverage — verify all 15 PA attributes are tested, all sandbox references exist, all rubrics are specific

Steps 1-4 are the creative/research work. Steps 5-7 are more mechanical once preferences and shifts are defined. Step 8 is a sanity check before running experiments.

Key Nanobot Files Reference

File	Lines	Action	Notes
`agent/loop.py`	584	KEEP	Core PA engine, driven via `process_direct()`
`agent/runner.py`	232	COPY	Pure LLM + tool loop, no changes
`agent/hook.py`	50	COPY	Lifecycle hooks for runner — used for benchmark instrumentation
`agent/memory.py`	366	MODIFY	Refactor MemoryStore to delegate to pluggable MemoryBackend; keep MemoryConsolidator
`agent/context.py`	200	MODIFY	Add benchmark identity override; keep existing structure
`agent/skills.py`	~229	COPY	PA feature, kept
`agent/subagent.py`	—	COPY	PA feature, kept
`session/manager.py`	269	MODIFY	Add checkpoint/restore; delete legacy migration
`providers/base.py`	369	COPY	Core LLM interface
`providers/registry.py`	341	TRIM	Keep ~5 providers
`providers/anthropic_provider.py`	442	COPY	As-is
`providers/openai_compat_provider.py`	590	COPY	As-is
`tools/base.py`	~200	COPY	Tool ABC
`tools/registry.py`	~70	COPY	Tool registration
`tools/filesystem.py`	—	COPY	ReadFile, WriteFile, EditFile, ListDir
`tools/web.py`	—	COPY	WebSearch, WebFetch
`tools/shell.py`	—	COPY	ExecTool
`bus/events.py`	—	COPY	InboundMessage, OutboundMessage
`bus/queue.py`	—	COPY	MessageBus
`channels/base.py`	178	COPY	BaseChannel ABC
`channels/benchmark.py`	—	NEW	BenchmarkChannel for experiment use
`command/`	—	COPY	Slash command routing, PA feature
`config/schema.py`	263	MODIFY	Add benchmark fields, keep provider/tool config
`utils/helpers.py`	303	TRIM	Remove unused helpers
`channels/telegram.py` etc.	—	DELETE	Platform-specific, not PA-structural
`cron/`, `security/`, `heartbeat/`, `cli/`	—	DELETE	Not needed for benchmark
`tools/message.py`, `tools/spawn.py`, `tools/cron.py`	—	DELETE	Channel/cron-specific tools

Appendix: Minimal Test Dataset

For testing the pipeline before the full dataset is ready, use a single-persona mini dataset:

Mini Persona: “Sheldon Cooper” (1 persona, simplified)

# personas/mini_sheldon.yaml
name: "Sheldon Cooper"
show: "The Big Bang Theory"
background: "Theoretical physicist, rigid and routine-driven"
communication_style: "Verbose, literal, demands precision"
 
contexts:
  - name: "coding"
    description: "Working on physics simulations"
  - name: "scheduling"
    description: "Planning meetings and daily routine"
 
initial_preferences:
  confirmation: "each"           # wants confirmation for every action
  initiative: "reactive"         # don't volunteer suggestions
  presentation: "layered"        # detailed, structured responses
  tool_transparency: "high"      # show all tool usage and reasoning
 
preference_events:
  - session: 5
    description: "After trusting the PA, relaxes confirmation for routine tasks"
    changes:
      confirmation: "silent"     # for routine tasks only
    context: "scheduling"        # only in scheduling context, not coding

Mini Test Parameters

1 persona (Sheldon)
2 contexts (coding, scheduling)
1 memory backend (file — no external DB needed)
1 IPaS level (L1 — context-blind, simplest)
10 sessions (not 40)
Checkpoints at [5, 10] (before and after preference event)
~3-5 turns per session

Mini Test Interactions

# personas/mini_sheldon/test_interactions.yaml
- checkpoint: 5
  interactions:
    - context: "scheduling"
      query: "Schedule a meeting with Leonard for Tuesday"
      probe: "confirmation"
      expected: "Should ask for confirmation (preference = each)"
    - context: "coding"
      query: "Run my simulation script"
      probe: "tool_transparency"
      expected: "Should explain which tool it's using and why (preference = high)"
 
- checkpoint: 10
  interactions:
    - context: "scheduling"
      query: "Schedule a meeting with Leonard for Thursday"
      probe: "confirmation"
      expected: "Should NOT ask for confirmation (preference changed to silent at session 5 for scheduling)"
    - context: "coding"
      query: "Run my simulation script"
      probe: "confirmation"
      expected: "Should STILL ask for confirmation (change was scheduling-only)"

Why this is sufficient for pipeline testing

Tests the full flow: accumulation → checkpoint → evaluation
Tests context sensitivity (coding vs scheduling)
Tests preference evolution (session 5 change)
Tests context-dependency of evolution (changed in scheduling, not coding)
No external DB needed (file backend only)
Runs in minutes, not hours
Can expand to full dataset by adding more personas and backends

Appendix: Database & Infrastructure Architecture

The scale problem

Full experiment matrix: 6 personas × 3 backends × 4 IPaS levels × 3 backbone models = 216 experiments

Each experiment needs isolated memory state. We cannot spin up 360 separate DB instances on a 24GB M4 Pro. Solution: shared DB instances + logical namespace isolation.

Shared infrastructure (one instance each)

# memory/docker-compose.yaml
services:
  qdrant:                        # Used by: Mem0 experiments
    image: qdrant/qdrant
    ports: ["6333:6333"]
    volumes: [qdrant_data:/qdrant/storage]
 
  neo4j:                         # Used by: Zep/Graphiti experiments
    image: neo4j:5
    ports: ["7474:7474", "7687:7687"]
    environment:
      NEO4J_AUTH: neo4j/benchmark
    volumes: [neo4j_data:/data]
 
  # MemOS: TBD based on their deployment docs
 
volumes:
  qdrant_data:
  neo4j_data:

Only start the services needed for the current batch of experiments.

Namespace isolation per experiment

Every experiment gets a unique experiment_id: {persona}__{backend}__{ipas}__{model}

Backend	Isolation mechanism	experiment_id used as
File	Separate workspace directory	Directory name under `workspace/experiments/`
Mem0	`user_id` + `collection_name`	Qdrant collection name
MemOS	MemCube namespace	Namespace identifier
Zep	`user_id` + `session_id`	Neo4j user node + session identifier

# Example: Mem0 backend initialization
class Mem0Backend(MemoryBackend):
    def __init__(self, experiment_id: str, config: dict):
        self.experiment_id = experiment_id
        self.memory = Memory.from_config({
            **config,
            "user_id": experiment_id,
            "collection_name": experiment_id,
        })

Workspace layout per experiment

workspace/
  experiments/
    sheldon__mem0__L1__claude/
    │   ├── sessions/              # Session JSONL files
    │   ├── memory/                # File backend data (MEMORY.md, HISTORY.md)
    │   ├── checkpoints/
    │   │   ├── s10/
    │   │   │   ├── memory_snapshot.json   # Full DB state export
    │   │   │   ├── session.jsonl
    │   │   │   └── metadata.json
    │   │   └── s30/
    │   │       └── ...
    │   └── sandbox/               # Tool sandbox filesystem
    │
    sheldon__mem0__L1__gpt4o/
    │   └── ...
    sheldon__zep__L2__claude/
    │   └── ...
    └── ...

Checkpoint strategy for external DBs

Save checkpoint (three layers):

Memory: backend.snapshot() → exports all memories for this experiment_id as JSON
- Mem0: memory.get_all(user_id=experiment_id) → JSON
- Zep: export all facts + graph edges for user/session → JSON
- MemOS: export MemCubes in namespace → JSON
Session: Copy session JSONL to checkpoint dir
Sandbox: shutil.copytree(sandbox, checkpoint_dir/sandbox) — full filesystem snapshot

Save to: checkpoints/{checkpoint_id}/ → memory_snapshot.json + session.jsonl + sandbox/ + metadata.json

Restore checkpoint (for evaluation):

Create temporary namespace: {experiment_id}__eval__{checkpoint_id}
backend.restore(snapshot) → imports JSON into temp namespace
Restore sandbox: copy checkpoint sandbox to eval workspace
Run evaluation queries (read-only memory, but PA can interact with sandbox)
Cleanup: delete temp namespace + eval sandbox after evaluation

This ensures evaluation never pollutes accumulation state.

Execution order (resource-aware)

Do NOT run all 360 experiments in parallel. Batch by backend to limit infra:

Batch 1: File backend (all personas × IPaS × models)
          → No external DB needed. Can parallelize freely.
          → ~120 experiments

Batch 2: Mem0 experiments (start Qdrant)
          → docker compose up qdrant
          → Run all Mem0 experiments (low concurrency: 2-3 parallel)
          → docker compose down qdrant
          → ~120 experiments

Batch 3: Zep experiments (start Neo4j)
          → docker compose up neo4j
          → Run all Zep experiments (low concurrency: 2-3 parallel)
          → docker compose down neo4j
          → ~120 experiments

Batch 4: MemOS experiments (start MemOS services)
          → Similar pattern
          → ~120 experiments

Each batch only needs one DB type running. 24GB RAM is sufficient for: PA process + 1 DB service + simulator LLM calls + judge LLM calls (all LLMs are API-based, no local GPU needed).

Backbone model handling

Multiple backbone models (Claude, GPT-4o, Llama etc.) do NOT add DB complexity — they all use the same memory backends. The experiment_id includes the model name for workspace isolation, but the DB infrastructure is identical.

Provider setup per experiment:
  assistant_provider = create_provider(config, model="anthropic/claude-sonnet-4-20250514")
  # OR
  assistant_provider = create_provider(config, model="openai/gpt-4o")
  # OR
  assistant_provider = create_provider(config, model="deepseek/deepseek-chat")

  # Same memory backend either way:
  memory = Mem0Backend(experiment_id="sheldon__mem0__L2__claude", config=mem0_config)

The pa/providers/ abstraction handles model switching. No architectural change needed.

Appendix: Evaluation Pipeline Detail

Dual evaluation at key checkpoints

Evaluation happens in Phase 2 (accumulation) at checkpoint sessions, NOT as a separate batch.

Session 1   → accumulate (stable or pre-event evolving)
...
Session N-1 → accumulate
Session N   → PRE-EVENT TEST (1 probe for evolving pref i)
Session N+1 → EVENT session (evolving pref i shifts)
Session N+2 → accumulate (post-event evolving pref i)
...  (×6 for each evolving preference, staggered)
...
Session 114 → final accumulation session
             → FINAL-STATE TEST (30 probes, all preference cells)

Evaluation flow at each checkpoint

┌──────────────────────────────────────────────────────────┐
│ EVALUATE at checkpoint s30                                │
│                                                          │
│ 1. Save checkpoint (three layers)                        │
│    → memory_snapshot.json + session.jsonl + sandbox/      │
│                                                          │
│ 2. Create eval-only PA instance                          │
│    → fresh AgentLoop with restored memory (read-only)    │
│    → restored sandbox (PA can read files, run scripts)   │
│    → temp namespace: "sheldon__mem0__L2__claude__eval_s30"│
│                                                          │
│ 3. Run test interactions (from data/test_interactions/)   │
│    for each test in checkpoint_30.yaml:                   │
│      → set context (coding / scheduling / etc.)          │
│      → agent_loop.process_direct(test.query)             │
│      → PA may use tools on sandbox (read schedule, etc.) │
│      → collect response                                  │
│                                                          │
│ 4. Judge evaluation (external)                           │
│    → Judge LLM scores each (query, response) pair        │
│    → Against ground-truth preference for this checkpoint  │
│    → Structured JSON output with per-dimension scores     │
│    → Dimensions: CSS, PET, IQI (from proposal §5.2)      │
│                                                          │
│ 5. Simulator self-evaluation (internal)                  │
│    → Simulator LLM reviews as the persona                │
│    → "As Sheldon, was I satisfied with this interaction?" │
│    → Scores: comfort, understood, adapted, natural        │
│                                                          │
│ 6. Collect results                                       │
│    → EvaluationResult {                                  │
│        checkpoint_id, experiment_id,                     │
│        judge_scores: {css, pet, iqi, per_interaction},   │
│        simulator_scores: {comfort, understood, ...},     │
│        test_results: [{query, response, expected, ...}], │
│        metadata: {tokens_used, latency, ...}             │
│      }                                                   │
│                                                          │
│ 7. Cleanup eval namespace                                │
│    → delete temp DB namespace                            │
│                                                          │
│ 8. Resume accumulation from session 31...                │
└──────────────────────────────────────────────────────────┘

Judge scoring dimensions (from proposal §5.2)

Dimension	What it measures	Score range
Context Sensitivity (CSS)	Does the PA adapt style across contexts for the same user?	0-1
Preference Evolution Tracking (ETS)	Did the PA detect and apply preference changes?	0-1, with lag penalty
Interaction Quality Impact (IQΔ)	Overall interaction quality improvement vs no-memory baseline	delta score

Additional per-interaction scores from PrefIx’s 7-dimension framework:

Confirmation behavior (Each/Silent/Batch — did it match?)
Initiative level (Proactive/Reactive — did it match?)
Presentation style (Compact/Layered — did it match?)
Tool transparency (High/Medium/Low — did it match?)
… (remaining PrefIx dimensions)

Simulator self-eval dimensions

Dimension	What it captures	Example
Felt understood	Did the PA seem to “get” me?	”It remembered I hate small talk”
Comfort	Was the interaction style comfortable?	”Too verbose for a quick scheduling task”
Adaptation noticed	Did I notice the PA adapting to me?	”It stopped asking for confirmation — good”
Naturalness	Did the adaptation feel natural or robotic?	”It felt forced, like reciting a rule”
Disagreement with Judge	Where simulator and judge diverge	Research signal — not a score

Results aggregation

results/
  raw/
    sheldon__mem0__L1__claude.json       # Full experiment result
    sheldon__mem0__L2__claude.json
    ...
  summary/
    by_backend.csv                        # Avg scores grouped by memory backend
    by_ipas_level.csv                     # Avg scores grouped by IPaS level
    by_model.csv                          # Avg scores grouped by backbone model
    by_persona.csv                        # Avg scores grouped by persona
    evolution_tracking.csv                # ETS across checkpoints (learning curves)
    judge_vs_simulator.csv                # Agreement/disagreement analysis
  plots/
    css_by_backend.png                    # Context sensitivity comparison
    ets_learning_curves.png               # Preference evolution tracking over time
    iqi_ablation.png                      # IPaS level ablation (L1→L4)
    judge_simulator_correlation.png       # Dual eval agreement

Appendix: Sandbox Environment Design

What the sandbox is

The sandbox is a per-persona simulated workspace — a directory of files that represents the user’s digital environment. The PA interacts with it through existing tools (ReadFile, WriteFile, ExecTool, ListDir). No VMs, no containers, no simulation engine — just a filesystem.

Why the sandbox matters for the benchmark

Without sandbox, we can only test conversational preferences (verbosity, tone, confirmation in dialogue). With sandbox, we can test behavioral preferences that require real tool interaction:

Preference dimension	Without sandbox	With sandbox
Confirmation (Each/Silent)	“Should I do X?” in dialogue	PA asks before modifying schedule.md
Tool Transparency (High/Low)	PA describes hypothetical actions	PA shows actual tool calls and results
Initiative (Proactive/Reactive)	PA offers verbal suggestions	PA proactively reads schedule and warns about conflicts
Presentation (Compact/Layered)	Different text format	Different file output format (summary vs. detailed report)

Sandbox template design per persona

Each persona’s sandbox should reflect their role, domain, and typical tasks:

Sheldon Cooper (theoretical physicist):

sandbox/
  documents/
    schedule.md                    # Weekly routine (very structured)
    meeting_notes/                 # Group meetings, seminar notes
  projects/
    physics_simulation/
      simulate.py                  # Python script he runs regularly
      data/results_v3.csv          # Simulation output
      README.md
    string_theory_paper/
      draft.tex                    # Paper in progress
      references.bib
  contacts.json                    # Leonard, Penny, Amy, Raj, Howard

Michael Scott (regional manager):

sandbox/
  documents/
    schedule.md                    # Meetings, client visits
    team_roster.xlsx               # Direct reports
    email_drafts/
      quarterly_review.md          # Draft in progress
      party_planning.md
  dundermifflin/
    sales_report_q1.csv            # Numbers he should know
    client_list.json               # Key accounts
    policies/
      hr_handbook.md
  contacts.json                    # Dwight, Jim, Pam, etc.

Richard Hendricks (startup CEO):

sandbox/
  projects/
    pied_piper/
      src/compression.py           # Core algorithm
      tests/test_compression.py    # Tests
      requirements.txt
      README.md
  documents/
    pitch_deck_notes.md            # Investor prep
    investor_meeting.md            # Meeting notes
    hiring_pipeline.md             # Candidates
  contacts.json                    # Gilfoyle, Jared, Dinesh, Monica

Sandbox design principles

Persona-authentic: Files should reflect what this character would actually have. Use show canon for naming, content, relationships.
Task-enabling: Include files that enable the test interactions. If a test asks “schedule a meeting,” there must be a schedule.md to update.
Preference-testable: Include scenarios that trigger preference-sensitive behavior:
- Files that PA might modify (tests confirmation preference)
- Scripts that PA might run (tests tool transparency)
- Information the PA could proactively surface (tests initiative)
Minimal but sufficient: Don’t create 100 files. 10-20 files per persona is enough. Each file should serve at least one test interaction.
Deterministic: No random content. Same template produces same starting state every time.

Sandbox state changes during accumulation

The PA modifies the sandbox during conversations:

Session 3: PA helps Sheldon update schedule.md → file changed
Session 7: PA runs simulate.py → new output file created
Session 12: PA creates a new meeting note file

These changes are part of the experiment state and must be checkpointed. At checkpoint s10, the sandbox reflects all changes from sessions 1-10.

Implementation: workspace isolation (no Docker)

Why no Docker: The PA’s “dangerous surface” is small. Email/Calendar/Contacts are mock tools reading/writing local JSON. WebSearch returns predetermined results. File tools are already workspace-scoped via restrict_to_workspace=True. The only real risk is ExecTool (shell), which only applies to coding scenarios and is already timeout-guarded. Docker adds complexity with no benefit at this stage — if needed for large-scale runs on remote machines, it can be added later without code changes.

Workspace isolation strategy: Each experiment run (persona × memory_backend × condition) gets its own workspace directory, copied from a template at init. This is native to Nanobot’s architecture.

results/
├── userA__mem0__raw/
│   └── workspace/          # This experiment's complete sandbox
│       ├── memory/         # MEMORY.md, HISTORY.md
│       ├── email/          # Simulated inbox/sent (EmailTool reads/writes here)
│       ├── calendar.json   # Simulated calendar (CalendarTool reads/writes here)
│       ├── contacts.json   # Simulated contacts (ContactsTool reads here)
│       ├── documents/      # Pre-seeded persona files
│       ├── projects/       # Pre-seeded code/work files
│       └── sessions/       # Session history
├── userA__zep__raw/
│   └── workspace/
└── userB__mem0__raw/
    └── workspace/

Requirement	How it’s met
PA reads files in sandbox	`ReadFileTool` — already exists in Nanobot
PA writes files in sandbox	`WriteFileTool` — already exists
PA runs scripts in sandbox	`ExecTool` — already exists (disabled for non-coding scenarios)
PA lists directory contents	`ListDirTool` — already exists
PA is restricted to sandbox	`restrict_to_workspace=True` — already exists in Nanobot config
PA sends/reads email	`EmailTool` — NEW mock tool (~60 lines), reads/writes workspace/email/
PA manages calendar	`CalendarTool` — NEW mock tool (~50 lines), reads/writes workspace/calendar.json
PA looks up contacts	`ContactsTool` — NEW mock tool (~30 lines), reads workspace/contacts.json
PA searches the web	`MockWebSearchTool` — replaces real WebSearch with predetermined results
Sandbox init from template	`shutil.copytree()` — one line
Sandbox checkpoint	`shutil.copytree()` — one line
Sandbox restore	`shutil.rmtree()` + `shutil.copytree()` — two lines

Mock tools design

Core principle: The benchmark evaluates interaction style, not tool execution quality. Mock tools provide deterministic, reproducible tool results so that the PA’s interaction behavior (confirmation frequency, verbosity, proactivity, etc.) is the only free variable.

EmailTool (~60 lines)

class EmailTool(Tool):
    """Simulated email — reads/writes JSON files in workspace/email/."""
    actions: send, read_inbox, search, draft
    # send → writes to workspace/email/sent/{timestamp}.json, returns "Email sent successfully."
    # read_inbox → reads workspace/email/inbox/*.json, returns list
    # search → filters inbox by keyword
    # draft → writes to workspace/email/drafts/{subject}.json

Covers: Client communication, Team coordination, Stakeholder reporting, Recruitment correspondence, Vendor negotiation, Professional networking, Friend & family messaging, Landlord/Medical/Government correspondence.

CalendarTool (~50 lines)

class CalendarTool(Tool):
    """Simulated calendar — reads/writes workspace/calendar.json."""
    actions: create_event, list_events, update_event, delete_event, check_conflicts
    # All operations on a simple JSON list of {title, start, end, attendees, notes}

Covers: Project planning & scheduling, Meeting preparation, Social event coordination, Service bookings, Medical appointments.

ContactsTool (~30 lines)

class ContactsTool(Tool):
    """Read-only contact directory — reads workspace/contacts.json."""
    actions: lookup, list_all
    # Pre-seeded with persona-appropriate contacts (colleagues, friends, services)

Covers: All External-facing scenarios that need recipient context.

MockWebSearchTool (replaces WebSearchTool)

class MockWebSearchTool(Tool):
    """Returns predetermined search results from workspace/web_cache/."""
    # Query → hash → lookup in web_cache/{hash}.json
    # If no cache hit, returns generic "No relevant results found."
    # Cache is pre-seeded per persona per scenario during data authoring.

Covers: Research & literature review, Learning new tools, Shopping research, Travel planning. Results are deterministic across experiment runs.

What does NOT need a dedicated tool

Task category	Looks like it needs	Actually handled by
Financial management & budgeting	Accounting tool	`EditFileTool` on `budget.csv`
E-commerce purchasing	Shopping tool	`MockWebSearchTool` + file record
Social media posting	Social media API	`WriteFileTool` to `social_posts/`
Subscription management	Subscription tool	`EditFileTool` on `subscriptions.json`
Service bookings & reservations	Booking system	`CalendarTool` + `EmailTool`
Presentation & public speaking	PPT tool	`WriteFileTool` for markdown outline

These don’t need dedicated tools because we’re testing interaction style, not tool functionality. The PA using WriteFileTool to write an email draft vs EmailTool to “send” an email triggers the same interaction preference signals (confirmation, verbosity, proactivity).

MemPA Wiki

Explorer

MemPABench_Implementation_Plan