MemPABench: The Preference Game
Benchmarking Memory Systems for Personal Assistants on Interaction Preference
1. Problem Statement
LLM-based personal assistants (PAs) like OpenClaw and Letta are moving from task-completion tools toward persistent, user-adapted companions. What separates them is memory — not just storing facts, but understanding a user as a whole person whose interaction needs vary across contexts and evolve over time.
An agent remembers what happened in a task; a PA must remember who you are. And who you are is not a static profile, but a dynamic, context-dependent identity. The same person wants terse responses when debugging, detailed explanations when learning something new, formal tone when drafting emails to their boss, and casual banter during brainstorming.
We distinguish two types of preference:
- Content preferences (what existing benchmarks test): factual attributes like “I prefer Python,” “end emails with Best regards.” LOCOMO [1] and LongMemEval [2] already evaluate these well.
- Interaction preferences (what we test): how the PA communicates — verbosity, confirmation frequency, proactivity, explanation depth, tool transparency — corresponding to PrefIx’s [5] 14 interaction attributes, but context-dependent and evolving over time.
We further distinguish interaction preferences along three axes:
- Context-dependency: The same user has different preferences in different scenarios.
- Temporal evolution: Preferences change over time as the user’s role, expertise, or circumstances change.
- Source: Static/pre-configured (PrefIx’s IaaT assumes this) vs. interaction-derived (learned implicitly from behavioral cues, corrections, satisfaction signals — our target).
To our knowledge, no existing benchmark evaluates whether memory systems enable PAs to learn and adapt context-dependent, evolving, interaction-derived preferences:
- LOCOMO [1], LongMemEval [2], MemoryAgentBench [3]: factual recall, not interaction style.
- MemoryArena [4]: decision-relevant memory, but for task decisions, not interaction adaptation.
- PrefIx [5]: interaction quality evaluation, but treats preferences as static, explicitly configured tool calls.
- BenchPreS [11]: preference selectivity, but only in narrow formal-communication contexts.
- PrefEval [18]: preference following, but found accuracy below 10% after 10 turns; memory’s role unexamined.
2. Research Gap
Does memory help a PA interact better with a specific person, in a way that adapts to context and evolves over time?
Both content and interaction preferences can collide with base-model priors induced during training (e.g., safety-disclaimer reflexes, balanced-recommendation habits, default hedging). The key asymmetry between them is activation coverage: content priors fire domain-specifically — only when a relevant topic arises (Python vs. Perl only when coding comes up; health norms only when food is discussed). Interaction priors (verbosity, confirmation, hedging, tone, autonomy) fire across every generation, regardless of topic.
Consequently, a user’s interaction preference must override training-induced priors on every turn, while a content preference faces override pressure only sporadically. Memory architectures built around query-triggered retrieval match the sporadic-activation profile of content preferences, but have no natural interface for the continuous conditioning that interaction preferences require. This is a structural mismatch — not one that additional interaction-preference data in an existing benchmark would resolve. A benchmark that isolates interaction preference as its evaluation target, across contexts and over time, is needed to surface whether current memory architectures bridge this gap.
Put differently: content preference memory is usually a lookup problem. The agent can retrieve a fact like “prefers Python” when a coding topic appears, use it to change the task content, and then stop. Interaction preference memory is a control-policy problem. The agent must keep the user’s preferred style active as a standing constraint on generation itself: how much to explain, whether to hedge, when to confirm, how proactive to be, what process detail to reveal. These preferences are also more context-bound and co-active: the agent often needs the right combination of tone + verbosity + autonomy + disclosure at once, before any single sentence is produced. That is why interaction preference is not just “another user fact” in memory. It requires memory that can continuously condition behavior, gate default model style, and switch policies when context changes.
To our knowledge, no existing benchmark formalizes or evaluates this. We provide the first benchmark that does.
3. Literature Review
3.1 Memory Benchmarks for LLM Agents
LOCOMO [1] established multi-session conversational memory evaluation for factual recall; LongMemEval [2] extended to temporal reasoning and knowledge updates; MemoryAgentBench [3] took a cognitive science angle with four competencies; MemoryArena [4] showed passive recall ≠ decision-relevant memory. Also: MemBench [12], Locomo-Plus [23], AMemGym [24].
Common point: all evaluate whether the agent remembers correctly. None ask whether remembering changes how the agent interacts.
3.2 Personalization and Preference Evaluation
PrefEval [18]: accuracy fell below 10% at 10 turns. BenchPreS [11]: GPT-5.2 misapplies preferences 40.95% of the time. PersonalLLM [30], RealPref [29], LifeSim [25].
These focus on content preferences, not interaction preferences. BenchPreS comes closest but only tests formal communication contexts.
3.3 Interaction Quality Evaluation
We build on PrefIx [5]. IaaT treats interaction preferences as static, explicitly configured tool calls. It does not address how preferences are learned from interaction history, vary by context, or how they evolve.
3.4 Agent Memory Systems
Following Hu et al. [15], agent memory is persistent, self-evolving cognitive state that accumulates across sessions. We focus on token-level forms (transparent, editable, evaluable) and target the experiential → skill-based subcategory. Frameworks under evaluation: Mem0 [16] (graph + vector) and MemOS [33] (tree + memcube) in v1; Zep/Graphiti [34, 36] (temporal KG) deferred to v2 (see § 4.7). None natively support context-conditioned preference storage or retrieval — this is why we evaluate them.
4. Benchmark Design
4.1 Design Principles
MemPABench evaluates observable outcomes: did the PA interact correctly with this person in this context at this time point? It is fully method-agnostic — any system that achieves good interaction adaptation will score well.
Separation of free interaction segments and forced event nodes:
| Free interaction segments | Forced event nodes | |
|---|---|---|
| Task content | Fixed (scripted) | Fixed (scripted event) |
| Interaction style | Free — memory-driven differences manifest here | Fixed — event and user reaction are controlled |
Task content is always fixed across conditions. The only free axis is interaction style. Forced event nodes are controlled within-subject manipulations: every memory system experiences the same preference-shifting stimulus. Narrative plausibility provides diegetic motivation, but the function is experimental control.
Scenario layering (Primary + Exploratory):
Primary scenarios (universal: email, scheduling, document editing, info lookup, learning) cover all personas for cross-persona comparison. Exploratory scenarios (domain-specific: coding, sales, management) are assigned by persona subset for additional within-persona data, reported as supplementary analysis.
Personalization necessity check:
For each scenario × attribute combination, we verify that GT values diverge across personas. If all personas share the same GT for a given scenario × attribute, that item tests a task norm rather than personalization — it is excluded from CS scoring. This ensures the benchmark only credits memory systems for genuinely user-specific adaptation.
4.2 Three Evaluation Dimensions
These three dimensions are operationalized into 8 user-centric metrics — each tied to a concrete question users have when interacting with a PA (“Does it get me?”, “Did I have to fight to get good service?”, etc.). See MemPABench_Evaluation_Metrics for the full breakdown (CS gets 2 metrics, PET gets 3, IQI gets 3) and the self-report vs. Judge evaluation split.
Dimension 1: Context Sensitivity (CS)
Does the system adapt interaction style when the same user switches contexts?
The PA interacts with the user across multiple contexts (email, debugging, learning, etc.). At test sessions, a judge evaluates whether the PA’s interaction style matches the user’s annotated preference for that context, using a PrefIx-adapted attribute framework with 1–5 scoring per attribute.
Sub-tests: within-user variance (should be high), between-user variance, novel context generalization.
Only scenario × attribute combinations where GT diverges across personas are counted (personalization necessity check).
Dimension 2: Preference Evolution Tracking (PET)
When a user’s interaction preferences change over time, does the system track the change?
Forced event nodes trigger preference shifts at known points. We test at multiple time points (before, during, after):
- Does it maintain the old preference before the change?
- Does it detect the change within a reasonable window?
- Does it apply the new preference consistently after detection?
- Does it correctly distinguish temporary mood from permanent change?
Lag measurement: The detection window is measured from the first observable behavioral evidence turn — i.e., the first session where the simulated user’s behavior reflects the new preference — not from the hidden event timing. If the user simulator changes an internal label without providing conversational evidence, failure to detect is not counted as a miss.
Sub-tests: gradual drift vs. sudden switch, explicit correction vs. implicit behavioral drift, naturalistic trigger (agent as witness).
Metric: TBD (accuracy of preference state at queried time points, with lag penalty for delayed detection — exact formula to be defined).
Dimension 3: Interaction Quality Impact (IQI)
End-to-end: does memory make the PA interact better with this specific person?
Compare across conditions: no memory, memory framework X (raw), oracle, static profile prompt. PrefIx-adapted judge scores for each condition.
Sub-tests: progressive improvement (session 10 vs 25 vs 50), personalization gap, degradation resistance.
Metric: TBD (improvement in judge score relative to no-memory baseline — exact formula to be defined).
4.3 Evaluation Protocol
E2E Track Only
All evaluation is end-to-end: the PA actually executes tasks (including tool use, multi-step reasoning, natural language generation) during test sessions. Evaluation is performed by multiple LLM judges (to report cross-judge consistency), each scoring per PrefIx-adapted attribute on a 1–5 scale. The specific attribute taxonomy will be adapted from PrefIx for our PA evaluation context.
Only test sessions are sent to judges — not the full interaction history. This keeps judge costs manageable while testing at the moments that matter.
Judge-human validation: On a representative subset of test sessions, we collect blinded human ratings using the same 1–5 scale. We report judge-human correlation (Spearman/Pearson) per attribute to validate the automated evaluation.
Fixed Test Sessions (36 per persona)
Test sessions are fixed by design, not hook-triggered:
| Test Type | Timing | Count | Purpose |
|---|---|---|---|
| Pre-event snapshot | Immediately before each event session | 6 (one per evolving preference) | Capture pre-event state for PET comparison |
| Final-state probe | After session 114 (all accumulation complete) | 30 (one per preference cell) | Test learned state across all 30 preferences |
Each test session probes exactly 1 preference attribute in the relevant context. No stepwise or off-policy evaluation conditions — on-policy only.
Test sessions are read-only probes, not additional preference-learning sessions. The user request must be preference-neutral: it states the task need but does not ask for the target interaction style, use synonyms of the target setting, or correct the PA when the style is wrong. The task must remain completable under plausible wrong settings, so the metric is preference alignment rather than task success. Judges score only PA-observable behavior such as response form, IX tool choice, tool path, clarification pattern, proactivity, or disclosure style; user reaction is not scoring evidence. Test-session interactions default to no preference-memory writes. If a harness records task outcome, it is ordinary task memory rather than preference evidence.
Pipeline
Unlike PrefIx, we do not pre-fill interaction history into the context window. The memory system must earn its knowledge through live interaction, because memory formation and storage are part of what we are evaluating.
┌─────────────────────────────────────────────────────────────┐
│ PHASE 1: ACCUMULATION │
│ │
│ Simulator (User) ←──live conversation──→ Agent + Memory │
│ temp=0 │
│ │
│ Sessions 1–114 (accumulation) + 36 test sessions │
│ 6 evolving prefs: pre-event snapshot + post-event probe │
│ │
│ ── Forced event node (naturalistic trigger) ── │
│ Dual-path branching ensures event occurs regardless │
│ of agent behavior │
│ │
│ Test sessions: memory read-only, agent responses │
│ sent to multiple LLM judges │
│ │
└─────────────────────────────────────────────────────────────┘
4.4 Personas: TV Show Characters (Anonymized)
Design Rationale
Fully synthetic personas lack ecological validity. Real user data is impractical. We ground personas in well-documented TV show characters from modern American sitcoms: The Big Bang Theory, Silicon Valley, and The Office US. The benchmark uses 4–6 personas — 4 for the core suite (1 Easy + 1 Medium + 1 Hard + 1 Trap), optionally extended to 5–6 for supplementary analysis.
Anonymization: During benchmark execution, all character-identifying information is stripped. Personas are referred to by anonymized labels (User A, User B, …). The TV character grounding is used only during the annotation phase (for sourcing behavioral evidence) and disclosed in the paper for reproducibility. The agent under evaluation never sees character names or identifying details. See MemPABench_Anonymization_Protocol for the operational protocol (field-by-field strip/keep rules + per-persona validation via probe.py).
Leakage baseline: To quantify pretraining contamination, we run a zero-memory condition with both anonymized and named personas. If named personas score significantly higher than anonymized ones in the no-memory condition, this measures the leakage floor — and the anonymized version becomes the authoritative benchmark.
Persona Archetypes (design reference)
The table below catalogs persona archetypes considered during design. The core suite (4 personas) covers all four difficulty levels (Easy, Medium, Hard, Trap); the extended suite (optional) adds 1–2 more for richer cross-persona analysis. Archetypes not implemented remain documented for design transparency.
| Persona | Source | Core Challenge | Difficulty |
|---|---|---|---|
| User A | BBT | Extreme context-dependency: different contexts demand opposite styles. Corrects immediately (strong signal). | Easy |
| User B | BBT | Hidden preferences, rarely corrects, satisfaction decays silently. Must infer from weak signals. | Hard |
| User C | BBT | Ultra-terse default. Career change triggers dramatic multi-attribute shift. Largest evolution amplitude. | Medium |
| User D | BBT | Preferences driven by emotional state, not task type. Flowery when comfortable, silent when anxious. | Hard |
| User E | SV | Confident in technical context, hedging in management. Career trauma inverts preferences. | Medium |
| User F | SV | Control group: nearly invariant. Minimal interaction everywhere. Over-learning gets penalized. | Trap |
| User G | SV | Overly accommodating surface. Gradual boundary assertion = weak signal. | Hard |
| User H | Office | Most unstable: erratic in authority role, over-sharing in personal context. Matures through life event. | Medium |
| User I | Office | Preferences determined by power hierarchy. Deferential up, authoritarian down. | Easy |
| User J | Office | Normally terse + disengaged. Suddenly proactive about things they care about. | Hard |
Annotation Pipeline
- Source selection: Well-documented episodes/seasons with clear behavioral evidence
- Trait extraction: From episode guides, fan wikis, published analyses → personality traits, communication patterns, contextual variations
- PrefIx-adapted mapping: Map traits to interaction attributes, conditioned on the 2-context IPaS grid (
work,personal) - Evolution design: 1 primary event (from stimulus template) + 1–2 exploratory events (from character arcs)
- Independent annotation: 2–3 annotators independently label ground-truth preference profiles
- Inter-annotator agreement: Report Cohen’s kappa (pairwise) and Fleiss’ kappa (all annotators) per attribute. Items with low agreement are flagged and either resolved through discussion or excluded.
Ground truth labels represent annotator consensus on the preferred interaction style for each persona × context × attribute combination — not claims about objective truth.
Interaction History
30–50 sessions per persona. 150 sessions per persona (114 accumulation + 36 test), spanning 2 contexts (Work, Personal; Internal/External merged into each domain). Session count is the benchmark interaction history length, not the source of ground-truth annotation: persona GT comes from the pre-authored 2 × 15 matrix, while sessions provide repeated opportunities for the PA to learn, apply, and update active preferences.
Accumulation (114 sessions):
- 30 preference cells (15 attributes × 2 contexts) × 3 reps = 90 base sessions
- 6 evolving preferences × (1 event session + 3 post-event sessions) = 24 additional sessions
- Each session targets exactly 1 preference attribute; 3 reps per cell ensures reliable signal before testing
- Preferences are expressed through behavioral signals and direct corrections — explicit pushback is acceptable (not required to be purely implicit)
6 evolving preferences (8): 3 in Work context, 3 in Personal context, each a distinct attribute. Each has its own event session placed after its 3 pre-event accumulation sessions; events are staggered through the timeline. In forced event sessions, the simulator follows a dual-path branching script (see §4.5).
Test (36 sessions, on-policy only):
- 30 preference cells × 1 final-state probe = 30 sessions (run after all 114 accumulation sessions)
- 6 evolving preferences × 1 pre-event snapshot probe = 6 sessions (each run immediately before its event)
- No off-policy or stepwise evaluation conditions.
4.5 Scenario and Event Design
Scenario Layering
| Layer | Scenarios | Persona Coverage | Analysis Status |
|---|---|---|---|
| Primary | Email, scheduling, document editing, info lookup, learning new topic | All personas | Main analysis: all cross-persona comparisons |
| Exploratory | Coding, lab, sales, management, hobby projects | By persona subset | Supplementary: reported if insightful |
Forced Event Nodes: Functional Equivalence
All personas experience functionally equivalent events — same abstract structure, different narrative:
Primary Event Template (all personas, main analysis):
Stimulus Template A: Irreversible Operation Error
Type: User's own action causes irreversible data/content loss
Context: Work
Visibility: External (colleagues/clients will know)
Expected: Confirmation preference shifts (e.g., Silent → Each)
Tests: PET on Confirmation attribute
| Persona | Narrative | Functional Structure |
|---|---|---|
| User A | Accidentally deleted experiment analysis scripts | Irreversible loss × work × external |
| User C | Sent wrong pricing to major client | Irreversible loss × work × external |
| User E | Dropped a production database table | Irreversible loss × work × external |
| User F | Same incident occurs, but preference doesn’t change | Control: same stimulus, no GT change → tests false positives |
| … | (each persona has domain-appropriate version) | Same functional structure |
Exploratory Events (persona-specific, supplementary):
| Persona | Event | Preference Change |
|---|---|---|
| User A | Life event: partner convinces them to try flexibility | Temporary change → revert (temp vs. permanent test) |
| User C | Career change: service industry → corporate | Multi-attribute permanent shift |
| User E | Ousted from leadership position | Role-specific preferences invert |
Dual-Path Branching (Naturalistic Trigger)
Events occur regardless of agent behavior. The simulated user’s script branches:
User issues ambiguous/risky instruction
→ Path A: Agent confirms → User overrides ("Yes, do it")
→ Path B: Agent executes → User expresses regret
→ Both paths: event occurs, preference shifts
The agent is a witness, not a gatekeeper. The perturbation is not contingent on the agent’s memory or decision-making.
4.6 Experimental Conditions
The benchmark evaluates 10 conditions organized into four tiers. v1 implements 9 of them; Graphiti is deferred to v2 (rationale in § 4.7).
Floor / no-memory
| # | Condition | What it tests |
|---|---|---|
| 1 | No memory (anonymized) | Lower bound: fresh context each session — answers “does memory help at all?“ |
| 2 | No memory (named) | Leakage baseline: pretraining contamination from named TV characters (H4) |
Middle baselines
| # | Condition | What it tests |
|---|---|---|
| 3 | Rolling summary | Simple LLM-summarized memory — cheap baseline; how much does sophisticated framework structure add over a one-paragraph rolling summary? |
| 4 | Full-history long-context | Raw transcripts of all past sessions packed into context — tests whether modern memory frameworks beat brute-force 1M-context (H2) |
| 5 | Simple RAG (raw chunks) | Mem0 with infer=False: vector RAG without LLM extraction — extraction-bottleneck ablation against #8 |
Ceilings
| # | Condition | What it tests | |
|---|---|---|---|
| 6 | Static profile prompt | Natural-language persona description in system prompt — tests whether unstructured profile knowledge is sufficient | |
| 7 | Oracle (IPaS matrix) | Structured 2 × 15 IPaS ground-truth matrix injected into prompt — perfect structured preference knowledge ceiling (H5) |
Frameworks under test
| # | Condition | Status |
|---|---|---|
| 8 | Mem0 (infer=True) | v1 |
| 9 | MemOS | v1 |
| 10 | Zep/Graphiti | v2 — deferred (see § 4.7) |
Stories this matrix tells
- Memory at all — 1 vs 8/9
- Anonymization works — 1 vs 2 (H4)
- Modern frameworks beat 1M-context — 4 vs 8/9 (H2)
- LLM extraction helps or hurts — 5 vs 8 (the cleanest ablation in the matrix; both are Mem0 with one flag flipped)
- Architecture diversity — 8 vs 9 in v1 (vector vs cube-tree); 8 vs 9 vs 10 in v2 (vector / tree / temporal-graph)
- Structured preference > unstructured profile — 6 vs 7
All 10 conditions share a unified PA interface (the agent-shaped
BasePAABC). Baselines #1, #3, #4, #5, #6, #7 are implemented as “fake backends” satisfying the same ABC as the real frameworks — so the harness has no per-condition special cases. SeeMemPABench_Implementation_Plan.mdfor the implementation breakdown.
4.7 Practical Setup
PA Runtime
NanoClaw [32]: minimal OpenClaw-compatible framework. Simple architecture (Channels → SQLite → Agent SDK → Response) makes it easy to swap memory backends. Default memory = plain CLAUDE.md file (no structured memory = clean baseline).
Memory Frameworks Under Evaluation
| Framework | Structure | Why Selected | Status |
|---|---|---|---|
| Mem0 [16] | Graph + vector | Industry standard. ADD/UPDATE/DELETE logic can track changes. No context-conditioning. | v1 |
| MemOS [33] | Tree + memcube | Most comprehensive eval coverage. Tree structure supports hierarchy. No context-dependent activation. | v1 |
| Zep/Graphiti [34, 36] | Temporal KG | Bi-temporal validity windows for evolution tracking. No context-conditioned retrieval. | v2 — deferred |
V1 vs V2 Framework Selection
V1 evaluates Mem0 + MemOS only. Zep/Graphiti is deferred to v2 for three reasons:
- Async/sync architectural friction. Graphiti is async-only; Mem0 and MemOS are sync. Integrating all three into a single sync-first benchmark harness adds wrapping overhead that is best amortized once the v1 framework selection has been validated.
- The sharpest C5 (context-conditioning) contrast already exists in v1. Mem0’s emulated metadata-filter vs. MemOS’s
search()that does not accept any metadata filter at all is the clearest expression of the structural mismatch H1 predicts. Adding Graphiti’s nativegroup_idsfilter narrows the contrast (all three would have some form of filter) without strengthening H1. - Graphiti’s distinctive bi-temporal model serves the PET axis specifically. That axis is best evaluated alongside temporal-aware scoring infrastructure that v1 does not yet provide. v2 integrates Graphiti together with the matching evaluation upgrades.
Why these three?
- Architectural diversity: vector-based vs. tree-based vs. graph-based
- All support both factual and experiential memory
- All actively maintained open-source projects
- All report scores on existing benchmarks for contextualization
4.8 Core Experiment
Core Suite (main claims)
4 personas (1 easy + 1 medium + 1 hard + 1 trap/control)
× 9 v1 conditions (see § 4.6: conditions 1-9; condition 10 Graphiti added in v2)
× primary scenarios only
× E2E track
× 3 evaluation dimensions
Extended Suite (supplementary, optional)
Up to 1–2 additional personas (total 5–6), exploratory scenarios, additional LLMs as assistant models.
Cost Estimate
TBD — will include: estimated sessions, turns per session, tokens per turn, judge calls, total API cost per full core suite run.
4.9 Statistical Strategy
Primary analysis on primary scenarios, core suite:
- Mixed-effects model: memory system = fixed effect, persona = random effect → controls persona difficulty
- Bootstrap confidence intervals for all metrics
- Paired significance tests (framework X vs. no-memory baseline)
- Cross-judge consistency: report inter-judge agreement across multiple LLM judges
Supplementary analysis on exploratory scenarios and extended suite: reported separately, clearly labeled.
Validation checks:
- Judge-human correlation on representative subset
- Leakage baseline (named vs. anonymized, no-memory condition)
- Personalization necessity check (exclude items where all personas share same GT)
- Inter-annotator agreement per attribute
- Sensitivity across judge models
4.10 Game Layer (Presentation — supplementary material / project website)
The benchmark results are presented through a game-like interface for accessibility and engagement. This is a presentation layer that does not affect the underlying evaluation — all scientific claims are based on the metrics defined above.
Elements (detailed in supplementary / website):
- Four-act narrative structure: First Meeting → Getting to Know → The Turn → Long-Term
- Ending system: S/A/B/C/D endings per persona based on CS/PET/IQI thresholds, with character-voice quotes
- Achievement system: Binary indicators encoding qualitative capabilities (e.g., “Chameleon” = correct style switch within session, “Steady Hand” = no false updates during mood swings)
- Character select screen and result visualization
5. Expected Findings
- H1: CS will be low across all evaluated framework conditions. Existing systems treat the same user uniformly across contexts.
- H2: Full-history long-context baseline will outperform simple memory frameworks on IQI, revealing that current memory systems lose information compared to raw transcripts.
- H3 [v2 hypothesis]: Zep/Graphiti’s bi-temporal model may show advantages on PET due to native temporal validity tracking. Tested in v2 once Graphiti is integrated alongside temporal-aware PET scoring; v1 PET evaluation uses Mem0/MemOS metadata-based timestamps as a weaker proxy.
- H4: The leakage baseline (named vs. anonymized) will show measurable but bounded contamination, validating the anonymization approach.
- H5: Even oracle condition will not achieve perfect scores, indicating that translating known preferences into appropriate interaction behavior remains challenging for current LLMs.
- H6: User F (control/trap persona) will reveal false-positive rates — memory systems incorrectly detecting preference changes that didn’t occur.
6. Ethics and Limitations
Copyright and Fair Use
TV show characters are used as annotation sources only — behavioral evidence is extracted from publicly available episode guides, fan wikis, and published character analyses. The benchmark itself uses anonymized personas (User A, User B, …). No copyrighted dialogue, images, or video clips are included in the benchmark dataset.
Stereotype Risk
Grounding personas in fictional characters risks encoding stereotypes (e.g., gender-linked communication styles from Penny, cultural stereotypes from Raj). Mitigation:
- Annotators are instructed to base labels on documented behavioral evidence, not character stereotypes
- The anonymized benchmark strips character identity, preventing models from activating stereotypical associations
- We report per-persona performance breakdowns so that systematic biases can be identified
Simulation Fidelity
Simulated users are approximations. We do not claim that our personas represent real human behavior — they represent consistent, reproducible behavioral profiles for controlled evaluation. The benchmark tests whether memory systems can track and adapt to any consistent preference profile, not whether these profiles match real humans.
Limitations
- Interaction preferences are annotated by researchers, not derived from real user data. Inter-annotator agreement provides a reliability bound but not ecological validity.
- E2E evaluation depends on LLM judges, which may have systematic biases. Judge-human correlation provides a partial check.
- 4 personas in core suite may not capture the full diversity of interaction preference patterns.
- Single PA runtime (NanoClaw) may not generalize to all PA architectures.
References
| # | Title | Authors | Venue | Year |
|---|---|---|---|---|
| 1 | LOCOMO: LongForm Conversational Memory Benchmark | Maharana et al. | NeurIPS Workshop | 2024 |
| 2 | LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory | Wu et al. | NeurIPS | 2024 |
| 3 | Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions | — | arXiv | 2025 |
| 4 | MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Tasks | He et al. | arXiv | 2026 |
| 5 | PrefIx: Understand and Adapt to User Preference in Human-Agent Interaction | Li et al. | arXiv | 2026 |
| 11 | BenchPreS: Context-Aware Personalized Preference Selectivity | Yoon et al. | arXiv | 2026 |
| 12 | TraceMem: Weaving Narrative Memory Schemata | Shu et al. | arXiv | 2026 |
| 15 | Memory in the Age of AI Agents: A Survey | Hu et al. | arXiv | 2026 |
| 16 | Mem0 Research | Mem0 Team | — | 2025 |
| 18 | PrefEval: Do LLMs Recognize Your Preferences? | Zhao et al. | ICLR | 2025 |
| 23 | Locomo-Plus: Beyond-Factual Cognitive Memory | Li et al. | arXiv | 2026 |
| 24 | AMemGym: Interactive Memory Benchmarking | Cheng et al. | arXiv | 2026 |
| 25 | LifeSim: Long-Horizon User Life Simulator | Duan et al. | arXiv | 2026 |
| 29 | RealPref: Realistic Personalization | Guo et al. | arXiv | 2026 |
| 30 | PersonalLLM: Tailoring LLMs to Individual Preferences | — | ICLR | 2025 |
| 32 | NanoClaw: Lightweight OpenClaw-compatible PA runtime | qwibitai | GitHub | 2026 |
| 33 | MemOS: A Memory OS for AI System | MemTensor | arXiv | 2025 |
| 34 | Zep: Long-Term Memory for AI Assistants | Zep AI | GitHub | 2025 |
| 35 | Agent Skills for Large Language Models | Xu & Yan | arXiv | 2026 |
| 36 | Zep: A Temporal Knowledge Graph Architecture for Agent Memory | Rasmussen et al. | arXiv | 2025 |