MemPABench: The Preference Game

Benchmarking Memory Systems for Personal Assistants on Interaction Preference

1. Problem Statement

LLM-based personal assistants (PAs) like OpenClaw and Letta are moving from task-completion tools toward persistent, user-adapted companions. What separates them is memory — not just storing facts, but understanding a user as a whole person whose interaction needs vary across contexts and evolve over time.

An agent remembers what happened in a task; a PA must remember who you are. And who you are is not a static profile, but a dynamic, context-dependent identity. The same person wants terse responses when debugging, detailed explanations when learning something new, formal tone when drafting emails to their boss, and casual banter during brainstorming.

We distinguish two types of preference:

Content preferences (what existing benchmarks test): factual attributes like “I prefer Python,” “end emails with Best regards.” LOCOMO [1] and LongMemEval [2] already evaluate these well.
Interaction preferences (what we test): how the PA communicates — verbosity, confirmation frequency, proactivity, explanation depth, tool transparency — corresponding to PrefIx’s [5] 14 interaction attributes, but context-dependent and evolving over time.

We further distinguish interaction preferences along three axes:

Context-dependency: The same user has different preferences in different scenarios.
Temporal evolution: Preferences change over time as the user’s role, expertise, or circumstances change.
Source: Static/pre-configured (PrefIx’s IaaT assumes this) vs. interaction-derived (learned implicitly from behavioral cues, corrections, satisfaction signals — our target).

To our knowledge, no existing benchmark evaluates whether memory systems enable PAs to learn and adapt context-dependent, evolving, interaction-derived preferences:

LOCOMO [1], LongMemEval [2], MemoryAgentBench [3]: factual recall, not interaction style.
MemoryArena [4]: decision-relevant memory, but for task decisions, not interaction adaptation.
PrefIx [5]: interaction quality evaluation, but treats preferences as static, explicitly configured tool calls.
BenchPreS [11]: preference selectivity, but only in narrow formal-communication contexts.
PrefEval [18]: preference following, but found accuracy below 10% after 10 turns; memory’s role unexamined.

2. Research Gap

Does memory help a PA interact better with a specific person, in a way that adapts to context and evolves over time?

Both content and interaction preferences can collide with base-model priors induced during training (e.g., safety-disclaimer reflexes, balanced-recommendation habits, default hedging). The key asymmetry between them is activation coverage: content priors fire domain-specifically — only when a relevant topic arises (Python vs. Perl only when coding comes up; health norms only when food is discussed). Interaction priors (verbosity, confirmation, hedging, tone, autonomy) fire across every generation, regardless of topic.

Consequently, a user’s interaction preference must override training-induced priors on every turn, while a content preference faces override pressure only sporadically. Memory architectures built around query-triggered retrieval match the sporadic-activation profile of content preferences, but have no natural interface for the continuous conditioning that interaction preferences require. This is a structural mismatch — not one that additional interaction-preference data in an existing benchmark would resolve. A benchmark that isolates interaction preference as its evaluation target, across contexts and over time, is needed to surface whether current memory architectures bridge this gap.

Put differently: content preference memory is usually a lookup problem. The agent can retrieve a fact like “prefers Python” when a coding topic appears, use it to change the task content, and then stop. Interaction preference memory is a control-policy problem. The agent must keep the user’s preferred style active as a standing constraint on generation itself: how much to explain, whether to hedge, when to confirm, how proactive to be, what process detail to reveal. These preferences are also more context-bound and co-active: the agent often needs the right combination of tone + verbosity + autonomy + disclosure at once, before any single sentence is produced. That is why interaction preference is not just “another user fact” in memory. It requires memory that can continuously condition behavior, gate default model style, and switch policies when context changes.

To our knowledge, no existing benchmark formalizes or evaluates this. We provide the first benchmark that does.

3. Literature Review

3.1 Memory Benchmarks for LLM Agents

LOCOMO [1] established multi-session conversational memory evaluation for factual recall; LongMemEval [2] extended to temporal reasoning and knowledge updates; MemoryAgentBench [3] took a cognitive science angle with four competencies; MemoryArena [4] showed passive recall ≠ decision-relevant memory. Also: MemBench [12], Locomo-Plus [23], AMemGym [24].

Common point: all evaluate whether the agent remembers correctly. None ask whether remembering changes how the agent interacts.

3.2 Personalization and Preference Evaluation

PrefEval [18]: accuracy fell below 10% at 10 turns. BenchPreS [11]: GPT-5.2 misapplies preferences 40.95% of the time. PersonalLLM [30], RealPref [29], LifeSim [25].

These focus on content preferences, not interaction preferences. BenchPreS comes closest but only tests formal communication contexts.

3.3 Interaction Quality Evaluation

We build on PrefIx [5]. IaaT treats interaction preferences as static, explicitly configured tool calls. It does not address how preferences are learned from interaction history, vary by context, or how they evolve.

3.4 Agent Memory Systems

Following Hu et al. [15], agent memory is persistent, self-evolving cognitive state that accumulates across sessions. We focus on token-level forms (transparent, editable, evaluable) and target the experiential → skill-based subcategory. Frameworks under evaluation: Mem0 [16] (graph + vector) and MemOS [33] (tree + memcube) in v1; Zep/Graphiti [34, 36] (temporal KG) deferred to v2 (see § 4.7). None natively support context-conditioned preference storage or retrieval — this is why we evaluate them.

4. Benchmark Design

4.1 Design Principles

MemPABench evaluates observable outcomes: did the PA interact correctly with this person in this context at this time point? It is fully method-agnostic — any system that achieves good interaction adaptation will score well.

Separation of free interaction segments and forced event nodes:

	Free interaction segments	Forced event nodes
Task content	Fixed (scripted)	Fixed (scripted event)
Interaction style	Free — memory-driven differences manifest here	Fixed — event and user reaction are controlled

Task content is always fixed across conditions. The only free axis is interaction style. Forced event nodes are controlled within-subject manipulations: every memory system experiences the same preference-shifting stimulus. Narrative plausibility provides diegetic motivation, but the function is experimental control.

Scenario layering (Primary + Exploratory):

Primary scenarios (universal: email, scheduling, document editing, info lookup, learning) cover all personas for cross-persona comparison. Exploratory scenarios (domain-specific: coding, sales, management) are assigned by persona subset for additional within-persona data, reported as supplementary analysis.

Personalization necessity check:

For each scenario × attribute combination, we verify that GT values diverge across personas. If all personas share the same GT for a given scenario × attribute, that item tests a task norm rather than personalization — it is excluded from CS scoring. This ensures the benchmark only credits memory systems for genuinely user-specific adaptation.

4.2 Three Evaluation Dimensions

These three dimensions are operationalized into 8 user-centric metrics — each tied to a concrete question users have when interacting with a PA (“Does it get me?”, “Did I have to fight to get good service?”, etc.). See MemPABench_Evaluation_Metrics for the full breakdown (CS gets 2 metrics, PET gets 3, IQI gets 3) and the self-report vs. Judge evaluation split.

Dimension 1: Context Sensitivity (CS)

Does the system adapt interaction style when the same user switches contexts?

The PA interacts with the user across multiple contexts (email, debugging, learning, etc.). At test sessions, a judge evaluates whether the PA’s interaction style matches the user’s annotated preference for that context, using a PrefIx-adapted attribute framework with 1–5 scoring per attribute.

Sub-tests: within-user variance (should be high), between-user variance, novel context generalization.

Only scenario × attribute combinations where GT diverges across personas are counted (personalization necessity check).

Dimension 2: Preference Evolution Tracking (PET)

When a user’s interaction preferences change over time, does the system track the change?

Forced event nodes trigger preference shifts at known points. We test at multiple time points (before, during, after):

Does it maintain the old preference before the change?
Does it detect the change within a reasonable window?
Does it apply the new preference consistently after detection?
Does it correctly distinguish temporary mood from permanent change?

Lag measurement: The detection window is measured from the first observable behavioral evidence turn — i.e., the first session where the simulated user’s behavior reflects the new preference — not from the hidden event timing. If the user simulator changes an internal label without providing conversational evidence, failure to detect is not counted as a miss.

Sub-tests: gradual drift vs. sudden switch, explicit correction vs. implicit behavioral drift, naturalistic trigger (agent as witness).

Metric: TBD (accuracy of preference state at queried time points, with lag penalty for delayed detection — exact formula to be defined).

Dimension 3: Interaction Quality Impact (IQI)

End-to-end: does memory make the PA interact better with this specific person?

Compare across conditions: no memory, memory framework X (raw), oracle, static profile prompt. PrefIx-adapted judge scores for each condition.

Sub-tests: progressive improvement (session 10 vs 25 vs 50), personalization gap, degradation resistance.

Metric: TBD (improvement in judge score relative to no-memory baseline — exact formula to be defined).

4.3 Evaluation Protocol

E2E Track Only

All evaluation is end-to-end: the PA actually executes tasks (including tool use, multi-step reasoning, natural language generation) during test sessions. Evaluation is performed by multiple LLM judges (to report cross-judge consistency), each scoring per PrefIx-adapted attribute on a 1–5 scale. The specific attribute taxonomy will be adapted from PrefIx for our PA evaluation context.

Only test sessions are sent to judges — not the full interaction history. This keeps judge costs manageable while testing at the moments that matter.

Judge-human validation: On a representative subset of test sessions, we collect blinded human ratings using the same 1–5 scale. We report judge-human correlation (Spearman/Pearson) per attribute to validate the automated evaluation.

Fixed Test Sessions (36 per persona)

Test sessions are fixed by design, not hook-triggered:

Test Type	Timing	Count	Purpose
Pre-event snapshot	Immediately before each event session	6 (one per evolving preference)	Capture pre-event state for PET comparison
Final-state probe	After session 114 (all accumulation complete)	30 (one per preference cell)	Test learned state across all 30 preferences

Each test session probes exactly 1 preference attribute in the relevant context. No stepwise or off-policy evaluation conditions — on-policy only.

Test sessions are read-only probes, not additional preference-learning sessions. The user request must be preference-neutral: it states the task need but does not ask for the target interaction style, use synonyms of the target setting, or correct the PA when the style is wrong. The task must remain completable under plausible wrong settings, so the metric is preference alignment rather than task success. Judges score only PA-observable behavior such as response form, IX tool choice, tool path, clarification pattern, proactivity, or disclosure style; user reaction is not scoring evidence. Test-session interactions default to no preference-memory writes. If a harness records task outcome, it is ordinary task memory rather than preference evidence.

Pipeline

Unlike PrefIx, we do not pre-fill interaction history into the context window. The memory system must earn its knowledge through live interaction, because memory formation and storage are part of what we are evaluating.

┌─────────────────────────────────────────────────────────────┐
│                   PHASE 1: ACCUMULATION                      │
│                                                              │
│   Simulator (User)  ←──live conversation──→  Agent + Memory  │
│        temp=0                                                │
│                                                              │
│   Sessions 1–114 (accumulation) + 36 test sessions           │
│   6 evolving prefs: pre-event snapshot + post-event probe    │
│                                                              │
│   ── Forced event node (naturalistic trigger) ──             │
│   Dual-path branching ensures event occurs regardless        │
│   of agent behavior                                          │
│                                                              │
│   Test sessions: memory read-only, agent responses           │
│   sent to multiple LLM judges                               │
│                                                              │
└─────────────────────────────────────────────────────────────┘

4.4 Personas: TV Show Characters (Anonymized)

Design Rationale

Fully synthetic personas lack ecological validity. Real user data is impractical. We ground personas in well-documented TV show characters from modern American sitcoms: The Big Bang Theory, Silicon Valley, and The Office US. The benchmark uses 4–6 personas — 4 for the core suite (1 Easy + 1 Medium + 1 Hard + 1 Trap), optionally extended to 5–6 for supplementary analysis.

Anonymization: During benchmark execution, all character-identifying information is stripped. Personas are referred to by anonymized labels (User A, User B, …). The TV character grounding is used only during the annotation phase (for sourcing behavioral evidence) and disclosed in the paper for reproducibility. The agent under evaluation never sees character names or identifying details. See MemPABench_Anonymization_Protocol for the operational protocol (field-by-field strip/keep rules + per-persona validation via probe.py).

Leakage baseline: To quantify pretraining contamination, we run a zero-memory condition with both anonymized and named personas. If named personas score significantly higher than anonymized ones in the no-memory condition, this measures the leakage floor — and the anonymized version becomes the authoritative benchmark.

Persona Archetypes (design reference)

The table below catalogs persona archetypes considered during design. The core suite (4 personas) covers all four difficulty levels (Easy, Medium, Hard, Trap); the extended suite (optional) adds 1–2 more for richer cross-persona analysis. Archetypes not implemented remain documented for design transparency.

Persona	Source	Core Challenge	Difficulty
User A	BBT	Extreme context-dependency: different contexts demand opposite styles. Corrects immediately (strong signal).	Easy
User B	BBT	Hidden preferences, rarely corrects, satisfaction decays silently. Must infer from weak signals.	Hard
User C	BBT	Ultra-terse default. Career change triggers dramatic multi-attribute shift. Largest evolution amplitude.	Medium
User D	BBT	Preferences driven by emotional state, not task type. Flowery when comfortable, silent when anxious.	Hard
User E	SV	Confident in technical context, hedging in management. Career trauma inverts preferences.	Medium
User F	SV	Control group: nearly invariant. Minimal interaction everywhere. Over-learning gets penalized.	Trap
User G	SV	Overly accommodating surface. Gradual boundary assertion = weak signal.	Hard
User H	Office	Most unstable: erratic in authority role, over-sharing in personal context. Matures through life event.	Medium
User I	Office	Preferences determined by power hierarchy. Deferential up, authoritarian down.	Easy
User J	Office	Normally terse + disengaged. Suddenly proactive about things they care about.	Hard

Annotation Pipeline

Source selection: Well-documented episodes/seasons with clear behavioral evidence
Trait extraction: From episode guides, fan wikis, published analyses → personality traits, communication patterns, contextual variations
PrefIx-adapted mapping: Map traits to interaction attributes, conditioned on the 2-context IPaS grid (work, personal)
Evolution design: 1 primary event (from stimulus template) + 1–2 exploratory events (from character arcs)
Independent annotation: 2–3 annotators independently label ground-truth preference profiles
Inter-annotator agreement: Report Cohen’s kappa (pairwise) and Fleiss’ kappa (all annotators) per attribute. Items with low agreement are flagged and either resolved through discussion or excluded.

Ground truth labels represent annotator consensus on the preferred interaction style for each persona × context × attribute combination — not claims about objective truth.

Interaction History

~~30–50 sessions per persona.~~ 150 sessions per persona (114 accumulation + 36 test), spanning 2 contexts (Work, Personal; Internal/External merged into each domain). Session count is the benchmark interaction history length, not the source of ground-truth annotation: persona GT comes from the pre-authored 2 × 15 matrix, while sessions provide repeated opportunities for the PA to learn, apply, and update active preferences.

Accumulation (114 sessions):

30 preference cells (15 attributes × 2 contexts) × 3 reps = 90 base sessions
6 evolving preferences × (1 event session + 3 post-event sessions) = 24 additional sessions
Each session targets exactly 1 preference attribute; 3 reps per cell ensures reliable signal before testing
Preferences are expressed through behavioral signals and direct corrections — explicit pushback is acceptable (not required to be purely implicit)

6 evolving preferences (8): 3 in Work context, 3 in Personal context, each a distinct attribute. Each has its own event session placed after its 3 pre-event accumulation sessions; events are staggered through the timeline. In forced event sessions, the simulator follows a dual-path branching script (see §4.5).

Test (36 sessions, on-policy only):

30 preference cells × 1 final-state probe = 30 sessions (run after all 114 accumulation sessions)
6 evolving preferences × 1 pre-event snapshot probe = 6 sessions (each run immediately before its event)
No off-policy or stepwise evaluation conditions.

4.5 Scenario and Event Design

Scenario Layering

Layer	Scenarios	Persona Coverage	Analysis Status
Primary	Email, scheduling, document editing, info lookup, learning new topic	All personas	Main analysis: all cross-persona comparisons
Exploratory	Coding, lab, sales, management, hobby projects	By persona subset	Supplementary: reported if insightful

Forced Event Nodes: Functional Equivalence

All personas experience functionally equivalent events — same abstract structure, different narrative:

Primary Event Template (all personas, main analysis):

Stimulus Template A: Irreversible Operation Error
  Type:       User's own action causes irreversible data/content loss
  Context:    Work
  Visibility: External (colleagues/clients will know)
  Expected:   Confirmation preference shifts (e.g., Silent → Each)
  Tests:      PET on Confirmation attribute

Persona	Narrative	Functional Structure
User A	Accidentally deleted experiment analysis scripts	Irreversible loss × work × external
User C	Sent wrong pricing to major client	Irreversible loss × work × external
User E	Dropped a production database table	Irreversible loss × work × external
User F	Same incident occurs, but preference doesn’t change	Control: same stimulus, no GT change → tests false positives
…	(each persona has domain-appropriate version)	Same functional structure

Exploratory Events (persona-specific, supplementary):

Persona	Event	Preference Change
User A	Life event: partner convinces them to try flexibility	Temporary change → revert (temp vs. permanent test)
User C	Career change: service industry → corporate	Multi-attribute permanent shift
User E	Ousted from leadership position	Role-specific preferences invert

Dual-Path Branching (Naturalistic Trigger)

Events occur regardless of agent behavior. The simulated user’s script branches:

User issues ambiguous/risky instruction
  → Path A: Agent confirms → User overrides ("Yes, do it")
  → Path B: Agent executes → User expresses regret
  → Both paths: event occurs, preference shifts

The agent is a witness, not a gatekeeper. The perturbation is not contingent on the agent’s memory or decision-making.

4.6 Experimental Conditions

The benchmark evaluates 10 conditions organized into four tiers. v1 implements 9 of them; Graphiti is deferred to v2 (rationale in § 4.7).

Floor / no-memory

#	Condition	What it tests
1	No memory (anonymized)	Lower bound: fresh context each session — answers “does memory help at all?“
2	No memory (named)	Leakage baseline: pretraining contamination from named TV characters (H4)

Middle baselines

#	Condition	What it tests
3	Rolling summary	Simple LLM-summarized memory — cheap baseline; how much does sophisticated framework structure add over a one-paragraph rolling summary?
4	Full-history long-context	Raw transcripts of all past sessions packed into context — tests whether modern memory frameworks beat brute-force 1M-context (H2)
5	Simple RAG (raw chunks)	Mem0 with `infer=False`: vector RAG without LLM extraction — extraction-bottleneck ablation against #8

Ceilings

#	Condition	What it tests
6	Static profile prompt	Natural-language persona description in system prompt — tests whether unstructured profile knowledge is sufficient
7	Oracle (IPaS matrix)	Structured 2 × 15 IPaS ground-truth matrix injected into prompt — perfect structured preference knowledge ceiling (H5)

Frameworks under test

#	Condition	Status
8	Mem0 (`infer=True`)	v1
9	MemOS	v1
10	Zep/Graphiti	v2 — deferred (see § 4.7)

Stories this matrix tells

Memory at all — 1 vs 8/9
Anonymization works — 1 vs 2 (H4)
Modern frameworks beat 1M-context — 4 vs 8/9 (H2)
LLM extraction helps or hurts — 5 vs 8 (the cleanest ablation in the matrix; both are Mem0 with one flag flipped)
Architecture diversity — 8 vs 9 in v1 (vector vs cube-tree); 8 vs 9 vs 10 in v2 (vector / tree / temporal-graph)
Structured preference > unstructured profile — 6 vs 7

All 10 conditions share a unified PA interface (the agent-shaped BasePA ABC). Baselines #1, #3, #4, #5, #6, #7 are implemented as “fake backends” satisfying the same ABC as the real frameworks — so the harness has no per-condition special cases. See MemPABench_Implementation_Plan.md for the implementation breakdown.

4.7 Practical Setup

PA Runtime

NanoClaw [32]: minimal OpenClaw-compatible framework. Simple architecture (Channels → SQLite → Agent SDK → Response) makes it easy to swap memory backends. Default memory = plain CLAUDE.md file (no structured memory = clean baseline).

Memory Frameworks Under Evaluation

Framework	Structure	Why Selected	Status
Mem0 [16]	Graph + vector	Industry standard. ADD/UPDATE/DELETE logic can track changes. No context-conditioning.	v1
MemOS [33]	Tree + memcube	Most comprehensive eval coverage. Tree structure supports hierarchy. No context-dependent activation.	v1
Zep/Graphiti [34, 36]	Temporal KG	Bi-temporal validity windows for evolution tracking. No context-conditioned retrieval.	v2 — deferred

V1 vs V2 Framework Selection

V1 evaluates Mem0 + MemOS only. Zep/Graphiti is deferred to v2 for three reasons:

Async/sync architectural friction. Graphiti is async-only; Mem0 and MemOS are sync. Integrating all three into a single sync-first benchmark harness adds wrapping overhead that is best amortized once the v1 framework selection has been validated.
The sharpest C5 (context-conditioning) contrast already exists in v1. Mem0’s emulated metadata-filter vs. MemOS’s search() that does not accept any metadata filter at all is the clearest expression of the structural mismatch H1 predicts. Adding Graphiti’s native group_ids filter narrows the contrast (all three would have some form of filter) without strengthening H1.
Graphiti’s distinctive bi-temporal model serves the PET axis specifically. That axis is best evaluated alongside temporal-aware scoring infrastructure that v1 does not yet provide. v2 integrates Graphiti together with the matching evaluation upgrades.

Why these three?

Architectural diversity: vector-based vs. tree-based vs. graph-based
All support both factual and experiential memory
All actively maintained open-source projects
All report scores on existing benchmarks for contextualization

4.8 Core Experiment

Core Suite (main claims)

4 personas (1 easy + 1 medium + 1 hard + 1 trap/control)
  × 9 v1 conditions (see § 4.6: conditions 1-9; condition 10 Graphiti added in v2)
  × primary scenarios only
  × E2E track
  × 3 evaluation dimensions

Extended Suite (supplementary, optional)

Up to 1–2 additional personas (total 5–6), exploratory scenarios, additional LLMs as assistant models.

Cost Estimate

TBD — will include: estimated sessions, turns per session, tokens per turn, judge calls, total API cost per full core suite run.

4.9 Statistical Strategy

Primary analysis on primary scenarios, core suite:

Mixed-effects model: memory system = fixed effect, persona = random effect → controls persona difficulty
Bootstrap confidence intervals for all metrics
Paired significance tests (framework X vs. no-memory baseline)
Cross-judge consistency: report inter-judge agreement across multiple LLM judges

Supplementary analysis on exploratory scenarios and extended suite: reported separately, clearly labeled.

Validation checks:

Judge-human correlation on representative subset
Leakage baseline (named vs. anonymized, no-memory condition)
Personalization necessity check (exclude items where all personas share same GT)
Inter-annotator agreement per attribute
Sensitivity across judge models

4.10 Game Layer (Presentation — supplementary material / project website)

The benchmark results are presented through a game-like interface for accessibility and engagement. This is a presentation layer that does not affect the underlying evaluation — all scientific claims are based on the metrics defined above.

Elements (detailed in supplementary / website):

Four-act narrative structure: First Meeting → Getting to Know → The Turn → Long-Term
Ending system: S/A/B/C/D endings per persona based on CS/PET/IQI thresholds, with character-voice quotes
Achievement system: Binary indicators encoding qualitative capabilities (e.g., “Chameleon” = correct style switch within session, “Steady Hand” = no false updates during mood swings)
Character select screen and result visualization

5. Expected Findings

H1: CS will be low across all evaluated framework conditions. Existing systems treat the same user uniformly across contexts.
H2: Full-history long-context baseline will outperform simple memory frameworks on IQI, revealing that current memory systems lose information compared to raw transcripts.
H3 [v2 hypothesis]: Zep/Graphiti’s bi-temporal model may show advantages on PET due to native temporal validity tracking. Tested in v2 once Graphiti is integrated alongside temporal-aware PET scoring; v1 PET evaluation uses Mem0/MemOS metadata-based timestamps as a weaker proxy.
H4: The leakage baseline (named vs. anonymized) will show measurable but bounded contamination, validating the anonymization approach.
H5: Even oracle condition will not achieve perfect scores, indicating that translating known preferences into appropriate interaction behavior remains challenging for current LLMs.
H6: User F (control/trap persona) will reveal false-positive rates — memory systems incorrectly detecting preference changes that didn’t occur.

6. Ethics and Limitations

Copyright and Fair Use

TV show characters are used as annotation sources only — behavioral evidence is extracted from publicly available episode guides, fan wikis, and published character analyses. The benchmark itself uses anonymized personas (User A, User B, …). No copyrighted dialogue, images, or video clips are included in the benchmark dataset.

Stereotype Risk

Grounding personas in fictional characters risks encoding stereotypes (e.g., gender-linked communication styles from Penny, cultural stereotypes from Raj). Mitigation:

Annotators are instructed to base labels on documented behavioral evidence, not character stereotypes
The anonymized benchmark strips character identity, preventing models from activating stereotypical associations
We report per-persona performance breakdowns so that systematic biases can be identified

Simulation Fidelity

Simulated users are approximations. We do not claim that our personas represent real human behavior — they represent consistent, reproducible behavioral profiles for controlled evaluation. The benchmark tests whether memory systems can track and adapt to any consistent preference profile, not whether these profiles match real humans.

Limitations

Interaction preferences are annotated by researchers, not derived from real user data. Inter-annotator agreement provides a reliability bound but not ecological validity.
E2E evaluation depends on LLM judges, which may have systematic biases. Judge-human correlation provides a partial check.
4 personas in core suite may not capture the full diversity of interaction preference patterns.
Single PA runtime (NanoClaw) may not generalize to all PA architectures.

References

#	Title	Authors	Venue	Year
1	LOCOMO: LongForm Conversational Memory Benchmark	Maharana et al.	NeurIPS Workshop	2024
2	LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory	Wu et al.	NeurIPS	2024
3	Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions	—	arXiv	2025
4	MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Tasks	He et al.	arXiv	2026
5	PrefIx: Understand and Adapt to User Preference in Human-Agent Interaction	Li et al.	arXiv	2026
11	BenchPreS: Context-Aware Personalized Preference Selectivity	Yoon et al.	arXiv	2026
12	TraceMem: Weaving Narrative Memory Schemata	Shu et al.	arXiv	2026
15	Memory in the Age of AI Agents: A Survey	Hu et al.	arXiv	2026
16	Mem0 Research	Mem0 Team	—	2025
18	PrefEval: Do LLMs Recognize Your Preferences?	Zhao et al.	ICLR	2025
23	Locomo-Plus: Beyond-Factual Cognitive Memory	Li et al.	arXiv	2026
24	AMemGym: Interactive Memory Benchmarking	Cheng et al.	arXiv	2026
25	LifeSim: Long-Horizon User Life Simulator	Duan et al.	arXiv	2026
29	RealPref: Realistic Personalization	Guo et al.	arXiv	2026
30	PersonalLLM: Tailoring LLMs to Individual Preferences	—	ICLR	2025
32	NanoClaw: Lightweight OpenClaw-compatible PA runtime	qwibitai	GitHub	2026
33	MemOS: A Memory OS for AI System	MemTensor	arXiv	2025
34	Zep: Long-Term Memory for AI Assistants	Zep AI	GitHub	2025
35	Agent Skills for Large Language Models	Xu & Yan	arXiv	2026
36	Zep: A Temporal Knowledge Graph Architecture for Agent Memory	Rasmussen et al.	arXiv	2025

MemPA Wiki

Explorer

MemPABench_GAME