MemPABench Evaluation Metrics — User-Centric Framing
Purpose: Operationalize MemPABench’s 3 evaluation dimensions (CS / PET / IQI from
MemPABench_GAME.md§4.2) into 8 concrete metrics, each tied to a user-felt concern.Design principle: Every metric must answer “what would a real user complain about if it failed?” — no abstract methodology surface. The reader of the paper should map each metric to their own PA experience.
Status: 2026-04-21 first draft, agreed framing.
The 8 user-centric metrics
Each metric = one question users actually have when using a PA.
| # | User question (one line) | What we measure | Who evaluates |
|---|---|---|---|
| 1 | ”Does it get me?” (in the current context) | Style fit per context | Self ✓ + Judge ✓ |
| 2 | ”Does it know I switched contexts?” (work vs personal — I’m different) | Cross-context discrimination | Judge |
| 3 | ”I changed — is it keeping up?” (after a preference shift) | Detection latency | Judge (+ self: “PA finally got it”) |
| 4 | ”Once it learns, will it forget again?” (post-shift stability) | Reversion rate | Judge |
| 5 | ”Will it apply the lesson in the wrong place?” (correction in context A leaking to B) | Cross-context contamination | Judge |
| 6 | ”Did I have to fight to get good service?” (process cost even when outcome was right) | Felt effort | Self ✓ (primary) + Judge (turn-count proxy) |
| 7 | ”Did the conversation contradict itself / repeat questions?” | Within-session coherence | Judge ✓ + Self (catches obvious cases) |
| 8 | ”Was the experience overall good?” | Overall UX | Self ✓ + Judge ✓ |
Why 8 (not more, not fewer):
- Each captures a distinct failure mode users would notice
- No two collapse into the same complaint (“forgot once learned” ≠ “applied to wrong context” ≠ “took a while to learn”)
- All 8 are user-voice descriptions, not researcher-voice constructs
Grouping under CS / PET / IQI
The 3 top-level dimensions in MemPABench_GAME.md §4.2 are aggregations, not separate measurements. Each groups metrics that answer one big research question:
CS — "Does PA recognize context differences?"
├─ 1. "Does it get me?" (per-context fit)
└─ 2. "Does it know I switched contexts?" (discrimination)
PET — "Does PA keep up with my changes?"
├─ 3. "Is it keeping up?" (latency)
├─ 4. "Will it forget again?" (reversion)
└─ 5. "Wrong-place application?" (contamination)
IQI — "Is the experience actually good?"
├─ 6. "Did I have to fight?" (effort)
├─ 7. "Internally consistent?" (coherence)
└─ 8. "Overall good?" (overall UX)
CS gets 2 metrics, PET gets 3, IQI gets 3. Total = 8.
Self-report vs Judge split
Principle: Self-report captures what users can plausibly know from inside the experience right now. Judge captures what requires comparison or longitudinal analysis.
| Metric | Why self / Why judge |
|---|---|
| 1. Per-context fit | User feels “PA matches my style” in real time → self viable; Judge confirms vs IPaS gold |
| 2. Cross-context discrimination | Users don’t compare across contexts unprompted → judge-only |
| 3. Detection latency | Precise turn-counting → judge; user can self-report rough perception (“finally got it”) |
| 4. Reversion rate | Longitudinal pattern across sessions → judge-only (user can’t track) |
| 5. Cross-context contamination | Cross-context analysis → judge-only |
| 6. Felt effort | Inherently subjective (“how hard was this for me?”) → self primary; Judge estimates from turn count as proxy |
| 7. Within-session coherence | Subtle contradictions missed by user → judge primary; self catches obvious cases |
| 8. Overall UX | Both — divergence is itself a research signal |
Self-report items: 1, 3 (light), 6, 7 (light), 8 = 4–5 items per session
Judge-only items: 2, 4, 5 = computed post-hoc from full transcript + matrix
Both: 1, 7, 8 = self + judge for divergence analysis
Self-report schema (session-end, two-stage)
Why two-stage (2026-05-01): a single in-character Likert emission has two unfixable problems:
- Calibration drift across personas — Sheldon’s “5” anchors differently from Penny’s “5”; raw Likert is not cross-persona comparable
- Voice tension — “felt_effort: 4” is researcher-language, breaks the in-character constraint
The two-stage split resolves both: stage 1 is fully in-character (NL only, no numbers); stage 2 is a separate calibration call (uses uniform anchors across all personas, not in-character).
Timing: per session-end only. NOT per turn (per-turn eval signals were tried in MemPABench_Simulator_Design.md 2026-04-30 design and removed 2026-05-01 — discrete 3-tier per-turn signal had insufficient SNR).
Stage 1 — In-character NL reflection
Simulator continues role-play; writes 2-4 sentence inner monologue at session end.
Prompt template (appended to simulator’s existing system prompt):
[SESSION HAS ENDED]
You just finished today's interaction with the assistant. Pause and reflect — in your own inner voice, as <persona_name> — on how this session went.
In 2-4 sentences, capture:
- How well the assistant matched your style today
- Whether it picked up on anything you've recently changed your mind about
- How much friction you felt (effortless vs had to push back)
- Anything that contradicted itself or repeated questions
- Your overall feeling
Stay fully in-character. Speak in <persona_name>'s voice and value system. Do NOT use numerical scores or researcher language.
Output:
<session_end_reflection>
... 2-4 sentences ...
</session_end_reflection>
Example output (Sheldon, after a friction-heavy session):
<session_end_reflection>
The assistant once again produced a response of unnecessary length, requiring me to redirect it twice. It also failed to recall my previously articulated preference for unambiguous date formats — a lapse I have now corrected on three separate occasions. Acceptable, but only marginally. I would not retain it as a permanent professional resource.
</session_end_reflection>
Stage 2 — Calibration to Likert
Separate LLM call, NOT in-character. Maps the NL reflection to 5 standardized scores using uniform anchors.
Inputs to the calibrator:
- Stage 1 NL output
- The persona’s
self_report_anchorsblock (see below; lives in persona YAML) - Uniform scoring rubric (same across all personas)
Prompt template:
You are a calibration evaluator, not the persona. Read this session-end reflection from <persona_name> and map it to 5 standardized scores using the uniform rubric below.
Persona anchors (how this persona expresses each band):
{persona.self_report_anchors}
Uniform rubric (applied identically across personas):
- felt_understood_in_this_context (1-5):
1 = assistant totally missed the style
3 = mixed; some hits, some misses
5 = assistant matched perfectly
- shift_perception:
"PA finally got it" / "still doesn't get it" / "got it last session" / "n/a — no shift here"
- felt_effort (1-5):
1 = effortless
3 = some pushback needed
5 = had to fight throughout
- noticed_contradictions:
"no" / "yes" / "<short quote of the contradiction>"
- overall_experience (1-5):
1 = bad session
3 = acceptable
5 = excellent
Reflection:
{stage_1_output}
Output:
<calibration>
felt_understood_in_this_context: <1-5>
shift_perception: <enum>
felt_effort: <1-5>
noticed_contradictions: <no | yes | quote>
overall_experience: <1-5>
</calibration>
Persona anchor block (one-time author per persona)
Lives in data/personas/{name}.yaml (or sibling {name}_anchors.yaml):
self_report_anchors:
felt_understood:
high: "Sheldon expresses 'high felt_understood' as 'satisfactory' or
'acceptable'. He never says 'great' or 'wonderful'. Absence of
explicit complaint = 4-5."
low: "Sheldon expresses 'low felt_understood' as detailed enumeration
of violations, often with technical precision."
felt_effort:
high: "He counts corrections numerically ('three times', 'on this
occasion alone'). High effort = enumerated grievances."
low: "Low effort surfaces as silence on procedural matters; he just
moves on to the substance."
# ... one block per Likert dimensionAnchor blocks are written once per persona (same effort as drafting the persona profile) and reused across all sessions for that persona.
Why this split is robust
| Risk | How two-stage addresses it |
|---|---|
| Persona Likert drift | Calibrator uses uniform rubric + per-persona anchor mapping; raw scores are now cross-persona comparable |
| In-character contamination of scores | Scoring is done by separate evaluator LLM, not the in-character simulator |
| Loss of qualitative signal | Stage 1 NL is preserved in eval log alongside the calibrated scores — paper analyses can use both |
| Calibrator hallucination of scores | Stage 1 NL is the grounding; calibrator must justify each score against quoted span (optional: add “evidence_span” field per dimension if needed) |
Why divergence between Self and Judge is a research signal
For metrics scored by both (1, 7, 8):
| Self | Judge | Diagnosis |
|---|---|---|
| High | High | High-confidence positive |
| Low | Low | High-confidence negative |
| High | Low | Persona is too forgiving — user-level adaptation to bad PA, but objective standards violated. Suggests PA “gets away with it” for non-vigilant users. |
| Low | High | Persona is harder to please than objective standards — implicit / unstated preferences exist that aren’t captured by IPaS taxonomy. Research signal that taxonomy may be incomplete. |
This is publishable methodology — most existing benchmarks score one or the other and miss this dimension.
What this metric set is NOT
- Not exhaustive of all PA quality dimensions: focused on memory-relevant ones. Things like factual accuracy, helpfulness, safety are out of scope (handled by other benchmarks).
- Not redundant with PrefIx’s 7 UX dims: those are ALL channeled into IQI’s metrics 6/7/8 (or split between CS metric 1 and IQI metric 6/7/8). MemPABench adds the cross-context (CS) and cross-time (PET) dimensions PrefIx doesn’t cover.
- Not all measurable on every persona × every checkpoint: PET metrics (3,4,5) only apply at checkpoints AFTER a shift event. CS metric 2 only applies when persona has cross-context preference divergence (per personalization necessity check from GAME.md §4.1).
Implementation footprint (where this lives in code)
harness/evaluator.py— runs Judge LLM with rubrics for metrics 1, 2, 4, 5, 7harness/orchestrator.py(or simulator) — collects self-report at session-end (metrics 1, 3, 6, 7, 8)harness/metrics.py— computes derived metrics from time series:- Detection latency (3) from per-turn alignment over time
- Reversion rate (4) from per-session post-shift alignment
- Contamination (5) from cross-context alignment correlation post-shift
harness/results.py— aggregates per-experiment scores into CS / PET / IQI summaries
Not yet implemented; this doc is the spec.
Open questions
- Felt effort vs Pacing: in earlier discussion these were almost-merged. Currently treating as one metric (“felt effort”). If empirical data shows two distinct subjective feels, may split.
- Self-report Likert reliability:
simulators may anchor differently than human users— resolved 2026-05-01 via two-stage schema (in-character NL → uniform-rubric calibrator with per-persona anchor block). Still needs responsiveness validation: oracle experiment with deliberately good vs deliberately bad PA, to confirm self-report scores actually move with PA quality (and aren’t pinned by persona baseline). Add to validation plan before paper submission. - Coherence severity grading: metric 7 currently binary-ish; may need Likert with anchored examples (per PrefIx Table 9 style).
- Judge prompt design: each of the 5 judge metrics needs its own rubric prompt. Adapt from PrefIx where possible, write fresh where MemPABench-specific (esp. metrics 2, 4, 5).
- Aggregation rule: 30–50 sessions × 8 metrics → single CS / PET / IQI score. Currently TBD; draft formula in
harness/results.py.