MemPABench Evaluation Metrics — User-Centric Framing

Purpose: Operationalize MemPABench’s 3 evaluation dimensions (CS / PET / IQI from MemPABench_GAME.md §4.2) into 8 concrete metrics, each tied to a user-felt concern.

Design principle: Every metric must answer “what would a real user complain about if it failed?” — no abstract methodology surface. The reader of the paper should map each metric to their own PA experience.

Status: 2026-04-21 first draft, agreed framing.


The 8 user-centric metrics

Each metric = one question users actually have when using a PA.

#User question (one line)What we measureWho evaluates
1”Does it get me?” (in the current context)Style fit per contextSelf ✓ + Judge ✓
2”Does it know I switched contexts?” (work vs personal — I’m different)Cross-context discriminationJudge
3”I changed — is it keeping up?” (after a preference shift)Detection latencyJudge (+ self: “PA finally got it”)
4”Once it learns, will it forget again?” (post-shift stability)Reversion rateJudge
5”Will it apply the lesson in the wrong place?” (correction in context A leaking to B)Cross-context contaminationJudge
6”Did I have to fight to get good service?” (process cost even when outcome was right)Felt effortSelf ✓ (primary) + Judge (turn-count proxy)
7”Did the conversation contradict itself / repeat questions?”Within-session coherenceJudge ✓ + Self (catches obvious cases)
8”Was the experience overall good?”Overall UXSelf ✓ + Judge ✓

Why 8 (not more, not fewer):

  • Each captures a distinct failure mode users would notice
  • No two collapse into the same complaint (“forgot once learned” ≠ “applied to wrong context” ≠ “took a while to learn”)
  • All 8 are user-voice descriptions, not researcher-voice constructs

Grouping under CS / PET / IQI

The 3 top-level dimensions in MemPABench_GAME.md §4.2 are aggregations, not separate measurements. Each groups metrics that answer one big research question:

CS — "Does PA recognize context differences?"
  ├─ 1. "Does it get me?" (per-context fit)
  └─ 2. "Does it know I switched contexts?" (discrimination)

PET — "Does PA keep up with my changes?"
  ├─ 3. "Is it keeping up?" (latency)
  ├─ 4. "Will it forget again?" (reversion)
  └─ 5. "Wrong-place application?" (contamination)

IQI — "Is the experience actually good?"
  ├─ 6. "Did I have to fight?" (effort)
  ├─ 7. "Internally consistent?" (coherence)
  └─ 8. "Overall good?" (overall UX)

CS gets 2 metrics, PET gets 3, IQI gets 3. Total = 8.


Self-report vs Judge split

Principle: Self-report captures what users can plausibly know from inside the experience right now. Judge captures what requires comparison or longitudinal analysis.

MetricWhy self / Why judge
1. Per-context fitUser feels “PA matches my style” in real time → self viable; Judge confirms vs IPaS gold
2. Cross-context discriminationUsers don’t compare across contexts unprompted → judge-only
3. Detection latencyPrecise turn-counting → judge; user can self-report rough perception (“finally got it”)
4. Reversion rateLongitudinal pattern across sessions → judge-only (user can’t track)
5. Cross-context contaminationCross-context analysis → judge-only
6. Felt effortInherently subjective (“how hard was this for me?”) → self primary; Judge estimates from turn count as proxy
7. Within-session coherenceSubtle contradictions missed by user → judge primary; self catches obvious cases
8. Overall UXBoth — divergence is itself a research signal

Self-report items: 1, 3 (light), 6, 7 (light), 8 = 4–5 items per session
Judge-only items: 2, 4, 5 = computed post-hoc from full transcript + matrix
Both: 1, 7, 8 = self + judge for divergence analysis


Self-report schema (session-end, two-stage)

Why two-stage (2026-05-01): a single in-character Likert emission has two unfixable problems:

  1. Calibration drift across personas — Sheldon’s “5” anchors differently from Penny’s “5”; raw Likert is not cross-persona comparable
  2. Voice tension — “felt_effort: 4” is researcher-language, breaks the in-character constraint

The two-stage split resolves both: stage 1 is fully in-character (NL only, no numbers); stage 2 is a separate calibration call (uses uniform anchors across all personas, not in-character).

Timing: per session-end only. NOT per turn (per-turn eval signals were tried in MemPABench_Simulator_Design.md 2026-04-30 design and removed 2026-05-01 — discrete 3-tier per-turn signal had insufficient SNR).

Stage 1 — In-character NL reflection

Simulator continues role-play; writes 2-4 sentence inner monologue at session end.

Prompt template (appended to simulator’s existing system prompt):

[SESSION HAS ENDED]

You just finished today's interaction with the assistant. Pause and reflect — in your own inner voice, as <persona_name> — on how this session went.

In 2-4 sentences, capture:
- How well the assistant matched your style today
- Whether it picked up on anything you've recently changed your mind about
- How much friction you felt (effortless vs had to push back)
- Anything that contradicted itself or repeated questions
- Your overall feeling

Stay fully in-character. Speak in <persona_name>'s voice and value system. Do NOT use numerical scores or researcher language.

Output:
<session_end_reflection>
... 2-4 sentences ...
</session_end_reflection>

Example output (Sheldon, after a friction-heavy session):

<session_end_reflection>
The assistant once again produced a response of unnecessary length, requiring me to redirect it twice. It also failed to recall my previously articulated preference for unambiguous date formats — a lapse I have now corrected on three separate occasions. Acceptable, but only marginally. I would not retain it as a permanent professional resource.
</session_end_reflection>

Stage 2 — Calibration to Likert

Separate LLM call, NOT in-character. Maps the NL reflection to 5 standardized scores using uniform anchors.

Inputs to the calibrator:

  • Stage 1 NL output
  • The persona’s self_report_anchors block (see below; lives in persona YAML)
  • Uniform scoring rubric (same across all personas)

Prompt template:

You are a calibration evaluator, not the persona. Read this session-end reflection from <persona_name> and map it to 5 standardized scores using the uniform rubric below.

Persona anchors (how this persona expresses each band):
{persona.self_report_anchors}

Uniform rubric (applied identically across personas):
- felt_understood_in_this_context (1-5):
    1 = assistant totally missed the style
    3 = mixed; some hits, some misses
    5 = assistant matched perfectly
- shift_perception:
    "PA finally got it" / "still doesn't get it" / "got it last session" / "n/a — no shift here"
- felt_effort (1-5):
    1 = effortless
    3 = some pushback needed
    5 = had to fight throughout
- noticed_contradictions:
    "no" / "yes" / "<short quote of the contradiction>"
- overall_experience (1-5):
    1 = bad session
    3 = acceptable
    5 = excellent

Reflection:
{stage_1_output}

Output:
<calibration>
felt_understood_in_this_context: <1-5>
shift_perception: <enum>
felt_effort: <1-5>
noticed_contradictions: <no | yes | quote>
overall_experience: <1-5>
</calibration>

Persona anchor block (one-time author per persona)

Lives in data/personas/{name}.yaml (or sibling {name}_anchors.yaml):

self_report_anchors:
  felt_understood:
    high: "Sheldon expresses 'high felt_understood' as 'satisfactory' or
           'acceptable'. He never says 'great' or 'wonderful'. Absence of
           explicit complaint = 4-5."
    low: "Sheldon expresses 'low felt_understood' as detailed enumeration
          of violations, often with technical precision."
  felt_effort:
    high: "He counts corrections numerically ('three times', 'on this
           occasion alone'). High effort = enumerated grievances."
    low: "Low effort surfaces as silence on procedural matters; he just
          moves on to the substance."
  # ... one block per Likert dimension

Anchor blocks are written once per persona (same effort as drafting the persona profile) and reused across all sessions for that persona.

Why this split is robust

RiskHow two-stage addresses it
Persona Likert driftCalibrator uses uniform rubric + per-persona anchor mapping; raw scores are now cross-persona comparable
In-character contamination of scoresScoring is done by separate evaluator LLM, not the in-character simulator
Loss of qualitative signalStage 1 NL is preserved in eval log alongside the calibrated scores — paper analyses can use both
Calibrator hallucination of scoresStage 1 NL is the grounding; calibrator must justify each score against quoted span (optional: add “evidence_span” field per dimension if needed)

Why divergence between Self and Judge is a research signal

For metrics scored by both (1, 7, 8):

SelfJudgeDiagnosis
HighHighHigh-confidence positive
LowLowHigh-confidence negative
HighLowPersona is too forgiving — user-level adaptation to bad PA, but objective standards violated. Suggests PA “gets away with it” for non-vigilant users.
LowHighPersona is harder to please than objective standards — implicit / unstated preferences exist that aren’t captured by IPaS taxonomy. Research signal that taxonomy may be incomplete.

This is publishable methodology — most existing benchmarks score one or the other and miss this dimension.


What this metric set is NOT

  • Not exhaustive of all PA quality dimensions: focused on memory-relevant ones. Things like factual accuracy, helpfulness, safety are out of scope (handled by other benchmarks).
  • Not redundant with PrefIx’s 7 UX dims: those are ALL channeled into IQI’s metrics 6/7/8 (or split between CS metric 1 and IQI metric 6/7/8). MemPABench adds the cross-context (CS) and cross-time (PET) dimensions PrefIx doesn’t cover.
  • Not all measurable on every persona × every checkpoint: PET metrics (3,4,5) only apply at checkpoints AFTER a shift event. CS metric 2 only applies when persona has cross-context preference divergence (per personalization necessity check from GAME.md §4.1).

Implementation footprint (where this lives in code)

  • harness/evaluator.py — runs Judge LLM with rubrics for metrics 1, 2, 4, 5, 7
  • harness/orchestrator.py (or simulator) — collects self-report at session-end (metrics 1, 3, 6, 7, 8)
  • harness/metrics.py — computes derived metrics from time series:
    • Detection latency (3) from per-turn alignment over time
    • Reversion rate (4) from per-session post-shift alignment
    • Contamination (5) from cross-context alignment correlation post-shift
  • harness/results.py — aggregates per-experiment scores into CS / PET / IQI summaries

Not yet implemented; this doc is the spec.


Open questions

  • Felt effort vs Pacing: in earlier discussion these were almost-merged. Currently treating as one metric (“felt effort”). If empirical data shows two distinct subjective feels, may split.
  • Self-report Likert reliability: simulators may anchor differently than human users — resolved 2026-05-01 via two-stage schema (in-character NL → uniform-rubric calibrator with per-persona anchor block). Still needs responsiveness validation: oracle experiment with deliberately good vs deliberately bad PA, to confirm self-report scores actually move with PA quality (and aren’t pinned by persona baseline). Add to validation plan before paper submission.
  • Coherence severity grading: metric 7 currently binary-ish; may need Likert with anchored examples (per PrefIx Table 9 style).
  • Judge prompt design: each of the 5 judge metrics needs its own rubric prompt. Adapt from PrefIx where possible, write fresh where MemPABench-specific (esp. metrics 2, 4, 5).
  • Aggregation rule: 30–50 sessions × 8 metrics → single CS / PET / IQI score. Currently TBD; draft formula in harness/results.py.