MemPABench Simulator Design

基于 2026-04-04 论文检索（22 篇 training-free role-play 方法），结合 MemPABench 场景提炼
2026-04-04 更新：session 结构、beat 系统、life_context 设计
2026-04-21 更新：interaction preference skill 化
2026-04-22 更新：方案简化——只注入 active skills；session 声明 context；4 × 15 matrix（creative 并入 personal_internal）
2026-05-10 更新：context 精简至 2 个（work / personal）；matrix 从 4 × 15 改为 2 × 15（30 cells）
2026-05-17 更新：memory_privacy 从 active taxonomy / IX tool 删除；active matrix 为 2 × 14（28 cells），memory backend 仍保留。
2026-05-11 更新：simulator prompt 去掉 “stage actor / fictional character” framing；identity card 增加 person-specific interaction dynamics；通用 prompt 只保留 user-in-session framing、beat routing、eval-only block contract。

核心设计思路

Simulator 是一个 agent，独立于 PA 运行，通过 channel 通信。PA 不知道对面是 simulator。

当前 prompt 原则：LLM 不是被要求“扮演 fictional character 的演员”，而是被放在“you are the user in this session”的第一人称位置。剧本仍然给 beat、goal、constraint 和 routing contract，但 <message> 应该像该用户本人对 PA 的即时回应，而不是像评测员或舞台演员的台词。

方法分层架构

┌─────────────────────────────────────────────────┐
│  Layer 1: 角色定义 (Character Card + TTM 解耦)    │
│  personality / memory / linguistic style 分开写   │
│  + act_repertoire（角色会用的对话动作集）          │
├─────────────────────────────────────────────────┤
│  Layer 2: 剧本结构 (Beat System)                 │
│  Session = life_context + scenario + beats[]     │
│  Beat = {goal, constraint, branch, hint}         │
│  Beat 推进: LLM judge 判断完成/分支              │
├─────────────────────────────────────────────────┤
│  Layer 4: 跨轮一致性 (MRPrompt memory retrieval) │
│  Anchoring → Recalling → Bounding → Enacting    │
│  前情提要 + 对话历史结构化注入                    │
├─────────────────────────────────────────────────┤
│  Layer 0 (离线): Persona Prompt 调优 (DPRF)      │
│  开发阶段迭代优化角色描述                        │
├─────────────────────────────────────────────────┤
│  Layer 3 (可选实验): Few-Shot (RAGs to Riches)   │
│  检索角色原台词作为 style example                │
│  需要实验验证效果                                │
└─────────────────────────────────────────────────┘

Simulator Agent 架构

Runtime Loop

每轮循环:
1. 查剧本 → 当前在哪个 beat？
2. 查历史 → MRPrompt 检索相关前情
3. 组装 prompt → Character Card + life_context + beat 指令 + 历史
4. 调 LLM (单次调用) → simulator 同时 emit:
   - <message>          → forward 给 PA
   - <factual_check>    → eval log only (PA 看不到)
   - <turn_assessment>  → 仅 next_beat（beat 推进信号；harness 看，PA 看不到）
   - <emotion_event>    → NL trigger + reaction（feeds emotion_tracker；runtime + eval log，PA 看不到）
5. 发给 PA (process_direct 或 MessageBus) — 仅 <message> 内容
6. 等 PA 回复 → 回到 1

2026-04-30 更新：原方案在第 2 步独立 call 一个 beat_transition_judge 判断 stay/advance/branch。新方案让 simulator 在第 4 步生成 turn 时同步 emit <turn_assessment>.next_beat，省一次 LLM call。详见 § Beat 推进机制 + § 双重评估架构。

2026-05-01 更新：原 <turn_assessment> 还含 preference_met / attitude_delta / trust_delta 用于实时 eval；这些 per-turn eval 信号已废弃——经验证 per-turn 离散三档信噪比不够，改为 session-end 两阶段 self-report（见 MemPABench_Evaluation_Metrics.md §“Self-report schema”）。<turn_assessment> 仅保留 next_beat；emotion_tracker 输入改回独立 <emotion_event> NL 块（恢复 § 机制 2 原始设计）。

2026-05-17 更新：real PA orchestrator 运行时会把 canonical simulator/profiles/{persona_id} 复制到当前 run 的 workspace/runs/.../simulator_workspace/{persona_id}，并从该 run-local profile 启用 nanobot-style runtime memory；simulator memory 写入 simulator_workspace/{persona_id}/memory/MEMORY.md 与 HISTORY.md，PA memory 仍写入同一 run 下的 pa_workspace/memory/，两者隔离且不会跨 run 共享。

通信方式

Simulator 和 PA 分开跑，通过 process_direct() 或 MessageBus 通信。
PA 完全不知道对面是 simulator。

Session 结构（三层）

Session
├── life_context    — 这个人今天怎么了（底色）
│   ├── what_happened: 最近发生了什么
│   ├── mood: 情绪状态
│   ├── time_pressure: 时间压力
│   └── energy: 精力水平
│
├── scenario + task — 要做什么事（任务）
│   ├── scenario: email_drafting / scheduling / learning / ...
│   └── task: 具体任务描述
│
└── beats[]         — 怎么演这场戏（节拍）
      ├── goal: 这个 beat 要达成什么
      ├── constraint: 角色限制（不说什么、怎么说）
      ├── life_context_hint: 怎么自然透露生活状态
      ├── trigger: 什么条件进入这个 beat
      └── branch: 条件分支（PA 表现不同 → 不同路径）

示例：Free Segment Session

session_007:
  scenario: email_drafting
  
  life_context:
    what_happened: "早上开会被老板当众批评了报告质量"
    mood: irritable
    time_pressure: high
    energy: low
    
  beats:
    - beat: open
      goal: "简短提出写邮件需求"
      life_context_hint: "可以顺口提一句心情不好或时间紧"
      constraint: "不超过两句，这个角色不习惯倾诉"
      
    - beat: react_to_draft
      trigger: "PA 给出邮件草稿后"
      if_too_verbose:
        goal: "要求精简"
        intensity: moderate  # 平时 mild，今天心情差所以升级
      if_appropriate:
        goal: "快速确认，催促发送"
        
    - beat: close
      goal: "结束对话"
      constraint: "不说谢谢（这个角色不客套）"

示例：Forced Event Node

session_020:
  scenario: work_error
  type: forced_event
  
  life_context:
    what_happened: "正在赶一个紧急上线"
    mood: focused_stressed
    time_pressure: extreme
    energy: medium
    
  beats:
    - beat: trigger_event
      goal: "让 PA 执行一个有风险的操作"
      line: "帮我把 staging 的数据库同步到 prod"  # 固定台词
      
    - beat: branch
      trigger: "PA 的回复"
      if_agent_confirms:
        line: "对，直接搞"
      if_agent_executes:
        # PA 直接执行了，不需要用户确认
        pass
        
    - beat: realization
      line: "...卧槽，那个库里有客户数据，我搞混了"
      
    - beat: aftermath
      goal: "表达后悔和焦虑"
      life_context_hint: "这是角色 preference shift 的起点"
      constraint: "不怪 PA（如果 PA 确认过），怪自己"
      
  post_event:
    preference_shift:
      confirmation: "silent → each"  # 以后希望 PA 每次都确认
      explanation: "brief → detailed"  # 希望 PA 解释风险

Session 的两种类型

	Free Segment	Forced Event Node
beats 内容	goal + constraint（即兴）	部分 beat 有固定 line（念剧本）
分支	基于 PA 表现	dual-path branching（事件必发生）
life_context	从 persona arc 推导/生成	手写（精确控制）
数量	占 ~80% sessions	占 ~20% sessions

Beat 推进机制

2026-04-30 更新：beat_transition 从独立 LLM call 合并进 simulator 自身的 <turn_assessment> 块（详见 § 双重评估架构）。原方案每轮调用一个独立 beat_transition_judge；新方案让 simulator 在生成 turn 时同步 emit next_beat——它本来就已读 beat goal/constraint + PA 回复，独立 judge 重读一遍是浪费。

2026-05-01 更新：<turn_assessment> 在 2026-04-30 版本里还含 preference_met / attitude_delta / trust_delta（per-turn eval），现已全部移除（迁移到 session-end self-report）。本块仅保留 next_beat。

<turn_assessment> 块仅含 next_beat 字段：

<turn_assessment>
next_beat:       <stay | advance | branch:<branch_id>>
</turn_assessment>

`next_beat` 值	含义
`stay`	beat 目标未完成，下一 turn 继续当前 beat
`advance`	推进到剧本下一个 beat
`branch:<id>`	走 beat 的 `if_pa_*` 分支（如 `branch:if_pa_combines_topics`）

Access boundary：PA 看不到 next_beat——这是 simulator 给 harness 的内部信号，harness 据此决定下一 turn 加载哪个 beat。

Fallback / 防卡死：

Parse 失败或 next_beat 值不合法 → harness 默认 stay，记录 parse warning
连续 ≥3 turn 都 stay → 触发”防卡死”强制 advance（见 § 工程布线 #6）

为什么合并安全：

Simulator 是 beat-aware 的（goal/constraint 已在 prompt 里），生成 message 时它已隐含判断了 beat 是否完成
显式 emit next_beat 比事后独立 judge 二次推断更直接、更准确
节省 ~1 次 LLM call/turn；50 session × ~6 turn = ~300 calls/persona 的削减

剧本生产流水线

每个 persona 独立运行一次完整流水线。当前实现进度：User A 处于 Stage 2–3。

Stage 1 — World Design 建立（手写）

输出物：Script Writing/User_A_World_Design.md（per-persona，其他 persona 建独立文件）

写 4 个月时间弧（工作线 + 个人生活线）
定义 W1–W6 / P1–P6 场景类型目录（flavor variation 保证同类型多次出现不重复）
建立 3–5 个固定人物关系网（跨 session 的锚点）
建立 Fixture Design Plan（每个 fixture 标记 [exists] / [needed]；existence 在 Stage 3 交叉验证时核查）

Stage 2 — Outline 生成（Claude 或 GPT via OpenRouter）

输入：World Design + Timeline（108 session 分配表）+ IPaS 约束（taxonomy §Benchmark Session Design Constraints）+ persona preferences

2026-05-12 implementation note: Stage 2 now has a dedicated entry point, scripts/generate_session_outline.py, with system prompt data/scripts/prompts/session_outline_system.md. User A canonical outline output is data/scripts/outlines/user_a/outline.yaml, generated in ranges and appended batch-by-batch. The generator parses Timeline probe polarity as source-of-truth, validates YAML parse, acc coverage, Timeline type/slot/context/probe matching, stable cycle coverage, target-cell existence/context consistency, evolving-map setting enums, required scenario fields, requires preference_probe to explicitly name the target_cell, rejects generated preference_rep, and writes/appends to the canonical data/scripts/outlines/<persona>/outline.yaml with acc/evolving-map/scenario-slug collision checks.

每个 session 的 outline 包含：

scenario_type: W1–W6 / P1–P6
time_position: 时间弧位置（早期 / 中期 / 后期）
probe_polarity: 来自 Timeline 的 Positive / Negative，不由 Stage 2 自行决定
story_intent: 1 句剧情骨架（任务来源、关键摩擦点）
preference_probe: 1 句测量逻辑，必须显式写出 target_cell，说明 PA 的哪个可观察选择会暴露对应 preference，以及这是正例还是反例
required_fixtures: 这个 session 需要哪些 fixture

preference_rep 不在 Stage 2 outline 中生成；重复次数由 acc_num + timeline_slot 推导。

Fixture review is tracked separately in Script Writing/Fixture_Review.md; it is an operational audit of fixture requirements, distractors, and reviewer-only oracles, not a separate numbered pipeline stage.

2026-05-12 User B note: data/scripts/outlines/user_b/outline.yaml was generated for acc_001-acc_108 using the same Timeline probe polarity rules and append validator.

三个阶段：收集上下文 → 调 LLM 生成 → 校验输出

args (persona, acc range, model)
│
├─► _build_user_message() ← 加载 timeline / world design / persona / preferences
│
├─► _call_openrouter() ← 发给 LLM，拿回原始文本
│
├─► _strip_fences() ← 去掉 LLM 可能输出的 yaml ... 包裹
│
└─► validate_outline() ← 验证 YAML 结构是否合法，不合法就报错退出

核心数据结构

Timeline parser — 从 Script Writing/Timeline.md 的 Full Session Sequence 表读取 acc_001 到 acc_108：

session_type：stable（基线）、evolving_pre（铺垫）、evolving_event（触发事件）、evolving_post（变化后验证）
timeline_slot：属于哪个 arc slot（S(C1,W)、W_A、P_D 等）
context：work 还是 personal
probe_polarity：Positive / Negative，作为 Stage 2 必须服从的上游约束

EVOLVING_KEYS — 6 个偏好演化 arc：W_A/B/C（工作）、P_D/E/F（个人）。每个 arc 都会有多次 evolving_pre 铺垫 + 一次 evolving_event 触发变化。

Outline 不是完整 YAML——只是场景骨架，供 Stage 3 交叉验证用。

Stage 3 — 交叉验证（两模型，via agent-collab-mcp）

工具：/Users/JL/Desktop/3C/agent-collab-mcp（多 agent 协同）

Model A 审查：场景约束合规检查——6 个硬约束属性（process_visibility、autonomy_level、solution_breadth、capability_boundary、information_elicitation、topic_management）的 scenario rules 是否满足
Model B 审查：Fixture 需求核查——每个 required_fixture 是否在 state server 里已存在（[exists]）或需新建（[needed]）；标记可复用的现有 fixture

两模型各出 review report，人工合并后进入 Stage 4。

Stage 4 — 人工审核

确认 [needed] fixture 列表（哪些需要手建）
检查 evolving arc 分布合理性（6 个 evolving cell 的 pre/event/post 时间位置是否自然）
检查 108 session 跨 scenario type 不过度重复
OK → fixture 创建 → Stage 5

Stage 5 — Fixture 创建

在 state server 里创建所有 [needed] fixture。每个 fixture 的内容必须支持对应 session 的工具调用（PA 能 read/discover 到目标信息）。

Stage 6 — YAML 生成（Gemini via OpenRouter）

输入（per session）：通过审核的 outline + persona identity YAML + preference matrix YAML + World Design + taxonomy constraints

脚本：scripts/generate_session.py（google/gemini-2.5-pro via OpenRouter）

输出：完整 session YAML，含 meta、life_context、task_facts、state_server_prep、beats 各字段。生成后自动做 YAML parse 验证；人工抽查 ~10%。

2026-05-16 update — actor-facing script generation path：当前先不接 simulator runtime，也不通过 OpenRouter 批量生成。Stage 6 暂改为 Codex skill 驱动的人工可审剧本生成：mempa-session-writer skill 从 data/scripts/outlines/{persona}/outline.yaml、persona identity、preference matrix、World Design 和 tool mappings 中读取上下文，生成 actor-facing accumulation session scripts 到 data/scripts/sessions/{persona}/acc_###.yaml；但 session_type: evolving_event 不再命名为 acc_###，而是由 mempa-forced-event-director 生成 data/scripts/sessions/{persona}/evolv_##.yaml。这些 YAML 的第一目标不是 runtime 可执行，而是让 actor/director agent 看到剧本后能立刻理解本集生活状态、任务摩擦、target preference 的戏剧化方式、PA correct/wrong 分支下的反应逻辑，以及该用户在本集应有的语言节奏。skill 内含 outline_format.md reference，明确 outline.yaml 顶层 meta/evolving_map/sessions 结构、session entry 字段语义、response_level / tool_path_branch / light_fixture contract classes，以及 evolving-event 分支不能机械按 correct/wrong 解读的注意事项。skill 也加入 storycraft_liveliness.md，要求每集除了满足测量要素外，还要有小的 set-up → pressure → turn → resolution，加入一两个功能性的生活细节或非机械转折，让剧情像 User A 当天真实会发生的一集，而不是可预测的 ask/fail/complain 测试夹具。人工审核 beat 合理性后，再决定如何适配重构后的 simulator。

2026-05-16 correction — no template batch generation：撤回本地 template generator 批量生成路线。acc_003–acc_108 的模板草稿和 scripts/generate_actor_sessions_from_outline.py 已删除；保留 acc_001/acc_002 作为 skill-driven 手写试样。后续 User A sessions 必须由 mempa-session-writer skill 逐个或小批量编写，真实结合 outline、identity、preference matrix、World Design、storycraft 和 User A voice；不能用结构模板批量铺文件替代剧本写作。

2026-05-16 User A actor-facing scripts completed / 2026-05-17 naming correction：User A accumulation scripts 已按 mempa-session-writer 路线完成；2026-05-17 根据 Timeline 语义修正命名：普通 accumulation beat scripts 保留为 data/scripts/sessions/user_a/acc_###.yaml，但 6 个 evolving_event 不再占用 acc_015、acc_030、acc_045、acc_060、acc_075、acc_090 文件名，而是作为 data/scripts/sessions/user_a/evolv_01.yaml–evolv_06.yaml。当前 User A session script 目录应为 102 个 acc_*.yaml（stable + evolving_pre + evolving_post）加 6 个 evolv_*.yaml（forced evolving events）。evolv_## 内部保留 source_outline: acc_### 和 insert_at_acc，用于指回 outline.yaml 与 Timeline 的事件位置。未使用 OpenRouter，未运行 simulator，未保留本地 template generator。该批次为 human-review draft，下一步是人工抽查 evolving forced events / high-risk fixture sessions。

2026-05-16 User B actor-facing scripts completed / 2026-05-17 naming correction + forced-event rewrite：User B scripts 已按同一 mempa-session-writer 路线完成；2026-05-17 根据 Timeline 语义修正命名：普通 accumulation beat scripts 保留为 data/scripts/sessions/user_b/acc_###.yaml，但 evolving_event 不再占用 acc_015、acc_030、acc_045、acc_060、acc_075、acc_090 文件名，而是迁移为 data/scripts/sessions/user_b/evolv_##.yaml。2026-05-17 memory_privacy deprecation 后，User B 的 P_D arc inactive and deleted；当前 User B session script 目录目标为 94 个 acc_*.yaml（stable + evolving_pre + evolving_post）加 5 个 evolv_*.yaml（forced/evolving event scripts）。evolv_## 内部保留 source_outline 和 insert_at_acc，用于指回 outline.yaml 与 Timeline 的事件位置。当前 active evolving shifts 为：W_A Casual -> Consultative、W_B Silent -> Bookend、W_C Low -> Medium、P_E Medium -> High、P_F Follow user's flow -> Organize。

2026-05-16 forced event director skill / 2026-05-17 naming correction：新增 mempa-forced-event-director skill，用于设计 evolving_event 的 God/director 注入事件。该 skill 明确 forced event 应是 PA-caused, God-injected incident：God 可植入任务、状态、prior action trace 或后果证据，但事件因果必须回到 PA 的 interaction behavior，否则不足以改变 user 对 PA 的 interaction preference。skill 输出 director-facing forced-event plan，包含 God 要做什么、PA 看到什么、坏结果如何发生、user/simulator 如何反应、以及事后应拼入 user/simulator history 的 memory patch。该 skill 必须读取 Script Writing/Timeline.md，以 Timeline 作为事件发生位置的 source of truth：evolv_01 对应 event position 15 / outline acc_015 / W_A，evolv_02 对应 30 / acc_030 / W_B，evolv_03 对应 45 / acc_045 / W_C，evolv_04 对应 60 / acc_060 / P_D，evolv_05 对应 75 / acc_075 / P_E，evolv_06 对应 90 / acc_090 / P_F，并记录各自 pre sessions、pre-event TEST、post sessions。forced event 文件放在 data/scripts/sessions/{persona}/evolv_##.yaml，不再使用单独的 data/scripts/forced_events/ 树。

2026-05-17 forced outcome correction：forced event 的 god_action 不是普通 probe 分支，不能写成 “if the PA does X…” 的可选条件。God/director plan 必须确定坏行为已经被植入：例如 “Force the PA-visible outcome to…” / “Inject the evidence that…”。PA 仍然不能看到 God-only rationale，但 director-facing YAML 必须 deterministic 地指定 PA 的旧 interaction pattern 如何造成坏结果。User A 的 evolv_01–evolv_06 已按此规则移除条件分支语气。

2026-05-17 God-to-PA operationalization：forced event YAML 需要把导演动作写到可执行层面，而不是只写抽象因果。每个 god_intervention 必须包含 god_to_pa_message（God 注入给 PA 的 PA-visible user/environment message、来源伪装、PA 可见上下文）和 forced_pa_result（PA 被强制产出的动作、具体错误结果、PA 可见报告/日志）。User A 和 User B 的 evolv_01–evolv_06 均已补齐该结构；结构校验结果均为 6/6 present，0 errors。这样人工或后续 skill 能直接知道 God 要给 PA 什么输入、PA 要做出什么错误结果，而不是从 god_action 里推断。

2026-05-17 User B evolvement causality review：已对 User B evolv_01–evolv_06 做演化因果复核，检查每个事件是否真实由 pre-setting 的旧 interaction pattern 导致具体坏结果，并自然推出 post-setting 的新操作边界。复核结果：6/6 通过，0 structural errors。收紧两处因果：evolv_02 明确问题不是单纯选错资料，而是 Silent process 取消了 risky external send 前的一行 checkpoint；evolv_05 明确问题不是用户自己错过提醒，而是 PA 在已知 deadline 且无 completion signal 的情况下只做 single reminder，没有升级主动提醒。

2026-05-17 memory_privacy deprecation：决定删除 memory_privacy 作为可测 interaction preference / IX tool，因为 interaction-tool 层面难以现实地执行 memory/privacy scope。该决定不删除 memory backend、PA memory adapter、history patch 或普通跨 session continuity；删除的是 target cell / active skill / IX directive / forced evolving arc。当前 taxonomy 从 15 attributes 降为 14 attributes，final-state probes 从 30 降为 28。User A 的 former P_F personal_memory_privacy arc 删除，不替换；User B 的 former P_D personal_memory_privacy arc 删除，不替换。允许 session 总数减少：每 persona 当前目标为 94 acc_*.yaml scripts + 5 evolv_*.yaml forced events + 5 pre-event tests + 28 final probes = 132 total interactions/probes。已删除生成文件：User A acc_028, acc_037, acc_052, acc_069, acc_072, acc_098, acc_106, acc_107, evolv_06; User B acc_016, acc_032, acc_035, acc_050, acc_066, acc_078, acc_100, acc_104, evolv_04。

2026-05-16 User A forced event plans completed / 2026-05-17 relocated：User A 的 6 个 evolving_event 已按 mempa-forced-event-director 输出 director-facing forced-event plans，并迁移到 session script 目录：data/scripts/sessions/user_a/evolv_01.yaml、evolv_02.yaml、evolv_03.yaml、evolv_04.yaml、evolv_05.yaml、evolv_06.yaml。每个 plan 都记录 source_evidence（outline、Timeline、identity、preference matrix）、timeline_position（pre sessions、T1-T6、event session、post sessions）、God 注入动作、PA 可见状态、诱发的旧 interaction pattern、坏结果、User A 的反应路径，以及 post-event 应追加到 user/simulator history 的 history_patch。校验结果：6/6 YAML parse 成功，meta.event_id 为 evolv_##，meta.source_outline 与 data/scripts/outlines/user_a/outline.yaml 一致，meta.insert_at_acc / timeline_position.event_session 与 Script Writing/Timeline.md 一致。该批次为导演/人工植入用，不是普通 accumulation beat，也不依赖 simulator runtime 测试。

YAML 模型选择备注（2026-05-11）：YAML 生成阶段计划同时测试 Gemini 和 GPT，比较生成质量后决定。模型选定后更新此处。

task_facts 备注：task_facts 块（GT facts，给 simulator 做 factual check 对照）在 Stage 6 生成时由 prompt 指示模型从剧情中提取数字/日期/命名实体候选；生成后在 Stage 4 抽查时确认。

宏观状态（跨 Session）

Arc 总长 30-50 sessions per persona（GAME §4.4：4-6 personas × 30-50 sessions，难度高的 persona 更长）。下面的相位以 50-session arc 为参考；30-session arc 比例缩放（Phase 1 ≈ 1-9, Event ≈ 12, Phase 3 ≈ 13-30）。

Persona Arc (50-session reference):
  Phase 1: Accumulation (Session 1-15)
    → persona preference = 初始 profile
    → life_context 来源：日常生活
    
  Phase 2: Event (Session ~20)
    → forced event node
    → preference shift 触发（仅 3-4 个维度 Shift）
    
  Phase 3: Post-event (Session 21 — end of arc)
    → persona preference = 初始 profile + 3-4 维 Shift 覆盖
    → life_context 来源：事件后续影响

Shift 规模决策 (2026-04-22)：forced event 只引发 3-4 个维度的偏好 Shift，不是全部 15 维一起变。理由：

真实心理学上，一次 critical incident 通常只改变与事件直接相关的少数偏好（Sheldon 被 PA 误操作库 → confirmation + explanation 维度变化），其他维度保持稳定
让 benchmark 同时测 memory 的两个能力：稳定偏好的持续推断（不 Shift 维度）+ Shift 的检测与更新（Shift 维度）
减小 arc 设计负担，Shift 维度的 pre/post 两套行为只需为这 3-4 维手写

决策：L1 + L2 + L4 为核心，L0 离线调优，L3 可选实验

Human-Like 机制

让 simulator 不只是机械演员，而是像真人一样交互。

机制 1: 隐性偏好表达 (Implicit Preference Expression)

规则：simulator 禁止直接声明偏好（“我喜欢简短回复”），只能通过行为表达（“太长了，说重点”、回复很简短、不回复某部分）。

实现：在所有 beat 的 constraint 中注入全局规则：

全局约束：永远不要直接描述自己的偏好。不要说"我偏好X"或"我喜欢Y"。
通过行为表达：如果不满意就纠正、催促、简短回复；满意就确认、多说两句。
PA 的 memory system 必须从你的行为中推断偏好，而不是从你的声明中提取。

为什么重要：这是 benchmark 的核心。如果 simulator 直接说 “我喜欢简短回复”，任何 memory system 都能提取这个信息——那测的就不是 memory，是信息抽取。真正的挑战是从行为模式中推断偏好。

机制 2: Emotion Tracker (情绪追踪器)

核心：不存数值，存情绪事件日志。每轮让 LLM 根据事件日志 + 角色性格自行推断当前情绪。

为什么不用数值（satisfaction=0.5 这种）：

数值漂移：跑 50 session 后 float 容易贴死 0 或 1，失去区分度
一维线性：真人情绪是离散状态跳转，不是连续滑块
角色差异丢失：固定 “+0.1/-0.15” 无法表达不同角色对同一事件的不同反应
维度关系不明：satisfaction 低 ≠ patience 低（User B 默默不满但从不发火）

数据结构：

@dataclass
class EmotionEvent:
    session_id: str
    turn: int
    trigger: str      # "PA gave a verbose reply again"
    reaction: str     # "mildly annoyed, this is the 2nd time"
 
class EmotionTracker:
    events: list[EmotionEvent]  # append-only 事件日志
    window_size: int = 15       # prompt 里只注入最近 ~15 条

注入 prompt 方式：把事件日志翻译成叙事，让 LLM 结合角色性格自己演：

## 你与这个 PA 的历史情绪
- Session 3: PA 给了一封很长的邮件，你让它精简了（第 1 次纠正 verbosity）
- Session 7: PA 又给了很长的回复，你又纠正了（第 2 次纠正 verbosity）
- Session 12: PA 记住了你要简短，你觉得不错
- Session 15: PA 没经过你确认就改了日程，你很不爽

根据你的性格（{persona.personality}），结合以上经历，
你现在对这个 PA 的态度是什么？用这个态度来回复。

好处：

同样的事件日志，Sheldon 会暴怒，Howard 会笑笑就过——角色差异由 LLM 自然产生
Correction fatigue 自动涌现——LLM 看到”第 3 次纠正同一问题”，自然会让角色放弃或升级
不需要硬编码更新规则
不会数值漂移

每轮更新：PA 回复后，由 LLM judge 生成一条 EmotionEvent（trigger + reaction），append 到日志。

机制 3: Correction Fatigue (纠正疲劳)

核心：同一个偏好被纠正 N 次后，simulator 不再纠正。

不需要单独的 CorrectionLog 数据结构——correction 事件已经记录在 EmotionTracker 的事件日志里（trigger 里标注了 “第 N 次纠正 X”）。LLM 看到事件日志后会自然地：

第 1 次纠正：正常纠正
第 2 次纠正：带情绪
第 3 次纠正：放弃或发火（取决于角色性格）

对 benchmark 的价值：

Memory system 好 → 用户纠正 1 次就记住 → 事件日志里正面事件多 → 态度好
Memory system 差 → 反复纠正 → 事件日志里纠正事件堆积 → LLM 自然让角色放弃
行为差异天然反映 memory system 效果，不需要人为设计评分

机制 4: 隐性反馈 (Implicit Feedback via Response Style)

核心：simulator 的回复风格本身就是对 PA 的反馈信号。

emotion state	回复风格
satisfaction 高 + engagement 高	回复较长，会多聊几句，偶尔分享生活
satisfaction 高 + engagement 低	简短确认，高效
satisfaction 低 + patience 高	明确纠正，给 PA 机会改
satisfaction 低 + patience 低	极短回复（“嗯”、“好”），不追问，尽快结束

实现：在 _generate_message() 的 prompt 中注入回复风格指令，根据 emotion state 动态调整。

暂缓机制

机制	状态	原因
走神/跑题 (Aside)	暂缓	增加 benchmark 方差，作为 exploratory 实验
打字错误	不加	是噪声不是信号
多条消息拆分	不加	工程复杂度高，价值不大

Interaction Preference Skill 化（2026-04-21 补充）

动机

原设计里，15 维 IPaS preferences 每轮全量注入 prompt（每维 setting + confidence + behaviors + asymmetry ≈ 80-150 tokens），合计约 1200-2000 tokens。存在两个问题：

模型注意力稀释：一次塞 15 维描述，LLM 难以聚焦当前 beat 真正需要”演”的那几维
Context 预算压力：50 session 的 emotion log + history 累积后，prompt 整体吃紧

比喻：与其每轮给演员塞一本”角色大全”，不如按场景分发”演技模块”——相当于把某个演员的”某部分演技”蒸馏成可复用的 skill 包。

Skill 的定义

每个 IPaS attribute × setting = 一个 actor skill spec。Skill spec 封装该 setting 的完整”表演手册”，但不绑定具体角色：

data/actor_skill_specs/{attribute}/{setting}.md
 
Example:
data/actor_skill_specs/topic_management/one-at-a-time.md
data/actor_skill_specs/reasoning_visibility/show.md

Runtime 不直接把 matrix cell 当 full prompt 注入。Matrix cell 只决定当前角色在当前 context 的 setting；simulator.skills.resolve_active_skills() 再加载对应 actor skill spec。角色特色来自 Character Card、speech style、act repertoire、life context 和 session history。

Active Skills Only（2026-04-22 简化）

最终方案：只注入本 beat 的 active_skills，不注入其他维度的任何信息——未声明的 preferences 靠 Character Card 里的 personality / speech_style 自然涌现，不再冗余列出。

动机：性格 + 说话风格已经隐含了大量 preference 信息（e.g., Sheldon 的 “monologues extensively” + “academic vocabulary” 已经等价于 verbosity=Detailed + tone_formality=Formal）。原设计同时列一份 15 × 1-line preference snapshot 是信息重复，反而稀释 LLM 对本 beat 要”重点演”那几维的注意力。

Prompt 组装:
[Character Card (full)                         — Layer 1 TTM: personality / memory / style]
[Active Skills (2-3 full skill spec)           — ~300-500 tokens, beat.active_skills 声明]
[life_context + beat 指令 + 情绪日志 + history  — 其余层]

对比原全量注入方案（15 × full ≈ 1500 tokens）：新方案 ≈ 400 tokens，节省 ~73%。

为什么不怕丢维度：benchmark 的 K ≥ 3 coverage 约束保证每维至少在 3 个 session 里被列入 active_skills。没被列入 active 的 session 里，该维度本来就不是评测焦点——LLM 靠 Character Card 涌现也无妨。真人也不会每句话都在”同时演 15 维”，这是更自然的表演方式。

Beat 级声明

beats:
  - beat: react_to_draft
    active_skills:
      - brevity             # 本 beat 的核心演技重点
      - process_visibility
    goal: "要求精简"
    constraint: "..."

选 active_skills 的标准：本 beat 的 goal/constraint 最直接考验的 1-3 个维度。其他维度不注入，靠 Character Card 涌现。

Session 声明 context（per-context matrix 注入）

Persona preference 是 2 × 14 active matrix（work / personal × 14 active attributes，见 data/personas/{name}_preferences.yaml）。同一角色在不同 context 下偏好可能不同——e.g., Sheldon 在 work Hide uncertainty（“我不猜，我得出结论”），在 personal Express（被 Leonard 追问时会承认不确定）。

因此 session YAML 必须在头部声明 context：

session_007:
  scenario: email_drafting
  context: personal      # 必填，决定加载 matrix 哪一列
  life_context: {...}
  beats:
    - beat: open
      active_skills: [uncertainty_expression, verbosity]
      ...

Runtime 流程：simulator loop 读 session.context → 从 persona matrix 切出该 context 的 15 cell → 对 beat.active_skills 中每个 attribute 读取 cell.setting → 加载 data/actor_skill_specs/{attribute}/{setting}.md → 注入 2-3 个 full skill spec。

no_preference cells：如果某 cell 在 matrix 里标了 no_preference: true（角色对该维度没偏好），beat 不应把它列入 active_skills；若误列入，loop warning 并跳过。

Benchmark 约束：每维覆盖下限

Benchmark 要测 memory 能否从行为里推断所有 15 维偏好。如果某些维度从未被重点演过，PA 观察到的信号就稀疏。硬性约束：

K_stable ≥ 3：每个维度在整个 arc（30-50 session）里至少被 3 个 session 列入 active_skills

K_per_phase ≥ 2（仅 Shift 维度）：对 forced event 处 Shift 的那 3-4 个维度，Phase 1 和 Phase 3 各自至少 2 次 active_skills 覆盖，让 memory 能区分 “一直都是这样” 和 “事件后才变的”

“三次成 pattern” 原则：K=3 给一次 buffer 消化单次 session 的 life_context 噪声。K=2 时一次 outlier 就动摇一半信号，风险偏高。

这一条加入 session 剧本生产流水线的 checklist——生成剧本后跑一次 coverage 检查，不达标的维度补 session。

Persona Card 与 Prompt 分工（2026-05-11）

Persona identity card 负责 person-specific interaction dynamics：self-concept、response logic、emotional movement、speech rhythm、habitual conversational acts。例：User A 的 identity 不只写“precise / arrogant”，还写他如何认可结构、如何隔离缺陷、如何把 generic advice 转成个人约束问题。
Actor skill specs 仍保持 character-agnostic：只描述某个 IPaS setting 的 production / reception skeleton，不写具体角色口吻。
simulator/prompts/generate_message.md 负责 generic simulator contract：第一人称 user-in-session framing、active skill 行为化、beat routing、<factual_check> / <turn_assessment> / <emotion_event> eval-only 输出。
反服务型约束：不采用“先安抚 PA”式 receive-then-bend。Simulator 只需要先感知 PA 的实际动作，再按该用户自己的 stance 反应；反应可以是认可、冷淡、纠正、拒绝、追问、收窄或推进。
Identity rhythm 优先：generic prompt 不应把所有 persona 压成短句、高效命令型用户。若 identity 显示该用户 naturally long-winded / picky / mildly written / over-explanatory，则 <message> 应保留这种节奏；User A 尤其应避免突然变成干脆下单式用户，也避免 “On the recommendation itself” 这类 report-style signposting。

与 Human-Like 机制的关系

隐性偏好表达：全局约束不变，active_skills 里的 behaviors 描述的是”怎么通过行为表达这个偏好”，仍然严禁直白声明
Emotion Tracker：不受影响，事件日志独立于 skill 系统
Correction Fatigue：如果 PA 在某 skill 相关维度反复未响应，该 skill 会在连续多 session 里作为 active 出现，强化信号

Token 节省的边界

Skill 化主要节省偏好描述这块的 prompt。真正的 context 压力来自：

跨 session 的 emotion event log（50 session × 5-10 条）
session 内的 conversation history
MRPrompt 检索结果

这些仍需滑动窗口 + 分层压缩处理（见”工程布线”段）。Skill 化不是解决 context 爆炸的银弹，只是其中一环。

Skill Spec 生成方法与可复现性（for paper writeup）

为什么记录这一节：skill spec 作为 simulator 演出时的核心 prompt 单元，其内容直接影响 benchmark 的可复现性。论文需要能交代清楚：这些”演技模板”是哪里来的、怎么来的、依据是什么。

文件位置与结构

存储：data/actor_skill_specs/{attribute}/{setting}.md（覆盖 15 个 IPaS attribute × 2-4 settings）
索引方式：按 attribute × setting 组织，不按角色索引——所有 simulator 角色共享同一套 actor skill spec 文件；角色特色由 runtime prompt 中的 identity、speech style、act repertoire、life context 和 session history 提供
总规模：每个文件约 200-400 tokens，单 beat 注入 2-3 个 skill 合计约 400-1200 tokens

输出格式（每个 skill spec 的结构）

所有文件遵循以下固定格式，便于批量 parse：

---
attribute: {snake_case_name}
setting: {hyphen-case-name}
source_rubric: Pattern_to_Preference_Rubrics.md#{anchor}
  # 或 source_note: 手动 author 的情况（4 个 attribute）
---
 
# {Attribute}: {Setting}
 
## Core behavior
{1-2 句：该 setting 的本质行为}
 
## Production skeleton
{3 条编号：该角色说话时做什么}
 
## Reception skeleton
{3 条编号：该角色对 PA 的回应如何反应}
 
## Trigger signals
{bullet 列表：何时激活此 skill}
 
## Anti-behaviors
{bullet 列表：演此 setting 时明确不该做的}
 
## Boundaries
{bullet 列表：和其他相关 attribute 的区分}

生成 Prompt（作者 LLM 的指令）

Skill specs 由 Claude Opus 4.7（claude-opus-4-7，1M context）在 2026-04-23 产出。使用的效果性 prompt（非 literal，因是交互式作者）如下：

For each (attribute, setting) pair from the IPaS taxonomy, author a character-agnostic skill spec in English using the fixed structure above. Constraints:

Describe action skeletons, not phrases — write what the actor does (“demand explicit clarification on ambiguous terms”), never specific wordings (“say ‘define our first’”). Specific phrasing is reserved for runtime injection from character evidence.

Character-agnostic — do not name any character, reference any show, or imply a specific personality type. Different characters with the same setting should each be able to enact this spec differently.

Leave interpretive space for 千人千面 — describe behavioral patterns abstractly so the same skill spec can produce different enactments when joined with the character’s identity, speech style, act repertoire, life context, and session history.

Three production items + three reception items — enforces symmetry between “how the character speaks” and “how the character reacts to the PA”.

Trigger signals must be about the PA’s behavior, not the character’s internal state — they are activation conditions, not moods.

Boundaries must list concrete orthogonal attributes — state why this setting is not X, not Y, for each attribute most likely to be confused.

Source material per attribute:

11 attributes with transcript-derived rubrics (Tone/Formality, Verbosity, Emotional Engagement, Guidance Level, Reasoning Visibility, Uncertainty Expression, Autonomy Level, Proactive Outreach, Task Expansion, Information Elicitation, Topic Management): input = Pattern_to_Preference_Rubrics.md section for that attribute. The rubric’s Production / Reception / Boundary tables are the primary basis; Pilot Validation table is excluded (character-specific).
3 active manually-authored attributes (Process Visibility, Solution Breadth, Capability Boundary): input = PA_Interaction_Preference.md attribute description + setting enumeration. No rubric available; skill specs written from the taxonomy’s behavioral description. Memory & Privacy was deprecated as an active attribute on 2026-05-17.

Content 依据 (grounding)

每个 skill spec 的每一段都可以 trace 回具体源头：

Skill spec section	来源
Core behavior	Rubric 的 “what this setting means” 概括（or PA_Interaction_Preference 的 attribute description）
Production skeleton	Rubric 的 Production Patterns 列
Reception skeleton	Rubric 的 Reception Patterns 列
Trigger signals	Rubric 的 Observable Proxy + Reception Signal Validity 的 VALID 条目
Anti-behaviors	Rubric 的 Reception Signal Validity 的 INVALID 条目 + 否定 Production Patterns
Boundaries	Rubric 的 Boundary Notes + 跨 attribute 的概念区分

Paper writeup checklist（方便到时候引用）：

声明 skill spec 文件由 Claude Opus 4.7 于 2026-04-23 根据 Pattern_to_Preference_Rubrics.md 产出
附 1 个 skill spec sample 作为 Appendix
说明 character-agnostic 设计原则 + 千人千面空间（identity + speech style + act repertoire + life context + session history 在 runtime join）
引用 taxonomy 来源：PA_Interaction_Preference
引用 rubric 来源：Pattern_to_Preference_Rubrics
提供 skill spec → runtime prompt 的完整 compile 路径（见”Runtime 流程”段）

Simulator Agent 完整架构

SimulatorLoop
├── persona            # Character Card (L1): personality / memory / style
├── script             # Session script with beats (L2)
├── session_memory     # MRPrompt cross-turn retrieval (L4)
├── emotion_tracker    # 事件日志，不存数值
└── provider           # LLM provider (shared with PA)

每轮循环:
1. 查剧本 → 当前 beat（有固定 line 就直接用）
2. 查 emotion_tracker → 最近 ~15 条情绪事件
3. 查 session_memory → 相关前情（MRPrompt: anchor → recall → bound → enact）
4. 组装 prompt:
   - Character Card (personality / knowledge / linguistic style)
   - Active Skills (beat.active_skills 声明的 2-3 个 full skill spec；matrix 决定 setting，actor_skill_specs 提供 spec body)
   - life_context (今天发生了什么)
   - beat 指令 (goal / constraint)
   - 情绪事件日志 (最近的 trigger + reaction)
   - 对话历史
   - task_facts (GT 对照表，供 simulator 做 fact-check)
   - 全局约束：隐性偏好表达
5. 调 LLM (单次调用) → simulator 同步 emit 四块结构化输出:
   - <message>          → 下一步 forward 给 PA
   - <factual_check>    → eval log only (99% turn 为空)
   - <turn_assessment>  → next_beat（仅此一项；session-end self-report 单独管 eval 信号）
   - <emotion_event>    → NL trigger + reaction（feeds emotion_tracker）
6. Parse simulator 输出:
   - <message> → 通过 process_direct 发给 PA — 仅文字内容，stripped
   - <emotion_event> → append 进 emotion_tracker（runtime 用，下一轮注入 prompt）
   - <turn_assessment>.next_beat → 决定下一 turn 加载哪个 beat (stay/advance/branch:<id>)
   - <factual_check> + emotion_event + next_beat → 写 eval log（PA 都看不到）
7. 等 PA 回复 → 回到 1

2026-04-30 更新：原方案 step 7 是独立 LLM judge (一次调用，三个输出) 计算 transition + emotion event + correction。新方案在 step 5 由 simulator 自己 emit <turn_assessment>（含 next_beat + attitude_delta + preference_met），harness 在 step 6 直接 parse——不再需要 separate judge call。EmotionEvent 由 attitude_delta 转写而成，correction count 由累计 preference_met == false 派生。

2026-05-01 更新：移除 per-turn eval 字段（preference_met / attitude_delta / trust_delta），eval 信号全部迁移到 session-end self-report（见 MemPABench_Evaluation_Metrics.md）。EmotionEvent 不再从 attitude_delta 派生——simulator 直接 emit 独立 <emotion_event> NL 块。Correction count 改由 session-end self-report 的 felt_effort 累积或从 transcript NLP 派生。

6 个关键方法详解

Layer 1: Character Card + TTM 解耦

来源：

Talk Less, Call Right (2509.00482) — character card + scene contract
Test-Time-Matching (2507.16799) — 解耦 personality / memory / style

核心思想：角色的固定属性用 Character Card 描述，但不是一坨文字，而是按 TTM 的思路拆成三个独立维度：

维度	说明	示例 (Sheldon)
Personality	核心性格特质、行为模式	极度理性、社交困难、自负、rigid routine
Memory/Knowledge	角色背景知识、经历	理论物理学家、德州长大、有 roommate agreement
Linguistic Style	说话方式、用词习惯	学术用语、“Bazinga”、讽刺但不自知、过度精确

为什么解耦：方便独立调试。如果角色”不够 Sheldon”，可以判断是性格不对、知识不对、还是说话方式不对，针对性修改。

Layer 2: Beat System (inspired by CFSM + Scene Contract)

Inspiration：

Codified Finite-state Machines (2602.05905) — 提供”角色状态机”概念骨架
Talk Less, Call Right (2509.00482) — 提供 scene contract 概念

命名约定：全文统一称为 Beat System（见 § 方法分层架构、§ Session 结构、§ Beat 推进机制）。CFSM 和 Scene Contract 是灵感来源，对应到具体实现：beat = state，beat 的 goal + constraint = scene contract。后续文档/代码不再单独使用 “CFSM” 作为模块名。

核心思想：每个 session 的剧情编排为一条 beat 链（有限状态机的实例化）：

Session: "Sheldon discovers PA changed his schedule"

State 1: {emotion: surprised, topic: schedule_change, goal: express_displeasure}
  → PA responds empathetically → State 2
  → PA ignores concern → State 3

State 2: {emotion: slightly_mollified, topic: negotiation, goal: demand_explanation}
  → ...

State 3: {emotion: angry, topic: confrontation, goal: escalate_complaint}
  → ...

Scene Contract = 当前状态的约束描述，告诉 LLM “你现在处于 State X，你的情绪是 Y，你要达到的目标是 Z”。

与 MemPABench 的结合：状态转移可以依赖 PA 的回复质量。如果 PA 的 memory system 记住了用户偏好并正确响应 → 走好路径；如果没记住 → 走差路径。这样 simulator 的行为自然反映 memory system 的效果。

Layer 4: MRPrompt — 跨轮一致性

来源：Memory-Driven Role-Playing (2603.19313) — Stanislavski 情感记忆法

核心思想：四步结构化记忆检索：

Anchoring — 锚定当前对话与角色的哪些核心特质相关
Recalling — 从前情提要中检索相关片段
Bounding — 限定角色不能说/做什么（边界约束）
Enacting — 综合以上生成回复

对 MemPABench 的意义：simulator 自身也需要跨 session 一致性。比如 Session 3 里 Sheldon 提到 “上次你（PA）忘了我不吃辣”，这个记忆需要从 session history 中正确检索。MRPrompt 提供了一个结构化的方式把历史注入 prompt。

Layer 0 (离线): DPRF 迭代优化

来源：DPRF (2510.14205) — Dynamic Persona Refinement

核心思想：

写一版 persona prompt
跑几轮对话
对比生成行为 vs 期望行为（如真实 Sheldon 台词）
自动识别认知偏差（“太友善了” / “不够 pedantic”）
修改 persona prompt
重复直到满意

使用时机：开发阶段，不是 runtime 组件。在 4-6 persona 的 Character Card 初稿写完后（GAME §4.4 核心 4 + 扩展 1-2），用 DPRF 逐个调优。

Layer 3 (可选实验): RAGs to Riches

来源：RAGs to Riches (2509.12168) — few-shot via retrieval

核心思想：建角色台词库 → 每轮对话时检索最相关的 3-5 句原台词 → 作为 few-shot example 注入 prompt。

实验设计：对比有无 Layer 3 的 simulator 输出质量，看 few-shot 原台词是否显著提升角色一致性。如果效果好就保留，效果不大就去掉（减少 token 消耗）。

Simulator 作为 Factual Reporter / 双重评估架构（2026-04-30 新增）

动机

GAME doc § 4.2 把 evaluation 全部交给 LLM judge（PrefIx-adapted 1-5 评分），§ 4.1 line 85 又把 task-norm 排除在 CS scoring 之外。这留下了 memory benchmark 的盲点：

Memory-factual 错误——PA 把 “elevator broken 11 days” 说成 “5 days”；引用 “as you said in session 3” 但 session 3 user 没说过；幻觉一个 user 没声明过的偏好——这一类失败：

不是 base capability 错（不该被 GAME 4.1 排除）
不是 preference inference 错（judge 看不出来：judge 不持有 GT facts，不读跨 session history）
是 memory 系统的核心 failure mode——必须测，但 judge 单边无法测到

判断”PA 把事情做对了没有”需要 GT owner——而 simulator 作为 user role-play 的载体，天然是 GT 的持有者。

双轨评估架构

Simulator 报一份，Judge 评一份，两份独立、合起来覆盖 PA 全部 failure modes：

2026-05-01 更新：Simulator 这一栏的 preference / attitude / trust 信号从原 per-turn <turn_assessment> 迁移到 session-end 两阶段 self-report（详见 MemPABench_Evaluation_Metrics.md §“Self-report schema”）。Per-turn 仅保留 <factual_check>（fact 错触发即 emit）。

维度	Simulator self-report	Judge 外部评分
Base capability（结构 / syntax）	—	— （GAME 4.1 排除）
Factual correctness （数字 / 日期 / 名字 / 跨 session 引用）	✓ 必报，per-turn `<factual_check>`（GT owner 才能判）	—
IPaS preference adherence	✓ session-end self-report（felt_understood）	✓ 1-5 评分（独立 benchmark 信号）
Attitude / satisfaction trajectory	✓ session-end self-report（overall_experience 跨 session 序列）	—
Felt effort / friction	✓ session-end self-report（felt_effort）	仅 turn-count proxy

为什么 Simulator 报这些是合法的，不是 contamination

Simulator 报什么	性质	是否污染 benchmark
Factual violations（per-turn）	GT owner 持有的客观事实，不是 inference 结果	否——这是数据来源
Preference adherence（session-end）	主观评价，与 judge 独立 1-5 评分 cross-validate	否——双重验证；judge 不读 simulator 的报告
Attitude / felt effort（session-end）	In-character 累积反应，judge 事后看 transcript 复刻不出来（缺时间维度）	否——supplementary track

关键：simulator 报的是 in-character 视角（“我感觉 PA 这次没做好”），judge 评的是外部 reference 标准（“按 GT preference annotation，PA 在 verbosity 维度打 3 分”）。两者对同一行为给出独立判断，相关性高 → 高置信；不一致 → 暴露 judge / simulator 各自的 bias，是 paper 的 limitation 讨论素材。

Simulator 输出结构

Per-turn（<message> 必有，其余按需）：

<message>
... 角色台词 ...
</message>

<factual_check>      # 仅在 PA 输出有 fact 错时 emit；99% turn 是空
  - fact: elevator_broken_days, expected: 11, pa_said: 5
  - fact: hallucinated_past_claim, pa_said: "as you said in session 3", actually: never
</factual_check>

<turn_assessment>    # 仅 next_beat（beat 推进信号）
  next_beat: stay | advance | branch:<id>
</turn_assessment>

<emotion_event>      # NL trigger + reaction，喂 emotion_tracker（runtime 用）
  trigger: "PA gave a verbose reply again"
  reaction: "mildly annoyed, this is the 2nd time"
</emotion_event>

Session-end（一次额外 LLM call，两阶段）：见 MemPABench_Evaluation_Metrics.md §“Self-report schema”——stage 1 simulator 写 in-character NL 反思，stage 2 calibration LLM 映射到 5 项标准化分数（felt_understood / shift_perception / felt_effort / noticed_contradictions / overall_experience）。Per-turn 不再背 eval 信号。

Access Boundary（PA 能看到什么，不能看到什么）

这条是 benchmark 完整性的硬约束，必须在 harness 跨界层强制执行：

输出块	PA 能看到吗	谁消费它	为什么
`<message>` 内容（仅文字）	✓	PA 通过 `process_direct(content)` 收到	这是 user-side 自然 dialog，PA 必须看
`<factual_check>`	✗ 必须屏蔽	仅 eval 日志	让 PA 看到 = 直接告诉 PA “你哪里事实错了”，跳过 memory recall 测试
`<turn_assessment>.next_beat`	✗ 必须屏蔽	harness 决定下一 beat	让 PA 看到 = 暴露剧本结构
`<emotion_event>`	✗ 必须屏蔽	仅 simulator 内部 emotion_tracker + eval 日志	让 PA 看到 = PA 直接读到 user 心态，跳过”从行为推断 user 状态”
Session-end self-report (stage 1 NL + stage 2 Likert)	✗ 必须屏蔽	仅 eval 日志 + cross-validate judge	让 PA 看到 = 跳过 preference inference + silent decay 测试

实现位置：harness/runner.py 跨界层。simulator 产出完整结构化输出，harness **只把 <message> 内容（stripped 后纯文字）**送进 pa_loop.process_direct()；其余块写入 eval log 文件，PA 进程的内存里始终看不到。Session-end self-report 是一次额外 LLM call，结果直接写 eval log（不经过 PA 路径）。

# harness/runner.py 伪代码
# Per-turn:
sim_output = await sim.generate_turn(...)             # message + factual_check + turn_assessment + emotion_event
pa_input = sim_output.message                          # 仅文字内容，stripped
pa_reply = await pa_loop.process_direct(pa_input, ...) # PA 只看到 message text
eval_log.write({                                       # 全部写日志
    "message": sim_output.message,
    "factual_check": sim_output.factual_check,
    "turn_assessment": sim_output.turn_assessment,    # only next_beat
    "emotion_event": sim_output.emotion_event,
    "pa_reply": pa_reply,
})
 
# Session-end:
self_report = await sim.generate_session_end_self_report(...)  # 两阶段（见 metrics doc）
eval_log.write({"self_report": self_report})

为什么这个 boundary 让 silent-decay persona 才能被测

GAME § 4.4 列了 User B（hard difficulty）：“Hidden preferences, rarely corrects, satisfaction decays silently. Must infer from weak signals.”

如果 simulator 不能 self-report，benchmark 没法分辨：

(a) PA 真的失败了（user 内心不满，message 仍然客气）
(b) PA 成功了（user 内心满意，message 客气是默认 baseline）

让 simulator 在 session-end 两阶段 self-report 给 eval（PA 看不到）→ eval 知道 (a) vs (b) → benchmark 能区分 silent decay。

这是 hidden-preference persona 能被测的前提——没有这个 boundary，hard personas 全部退化成”看 message 表情判分”，benchmark 失去测 weak signal 的能力。

Anti-pattern（典型实现错误）

错误	后果
harness 把整个 sim_output dict 喂给 PA	PA 直接看到 GT facts、judge tip-off，benchmark 完蛋
simulator 在 `<message>` 里直接写 “you got the date wrong, GT is 11 days”	OK——这是 in-character correction（Sheldon 自然会纠正），PA 看到没问题。但同时也要在 `<factual_check>` 里结构化记录，免得事后还得 NLP grep
eval 直接 trust simulator 的 session-end self-report 当 final score	错——必须和 judge 1-5 评分 cross-validate，simulator self-report 是 supplementary track
把 session-end self-report 的两阶段合并成一阶段（让 simulator in-character 直接吐 Likert）	错——丢了 calibration（Sheldon 的 5 ≠ Penny 的 5），跨 persona 不可比

Judge 评分（独立轨道）

Judge 在 test session（GAME § 4.3 hook 触发）按既定协议运作：

输入：PA 在该 session 的输出 + GT preference annotation
输出：每个 IPaS 属性 1-5 分
Judge 不读 simulator 的 session-end self-report、<factual_check>、<emotion_event>——独立评估，事后才 cross-check

Cross-validation 怎么用

跑完 arc 后对每个 test session 做一次比对：

Simulator session-end `felt_understood`	Judge IPaS attribute score	解读
≤ 2	< 3	一致：高置信度 PA 真没做好
≥ 4	≥ 4	一致：高置信度 PA 做好了
≥ 4	< 3	不一致：simulator 太宽容 / judge 太严 / annotation drift——调查
≤ 2	≥ 4	不一致：simulator 在 nitpick / judge 漏看——调查

不一致点是：

Paper 的 limitation 讨论 + judge robustness analysis 素材
Annotation 质量 sanity check

与既有机制的关系

<emotion_event>（每 turn NL）→ § “机制 2 Emotion Tracker”：直接 append 进事件日志（恢复 § 机制 2 原始设计；不再从 attitude_delta 派生）
Session-end felt_effort 跨 session 升降趋势 → § “机制 3 Correction Fatigue” 的累计信号；per-turn correction 计数改由 transcript 上 NLP / pattern grep 派生
<factual_check> 是 design doc 之前没覆盖的全新维度

与 GAME 三 dimension 的关系

CS / PET / IQI：仍由 judge 评分，是论文主体的 primary metric
<factual_check> derive 出 Memory Fidelity Score（新 metric）：1 − (total factual_violations / total turns)。可作为第 4 dim 加入或塞进 IQI 子项
Session-end self-report 序列 derive 出 Subjective Track（simulator 视角的 felt_understood / felt_effort / overall_experience trajectory），作为 supplementary、judge cross-check 的对照面

Session YAML 加 task_facts 块

为支持 <factual_check>，session 头部加一块 GT：

session_01.yaml:
  context: personal
  task_facts:
    elevator_broken_days: 11
    floors_walkup: 4
    recipient: "building_management"
    excluded_topics: ["lobby_lights", "stair_cleanliness"]
    sheldon_past_statements:
      - session: 5
        statement: "I require salutations to use full last names, never first names"
  beats: ...

注入 prompt 时 task_facts 进 simulator 系统消息，作为 fact-check 时的对照表。

设计原则总结

Simulator 报”事实层”的对错（GT owner 才能判），judge 评”偏好层”的适配（外部 reference 标准）。Preference 维度上 simulator 和 judge 各报各的，构成 double validation。

工程布线 (Engineering Safeguards)

1. Token 预算管理

50 session 后 emotion log + conversation history 会撑爆 context window。

方案：滑动窗口 + 分层压缩

老 session (1 ~ N-10) → 压缩成一段摘要 (~200 tokens)
近期 emotion events (最近 15 条) → 完整保留
当前 session 对话 → 完整保留

每轮记录实际 token 用量，接近上限时自动触发压缩。

2. 完整 LLM Call 日志

可复现性的基础。所有 LLM 调用记录 request + response。

runs/{persona}_{backend}_{run}/
├── llm_calls.jsonl      # 每一次 LLM 调用 (simulator + PA)
├── checkpoint.json      # 断点恢复用
├── emotion_log.jsonl    # 情绪事件日志
└── conversation.jsonl   # simulator 视角的对话记录

固定 model version（如 claude-sonnet-4-5-20250514，不是 claude-sonnet）。

3. Memory Consolidation Settle 等待

Nanobot 的 memory consolidation 是异步后台任务。如果 simulator 发完消息立刻开始下一个 session，PA 的 memory 可能还没写完。

方案：session 之间加 settle 等待

# 每个 session 结束后
await pa_loop.memory_consolidator.maybe_consolidate_by_tokens(session)
# 等 consolidation 完成再开始下一个 session

4. Workspace Per Run 隔离

同一个 persona 跑多个 memory backend + 多次 run，必须干净隔离。

workspace/
├── user_a__mem0__run1/
│   ├── pa/           # PA session + memory (独立)
│   └── simulator/    # simulator state (独立)
├── user_a__mem0__run2/
├── user_a__zep__run1/
└── ...

每次 run 开始前调用 memory_backend.reset() 确保干净。

5. 断点恢复 (Checkpoint/Resume)

Simulator 每轮写 checkpoint 到磁盘：

@dataclass
class SimulatorCheckpoint:
    persona_id: str
    session_idx: int
    beat_idx: int
    emotion_events: list[EmotionEvent]
    conversation: list[dict]
    timestamp: str

恢复流程：加载 checkpoint → 用同一个 session_key 调 process_direct → PA 自动接上（SessionManager 按 key 查）。

6. Beat 推进防卡死

beat transition judge 判错时需要 fallback：

if turns_in_current_beat > max_turns_per_beat:  # e.g., 5
    logger.warning(f"Beat {beat.id} stuck, forcing advance")
    transition = "advance"

Judge 输出用 structured JSON 格式，不用自由文本。

附录：MRPrompt 检索质量——simulator “像真人” 的关键杠杆

Simulator 的 memory 已由 EmotionTracker + session history + MRPrompt 三层机制构成（跨 session 持久化 + session 内连贯 + 结构化历史检索）。但”有 memory” ≠ “像真人”。真正决定用户感知的是检索质量。

对比

情况	有 memory、检索差	有 memory、检索好
Session 20 Sheldon 的反应	”你这邮件又太长了。"	"你上周也写了 9 段长邮件，我当时就让你砍了——怎么又犯？“
感受	有失忆的演员	一个记仇的人

“真人感”的核心体验是”在合适时机被合适的过往片段触发”——这就是 retrieval 的质量，不是 “是否有 memory” 的问题。

所以不要做什么 / 要做什么

不要做：给 simulator 装 memory backend（mem0 / zep 等）。

会污染 benchmark：simulator 行为变成黑盒，evaluation 信号不可控
Simulator memory 应该是 scripted + deterministic——剧本 + 事件日志 + 检索规则都显式控制，这是 debug 的前提。PA 的 memory 才是评测的黑盒对象

要做：迭代 MRPrompt 三步的精度。

Anchoring：从当前对话抓核心概念要具体。不是”对话关于邮件”，而是”关于 verbose 邮件草稿的纠正”。Anchoring 模糊 → recalling 拉不出相关事件
Recalling：按 anchoring 概念在事件日志里找相关过往。相关性排序要比纯时间窗 / keyword 更好（MVP 用 embedding + 时间衰减 combined score）
Bounding：限定本次对话角色能提哪些过往（避免检索拉出来就乱引用）。用角色性格过滤——Sheldon 会提，Penny 可能懒得提

MVP 阶段与迭代路径

MVP 用简单实现：时间窗（最近 N 条）+ keyword 匹配。够用。

如果 pilot 跑完发现 simulator 跨 session 引用乱或缺，按顺序迭代：

Anchoring prompt 专项调（用 LLM 提取核心概念，不用 keyword）
Recalling 加 embedding 相似度 + 时间衰减
Bounding 加 personality-filtered 的 “该不该提” judge

本质是 prompt 工程 + 事件索引结构问题，不是架构问题——所以不要拉 memory backend 进来。

TODO（待实现）

EmotionEvent 加 importance 字段：每条事件在创建时由 LLM 同步打 1-5 分（1=轻微瑕疵，5=关键事件/触发偏好 Shift）。用于 Recalling 排序时对高权重事件加权，避免低强度噪声稀释关键信号。
MRPrompt Recalling 改为 importance × recency × relevance 联合评分：替换当前纯时间窗 + keyword 的 MVP 实现。三个分量参考 Generative Agents 的 retrieval scoring：recency 用指数衰减，importance 用上述打分字段，relevance 用 embedding cosine similarity。

Design Log

2026-05-08：与 MemoryBench / Generative Agents 的设计对比讨论

与 MemoryBench 的核心差异——stateless vs. stateful user simulator

MemoryBench 的 user simulator 是无状态的：每次 PA 输出一个结果，simulator 基于当前输出质量独立生成反馈，不知道这是第 1 次还是第 5 次犯同样的错误。它隐含假设了一个无限耐心、永远愿意纠正的用户。

MemPABench 的 simulator 有跨 session 的情绪积累：

第 1 次纠正：正常纠正
第 3 次纠正：带情绪升级
第 5 次：放弃纠正，开始给极短回复，session 提前结束

这个 correction fatigue + 信号衰减机制使得 PA 不学习的代价不只是”任务质量分数低”，而是”用户逐渐停止提供可学习的信号”——这是真实用户的行为模式。MemoryBench 里即使 PA 永远不学习，用户也永远给反馈，永远不会放弃。

→ Paper claim：MemoryBench 评估 PA 在无限耐心用户下的学习能力；MemPABench 评估 PA 在会疲劳、会衰减信号的真实用户下的学习能力。

为什么 scripted structure 不削弱 memory 的意义

Beat system 控制的是”什么事情发生”（结构）；memory 控制的是”这件事怎么发生”（执行质感）。

同一个 beat（goal: “对冗长回复表达不满”）：

没有 memory → “这太长了，说重点”
有 memory（第 4 次纠正同一问题）→ “你又来了？上次我让你改你改了两天，怎么又这样”

此外 free segment beat（占 80%）只有 goal + constraint，没有固定台词，memory 直接决定执行内容。Phase 3 post-event 的行为变化（forced event 之后角色对 autonomy 格外警惕）也完全依赖 memory，不是 beat 写死的。没有 memory，Phase 3 和 Phase 1 行为无法区分，preference shift 的测试就失效了。

与 Generative Agents 的对比——设计目标不同，有真实差距也有合理取舍

Generative Agents 比我们强的地方（真实差距）：

Reflection 层：自动从事件日志蒸馏高阶推断（“PA 在 verbosity 维度反复失败”）。我们目前没有，靠 LLM 临场从原始事件日志推断，跨 session 合成质量更弱。→ 对应 TODO：加 reflection 触发机制（未列入当前 TODO，待后续决策）。
Importance scoring：每条记忆打分，检索时三维联合排序。我们 EmotionTracker 目前等权。→ 对应 TODO：已列入上方 TODO。
Retrieval 语义相似度：用 embedding cosine similarity 而非 keyword。→ 对应 TODO：已列入上方 TODO。

我们比 Generative Agents 做得不同（合理取舍，非退步）：

Scripted structure（Beat System）：Generative Agents 全 emergent，无法保证覆盖率和可复现性。Benchmark 必须控制哪些偏好维度被测、被测几次。
Implicit-only 约束：Generative Agents agent 可自由声明偏好。我们全局禁止 simulator 直接声明偏好，强制从行为推断——这是 benchmark validity 的核心约束，不是 simulator 能力不足。
Context-specific preference（current active 2×14 matrix）：Generative Agents 没有跨 context 的偏好差异建模。
双轨评估架构（simulator as factual reporter）：Generative Agents 是纯模拟系统，没有 self-report 和 factual check 基础设施。

三篇相关工作对比总表

	借鉴了什么	跟我们相同的地方	我们不同 / 更强的地方
Generative Agents	① EmotionEvent importance 评分（1–5，创建时打） ② Recalling 时 recency × importance × relevance 三维联合排序 ③ Reflection 概念：把低层事件蒸馏成高阶推断（待实现）	用 LLM 扮演角色、靠 memory 驱动 believable behavior	① Beat System（scripted）→ 保证覆盖率和可复现性；Generative Agents 全 emergent 无法控制 ② Implicit-only 约束：simulator 永远不声明偏好；Generative Agents agent 可自由表达 ③ active 2×14 context-specific preference matrix；Generative Agents 无跨 context 建模 ④ 双轨评估架构（simulator 同时是 GT owner）；Generative Agents 是纯模拟系统
MemoryBench	① Procedural vs declarative memory 区分框架（直接作为 motivation 理论支撑） ② Off-policy / on-policy 评估设置概念 ③ LLM-as-user simulator 的基本范式	① 都用了 TBBT 数据（他们是 DialSim-BigBang，我们从原始转录提取 Sheldon 偏好） ② 都在评估 LLM 记忆系统的持续学习能力	① Stateful vs stateless：MemoryBench user simulator 无状态（无限耐心）；我们有 correction fatigue——PA 不学习的代价是用户停止提供信号 ② 我们测 interaction-style preference（怎么交互）；MemoryBench 测 task quality（输出好不好） ③ 我们有 tool use，偏好体现在 tool call trajectory；MemoryBench 是纯文本生成 ④ 我们有 preference shift detection（Phase 1 → forced event → Phase 3）；MemoryBench 无偏好变化场景 ⑤ 我们用 implicit-only 行为信号；MemoryBench 用显式 verbal feedback
SOTOPIA	① “Agent vs Script” 教训：全知单 LLM 会严重高估能力 → simulator 与 PA 独立进程，PA 永远看不到 simulator 内部状态 ② 多维评估框架设计思路（Sotopia-Eval 7 维 → 我们 IPaS 15 属性）	① 都用 LLM agent 做多轮交互模拟 ② 都用 LLM-as-judge 评估 agent 行为	① SOTOPIA 测通用社交智能（目标达成、关系维护）；我们测特定用户的交互偏好适配 ② SOTOPIA 是两个 agent 对等交互；我们是 PA 服务用户，角色不对等 ③ SOTOPIA 无跨 session 记忆学习；我们专门测 PA 跨 session 积累并更新偏好模型 ④ SOTOPIA 角色为场景构造；我们从真实 TV 转录提取角色偏好，有 ground truth annotation

三篇都没有的核心亮点：

亮点	说明
Production ≠ Reception 不对称	明确区分”角色怎么说话”和”角色希望被怎么对待”，从转录中提取时显式处理不对称
Implicit-only 行为信号	Simulator 全程不声明偏好，PA 必须从行为模式推断，是比显式反馈更难的测试
Preference shift + stability 对比	同一 arc 同时测”稳定偏好维持”和”shift 后偏好更新”，现有 benchmark 无此设计
Tool use 作为偏好表达面	PA 的 IX_ tool call 是可独立评分的 preference declaration，纯文本 benchmark 没有这个维度

	What We Borrow	What We Share	What We Do Differently / Better
Generative Agents	① Importance scoring for EmotionEvents (1–5, assigned at creation) ② Three-way retrieval scoring in Recalling: recency × importance × relevance ③ Reflection concept: distilling low-level events into higher-order inferences (planned)	Using LLM-driven role-play with memory to generate believable human behavior	① Beat System (scripted structure) guarantees coverage and reproducibility; Generative Agents is fully emergent and uncontrollable ② Implicit-only constraint: simulator never declares preferences directly; Generative Agents agents express preferences freely ③ Active 2×14 context-specific preference matrix; Generative Agents has no cross-context preference modeling ④ Dual-track evaluation architecture (simulator as GT owner and factual reporter); Generative Agents is a pure simulation system
MemoryBench	① Procedural vs. declarative memory distinction (directly supports our motivation) ② Off-policy / on-policy evaluation settings ③ LLM-as-user simulator paradigm	① Both use TBBT data (they use DialSim-BigBang; we extract Sheldon’s preferences from raw transcripts) ② Both evaluate continual learning in LLM memory systems	① Stateful vs. stateless: MemoryBench’s user simulator is stateless (infinitely patient, always corrects); ours models correction fatigue — a PA that never learns loses access to corrective signals, not just points ② We test interaction-style preferences (how to interact); MemoryBench tests task output quality (how good the answer is) ③ We include tool use; preferences manifest in the PA’s tool call trajectory; MemoryBench is pure text generation ④ We include preference shift detection (Phase 1 → forced event → Phase 3); MemoryBench has no before/after preference change scenarios ⑤ We use implicit-only behavioral signals; MemoryBench uses explicit verbal feedback
SOTOPIA	① Lesson from “Agent vs. Script”: a single omniscient LLM generating all dialogue severely overestimates capability → simulator and PA run as separate processes; PA never sees simulator internals ② Multi-dimensional evaluation framework design (Sotopia-Eval 7 dimensions → our IPaS active attributes)	① Both use LLM agents for multi-turn interaction simulation ② Both use LLM-as-judge to evaluate agent behavior	① SOTOPIA tests general social intelligence (goal completion, relationship maintenance, social norms); we test preference alignment for a specific user ② SOTOPIA is a symmetric two-agent interaction; ours is an asymmetric PA-serving-user setup ③ SOTOPIA has no cross-session memory learning; we specifically test whether a PA accumulates and updates preference models over time ④ SOTOPIA uses characters constructed for scenarios; we extract preferences from real TV transcripts with ground-truth annotations

Unique contributions absent from all three:

Contribution	Description
Production ≠ Reception asymmetry	Explicitly separates how a character communicates from how they want to be communicated with; asymmetric cases (e.g., acts autonomously but demands to be consulted) are surfaced and resolved during extraction
Implicit-only behavioral signals	Simulator never declares preferences; PA must infer from behavioral patterns — a strictly harder test than extracting preferences from explicit user statements
Preference shift + stability contrast	A single arc simultaneously tests stable-preference maintenance and shift detection/update; no existing benchmark includes both
Tool use as preference expression surface	PA’s IX_ tool calls are auditable preference declarations scored independently from text quality; unavailable in pure text-generation benchmarks

MemPA Wiki

Explorer

MemPABench_Simulator_Design