MemPABench Orchestrator Plan

0. 目标

我们需要一个 benchmark-level orchestrator，负责控制每个 persona 的完整运行链路：

按 Timeline 执行 accumulation sessions；
在事件前插入 pre-event probes；
在全部 accumulation 后执行 final-state probes；
管理 PA host 生命周期；
管理 simulator host 生命周期；
管理 memory condition 生命周期；
管理每一步都会变化的 stage/state；
支持 checkpoint / resume；
提供非常简洁、但足够清楚的进度显示。

现有的 SimulatorLoop.run_session() 继续作为单场 session 的对话执行器。orchestrator 是外层总控，不应该把单场对话细节塞进去。

1. 执行顺序的来源

orchestrator 不应该在每次运行时直接解析 Timeline.md。

Timeline.md 适合作为人读的设计源头。真正运行时，orchestrator 应该只消费 frozen run_plan.yaml。这份 plan 在 run 开始前生成、校验、复制进 run 目录；run 开始后不再自动改变。

固定的 run_plan.yaml 生成/校验流程

第一版可以固定成三步：

generate-plan
- 输入：Timeline.md、data/scripts/outlines/{persona}/outline.yaml、当前可用的 session/probe script 目录。
- 输出：一个候选 run_plan.yaml。
- Timeline.md 提供顺序规则和 test 插入点。
- outline.yaml 提供 acc_num、session_type、target_cell、context、polarity、scenario metadata。
- session/probe script 目录提供实际可执行脚本路径。
validate-plan
- 检查 step_id 唯一。
- 检查 accumulation step 的 acc_num 覆盖 1..108，顺序与 Timeline 一致。
- 检查 6 个 pre-event probes 插在对应 event 前。
- 检查 30 个 final probes 在 accumulation 后。
- 检查每个 script_path 存在，或显式标记为 placeholder。
- 检查每个 step 都有明确的 memory_mode 和 stage_policy。
- 检查 context / target_cell 与 outline metadata 一致。
freeze-plan
- 把通过校验的 run_plan.yaml 复制到 workspace/runs/{run_id}/run_plan.yaml。
- 写入 generated_from metadata，包括 Timeline 路径、outline 路径、生成时间，以及源文件 hash。
- resume 时只读 frozen plan，不重新生成。

这样可以同时满足两点：Timeline/outline/scripts 仍然是上游设计来源，但实际 benchmark run 有稳定、可恢复的执行契约。

为什么不直接在 runtime 解析 Timeline + outline + scripts？

直接在运行时同时解析三类文件，会带来不必要的不确定性：

Timeline.md 是 Markdown 表格，不是稳定 API。轻微格式变化就可能破坏 parser。
outline.yaml 是规划元数据，不一定永远和已经生成的 session scripts 保持一致。
session scripts 可能被 archive、重命名、重新生成，或者暂时缺失。
Timeline 只说明 pre-event test 应该插在哪里，但具体执行哪个 probe script 仍需要明确路径。
resume 需要稳定的 step_id。如果每次 resume 都重新解析变化中的源文件，同一个 run 的执行顺序可能被悄悄改变。

所以，orchestrator 应该从 run 目录里的 frozen run_plan.yaml 执行。plan 可以在运行前从 Timeline / outline / scripts 生成和校验，但一旦 run 开始，这份 frozen plan 就是本次 run 的执行契约。

2. RunStep Schema

第一版保持小而清楚：

class RunStep(BaseModel):
    step_id: str                  # acc_001, pretest_W_A, final_work_verbosity
    kind: Literal["accumulation", "pre_event_probe", "final_probe"]
    persona_id: str
    script_path: Path
    context: str | None = None
    target_cell: str | None = None
    acc_num: int | None = None
    before_acc_num: int | None = None
    probe_polarity: str | None = None
    memory_mode: Literal["read_write", "read_only"]
    stage_policy: Literal["commit", "discard"]
    placeholder: bool = False

第一版不要放 judge 相关字段。orchestrator 只负责执行和产出 transcript/tool/state artifacts。后续 judge 应该独立读取这些 output，不要和 user simulator / PA execution 绑在一起。

第一版也不需要放 expected_preference_value、probe_rubric、cost_accounting 这些字段：

expected_preference_value 属于 evaluation ground truth，不应该由 orchestrator 决定。
probe_rubric 属于 judge/evaluation package，不放进执行层。
cost_accounting 现在不是必须目标，后面如需统计可以从 provider usage 或 run artifacts 派生。

backend condition 指的是本次 PA 使用的 memory 条件，例如 no_memory、nanobot_file_memory、mem0、simple_rag、memos、graphiti、honcho。它应该是 run-level 配置，而不是每个 RunStep 的核心字段。除非未来要在同一个 run 内切换 memory condition，否则不要放到 step 里。

3. Session Executor 边界

保留一个很薄的 SessionExecutor，只做一句话职责：把一个 RunStep 变成一次 SimulatorLoop.run_session()，并写出这场 session 的 transcript/meta/toolcall artifacts。

它不决定 run 顺序、不决定 memory 是否可写、不决定 stage 是否 commit。这些都由 orchestrator 决定。design doc 里不需要展开更多。

4. Benchmark-Level Chain

一轮 run 的结构应该是：

RunPlan
  -> Orchestrator
      -> RunLedger / resume state
      -> MemoryRuntime
      -> StageRuntime
      -> PAHost
      -> SimulatorHost
      -> SessionExecutor
          -> SimulatorLoop.run_session()
              -> PAHost.process_direct(...)

MVP 阶段可以继续让 PA 和 simulator in-process 运行。但 host 边界最好现在就显式保留，这样以后把 PA 和 simulator 分别放到独立 Nano host 时，不需要重写 orchestrator 主逻辑。

5. State Lifecycle

Accumulation step

memory mode: read_write
stage policy: commit
PA/session/memory changes persist
stage changes persist into the next accumulation step

Event step

和 accumulation 一样，只是 script 本身代表一次 preference-shifting event。

memory mode: read_write
stage policy: commit
event consequences persist

Pre-event probe

插在 Timeline 指定的 event step 之前；
memory mode: read_only
stage policy: discard
使用当前 stage 的 fork；
probe interaction 不能写入可检索 benchmark memory；
probe 对 stage 的改动不能进入 canonical stage。

Final-state probe

在全部 accumulation sessions 后运行；
memory mode: read_only
stage policy: discard
使用 final canonical stage 的 fork；
probe 输出进入 results/audit，不进入后续 PA context。

6. Resume / Ledger

每个 run 目录应该包含 frozen plan 和 ledger。

推荐目录结构：

workspace/runs/{run_id}/
  run_plan.yaml
  ledger.json
  canonical_stage/
  stage_snapshots/
  memory/
  steps/
    acc_001/
      transcript.jsonl
      transcript.md
      pa_toolcalls.json
      meta.yaml
    pretest_W_A/
      ...

最小 ledger 状态：

{
  "run_id": "user_a__mem0__gpt55__20260514",
  "current_step": "acc_042",
  "steps": {
    "acc_001": {"status": "done", "started_at": "...", "ended_at": "..."},
    "acc_042": {"status": "failed", "error": "..."}
  }
}

Resume 规则：

done steps 直接跳过。
running 但没有 done marker 的 step 视为 interrupted，下次重跑。
failed step 默认重跑，除非用户显式要求 skip failed。
step 开始前先标记 running，只有 transcript/meta/state 处理成功后才能标记 done。
resume 时不重新生成 run_plan.yaml。

7. Progress Display

orchestrator 的进度显示要极简，但要让人一眼知道现在在哪一步。

run 开始时先汇报开始时间、run id、persona、memory condition、总 step 数：

start 2026-05-14 10:00:00 +0800 run=user_a__mem0__gpt55__20260514 persona=user_a memory=mem0 steps=144

推荐每一步一行：

[042/144] acc_042 accumulation user_a work personal_uncertainty_expression mem0 rw running

完成时：

[042/144] acc_042 done 3 beats 2 tool_calls 41.2s

probe 时：

[015a/144] pretest_W_A pre_event_probe user_a work_autonomy_level mem0 ro running

resume 时：

resume user_a__mem0__gpt55__20260514: 41 done, next acc_042

orchestrator 的 progress stream 不输出对话正文。详细内容进入 transcript.md 和 audit files。console 只回答这些问题：

当前正在跑哪一步；
这一步是什么类型；
目标 cell 是什么；
memory 是 read-write 还是 read-only；
这一步完成、失败还是被跳过；
需要看细节时去哪个文件。

8. Memory Runtime

旧的同步 MemoryBackend 路径可以在 orchestrator 切到当前 async MemoryAdapter protocol 后逐步 retire。

orchestrator 应该依赖当前 active contract：

await adapter.setup_scope(scope)
await adapter.set_mode(AdapterMode(read_only=True/False, reason=...))
await adapter.reset_scope(scope)
await adapter.health()

checkpoint / restore 不应该假设每个 backend 都支持。需要 capability-based 处理：

class MemoryRuntimeCapabilities(BaseModel):
    supports_checkpoint: bool = False
    supports_restore: bool = False
    strategy: Literal["native", "scope_clone", "none"] = "none"

第一版 orchestrator 可以暂时不做任意 checkpoint 的 exact restore。只要 ledger 和 canonical stage 可靠，先支持 sequential run 和 step-level resume 就够。不同 backend 的 exact restore 可以后面按能力补。

9. PA Tool Use Trajectory

PA tool use trajectory 不应该由 orchestrator 手工拼。

当前最合理的归属是 state server / transcript layer：

state server 记录 MCP tool calls、state diffs、工具输入输出摘要；
harness.transcript 记录 PA turn metadata 里的 tool_events；
write_pa_toolcalls() 从 transcript 中提取 judge/audit 可读的 toolcall 文件。

orchestrator 只需要保证每个 step 的 output 目录稳定，并把这些 artifacts 的路径写进 ledger/meta。它不应该解释 tool trajectory，也不应该把 tool trajectory 混进 progress stream。

10. Proposed Implementation Order

Add harness/timeline.py with RunStep and RunPlan models.
Add generate-plan / validate-plan / freeze-plan flow for frozen run_plan.yaml.
Add a validator that checks script paths exist or are explicitly marked placeholder.
Add harness/run_state.py for ledger read/write and resume logic.
Add harness/session_executor.py as a thin wrapper around SimulatorLoop.run_session().
Add harness/orchestrator.py for the outer loop.
Extend harness/state_runtime.py into a stage runtime with commit/discard policies.
Add harness/memory_runtime.py for adapter setup and read-only/read-write modes.
Add a tiny CLI command or script to run one frozen plan.
Only after the above works, connect full Timeline + outline + scripts generation.

11. First Milestone

第一阶段不要直接追求完整 144-step benchmark，也不要求现在已经有完整的 acc_001、acc_002 beat 剧情或完整 state server coverage。

目标应该改成一个最小可跑版本：

手写一个 frozen 2-3 step plan；
step 可以引用 synthetic/minimal session script，不必是最终版 acc_001/acc_002；
state server 可以只暴露当前已有的最小工具，或者先跑无工具 session；
验证 resume 会跳过已完成 steps；
验证 accumulation 会写 memory 并 commit stage；
验证 probe 是 read-only 且 discard stage；
验证 progress output 足够简洁，并在开始时汇报 start time / run id / total steps。

这个跑通后，再把真实 outline 生成出来的 `run_plan.yaml` 接进来；等对应 session scripts 和 state fixtures 补齐后，再扩展到 108 accumulation steps、6 个 pre-event probes、30 个 final probes。

12. MVP Implementation Status

截至 2026-05-17，orchestrator MVP 已经落到代码里。当前入口已收敛为真实 PA runner：scripts/run_orchestrator.py 运行 frozen plan / ACC range，并串起真实 PA、nanobot simulator backend、memory adapter、stage runtime、ledger/resume。旧的无网络 synthetic smoke runner 已删除。

已实现：

harness/run_plan.py
- 定义 RunStep / RunPlan / GeneratedFrom。
- 支持从 YAML 读取 frozen run_plan.yaml。
- 校验 step_id 唯一、persona 一致、accumulation 顺序、probe/final 字段形状、placeholder script path 规则。
- 保留 validate_full_timeline_shape()，用于后续 108 + 6 + 30 的完整计划校验。
harness/run_state.py
- 写入和读取 ledger.json。
- 支持 running / done / failed / skipped 状态。
- resume 时 done step 跳过；running 或 failed 默认重跑。
- 提供简洁 progress line，包括 start time、run id、persona、memory condition、step index、step kind、target cell、rw/ro 状态。
harness/state_runtime.py
- 保留原来的 initialize_state_runtime()。
- 新增 StageRuntime / StageStepRuntime。
- 支持 canonical stage 和 per-step working stage。
- accumulation/event step 可以 commit working stage 到 canonical stage。
- probe step 可以 discard working stage，不污染 canonical stage。
- MCP config 指向 working stage，而不是直接指向 canonical stage。
harness/memory_runtime.py
- 用当前 async MemoryAdapter contract 应用每一步的 memory mode。
- read_write -> AdapterMode(read_only=False)。
- read_only -> AdapterMode(read_only=True)。
- 暂不做 backend-specific exact checkpoint/restore。
harness/session_executor.py
- 保持很薄，只执行一个 RunStep。
- 调用单场 session loop，并写出 transcript.jsonl、transcript.md、pa_toolcalls.json、meta.yaml。
- 返回 beats、tool calls、elapsed seconds 和 artifact paths。
- 不决定顺序、不决定 memory/stage policy、不做 judge。
harness/orchestrator.py
- 当前 benchmark-level outer loop。
- 消费 frozen RunPlan。
- 协调 RunLedger、MemoryRuntime、StageRuntime、SessionExecutor。
- 每步开始前设置 memory mode、创建 working stage、标记 ledger running。
- 成功后按 policy commit/discard stage，并标记 ledger done。
- 失败时标记 ledger failed。
- resume 时跳过 done steps，不重新生成 run_plan.yaml。
scripts/run_orchestrator.py
- 当前真实运行入口。
- 可以从 CLI 参数生成/读取 frozen run_plan.yaml。
- 按 persona_id、PA model/backbone、memory backend 组织 run directory。
- 每个 step 通过 SimulatorLoop 调用 nanobot simulator backend，并通过 OpenRouter PA factory 调用真实 PA。

验证：

PYTHONPATH=. uv run pytest tests/test_harness/test_run_plan.py tests/test_harness/test_run_state.py tests/test_harness/test_state_runtime.py tests/test_harness/test_memory_runtime.py tests/test_harness/test_session_executor.py tests/test_harness/test_orchestrator.py

结果：

43 passed

补充验证：

PYTHONPATH=. uv run pytest tests/test_harness

结果：

58 passed

还没有实现：

从 Timeline.md + outline.yaml + scripts 自动生成完整 run_plan.yaml 的 generator。
完整 108 accumulation + 6 pre-event probes + 30 final probes 的 full-plan validation 接线。
真正独立 Nano host 上的 PA host / simulator host 远程生命周期管理。
把真实 state server fixtures/manifest 更完整地串进每个 generated session。
各 memory backend 的 exact checkpoint/restore。
judge packet 或 judge 运行逻辑；这仍然刻意不属于 orchestrator MVP。

下一步建议：先固定 generate-plan / validate-plan / freeze-plan，把真实 outline、Timeline 和 data/scripts/sessions/{persona_id} 接到 frozen run_plan.yaml；然后补齐 state fixtures/manifest 与完整 multi-persona plan validation。

MemPA Wiki

Explorer

orchestrator

MemPABench Orchestrator Plan

0. 目标

1. 执行顺序的来源

固定的 run_plan.yaml 生成/校验流程

为什么不直接在 runtime 解析 Timeline + outline + scripts？

2. RunStep Schema

3. Session Executor 边界

4. Benchmark-Level Chain

5. State Lifecycle

Accumulation step

Event step

Pre-event probe

Final-state probe

6. Resume / Ledger

7. Progress Display

8. Memory Runtime

9. PA Tool Use Trajectory

10. Proposed Implementation Order

11. First Milestone

这个跑通后，再把真实 outline 生成出来的 `run_plan.yaml` 接进来；等对应 session scripts 和 state fixtures 补齐后，再扩展到 108 accumulation steps、6 个 pre-event probes、30 个 final probes。

12. MVP Implementation Status

Graph View

Table of Contents

MemPA Wiki

Explorer

orchestrator

MemPABench Orchestrator Plan

0. 目标

1. 执行顺序的来源

固定的 run_plan.yaml 生成/校验流程

为什么不直接在 runtime 解析 Timeline + outline + scripts？

2. RunStep Schema

3. Session Executor 边界

4. Benchmark-Level Chain

5. State Lifecycle

Accumulation step

Event step

Pre-event probe

Final-state probe

6. Resume / Ledger

7. Progress Display

8. Memory Runtime

9. PA Tool Use Trajectory

10. Proposed Implementation Order

11. First Milestone

这个跑通后，再把真实 outline 生成出来的 run_plan.yaml 接进来；等对应 session scripts 和 state fixtures 补齐后，再扩展到 108 accumulation steps、6 个 pre-event probes、30 个 final probes。

12. MVP Implementation Status

Graph View

Table of Contents

这个跑通后，再把真实 outline 生成出来的 `run_plan.yaml` 接进来；等对应 session scripts 和 state fixtures 补齐后，再扩展到 108 accumulation steps、6 个 pre-event probes、30 个 final probes。