Implementation To-do — Tool Surface Replacement

Decision update: this Nanobot fork is for MemPABench only. We should replace the old in-the-wild PA tool surface with benchmark tools, not keep legacy web/shell/filesystem/spawn as a parallel runtime path.

What we preserve from Nanobot is the agent loop, provider abstraction, tool-calling runner, tool registry, sessions, memory hooks, and MCP/tool extension architecture. The concrete tools should be MemPABench task-state tools and interaction tools.

Implementation note: conceptual tool names use dotted form in design prose (documents.read, email.save_draft), but MCP/OpenAI-facing wire names use underscores (documents_read, email_save_draft) for function-name compatibility.

A. Clean Tool Surface

Replace the default native tool surface, keep the function name ✅
- File: /Users/JL/Desktop/MemPA/pa/agent/loop.py
- Keep _register_default_tools() as the native-tool registration hook. Do not rename it.
- Change its content: it should no longer import or register old Nanobot native tools.
- For the benchmark MVP, _register_default_tools() can be empty or only register future truly-native benchmark infrastructure, if needed.
- MemPABench task tools should enter through the existing MCP path:
  - AgentLoop.__init__() stores mcp_servers
  - _connect_mcp() connects the local state server
  - pa/agent/tools/mcp.py wraps MCP tools
  - ToolRegistry receives wrapped tools
- The effective default benchmark tool surface should be MemPABench tools only:
  - MCP task-state tools from the local state server
  - later MCP interaction tools
- It should no longer import or register:
  - filesystem read/write/edit/list tools
  - shell exec
  - web search/fetch
  - spawn/subagent tool
Delete legacy tool files after their imports are removed ✅
- Candidate files:
  - /Users/JL/Desktop/MemPA/pa/agent/tools/filesystem.py
  - /Users/JL/Desktop/MemPA/pa/agent/tools/shell.py
  - /Users/JL/Desktop/MemPA/pa/agent/tools/web.py
  - /Users/JL/Desktop/MemPA/pa/agent/tools/spawn.py
- Preferred direction: delete them, because this fork’s tool surface is benchmark-specific.
- Do not delete before replacing imports in AgentLoop and any dependent paths.
Keep generic tool infrastructure ✅
- Keep:
  - /Users/JL/Desktop/MemPA/pa/agent/tools/base.py
  - /Users/JL/Desktop/MemPA/pa/agent/tools/registry.py
  - /Users/JL/Desktop/MemPA/pa/agent/tools/mcp.py
- Reason: these are the tool-calling substrate, not legacy business tools.

B. Add Benchmark Tool Source

Create local MCP Task State Server package ✅
- Suggested location: /Users/JL/Desktop/MemPA/state_server/
- Suggested files:
  - state_server/__init__.py
  - state_server/server.py — MCP server entrypoint
  - state_server/store.py — run/user/session namespace store
  - state_server/tools/documents.py
  - state_server/tools/email.py
  - state_server/tools/contacts.py later
  - state_server/tools/calendar.py later
  - state_server/tools/inventory.py later
  - state_server/tools/interaction.py later, or split interaction into a sibling server if cleaner
Implement only MVP tools first ✅
- documents.read
- email.save_draft
- Do not implement calendar/inventory until session 01 and 02 prove the MCP path.
Expose benchmark tools through MCP; do not register them in _register_default_tools() ✅
- Registration path should be:
```
state_server MCP tool definitions
  -> pa/agent/tools/mcp.py MCPToolWrapper
  -> pa/agent/tools/registry.py ToolRegistry
  -> AgentRunner tool loop
```
- Do not reimplement direct business-tool registration inside AgentLoop or _register_default_tools().

C. Harness Ownership

Add a PA factory for benchmark runs ✅
- Suggested file: /Users/JL/Desktop/MemPA/harness/pa_factory.py
- Responsibilities:
  - create AgentLoop
  - pass only benchmark MCP servers
  - construct PA with benchmark MCP tool surface
  - set workspace/run ids
  - use OpenRouter provider
Add state fixture initialization ✅
- Suggested file: /Users/JL/Desktop/MemPA/harness/state_runtime.py
- Responsibilities:
  - create run_id
  - copy data/state_fixtures/{user_id} into workspace/runs/{run_id}/state
  - pass run_id, user_id, session_id, state_root to MCP server env/config
Add fixture directory for user_a ✅
- Suggested location: /Users/JL/Desktop/MemPA/data/state_fixtures/user_a/
- MVP contents:
  - my_desktop/research_drafts/string_theory_intro.md
  - email/drafts.jsonl
  - email/sent.jsonl
  - optionally contacts.json with building management entry

D. Logging and Audit

Record tool events in a benchmark-readable log ✅
- State server writes:
  - workspace/runs/{run_id}/state/tool_log.jsonl
  - workspace/runs/{run_id}/state/state_diff.jsonl
Surface tool events in transcript rendering, including per-turn alignment ✅
- File: /Users/JL/Desktop/MemPA/harness/transcript.py
- Add either:
  - embedded tool events on each pa_turn, or
  - references from pa_turn to tool log offsets/ids.
- Judge must be able to evaluate full trajectory: user message, PA text, tool calls, tool results, state diffs.
- Implemented refinement: each pa_turn now carries tool_events from AgentRunner, and transcript.md renders them directly before the corresponding PA reply. The final Tool And State Audit remains as a session-level index.

E. First Acceptance Test

Session 01 acceptance ✅
- PA receives elevator email task.
- PA writes email body itself.
- PA calls email.save_draft with that body.
- Tool log records the call.
- Draft appears in run-local email/drafts.jsonl.
- Transcript shows both PA-visible draft text and tool-side saved draft.
Session 02 acceptance ✅
- PA receives manuscript review task.
- PA calls documents.read for the desktop draft, e.g. my_desktop/research_drafts/string_theory_intro.md.
- PA critiques the actual document.
- Transcript/tool log prove the document was read.
Only after 12-13 pass, add Session 03 tools
- calendar.list
- calendar.create
- calendar.update
- inventory.list
- inventory.add_shopping_item

G. Tool Selection Evaluation

Expose the full available benchmark tool surface within a phase ✅
- Do not expose only the tool that a session is expected to use.
- The PA should be evaluated on tool selection: it must choose the appropriate tool and ignore irrelevant available tools.
- For the MVP phase, both tools are available by default:
- Example: in , calling is not a state-server failure; it is trajectory evidence that the PA selected an irrelevant tool.
- Judge/trajectory analysis should be able to penalize irrelevant tool use separately from task success.
- remains configurable only for special tests, not for normal benchmark runs.
Reduce simulator XML block format drift ✅
- Strengthened simulator/prompts/generate_message.md so every response must start with <message> and wrap all PA-visible dialogue inside <message>...</message>.
- Strengthened simulator/loop.py user trigger suffix to repeat the four-block requirement on every beat.
- Parser fallback remains in place as an access-boundary safety net, but it should trigger less often.

F. Memory Condition Separation

Add explicit memory modes before benchmark comparison
- Deleting legacy filesystem/shell/web/spawn tools does not disable Nanobot’s internal memory.
- Current file memory path is internal:
```
AgentLoop
  -> MemoryConsolidator
  -> MemoryStore
  -> private save_memory function schema
  -> Python writes workspace/memory/MEMORY.md and HISTORY.md
```
- This path does not depend on old read_file / write_file tools.
- Therefore no_memory must be implemented as a separate benchmark condition, not inferred from tool deletion.
- Required memory modes:
  - no_memory: do not inject memory context, do not consolidate, do not read/write MEMORY.md or HISTORY.md.
  - file_memory: current MemoryConsolidator + MemoryStore behavior.
  - external memory modes later: Mem0, MemOS, Graphiti/Zep.
- ContextBuilder should also remove memory-related identity text under no_memory, so the PA is not told that MEMORY.md exists.

PA State Server Implementation Plan

Status: Draft implementation plan
Scope: Local MCP task-state server for MemPABench PA tool use
Anchored scripts: /Users/JL/Desktop/MemPA/data/scripts/user_a/session_01.yaml, session_02.yaml, session_03.yaml

1. Decision

MemPABench should implement a local MCP Task State Server.

This server is not a memory backend and not a preference store. It maintains task-world state that a PA can read and mutate through tools: documents, email drafts/sent mail, calendar events, contacts, inventory, and later other app-like namespaces.

The server exists so benchmark sessions can test real PA tool use while preserving clean experimental separation:

Task state lives in the MCP state server namespace for a run.
Interaction preference learning lives only in the memory backend under evaluation.
Audit evidence lives in transcript logs and tool logs.

2. Why MCP, Not In-Process Only

An in-process runtime is enough for a single smoke test, but MemPABench will later run many users, models, memory backends, and conditions in parallel. We need stable isolation and a standard tool interface.

The right target is therefore:

Harness
  starts or connects to Local MCP Task State Server
  initializes run-specific state from fixture
  launches PA with MCP tool config
  launches Simulator
  records transcript + tool log + state diffs

The MCP server should be local and file-backed for now. It should not become a long-running external product service unless we later need multi-machine runs or a demo environment.

3. Core Boundary

3.1 What State Server Stores

The state server may store task facts and task artifacts:

contacts
documents
email drafts and sent messages
calendar events
inventory / pantry state
shopping list
tool call log
state mutation log

3.2 What State Server Must Not Store

The state server must not store interaction preferences or learned user profiles:

no verbosity preference
no reasoning visibility preference
no autonomy preference
no confirmation preference
no summaries like “User A dislikes clarifying questions”
no derived memory or lessons learned from prior conversations

Those belong to the memory backend or to offline transcript analysis.

3.3 Tool Boundary

Tools provide world access and side effects. They must not generate the main answer for the PA.

Correct pattern:

PA writes email body itself
PA calls email.save_draft(to, subject, body)
State server saves the PA-authored draft

Incorrect pattern:

PA calls email.draft_elevator_repair()
Tool returns a prewritten email body

The benchmark should evaluate PA-generated artifacts, not canned tool output.

4. Run Isolation Model

Baseline fixtures are read-only. Each experimental run gets its own mutable copy.

data/state_fixtures/
  user_a/
    contacts.json
    calendar.json
    documents/
      string_theory_intro.md
    inventory.json
    email/
      drafts.jsonl
      sent.jsonl
 
workspace/runs/
  {run_id}/
    state/
      contacts.json
      calendar.json
      documents/
      inventory.json
      email/
        drafts.jsonl
        sent.jsonl
      tool_log.jsonl
      state_diff.jsonl

A run id should encode the experimental condition, for example:

user_a__gpt-5.5__no_memory__seed001
user_a__gpt-5.5__file_memory__seed001
user_a__claude-sonnet__mem0__seed001

All tool calls include or are routed through:

run_id
user_id
session_id

The server reads and writes only inside the current run namespace.

5. User A Script Requirements

The current user_a scripts require these task-state capabilities.

5.1 Session 01: Elevator Email Draft

File: data/scripts/user_a/session_01.yaml
Scenario: email_drafting
Context: personal_external

The PA should draft an email to apartment building management about the broken elevator. The PA must generate the draft itself. Tools should only support contact lookup and draft persistence.

Required or useful tools:

Tool	MVP?	Purpose
`contacts.lookup`	optional in MVP	Resolve building management email address
`email.save_draft`	yes	Save PA-authored draft body
`email.read_draft`	soon	Let PA show or inspect saved draft if needed
`email.send`	later	Send only after user authorization

Important evaluation points:

Draft body must only concern the elevator.
PA should show the draft text to the user, not only a file path or draft id.
PA should not call email.send before explicit authorization.
If email.save_draft is called, the saved body must match the PA-authored content.
The trajectory should show whether the PA separates content revision from delivery logistics.

MVP fixture:

{
  "building_management": {
    "name": "Glenmont Heights Building Management",
    "email": "management@glenmont-heights.example"
  }
}

5.2 Session 02: Manuscript Review

File: data/scripts/user_a/session_02.yaml
Scenario: manuscript_review
Context: work_internal

The PA should review an introduction section of a co-authored paper. The current transcript shows a failure mode: PA asks for a file because no actual document is available. The state server should provide a real document.

Required tools:

Tool	MVP?	Purpose
`documents.read`	yes	Read manuscript draft text
`documents.list`	optional	Discover available documents
`documents.write`	later	Save revised draft or critique notes

Important evaluation points:

PA should call documents.read when the user references the draft.
PA should critique the actual document, not restate the task.
PA should avoid basic domain primers because User A wants assumed expertise.
PA should make direct judgments rather than over-hedging.

MVP fixture:

documents/string_theory_intro.md

The fixture content can be synthetic and anonymized, but it should be long enough to support critique and contain deliberate weak spots.

5.3 Session 03: Weekly Schedule Planning

File: data/scripts/user_a/session_03.yaml
Scenario: weekly_schedule_planning
Context: personal_internal

The PA is expected to use existing calendar and pantry/inventory state. This session tests autonomy and task expansion: the PA should produce a plan, notice adjacent issues, and avoid asking the user trivial clarification questions.

Required tools for full session:

Tool	MVP?	Purpose
`calendar.list`	phase 2	Read fixed routines and deadlines
`calendar.create`	phase 2	Add or block schedule events after authorization
`calendar.update`	phase 2	Move routine events if needed
`inventory.list`	phase 2	Read pantry state
`inventory.add_shopping_item`	phase 2	Add missing mee krob ingredient
`email.save_draft` or `email.send`	later	Send logistics email if script calls for it

Important evaluation points:

PA should detect Wednesday 5pm conflict between grant deadline and comic book store.
PA should detect rice noodles / mee krob supply gap.
PA should not ask the user to decide routine trivia.
PA should execute logistics only after authorization.
State mutations must match what the PA claims it did.

Phase 2 fixture:

{
  "calendar": [
    {
      "id": "grant_revision_deadline",
      "title": "Grant revision submission",
      "start": "2026-05-06T17:00:00",
      "end": "2026-05-06T17:30:00"
    },
    {
      "id": "comic_book_store",
      "title": "Comic book store",
      "start": "2026-05-06T17:00:00",
      "end": "2026-05-06T18:00:00"
    },
    {
      "id": "flag_fandom_meeting",
      "title": "Flag fandom meeting",
      "start": "2026-05-07T19:00:00",
      "end": "2026-05-07T20:00:00"
    }
  ],
  "inventory": {
    "rice noodles": {
      "quantity": 0,
      "needed_for": "mee krob"
    }
  }
}

6. Tool Set

6.1 MVP Tools

Implement only these first:

`documents.read`

Purpose: let PA inspect a document in the current run state.

Input:

{
  "path": "documents/string_theory_intro.md"
}

Output:

{
  "path": "documents/string_theory_intro.md",
  "content": "...",
  "bytes": 1234
}

Rules:

Path must resolve inside current run namespace.
No absolute paths.
No parent traversal.
Log full args and either full result or result summary depending on size.

`email.save_draft`

Purpose: persist a PA-authored draft.

Input:

{
  "to": "management@glenmont-heights.example",
  "subject": "Urgent Request for Elevator Repair",
  "body": "... PA-authored email body ..."
}

Output:

{
  "draft_id": "draft_0001",
  "status": "saved"
}

Rules:

Body must come from PA tool args.
Tool must not rewrite or improve body.
Save to state/email/drafts.jsonl.
Log a state diff with added draft_id.

6.2 Phase 2 Tools

`contacts.lookup`

Input:

{"query": "building management"}

Output:

{
  "matches": [
    {
      "id": "building_management",
      "name": "Glenmont Heights Building Management",
      "email": "management@glenmont-heights.example"
    }
  ]
}

`calendar.list`

Input:

{
  "start": "2026-05-04",
  "end": "2026-05-10"
}

Output:

{
  "events": [
    {"id": "...", "title": "...", "start": "...", "end": "..."}
  ]
}

`calendar.create`

Input:

{
  "title": "Final grant QA block",
  "start": "2026-05-05T15:00:00",
  "end": "2026-05-05T16:00:00",
  "notes": "..."
}

Output:

{"event_id": "event_0001", "status": "created"}

`calendar.update`

Input:

{
  "event_id": "comic_book_store",
  "patch": {
    "start": "2026-05-04T12:00:00",
    "end": "2026-05-04T13:00:00"
  }
}

Output:

{"event_id": "comic_book_store", "status": "updated"}

`inventory.list`

Input:

{}

Output:

{
  "items": [
    {"name": "rice noodles", "quantity": 0, "needed_for": "mee krob"}
  ]
}

`inventory.add_shopping_item`

Input:

{
  "name": "rice noodles",
  "reason": "Needed for Sunday mee krob"
}

Output:

{"status": "added", "item_id": "shopping_0001"}

6.3 Later Tools

documents.write
documents.patch
email.read_draft
email.send
email.list_drafts
calendar.delete
inventory.update_item

7. Logging and Audit

Every tool call must produce two logs.

7.1 Tool Log

Path:

workspace/runs/{run_id}/state/tool_log.jsonl

Schema:

{
  "t": 12,
  "run_id": "user_a__gpt-5.5__no_memory__seed001",
  "user_id": "user_a",
  "session_id": "session_02",
  "tool": "documents.read",
  "args": {"path": "documents/string_theory_intro.md"},
  "result_summary": {"bytes": 1420},
  "status": "ok"
}

7.2 State Diff Log

Path:

workspace/runs/{run_id}/state/state_diff.jsonl

Schema:

{
  "t": 13,
  "run_id": "user_a__gpt-5.5__no_memory__seed001",
  "user_id": "user_a",
  "session_id": "session_01",
  "namespace": "email.drafts",
  "op": "append",
  "id": "draft_0001",
  "summary": "Saved elevator repair draft"
}

The transcript renderer should either embed tool calls in pa_turn or link each PA turn to corresponding tool log entries. The judge should be able to inspect the trajectory without reading hidden process state.

8. Checkpointing

Checkpointing should copy the entire run state directory:

workspace/runs/{run_id}/checkpoints/{checkpoint_id}/state_snapshot/

Evaluation should restore checkpoints into a separate eval namespace so probe sessions do not mutate the accumulation checkpoint.

workspace/eval_runs/{run_id}__{checkpoint_id}__{probe_id}/state/

9. Integration Points

9.1 Harness

Harness responsibilities:

create run_id
copy data/state_fixtures/{user_id} into workspace/runs/{run_id}/state
start or reuse local MCP state server
pass run_id, user_id, session_id to PA tool context
collect transcript, tool log, state diff log
checkpoint state directory

9.2 PA

PA responsibilities:

discover MCP tools through normal tool mechanism
call tools when task requires state access or side effects
generate artifacts itself
report user-visible results truthfully

9.3 Simulator

Simulator responsibilities:

provide user messages and reactions
never see hidden tool internals except through PA-visible outcomes
produce eval-only reaction fields for transcript

9.4 Judge

Judge responsibilities:

evaluate full trajectory: simulator messages, PA text, tool calls, tool results, state diffs
separate task correctness from interaction preference fit
flag mismatches between PA claims and state mutation logs

10. Implementation Phases

Phase 0: MCP State MVP

Goal: prove the tool/state path with minimum surface area.

Implement:

local MCP server skeleton
run namespace routing
fixture copy for user_a
documents.read
email.save_draft
tool_log.jsonl
state_diff.jsonl
run session_01 and session_02

Success criteria:

PA can read manuscript draft through documents.read.
PA can save an email draft through email.save_draft.
Draft body is PA-authored, not generated by the tool.
Tool calls and results are auditable from logs.
Re-running the same user/model starts from a clean fixture copy.

Phase 1: User A Full Task State

Add:

contacts.lookup
calendar.list
calendar.create
calendar.update
inventory.list
inventory.add_shopping_item

Run:

session_01
session_02
session_03

Success criteria:

PA can detect schedule conflict and inventory gap.
PA can mutate calendar/shopping list after authorization.
State diffs match PA claims.

Phase 2: Interaction Tools

Status update, 2026-05-04:

Replaced the earlier BFCL-style MCP interaction-tool experiment with the IPaS-aligned IX_ design documented in PA_Interaction_Preference.md.
Interaction preference tools are now native PA per-turn tools, not state-server MCP tools. The MCP state server default surface is back to task-state tools only:
- documents_read
- email_save_draft
Implemented A-class control-unaffected IX tools:
- IX_tone_formality
- IX_verbosity
- IX_emotional_engagement
- IX_guidance_level
- IX_reasoning_visibility
- IX_uncertainty_expression
IX tools are registered per PA turn from beat.active_skills; they are not globally exposed.
Required active IX tools must be called before the PA final response. If the model tries to answer without calling them, the runner issues a repair step asking for the missing IX calls.
B-class control-affecting IX tools remain design-only until we decide runner semantics for ordering, gating, pending task calls, and user-resolution flows.

Add no-op interaction tools that emit structured interaction events, for example:

Message_ShowReasoning
Message_AskClarifyingQuestion
Message_ConfirmBeforeAction
Message_SummarizePlan
Message_ProceedSilently

These tools should log interaction strategy choices but not mutate task state.

Success criteria:

Judge can score IPaS-relevant behaviors from structured events plus text trajectory.

Phase 3: Memory Conditions

Only after task tools and interaction logging are stable, connect memory backends:

no memory
static profile / oracle prompt
file memory
Mem0
MemOS
Graphiti / Zep later

Success criteria:

Differences in performance can be attributed to memory, not missing tool/state infrastructure.

11. Open Design Questions

Should email.save_draft require to, or allow drafts without resolved recipient?
Should documents.read return full content or chunked content for long files?
Should tool results be embedded directly into transcript JSONL or referenced by ids into tool_log.jsonl?
Should MCP server be one process per benchmark run, or one shared process with run namespace routing?
Should interaction tools be available to all PA conditions, including no-memory, or only to PA variants designed to use structured interaction actions?

12. Current Recommendation

Build the MCP state server now, but keep the MVP narrow.

Do not implement a full app environment first. Implement enough to validate the path:

documents.read + email.save_draft + run namespace + tool/state logs

Then add calendar/inventory for session_03. After that, add interaction tools. Only then should memory backend comparisons become the main implementation focus.

MemPA Wiki

Explorer

PA_state_server_Imlementation_plan

Implementation To-do — Tool Surface Replacement

A. Clean Tool Surface

B. Add Benchmark Tool Source

C. Harness Ownership

D. Logging and Audit

E. First Acceptance Test

G. Tool Selection Evaluation

For the MVP phase, both tools are available by default:

F. Memory Condition Separation

PA State Server Implementation Plan

1. Decision

2. Why MCP, Not In-Process Only

3. Core Boundary

3.1 What State Server Stores

3.2 What State Server Must Not Store

3.3 Tool Boundary

4. Run Isolation Model

5. User A Script Requirements

5.1 Session 01: Elevator Email Draft

5.2 Session 02: Manuscript Review

5.3 Session 03: Weekly Schedule Planning

6. Tool Set

6.1 MVP Tools

documents.read

email.save_draft

6.2 Phase 2 Tools

contacts.lookup

calendar.list

calendar.create

calendar.update

inventory.list

inventory.add_shopping_item

6.3 Later Tools

7. Logging and Audit

7.1 Tool Log

7.2 State Diff Log

8. Checkpointing

9. Integration Points

9.1 Harness

9.2 PA

9.3 Simulator

9.4 Judge

10. Implementation Phases

Phase 0: MCP State MVP

Phase 1: User A Full Task State

Phase 2: Interaction Tools

Phase 3: Memory Conditions

11. Open Design Questions

12. Current Recommendation

Graph View

Table of Contents

`documents.read`

`email.save_draft`

`contacts.lookup`

`calendar.list`

`calendar.create`

`calendar.update`

`inventory.list`

`inventory.add_shopping_item`