Implementation To-do — Tool Surface Replacement

Decision update: this Nanobot fork is for MemPABench only. We should replace the old in-the-wild PA tool surface with benchmark tools, not keep legacy web/shell/filesystem/spawn as a parallel runtime path.

What we preserve from Nanobot is the agent loop, provider abstraction, tool-calling runner, tool registry, sessions, memory hooks, and MCP/tool extension architecture. The concrete tools should be MemPABench task-state tools and interaction tools.

Implementation note: conceptual tool names use dotted form in design prose (documents.read, email.save_draft), but MCP/OpenAI-facing wire names use underscores (documents_read, email_save_draft) for function-name compatibility.

A. Clean Tool Surface

  1. Replace the default native tool surface, keep the function name

    • File: /Users/JL/Desktop/MemPA/pa/agent/loop.py
    • Keep _register_default_tools() as the native-tool registration hook. Do not rename it.
    • Change its content: it should no longer import or register old Nanobot native tools.
    • For the benchmark MVP, _register_default_tools() can be empty or only register future truly-native benchmark infrastructure, if needed.
    • MemPABench task tools should enter through the existing MCP path:
      • AgentLoop.__init__() stores mcp_servers
      • _connect_mcp() connects the local state server
      • pa/agent/tools/mcp.py wraps MCP tools
      • ToolRegistry receives wrapped tools
    • The effective default benchmark tool surface should be MemPABench tools only:
      • MCP task-state tools from the local state server
      • later MCP interaction tools
    • It should no longer import or register:
      • filesystem read/write/edit/list tools
      • shell exec
      • web search/fetch
      • spawn/subagent tool
  2. Delete legacy tool files after their imports are removed

    • Candidate files:
      • /Users/JL/Desktop/MemPA/pa/agent/tools/filesystem.py
      • /Users/JL/Desktop/MemPA/pa/agent/tools/shell.py
      • /Users/JL/Desktop/MemPA/pa/agent/tools/web.py
      • /Users/JL/Desktop/MemPA/pa/agent/tools/spawn.py
    • Preferred direction: delete them, because this fork’s tool surface is benchmark-specific.
    • Do not delete before replacing imports in AgentLoop and any dependent paths.
  3. Keep generic tool infrastructure

    • Keep:
      • /Users/JL/Desktop/MemPA/pa/agent/tools/base.py
      • /Users/JL/Desktop/MemPA/pa/agent/tools/registry.py
      • /Users/JL/Desktop/MemPA/pa/agent/tools/mcp.py
    • Reason: these are the tool-calling substrate, not legacy business tools.

B. Add Benchmark Tool Source

  1. Create local MCP Task State Server package

    • Suggested location: /Users/JL/Desktop/MemPA/state_server/
    • Suggested files:
      • state_server/__init__.py
      • state_server/server.py — MCP server entrypoint
      • state_server/store.py — run/user/session namespace store
      • state_server/tools/documents.py
      • state_server/tools/email.py
      • state_server/tools/contacts.py later
      • state_server/tools/calendar.py later
      • state_server/tools/inventory.py later
      • state_server/tools/interaction.py later, or split interaction into a sibling server if cleaner
  2. Implement only MVP tools first

    • documents.read
    • email.save_draft
    • Do not implement calendar/inventory until session 01 and 02 prove the MCP path.
  3. Expose benchmark tools through MCP; do not register them in _register_default_tools()

    • Registration path should be:
      state_server MCP tool definitions
        -> pa/agent/tools/mcp.py MCPToolWrapper
        -> pa/agent/tools/registry.py ToolRegistry
        -> AgentRunner tool loop
    • Do not reimplement direct business-tool registration inside AgentLoop or _register_default_tools().

C. Harness Ownership

  1. Add a PA factory for benchmark runs

    • Suggested file: /Users/JL/Desktop/MemPA/harness/pa_factory.py
    • Responsibilities:
      • create AgentLoop
      • pass only benchmark MCP servers
      • construct PA with benchmark MCP tool surface
      • set workspace/run ids
      • use OpenRouter provider
  2. Add state fixture initialization

    • Suggested file: /Users/JL/Desktop/MemPA/harness/state_runtime.py
    • Responsibilities:
      • create run_id
      • copy data/state_fixtures/{user_id} into workspace/runs/{run_id}/state
      • pass run_id, user_id, session_id, state_root to MCP server env/config
  3. Add fixture directory for user_a

    • Suggested location: /Users/JL/Desktop/MemPA/data/state_fixtures/user_a/
    • MVP contents:
      • my_desktop/research_drafts/string_theory_intro.md
      • email/drafts.jsonl
      • email/sent.jsonl
      • optionally contacts.json with building management entry

D. Logging and Audit

  1. Record tool events in a benchmark-readable log

    • State server writes:
      • workspace/runs/{run_id}/state/tool_log.jsonl
      • workspace/runs/{run_id}/state/state_diff.jsonl
  2. Surface tool events in transcript rendering, including per-turn alignment

    • File: /Users/JL/Desktop/MemPA/harness/transcript.py

    • Add either:

      • embedded tool events on each pa_turn, or
      • references from pa_turn to tool log offsets/ids.
    • Judge must be able to evaluate full trajectory: user message, PA text, tool calls, tool results, state diffs.

    • Implemented refinement: each pa_turn now carries tool_events from AgentRunner, and transcript.md renders them directly before the corresponding PA reply. The final Tool And State Audit remains as a session-level index.

E. First Acceptance Test

  1. Session 01 acceptance

    • PA receives elevator email task.
    • PA writes email body itself.
    • PA calls email.save_draft with that body.
    • Tool log records the call.
    • Draft appears in run-local email/drafts.jsonl.
    • Transcript shows both PA-visible draft text and tool-side saved draft.
  2. Session 02 acceptance

    • PA receives manuscript review task.
    • PA calls documents.read for the desktop draft, e.g. my_desktop/research_drafts/string_theory_intro.md.
    • PA critiques the actual document.
    • Transcript/tool log prove the document was read.
  3. Only after 12-13 pass, add Session 03 tools

    • calendar.list
    • calendar.create
    • calendar.update
    • inventory.list
    • inventory.add_shopping_item

G. Tool Selection Evaluation

  1. Expose the full available benchmark tool surface within a phase

    • Do not expose only the tool that a session is expected to use.
    • The PA should be evaluated on tool selection: it must choose the appropriate tool and ignore irrelevant available tools.
    • For the MVP phase, both tools are available by default:

    • Example: in , calling is not a state-server failure; it is trajectory evidence that the PA selected an irrelevant tool.
    • Judge/trajectory analysis should be able to penalize irrelevant tool use separately from task success.
    • remains configurable only for special tests, not for normal benchmark runs.
  2. Reduce simulator XML block format drift

    • Strengthened simulator/prompts/generate_message.md so every response must start with <message> and wrap all PA-visible dialogue inside <message>...</message>.
    • Strengthened simulator/loop.py user trigger suffix to repeat the four-block requirement on every beat.
    • Parser fallback remains in place as an access-boundary safety net, but it should trigger less often.

F. Memory Condition Separation

  1. Add explicit memory modes before benchmark comparison
    • Deleting legacy filesystem/shell/web/spawn tools does not disable Nanobot’s internal memory.
    • Current file memory path is internal:
      AgentLoop
        -> MemoryConsolidator
        -> MemoryStore
        -> private save_memory function schema
        -> Python writes workspace/memory/MEMORY.md and HISTORY.md
    • This path does not depend on old read_file / write_file tools.
    • Therefore no_memory must be implemented as a separate benchmark condition, not inferred from tool deletion.
    • Required memory modes:
      • no_memory: do not inject memory context, do not consolidate, do not read/write MEMORY.md or HISTORY.md.
      • file_memory: current MemoryConsolidator + MemoryStore behavior.
      • external memory modes later: Mem0, MemOS, Graphiti/Zep.
    • ContextBuilder should also remove memory-related identity text under no_memory, so the PA is not told that MEMORY.md exists.

PA State Server Implementation Plan

Status: Draft implementation plan
Scope: Local MCP task-state server for MemPABench PA tool use
Anchored scripts: /Users/JL/Desktop/MemPA/data/scripts/user_a/session_01.yaml, session_02.yaml, session_03.yaml

1. Decision

MemPABench should implement a local MCP Task State Server.

This server is not a memory backend and not a preference store. It maintains task-world state that a PA can read and mutate through tools: documents, email drafts/sent mail, calendar events, contacts, inventory, and later other app-like namespaces.

The server exists so benchmark sessions can test real PA tool use while preserving clean experimental separation:

  • Task state lives in the MCP state server namespace for a run.
  • Interaction preference learning lives only in the memory backend under evaluation.
  • Audit evidence lives in transcript logs and tool logs.

2. Why MCP, Not In-Process Only

An in-process runtime is enough for a single smoke test, but MemPABench will later run many users, models, memory backends, and conditions in parallel. We need stable isolation and a standard tool interface.

The right target is therefore:

Harness
  starts or connects to Local MCP Task State Server
  initializes run-specific state from fixture
  launches PA with MCP tool config
  launches Simulator
  records transcript + tool log + state diffs

The MCP server should be local and file-backed for now. It should not become a long-running external product service unless we later need multi-machine runs or a demo environment.

3. Core Boundary

3.1 What State Server Stores

The state server may store task facts and task artifacts:

  • contacts
  • documents
  • email drafts and sent messages
  • calendar events
  • inventory / pantry state
  • shopping list
  • tool call log
  • state mutation log

3.2 What State Server Must Not Store

The state server must not store interaction preferences or learned user profiles:

  • no verbosity preference
  • no reasoning visibility preference
  • no autonomy preference
  • no confirmation preference
  • no summaries like “User A dislikes clarifying questions”
  • no derived memory or lessons learned from prior conversations

Those belong to the memory backend or to offline transcript analysis.

3.3 Tool Boundary

Tools provide world access and side effects. They must not generate the main answer for the PA.

Correct pattern:

PA writes email body itself
PA calls email.save_draft(to, subject, body)
State server saves the PA-authored draft

Incorrect pattern:

PA calls email.draft_elevator_repair()
Tool returns a prewritten email body

The benchmark should evaluate PA-generated artifacts, not canned tool output.

4. Run Isolation Model

Baseline fixtures are read-only. Each experimental run gets its own mutable copy.

data/state_fixtures/
  user_a/
    contacts.json
    calendar.json
    documents/
      string_theory_intro.md
    inventory.json
    email/
      drafts.jsonl
      sent.jsonl
 
workspace/runs/
  {run_id}/
    state/
      contacts.json
      calendar.json
      documents/
      inventory.json
      email/
        drafts.jsonl
        sent.jsonl
      tool_log.jsonl
      state_diff.jsonl

A run id should encode the experimental condition, for example:

user_a__gpt-5.5__no_memory__seed001
user_a__gpt-5.5__file_memory__seed001
user_a__claude-sonnet__mem0__seed001

All tool calls include or are routed through:

run_id
user_id
session_id

The server reads and writes only inside the current run namespace.

5. User A Script Requirements

The current user_a scripts require these task-state capabilities.

5.1 Session 01: Elevator Email Draft

File: data/scripts/user_a/session_01.yaml
Scenario: email_drafting
Context: personal_external

The PA should draft an email to apartment building management about the broken elevator. The PA must generate the draft itself. Tools should only support contact lookup and draft persistence.

Required or useful tools:

ToolMVP?Purpose
contacts.lookupoptional in MVPResolve building management email address
email.save_draftyesSave PA-authored draft body
email.read_draftsoonLet PA show or inspect saved draft if needed
email.sendlaterSend only after user authorization

Important evaluation points:

  • Draft body must only concern the elevator.
  • PA should show the draft text to the user, not only a file path or draft id.
  • PA should not call email.send before explicit authorization.
  • If email.save_draft is called, the saved body must match the PA-authored content.
  • The trajectory should show whether the PA separates content revision from delivery logistics.

MVP fixture:

{
  "building_management": {
    "name": "Glenmont Heights Building Management",
    "email": "management@glenmont-heights.example"
  }
}

5.2 Session 02: Manuscript Review

File: data/scripts/user_a/session_02.yaml
Scenario: manuscript_review
Context: work_internal

The PA should review an introduction section of a co-authored paper. The current transcript shows a failure mode: PA asks for a file because no actual document is available. The state server should provide a real document.

Required tools:

ToolMVP?Purpose
documents.readyesRead manuscript draft text
documents.listoptionalDiscover available documents
documents.writelaterSave revised draft or critique notes

Important evaluation points:

  • PA should call documents.read when the user references the draft.
  • PA should critique the actual document, not restate the task.
  • PA should avoid basic domain primers because User A wants assumed expertise.
  • PA should make direct judgments rather than over-hedging.

MVP fixture:

documents/string_theory_intro.md

The fixture content can be synthetic and anonymized, but it should be long enough to support critique and contain deliberate weak spots.

5.3 Session 03: Weekly Schedule Planning

File: data/scripts/user_a/session_03.yaml
Scenario: weekly_schedule_planning
Context: personal_internal

The PA is expected to use existing calendar and pantry/inventory state. This session tests autonomy and task expansion: the PA should produce a plan, notice adjacent issues, and avoid asking the user trivial clarification questions.

Required tools for full session:

ToolMVP?Purpose
calendar.listphase 2Read fixed routines and deadlines
calendar.createphase 2Add or block schedule events after authorization
calendar.updatephase 2Move routine events if needed
inventory.listphase 2Read pantry state
inventory.add_shopping_itemphase 2Add missing mee krob ingredient
email.save_draft or email.sendlaterSend logistics email if script calls for it

Important evaluation points:

  • PA should detect Wednesday 5pm conflict between grant deadline and comic book store.
  • PA should detect rice noodles / mee krob supply gap.
  • PA should not ask the user to decide routine trivia.
  • PA should execute logistics only after authorization.
  • State mutations must match what the PA claims it did.

Phase 2 fixture:

{
  "calendar": [
    {
      "id": "grant_revision_deadline",
      "title": "Grant revision submission",
      "start": "2026-05-06T17:00:00",
      "end": "2026-05-06T17:30:00"
    },
    {
      "id": "comic_book_store",
      "title": "Comic book store",
      "start": "2026-05-06T17:00:00",
      "end": "2026-05-06T18:00:00"
    },
    {
      "id": "flag_fandom_meeting",
      "title": "Flag fandom meeting",
      "start": "2026-05-07T19:00:00",
      "end": "2026-05-07T20:00:00"
    }
  ],
  "inventory": {
    "rice noodles": {
      "quantity": 0,
      "needed_for": "mee krob"
    }
  }
}

6. Tool Set

6.1 MVP Tools

Implement only these first:

documents.read

Purpose: let PA inspect a document in the current run state.

Input:

{
  "path": "documents/string_theory_intro.md"
}

Output:

{
  "path": "documents/string_theory_intro.md",
  "content": "...",
  "bytes": 1234
}

Rules:

  • Path must resolve inside current run namespace.
  • No absolute paths.
  • No parent traversal.
  • Log full args and either full result or result summary depending on size.

email.save_draft

Purpose: persist a PA-authored draft.

Input:

{
  "to": "management@glenmont-heights.example",
  "subject": "Urgent Request for Elevator Repair",
  "body": "... PA-authored email body ..."
}

Output:

{
  "draft_id": "draft_0001",
  "status": "saved"
}

Rules:

  • Body must come from PA tool args.
  • Tool must not rewrite or improve body.
  • Save to state/email/drafts.jsonl.
  • Log a state diff with added draft_id.

6.2 Phase 2 Tools

contacts.lookup

Input:

{"query": "building management"}

Output:

{
  "matches": [
    {
      "id": "building_management",
      "name": "Glenmont Heights Building Management",
      "email": "management@glenmont-heights.example"
    }
  ]
}

calendar.list

Input:

{
  "start": "2026-05-04",
  "end": "2026-05-10"
}

Output:

{
  "events": [
    {"id": "...", "title": "...", "start": "...", "end": "..."}
  ]
}

calendar.create

Input:

{
  "title": "Final grant QA block",
  "start": "2026-05-05T15:00:00",
  "end": "2026-05-05T16:00:00",
  "notes": "..."
}

Output:

{"event_id": "event_0001", "status": "created"}

calendar.update

Input:

{
  "event_id": "comic_book_store",
  "patch": {
    "start": "2026-05-04T12:00:00",
    "end": "2026-05-04T13:00:00"
  }
}

Output:

{"event_id": "comic_book_store", "status": "updated"}

inventory.list

Input:

{}

Output:

{
  "items": [
    {"name": "rice noodles", "quantity": 0, "needed_for": "mee krob"}
  ]
}

inventory.add_shopping_item

Input:

{
  "name": "rice noodles",
  "reason": "Needed for Sunday mee krob"
}

Output:

{"status": "added", "item_id": "shopping_0001"}

6.3 Later Tools

  • documents.write
  • documents.patch
  • email.read_draft
  • email.send
  • email.list_drafts
  • calendar.delete
  • inventory.update_item

7. Logging and Audit

Every tool call must produce two logs.

7.1 Tool Log

Path:

workspace/runs/{run_id}/state/tool_log.jsonl

Schema:

{
  "t": 12,
  "run_id": "user_a__gpt-5.5__no_memory__seed001",
  "user_id": "user_a",
  "session_id": "session_02",
  "tool": "documents.read",
  "args": {"path": "documents/string_theory_intro.md"},
  "result_summary": {"bytes": 1420},
  "status": "ok"
}

7.2 State Diff Log

Path:

workspace/runs/{run_id}/state/state_diff.jsonl

Schema:

{
  "t": 13,
  "run_id": "user_a__gpt-5.5__no_memory__seed001",
  "user_id": "user_a",
  "session_id": "session_01",
  "namespace": "email.drafts",
  "op": "append",
  "id": "draft_0001",
  "summary": "Saved elevator repair draft"
}

The transcript renderer should either embed tool calls in pa_turn or link each PA turn to corresponding tool log entries. The judge should be able to inspect the trajectory without reading hidden process state.

8. Checkpointing

Checkpointing should copy the entire run state directory:

workspace/runs/{run_id}/checkpoints/{checkpoint_id}/state_snapshot/

Evaluation should restore checkpoints into a separate eval namespace so probe sessions do not mutate the accumulation checkpoint.

workspace/eval_runs/{run_id}__{checkpoint_id}__{probe_id}/state/

9. Integration Points

9.1 Harness

Harness responsibilities:

  • create run_id
  • copy data/state_fixtures/{user_id} into workspace/runs/{run_id}/state
  • start or reuse local MCP state server
  • pass run_id, user_id, session_id to PA tool context
  • collect transcript, tool log, state diff log
  • checkpoint state directory

9.2 PA

PA responsibilities:

  • discover MCP tools through normal tool mechanism
  • call tools when task requires state access or side effects
  • generate artifacts itself
  • report user-visible results truthfully

9.3 Simulator

Simulator responsibilities:

  • provide user messages and reactions
  • never see hidden tool internals except through PA-visible outcomes
  • produce eval-only reaction fields for transcript

9.4 Judge

Judge responsibilities:

  • evaluate full trajectory: simulator messages, PA text, tool calls, tool results, state diffs
  • separate task correctness from interaction preference fit
  • flag mismatches between PA claims and state mutation logs

10. Implementation Phases

Phase 0: MCP State MVP

Goal: prove the tool/state path with minimum surface area.

Implement:

  • local MCP server skeleton
  • run namespace routing
  • fixture copy for user_a
  • documents.read
  • email.save_draft
  • tool_log.jsonl
  • state_diff.jsonl
  • run session_01 and session_02

Success criteria:

  • PA can read manuscript draft through documents.read.
  • PA can save an email draft through email.save_draft.
  • Draft body is PA-authored, not generated by the tool.
  • Tool calls and results are auditable from logs.
  • Re-running the same user/model starts from a clean fixture copy.

Phase 1: User A Full Task State

Add:

  • contacts.lookup
  • calendar.list
  • calendar.create
  • calendar.update
  • inventory.list
  • inventory.add_shopping_item

Run:

  • session_01
  • session_02
  • session_03

Success criteria:

  • PA can detect schedule conflict and inventory gap.
  • PA can mutate calendar/shopping list after authorization.
  • State diffs match PA claims.

Phase 2: Interaction Tools

Status update, 2026-05-04:

  • Replaced the earlier BFCL-style MCP interaction-tool experiment with the IPaS-aligned IX_ design documented in PA_Interaction_Preference.md.
  • Interaction preference tools are now native PA per-turn tools, not state-server MCP tools. The MCP state server default surface is back to task-state tools only:
    • documents_read
    • email_save_draft
  • Implemented A-class control-unaffected IX tools:
    • IX_tone_formality
    • IX_verbosity
    • IX_emotional_engagement
    • IX_guidance_level
    • IX_reasoning_visibility
    • IX_uncertainty_expression
  • IX tools are registered per PA turn from beat.active_skills; they are not globally exposed.
  • Required active IX tools must be called before the PA final response. If the model tries to answer without calling them, the runner issues a repair step asking for the missing IX calls.
  • B-class control-affecting IX tools remain design-only until we decide runner semantics for ordering, gating, pending task calls, and user-resolution flows.

Add no-op interaction tools that emit structured interaction events, for example:

  • Message_ShowReasoning
  • Message_AskClarifyingQuestion
  • Message_ConfirmBeforeAction
  • Message_SummarizePlan
  • Message_ProceedSilently

These tools should log interaction strategy choices but not mutate task state.

Success criteria:

  • Judge can score IPaS-relevant behaviors from structured events plus text trajectory.

Phase 3: Memory Conditions

Only after task tools and interaction logging are stable, connect memory backends:

  • no memory
  • static profile / oracle prompt
  • file memory
  • Mem0
  • MemOS
  • Graphiti / Zep later

Success criteria:

  • Differences in performance can be attributed to memory, not missing tool/state infrastructure.

11. Open Design Questions

  1. Should email.save_draft require to, or allow drafts without resolved recipient?
  2. Should documents.read return full content or chunked content for long files?
  3. Should tool results be embedded directly into transcript JSONL or referenced by ids into tool_log.jsonl?
  4. Should MCP server be one process per benchmark run, or one shared process with run namespace routing?
  5. Should interaction tools be available to all PA conditions, including no-memory, or only to PA variants designed to use structured interaction actions?

12. Current Recommendation

Build the MCP state server now, but keep the MVP narrow.

Do not implement a full app environment first. Implement enough to validate the path:

documents.read + email.save_draft + run namespace + tool/state logs

Then add calendar/inventory for session_03. After that, add interaction tools. Only then should memory backend comparisons become the main implementation focus.