Implementation To-do — Tool Surface Replacement
Decision update: this Nanobot fork is for MemPABench only. We should replace the old in-the-wild PA tool surface with benchmark tools, not keep legacy web/shell/filesystem/spawn as a parallel runtime path.
What we preserve from Nanobot is the agent loop, provider abstraction, tool-calling runner, tool registry, sessions, memory hooks, and MCP/tool extension architecture. The concrete tools should be MemPABench task-state tools and interaction tools.
Implementation note: conceptual tool names use dotted form in design prose (
documents.read,email.save_draft), but MCP/OpenAI-facing wire names use underscores (documents_read,email_save_draft) for function-name compatibility.
A. Clean Tool Surface
-
Replace the default native tool surface, keep the function name ✅
- File:
/Users/JL/Desktop/MemPA/pa/agent/loop.py - Keep
_register_default_tools()as the native-tool registration hook. Do not rename it. - Change its content: it should no longer import or register old Nanobot native tools.
- For the benchmark MVP,
_register_default_tools()can be empty or only register future truly-native benchmark infrastructure, if needed. - MemPABench task tools should enter through the existing MCP path:
AgentLoop.__init__()storesmcp_servers_connect_mcp()connects the local state serverpa/agent/tools/mcp.pywraps MCP toolsToolRegistryreceives wrapped tools
- The effective default benchmark tool surface should be MemPABench tools only:
- MCP task-state tools from the local state server
- later MCP interaction tools
- It should no longer import or register:
- filesystem read/write/edit/list tools
- shell exec
- web search/fetch
- spawn/subagent tool
- File:
-
Delete legacy tool files after their imports are removed ✅
- Candidate files:
/Users/JL/Desktop/MemPA/pa/agent/tools/filesystem.py/Users/JL/Desktop/MemPA/pa/agent/tools/shell.py/Users/JL/Desktop/MemPA/pa/agent/tools/web.py/Users/JL/Desktop/MemPA/pa/agent/tools/spawn.py
- Preferred direction: delete them, because this fork’s tool surface is benchmark-specific.
- Do not delete before replacing imports in
AgentLoopand any dependent paths.
- Candidate files:
-
Keep generic tool infrastructure ✅
- Keep:
/Users/JL/Desktop/MemPA/pa/agent/tools/base.py/Users/JL/Desktop/MemPA/pa/agent/tools/registry.py/Users/JL/Desktop/MemPA/pa/agent/tools/mcp.py
- Reason: these are the tool-calling substrate, not legacy business tools.
- Keep:
B. Add Benchmark Tool Source
-
Create local MCP Task State Server package ✅
- Suggested location:
/Users/JL/Desktop/MemPA/state_server/ - Suggested files:
state_server/__init__.pystate_server/server.py— MCP server entrypointstate_server/store.py— run/user/session namespace storestate_server/tools/documents.pystate_server/tools/email.pystate_server/tools/contacts.pylaterstate_server/tools/calendar.pylaterstate_server/tools/inventory.pylaterstate_server/tools/interaction.pylater, or split interaction into a sibling server if cleaner
- Suggested location:
-
Implement only MVP tools first ✅
documents.reademail.save_draft- Do not implement calendar/inventory until session 01 and 02 prove the MCP path.
-
Expose benchmark tools through MCP; do not register them in
_register_default_tools()✅- Registration path should be:
state_server MCP tool definitions -> pa/agent/tools/mcp.py MCPToolWrapper -> pa/agent/tools/registry.py ToolRegistry -> AgentRunner tool loop - Do not reimplement direct business-tool registration inside
AgentLoopor_register_default_tools().
- Registration path should be:
C. Harness Ownership
-
Add a PA factory for benchmark runs ✅
- Suggested file:
/Users/JL/Desktop/MemPA/harness/pa_factory.py - Responsibilities:
- create
AgentLoop - pass only benchmark MCP servers
- construct PA with benchmark MCP tool surface
- set workspace/run ids
- use OpenRouter provider
- create
- Suggested file:
-
Add state fixture initialization ✅
- Suggested file:
/Users/JL/Desktop/MemPA/harness/state_runtime.py - Responsibilities:
- create
run_id - copy
data/state_fixtures/{user_id}intoworkspace/runs/{run_id}/state - pass
run_id,user_id,session_id,state_rootto MCP server env/config
- create
- Suggested file:
-
Add fixture directory for
user_a✅- Suggested location:
/Users/JL/Desktop/MemPA/data/state_fixtures/user_a/ - MVP contents:
my_desktop/research_drafts/string_theory_intro.mdemail/drafts.jsonlemail/sent.jsonl- optionally
contacts.jsonwith building management entry
- Suggested location:
D. Logging and Audit
-
Record tool events in a benchmark-readable log ✅
- State server writes:
workspace/runs/{run_id}/state/tool_log.jsonlworkspace/runs/{run_id}/state/state_diff.jsonl
- State server writes:
-
Surface tool events in transcript rendering, including per-turn alignment ✅
-
File:
/Users/JL/Desktop/MemPA/harness/transcript.py -
Add either:
- embedded tool events on each
pa_turn, or - references from
pa_turnto tool log offsets/ids.
- embedded tool events on each
-
Judge must be able to evaluate full trajectory: user message, PA text, tool calls, tool results, state diffs.
-
Implemented refinement: each
pa_turnnow carriestool_eventsfromAgentRunner, andtranscript.mdrenders them directly before the corresponding PA reply. The finalTool And State Auditremains as a session-level index.
-
E. First Acceptance Test
-
Session 01 acceptance ✅
- PA receives elevator email task.
- PA writes email body itself.
- PA calls
email.save_draftwith that body. - Tool log records the call.
- Draft appears in run-local
email/drafts.jsonl. - Transcript shows both PA-visible draft text and tool-side saved draft.
-
Session 02 acceptance ✅
- PA receives manuscript review task.
- PA calls
documents.readfor the desktop draft, e.g.my_desktop/research_drafts/string_theory_intro.md. - PA critiques the actual document.
- Transcript/tool log prove the document was read.
-
Only after 12-13 pass, add Session 03 tools
calendar.listcalendar.createcalendar.updateinventory.listinventory.add_shopping_item
G. Tool Selection Evaluation
-
Expose the full available benchmark tool surface within a phase ✅
- Do not expose only the tool that a session is expected to use.
- The PA should be evaluated on tool selection: it must choose the appropriate tool and ignore irrelevant available tools.
-
For the MVP phase, both tools are available by default:
- Example: in , calling is not a state-server failure; it is trajectory evidence that the PA selected an irrelevant tool.
- Judge/trajectory analysis should be able to penalize irrelevant tool use separately from task success.
- remains configurable only for special tests, not for normal benchmark runs.
-
Reduce simulator XML block format drift ✅
- Strengthened
simulator/prompts/generate_message.mdso every response must start with<message>and wrap all PA-visible dialogue inside<message>...</message>. - Strengthened
simulator/loop.pyuser trigger suffix to repeat the four-block requirement on every beat. - Parser fallback remains in place as an access-boundary safety net, but it should trigger less often.
- Strengthened
F. Memory Condition Separation
- Add explicit memory modes before benchmark comparison
- Deleting legacy filesystem/shell/web/spawn tools does not disable Nanobot’s internal memory.
- Current file memory path is internal:
AgentLoop -> MemoryConsolidator -> MemoryStore -> private save_memory function schema -> Python writes workspace/memory/MEMORY.md and HISTORY.md - This path does not depend on old
read_file/write_filetools. - Therefore
no_memorymust be implemented as a separate benchmark condition, not inferred from tool deletion. - Required memory modes:
no_memory: do not inject memory context, do not consolidate, do not read/writeMEMORY.mdorHISTORY.md.file_memory: currentMemoryConsolidator+MemoryStorebehavior.- external memory modes later: Mem0, MemOS, Graphiti/Zep.
ContextBuildershould also remove memory-related identity text underno_memory, so the PA is not told thatMEMORY.mdexists.
PA State Server Implementation Plan
Status: Draft implementation plan
Scope: Local MCP task-state server for MemPABench PA tool use
Anchored scripts:/Users/JL/Desktop/MemPA/data/scripts/user_a/session_01.yaml,session_02.yaml,session_03.yaml
1. Decision
MemPABench should implement a local MCP Task State Server.
This server is not a memory backend and not a preference store. It maintains task-world state that a PA can read and mutate through tools: documents, email drafts/sent mail, calendar events, contacts, inventory, and later other app-like namespaces.
The server exists so benchmark sessions can test real PA tool use while preserving clean experimental separation:
- Task state lives in the MCP state server namespace for a run.
- Interaction preference learning lives only in the memory backend under evaluation.
- Audit evidence lives in transcript logs and tool logs.
2. Why MCP, Not In-Process Only
An in-process runtime is enough for a single smoke test, but MemPABench will later run many users, models, memory backends, and conditions in parallel. We need stable isolation and a standard tool interface.
The right target is therefore:
Harness
starts or connects to Local MCP Task State Server
initializes run-specific state from fixture
launches PA with MCP tool config
launches Simulator
records transcript + tool log + state diffsThe MCP server should be local and file-backed for now. It should not become a long-running external product service unless we later need multi-machine runs or a demo environment.
3. Core Boundary
3.1 What State Server Stores
The state server may store task facts and task artifacts:
- contacts
- documents
- email drafts and sent messages
- calendar events
- inventory / pantry state
- shopping list
- tool call log
- state mutation log
3.2 What State Server Must Not Store
The state server must not store interaction preferences or learned user profiles:
- no verbosity preference
- no reasoning visibility preference
- no autonomy preference
- no confirmation preference
- no summaries like “User A dislikes clarifying questions”
- no derived memory or lessons learned from prior conversations
Those belong to the memory backend or to offline transcript analysis.
3.3 Tool Boundary
Tools provide world access and side effects. They must not generate the main answer for the PA.
Correct pattern:
PA writes email body itself
PA calls email.save_draft(to, subject, body)
State server saves the PA-authored draftIncorrect pattern:
PA calls email.draft_elevator_repair()
Tool returns a prewritten email bodyThe benchmark should evaluate PA-generated artifacts, not canned tool output.
4. Run Isolation Model
Baseline fixtures are read-only. Each experimental run gets its own mutable copy.
data/state_fixtures/
user_a/
contacts.json
calendar.json
documents/
string_theory_intro.md
inventory.json
email/
drafts.jsonl
sent.jsonl
workspace/runs/
{run_id}/
state/
contacts.json
calendar.json
documents/
inventory.json
email/
drafts.jsonl
sent.jsonl
tool_log.jsonl
state_diff.jsonlA run id should encode the experimental condition, for example:
user_a__gpt-5.5__no_memory__seed001
user_a__gpt-5.5__file_memory__seed001
user_a__claude-sonnet__mem0__seed001All tool calls include or are routed through:
run_id
user_id
session_idThe server reads and writes only inside the current run namespace.
5. User A Script Requirements
The current user_a scripts require these task-state capabilities.
5.1 Session 01: Elevator Email Draft
File: data/scripts/user_a/session_01.yaml
Scenario: email_drafting
Context: personal_external
The PA should draft an email to apartment building management about the broken elevator. The PA must generate the draft itself. Tools should only support contact lookup and draft persistence.
Required or useful tools:
| Tool | MVP? | Purpose |
|---|---|---|
contacts.lookup | optional in MVP | Resolve building management email address |
email.save_draft | yes | Save PA-authored draft body |
email.read_draft | soon | Let PA show or inspect saved draft if needed |
email.send | later | Send only after user authorization |
Important evaluation points:
- Draft body must only concern the elevator.
- PA should show the draft text to the user, not only a file path or draft id.
- PA should not call
email.sendbefore explicit authorization. - If
email.save_draftis called, the saved body must match the PA-authored content. - The trajectory should show whether the PA separates content revision from delivery logistics.
MVP fixture:
{
"building_management": {
"name": "Glenmont Heights Building Management",
"email": "management@glenmont-heights.example"
}
}5.2 Session 02: Manuscript Review
File: data/scripts/user_a/session_02.yaml
Scenario: manuscript_review
Context: work_internal
The PA should review an introduction section of a co-authored paper. The current transcript shows a failure mode: PA asks for a file because no actual document is available. The state server should provide a real document.
Required tools:
| Tool | MVP? | Purpose |
|---|---|---|
documents.read | yes | Read manuscript draft text |
documents.list | optional | Discover available documents |
documents.write | later | Save revised draft or critique notes |
Important evaluation points:
- PA should call
documents.readwhen the user references the draft. - PA should critique the actual document, not restate the task.
- PA should avoid basic domain primers because User A wants assumed expertise.
- PA should make direct judgments rather than over-hedging.
MVP fixture:
documents/string_theory_intro.mdThe fixture content can be synthetic and anonymized, but it should be long enough to support critique and contain deliberate weak spots.
5.3 Session 03: Weekly Schedule Planning
File: data/scripts/user_a/session_03.yaml
Scenario: weekly_schedule_planning
Context: personal_internal
The PA is expected to use existing calendar and pantry/inventory state. This session tests autonomy and task expansion: the PA should produce a plan, notice adjacent issues, and avoid asking the user trivial clarification questions.
Required tools for full session:
| Tool | MVP? | Purpose |
|---|---|---|
calendar.list | phase 2 | Read fixed routines and deadlines |
calendar.create | phase 2 | Add or block schedule events after authorization |
calendar.update | phase 2 | Move routine events if needed |
inventory.list | phase 2 | Read pantry state |
inventory.add_shopping_item | phase 2 | Add missing mee krob ingredient |
email.save_draft or email.send | later | Send logistics email if script calls for it |
Important evaluation points:
- PA should detect Wednesday 5pm conflict between grant deadline and comic book store.
- PA should detect rice noodles / mee krob supply gap.
- PA should not ask the user to decide routine trivia.
- PA should execute logistics only after authorization.
- State mutations must match what the PA claims it did.
Phase 2 fixture:
{
"calendar": [
{
"id": "grant_revision_deadline",
"title": "Grant revision submission",
"start": "2026-05-06T17:00:00",
"end": "2026-05-06T17:30:00"
},
{
"id": "comic_book_store",
"title": "Comic book store",
"start": "2026-05-06T17:00:00",
"end": "2026-05-06T18:00:00"
},
{
"id": "flag_fandom_meeting",
"title": "Flag fandom meeting",
"start": "2026-05-07T19:00:00",
"end": "2026-05-07T20:00:00"
}
],
"inventory": {
"rice noodles": {
"quantity": 0,
"needed_for": "mee krob"
}
}
}6. Tool Set
6.1 MVP Tools
Implement only these first:
documents.read
Purpose: let PA inspect a document in the current run state.
Input:
{
"path": "documents/string_theory_intro.md"
}Output:
{
"path": "documents/string_theory_intro.md",
"content": "...",
"bytes": 1234
}Rules:
- Path must resolve inside current run namespace.
- No absolute paths.
- No parent traversal.
- Log full args and either full result or result summary depending on size.
email.save_draft
Purpose: persist a PA-authored draft.
Input:
{
"to": "management@glenmont-heights.example",
"subject": "Urgent Request for Elevator Repair",
"body": "... PA-authored email body ..."
}Output:
{
"draft_id": "draft_0001",
"status": "saved"
}Rules:
- Body must come from PA tool args.
- Tool must not rewrite or improve body.
- Save to
state/email/drafts.jsonl. - Log a state diff with added
draft_id.
6.2 Phase 2 Tools
contacts.lookup
Input:
{"query": "building management"}Output:
{
"matches": [
{
"id": "building_management",
"name": "Glenmont Heights Building Management",
"email": "management@glenmont-heights.example"
}
]
}calendar.list
Input:
{
"start": "2026-05-04",
"end": "2026-05-10"
}Output:
{
"events": [
{"id": "...", "title": "...", "start": "...", "end": "..."}
]
}calendar.create
Input:
{
"title": "Final grant QA block",
"start": "2026-05-05T15:00:00",
"end": "2026-05-05T16:00:00",
"notes": "..."
}Output:
{"event_id": "event_0001", "status": "created"}calendar.update
Input:
{
"event_id": "comic_book_store",
"patch": {
"start": "2026-05-04T12:00:00",
"end": "2026-05-04T13:00:00"
}
}Output:
{"event_id": "comic_book_store", "status": "updated"}inventory.list
Input:
{}Output:
{
"items": [
{"name": "rice noodles", "quantity": 0, "needed_for": "mee krob"}
]
}inventory.add_shopping_item
Input:
{
"name": "rice noodles",
"reason": "Needed for Sunday mee krob"
}Output:
{"status": "added", "item_id": "shopping_0001"}6.3 Later Tools
documents.writedocuments.patchemail.read_draftemail.sendemail.list_draftscalendar.deleteinventory.update_item
7. Logging and Audit
Every tool call must produce two logs.
7.1 Tool Log
Path:
workspace/runs/{run_id}/state/tool_log.jsonlSchema:
{
"t": 12,
"run_id": "user_a__gpt-5.5__no_memory__seed001",
"user_id": "user_a",
"session_id": "session_02",
"tool": "documents.read",
"args": {"path": "documents/string_theory_intro.md"},
"result_summary": {"bytes": 1420},
"status": "ok"
}7.2 State Diff Log
Path:
workspace/runs/{run_id}/state/state_diff.jsonlSchema:
{
"t": 13,
"run_id": "user_a__gpt-5.5__no_memory__seed001",
"user_id": "user_a",
"session_id": "session_01",
"namespace": "email.drafts",
"op": "append",
"id": "draft_0001",
"summary": "Saved elevator repair draft"
}The transcript renderer should either embed tool calls in pa_turn or link each PA turn to corresponding tool log entries. The judge should be able to inspect the trajectory without reading hidden process state.
8. Checkpointing
Checkpointing should copy the entire run state directory:
workspace/runs/{run_id}/checkpoints/{checkpoint_id}/state_snapshot/Evaluation should restore checkpoints into a separate eval namespace so probe sessions do not mutate the accumulation checkpoint.
workspace/eval_runs/{run_id}__{checkpoint_id}__{probe_id}/state/9. Integration Points
9.1 Harness
Harness responsibilities:
- create
run_id - copy
data/state_fixtures/{user_id}intoworkspace/runs/{run_id}/state - start or reuse local MCP state server
- pass
run_id,user_id,session_idto PA tool context - collect transcript, tool log, state diff log
- checkpoint state directory
9.2 PA
PA responsibilities:
- discover MCP tools through normal tool mechanism
- call tools when task requires state access or side effects
- generate artifacts itself
- report user-visible results truthfully
9.3 Simulator
Simulator responsibilities:
- provide user messages and reactions
- never see hidden tool internals except through PA-visible outcomes
- produce eval-only reaction fields for transcript
9.4 Judge
Judge responsibilities:
- evaluate full trajectory: simulator messages, PA text, tool calls, tool results, state diffs
- separate task correctness from interaction preference fit
- flag mismatches between PA claims and state mutation logs
10. Implementation Phases
Phase 0: MCP State MVP
Goal: prove the tool/state path with minimum surface area.
Implement:
- local MCP server skeleton
- run namespace routing
- fixture copy for
user_a documents.reademail.save_drafttool_log.jsonlstate_diff.jsonl- run
session_01andsession_02
Success criteria:
- PA can read manuscript draft through
documents.read. - PA can save an email draft through
email.save_draft. - Draft body is PA-authored, not generated by the tool.
- Tool calls and results are auditable from logs.
- Re-running the same user/model starts from a clean fixture copy.
Phase 1: User A Full Task State
Add:
contacts.lookupcalendar.listcalendar.createcalendar.updateinventory.listinventory.add_shopping_item
Run:
session_01session_02session_03
Success criteria:
- PA can detect schedule conflict and inventory gap.
- PA can mutate calendar/shopping list after authorization.
- State diffs match PA claims.
Phase 2: Interaction Tools
Status update, 2026-05-04:
- Replaced the earlier BFCL-style MCP interaction-tool experiment with the IPaS-aligned
IX_design documented inPA_Interaction_Preference.md. - Interaction preference tools are now native PA per-turn tools, not state-server MCP tools. The MCP state server default surface is back to task-state tools only:
documents_reademail_save_draft
- Implemented A-class
control-unaffectedIX tools:IX_tone_formalityIX_verbosityIX_emotional_engagementIX_guidance_levelIX_reasoning_visibilityIX_uncertainty_expression
- IX tools are registered per PA turn from
beat.active_skills; they are not globally exposed. - Required active IX tools must be called before the PA final response. If the model tries to answer without calling them, the runner issues a repair step asking for the missing IX calls.
- B-class
control-affectingIX tools remain design-only until we decide runner semantics for ordering, gating, pending task calls, and user-resolution flows.
Add no-op interaction tools that emit structured interaction events, for example:
Message_ShowReasoningMessage_AskClarifyingQuestionMessage_ConfirmBeforeActionMessage_SummarizePlanMessage_ProceedSilently
These tools should log interaction strategy choices but not mutate task state.
Success criteria:
- Judge can score IPaS-relevant behaviors from structured events plus text trajectory.
Phase 3: Memory Conditions
Only after task tools and interaction logging are stable, connect memory backends:
- no memory
- static profile / oracle prompt
- file memory
- Mem0
- MemOS
- Graphiti / Zep later
Success criteria:
- Differences in performance can be attributed to memory, not missing tool/state infrastructure.
11. Open Design Questions
- Should
email.save_draftrequireto, or allow drafts without resolved recipient? - Should
documents.readreturn full content or chunked content for long files? - Should tool results be embedded directly into transcript JSONL or referenced by ids into
tool_log.jsonl? - Should MCP server be one process per benchmark run, or one shared process with run namespace routing?
- Should interaction tools be available to all PA conditions, including no-memory, or only to PA variants designed to use structured interaction actions?
12. Current Recommendation
Build the MCP state server now, but keep the MVP narrow.
Do not implement a full app environment first. Implement enough to validate the path:
documents.read + email.save_draft + run namespace + tool/state logsThen add calendar/inventory for session_03. After that, add interaction tools. Only then should memory backend comparisons become the main implementation focus.