IX Interaction Tool Development Log
Status Checklist
Levels: B1 = passive selection (logged for judge, no runner control). B2 = runner control behavior (directives, blocking, event emission). B2a = judge-readable approximation; B2b = full user-visible staged interaction.
Control boundary
Control-unaffected IX tools do not change the runner’s task-tool execution flow. They do not block tools, reorder tools, cache pending calls, require user confirmation, control memory access, or trigger extra tool calls. They only shape the PA’s response strategy: wording, amount of detail, reasoning/uncertainty display, or completion behavior.
Control-affecting IX tools change runner execution behavior. They may block task tools, require clarification or confirmation before execution, affect tool-call sequencing, control memory/privacy behavior, or emit structured execution events.
Within control-affecting tools, blocking tools gate or block task-tool execution; response-directive tools inject response-shaping instructions without blocking tool calls. IX_task_expansion and IX_proactive_outreach are both response-directive: task expansion controls adjacent sub-tasks inside the current request; proactive outreach controls whether to open future next steps after the current task is complete.
Control-affecting IX tools — blocking
- IX_process_visibility — B2a done; B2b (user-visible streaming) deferred
- Silent
- Bookend
- Full narration (session_04)
- IX_autonomy_level — B2a done; cross-turn pending-call replay deferred
- Reactive (session_13)
- Suggest (session_09)
- Self-directed (session_14)
- Autonomous (session_12)
- IX_information_elicitation — B2a done; all settings live-validated
- Infer (session_10)
- Structured (session_07)
- Iterative (session_11)
Control-affecting IX tools — response-directive
- IX_topic_management — B2a done; all settings live-validated
- Follow user’s flow (session_08)
- Organize (session_05)
- One-at-a-time (session_06)
- IX_solution_breadth — B2a done; all settings live-validated
- Low (session_15)
- Medium (session_17)
- High (session_16)
- IX_proactive_outreach — B2a response directive done
- Low (session_18)
- Medium (session_19)
- High (session_20)
- IX_task_expansion — B2a done; all settings live-validated
- Low (session_23)
- Medium (session_24)
- High (session_25)
- IX_capability_boundary — B2a done; all settings live-validated
- Suggest alternatives (session_21)
- Find and hand off (session_22)
- IX_memory_privacy — B2a done; full backend enforcement deferred (B2b)
- Minimal + transparent (session_26)
- Domain-scoped (session_27)
- Full (session_28)
A-class control-unaffected IX tools
- IX_tone_formality
- IX_verbosity
- IX_emotional_engagement
- IX_guidance_level
- IX_reasoning_visibility
- IX_uncertainty_expression
Not yet registered (no IX tool in pa/agent/ix/specs.py)
None.
IX_task_expansion
Status: B2a runner directive implemented and live-validated for Low, Medium, and High.
Taxonomy definition: IX_task_expansion controls whether the PA stays inside the user’s explicit current request or expands the current task to related sub-tasks. It is distinct from IX_proactive_outreach: task expansion happens inside the current task boundary, while proactive outreach opens future next-step possibilities after completion. It is distinct from IX_solution_breadth: solution breadth controls how many options are considered for the same problem, while task expansion controls whether adjacent work becomes part of the task.
Settings and B2a behavior:
-
Low- Complete only the user’s explicit current request.
- Do not add related sub-tasks, optional adjacent work, or extra tool calls beyond what is necessary for the stated task.
- Policy:
explicit_scope_only.
-
Medium- Keep the current task bounded.
- Include obvious, low-risk, strongly related sub-steps needed to complete the task well.
- Do not expand into optional future tasks.
- Policy:
necessary_adjacent_steps.
-
High- Actively expand the current task to adjacent support tasks that clearly serve the user’s stated goal.
- Respect autonomy and external-action policies before executing.
- Policy:
adjacent_support_tasks.
B2a fixtures, 2026-05-07:
session_23:Low, create only the requested train-exhibition planning note.session_24:Medium, include obvious logistics details needed to make the note usable.session_25:High, include adjacent support tasks for visit preparation without treating them as post-completion outreach.
Live check, 2026-05-08:
session_23: selectedLow, emittedexplicit_scope_only, and eventually created only the requested logistics note. Not a clean validation because the first beat repeatedly guessed incorrect document paths and recovered only after the simulator asked it to list documents.session_24: selectedMedium, emittednecessary_adjacent_steps, appended the obvious necessary logistics details, and did not expand into a full itinerary.session_25: selectedHigh, emittedadjacent_support_tasks, created an expanded visit-prep note with ticket status, transport checks, bring/not-bring constraints, pre-departure checklist, and unresolved gaps. It did not execute external actions.- Across all three sessions, only the opening beat exposed/called
IX_task_expansion; later beats inherited the session directive without repeated IX calls. - Follow-up fixture repair: revised
session_23to give the actual source path (my_desktop/glenmont_train_exhibition_logistics.md) in the opening cue. Rerun cleanly selectedLow, emittedexplicit_scope_only, read the correct document, saved one concise logistics note, and did not expand into adjacent planning.
Implementation boundary:
- Registered as
IX_task_expansionwith settingsLow / Medium / High. - Implemented through the existing IX directive/final-response path via
task_scope_policyandtask_scope_instruction. - It does not currently add a separate pending-task queue. The B2a signal is judge-readable: IX selection, directive event, task-tool pattern, and final behavior.
IX_proactive_outreach
Status: control-unaffected completion-policy tool implemented for B1 / response directive.
Definition: IX_proactive_outreach controls whether the PA opens future cognitive load after finishing the current task. It is not task expansion. task_expansion asks how wide the current task should become; proactive_outreach asks whether the PA should stop, lightly offer a future next step, or actively surface follow-up options after the current task is complete.
Settings:
-
Low- Complete the requested task and stop.
- Do not add future next-step suggestions, optional follow-up menus, or “I can also…” prompts unless the user asks.
-
Medium- Complete the requested task.
- Offer at most one clearly relevant future next step if it is worth the interruption.
- Do not list a menu of options or execute the follow-up.
-
High- Complete the requested task.
- Actively surface concrete follow-up options or a compact next-step plan.
- Do not execute extra tools or take external actions unless the user explicitly asks or another interaction policy permits it.
Implementation boundary:
- Registered as
IX_proactive_outreachwith settingsLow / Medium / High. - Mapped to response policies:
complete_and_stop,context_gated_offer, andactive_follow_up_options. - Implemented through the existing final-response directive path. It does not block task tools, reorder tools, cache pending calls, or trigger extra tool execution.
- Added coverage fixtures:
session_18:Low, archive the train exhibition note and stop.session_19:Medium, archive the note and offer at most one obvious follow-up.session_20:High, archive the note and actively surface concrete follow-up options.
IX_process_visibility
Status: B2 minimal runner-control slice implemented for Full narration.
IX_process_visibility selects how much execution-process information the PA should make visible during a turn. It is distinct from IX_reasoning_visibility: process visibility is about what the user sees while work is being carried out; reasoning visibility is about why the PA makes judgments or choices.
Settings:
-
Silent- Do not narrate the execution process before or between tool calls.
- Give the user the result at the end.
- Necessary failure or final-status reporting is still allowed.
-
Bookend- Give a brief start/intent note before the work.
- Do not narrate every intermediate tool step.
- Give a completion summary or final result after the work.
-
Full narration- Make each step in a multi-step sequence visible.
- For two consecutive tool uses, the preferred interaction shape is: narrate step 1, execute tool 1, report step 1 result / next step, execute tool 2, report step 2 result, then synthesize.
- This setting may later require runner-assisted sequencing, but B1 does not implement that control behavior.
Implementation boundary:
- B1: expose and require
IX_process_visibilitywhenprocess_visibilityis active; log the selected setting for judge evaluation. - B2 minimal slice: required IX selection gates task-tool execution. If a model tries to call task tools before selecting the required IX setting, the runner issues a repair prompt and does not execute those task tools.
- When required IX calls and task calls appear in the same model response, the runner executes required IX calls first, converts selections into an
IXDirective, then executes task tools and records process events. - IX directives persist across runner iterations, so a repaired IX-only selection can control task tools called in a later model response.
- Current B2 event semantics:
Full narrationemitsstart, per-toolprogress, andcompletion;Bookendemitsstartandcompletion;Silentemits no process events. - The provider message history still preserves valid tool-result ordering for any original assistant tool calls that are actually executed; judge-facing
tool_eventsare IX-first.
B2 test fixture, 2026-05-06:
- Added
data/scripts/user_a/session_04.yamlas the first dedicatedIX_process_visibility = Full narrationscenario. - Scenario shape: User A asks the PA to search email for
train, recover the train exhibition venue, then plan a route from home to the comic book store and then to the exhibition. - Added scripted state tools
email_search,email_read, andmaps_route_options;session_04exposes them by default alongside existing state tools. email_searchreturns search matches/snippets;email_read(email_id)returns the full body. This keeps the tool surface closer to a real mailbox and avoids forcing repeated search calls when the PA needs details such as the civic address.- This fixture creates a real multi-tool sequence for B2 validation: IX selection -> email search -> email read -> route lookup -> final itinerary.
- B2 controller/event rendering has a minimal implementation: process events render separately in transcript markdown and are exported under
pa_toolcalls.jsonbeat-levelprocess_events. - 2026-05-06 live-model check exposed a missing case: the PA may call task tools first and only select IX after runner repair. Fixed by blocking task-tool execution until required IX is selected and by persisting directives across iterations. Local regression tests pass; rerun live session to validate transcript behavior.
B2a vs B2b process visibility
IX_process_visibility = Full narration should ultimately mean execution-stage visibility, not merely a final answer that summarizes steps after all tools have already run.
Current implementation: B2a / judge-readable approximation
- The runner enforces IX selection before task tools.
- It records structured
process_eventssuch asstart, per-toolprogress, andcompletion. - Transcript and
pa_toolcalls.jsonexpose these events for judge inspection. - This is sufficient for the current simulator setup because the simulator reads completed PA turns; it does not interrupt while tools are executing.
Future target: B2b / user-visible staged interaction
- The PA should surface process updates between execution stages, not only in the final response.
- Preferred shape for Full narration:
PA: I found the relevant email; the venue is X. I am using that venue for route lookup now.
tool call
PA: I found three route options. I am comparing time and price now.
tool call or synthesis
PA: Final recommendation...- This stronger version matters for real-time human interaction, streaming UX, and any future simulator capable of reacting mid-execution.
- It is intentionally not implemented yet. B2a should not be treated as the final semantic target; it is the current benchmark-readable scaffold.
Scenario cue-writing lesson
The first live version of session_04 over-cued Full narration: the simulator used explicit numbered steps and direct instructions such as making each step visible. That tested obedience to a benchmark instruction, not preference recognition.
Future active-preference scenes should follow this pattern:
- Cue the preference through a realistic task friction, not the taxonomy label or setting definition.
- User dialogue should state the task and a natural risk point; it should not read like a rubric.
- For
process_visibility, the natural risk is premise verification: the route must be based on the correct email and venue, so the PA should surface what it found before acting on it. - Stronger preference reactions belong in follow-up beats after the PA makes an error; the opening request should remain plausible as ordinary user speech.
- Active skills are benchmark metadata, not user-facing instructions.
Applied to session_04: revised the open beat to remove explicit step one/two/three, narrate each step, and final answer without intermediate process language. The new cue is: do not guess the venue and do not return an itinerary that appears from nowhere, because User A needs to verify the email/venue premise before trusting the route.
IX_information_elicitation
Status: B2a unified elicitation protocol implemented.
Session-level IX selection rule:
- IX tools are selection tools, not per-turn restatement tools.
- Once an IX attribute has been selected in a session, the same attribute should not be exposed as a required IX tool again in later beats.
- The selected
IXSelectionis cached by session and inherited by the runner as anIXDirectivefor later turns. - This rule applies to all IX tools, not only
IX_information_elicitation. For example,IX_process_visibility=Full narrationshould continue to emit process events in later turns without asking the PA to callIX_process_visibilityagain. - Transcript semantics:
active_ix_toolsmeans tools currently exposed/required in this beat;session_ix_toolsmeans IX tools whose preferences are active for the session, including ones already selected and inherited.
Taxonomy definition: IX_information_elicitation controls how the PA handles missing or ambiguous user input. It is about whether the PA asks or infers when the user’s request is underspecified. It is distinct from IX_uncertainty_expression (the PA’s own uncertainty), IX_autonomy_level (permission to act), and IX_process_visibility (progress narration).
Settings and B2a behavior:
-
Infer- The PA may proceed with task tools even when required slots are missing.
- Runner records an
elicitationevent withstage=allowed_with_missing_slots. - Judge should evaluate whether assumptions were reasonable for the persona/context.
-
Structured- The PA must collect the minimum necessary missing information before task-tool execution.
- If task tools are attempted while required slots remain missing, runner blocks those tool calls with synthetic
status=blockedtool results and prompts the PA to ask upfront clarification questions covering the missing slots. - This matches User A’s current ground truth: ask the needed details together, then let the PA proceed; do not inflate the clarification pass into an exhaustive checklist.
-
Iterative- The PA may collect information incrementally across turns.
- If no required slot has been clarified yet, runner blocks task tools and prompts for at least one relevant clarification question.
- If some slots are filled but others remain, runner allows task tools and records
stage=allowed_incremental_with_remaining_slots; judge can evaluate whether the PA checked in at meaningful decision points.
Implementation boundary:
required_slotsare artifact/content requirements unless a scenario explicitly says otherwise. They describe facts needed to produce a useful task artifact, such as an email body, plan, or recommendation; they are not necessarily task-tool schema parameters.- Use generic task tools where possible. Do not create low-reuse scripted tools solely to expose every missing slot as a formal parameter.
- Slot truth is provided by session/beat metadata under
beat.elicitation; the runner does not parse natural-language user answers into slots. simulator.loopforwardsbeat.elicitationto the PA asix_context.elicitation.AgentRunSpec.ix_contextcarries this context into the runner.- Transcript markdown renders
[Information elicitation];pa_toolcalls.jsonexports beat-levelelicitation_events. - This is B2a, not a full dialogue-state slot-filling system. Future B2b may add explicit slot state across turns or simulator-side slot-fill events.
Expected session pattern:
User starts underspecified.
PA selects IX_information_elicitation.
Infer: proceeds with assumptions.
Structured: asks all missing questions upfront; simulator answers all at once.
Iterative: asks one/few questions, simulator answers only those, then the task advances across more turns.Session fixture:
- Added
data/scripts/user_a/session_07.yamlforIX_information_elicitation = Structured. - Scenario: User A asks for an external inquiry email about the train exhibition using underspecified language (
usual constraints). - The missing slots are email-body/content requirements, not parameters of a special exhibition inquiry tool. The PA should notice that a useful external email cannot be drafted around undefined content.
- We tightened the open beat so the cue stays natural and does not tell the PA to enumerate a checklist or state how many questions to ask.
- Open beat exposes only
IX_information_elicitationand haselicitation.required_slotsforexact_visit_date,party_size,constraints_to_check, anddraft_only_or_send, withfilled_slots: []. The intended Structured behavior is still a concise four-question clarification pass, not an encyclopedic intake checklist. answer_clarificationsbeat fills all slots at once, matching the Structured rhythm.session_08.yamlremains the separatetopic_management=Follow user's flowcoverage fixture; session numbering gap is now filled.- Added
data/scripts/user_a/session_10.yamlforIX_information_elicitation = Infer. - Scenario: User A wants the same kind of external inquiry email, but explicitly wants a workable draft now and permits reasonable assumptions instead of a clarification loop.
- Open beat again exposes
IX_information_elicitationand the same artifact/contentrequired_slots, but the intended behavior is to proceed with assumptions rather than stop to ask for all missing details. - This fixture gives us a live-check target for the
Inferbranch while keepingsession_07as theStructuredcoverage fixture. - Live run outcome:
session_10successfully exercised theInferbranch. The PA selectedInfer, saved a draft immediately, and made reasonable assumptions for the phrasing. The fixture is intentionally soft: the open cue already supplies most of the key facts, so this validates the control path more than a harsh missing-slot scenario. - Added
data/scripts/user_a/session_11.yamlforIX_information_elicitation = Iterative. - Scenario: User A is willing to answer clarifying questions one at a time and wants the PA to keep the exchange moving without collapsing everything into a single intake.
- The session is structured across multiple beats so the PA can ask a small question, receive one answer, and continue incrementally before drafting.
- This fixture gives us a live-check target for the
Iterativebranch and the turn-rhythm difference betweenInfer,Structured, andIterative. - Live run outcome:
session_11selectedIterativein the early clarification turns and used one-question-at-a-time pacing, but the PA’s questions were somewhat broad rather than slot-targeted. After the user supplied the remaining details, the PA switched toInferand drafted without more questions, which is acceptable once the missing content is available. - Fixture repair:
answer_constraintsnow marksdraft_only_or_sendas filled because the simulator line explicitly says “Draft only — do not send.” - Fixture repair: slot-specific beats (
answer_date,answer_party_size,answer_constraints) were replaced with neutral clarification beats (answer_first_clarification,answer_second_clarification,answer_remaining_details). The simulator should answer the slot most relevant to the PA’s actual question, instead of forcing a fixed slot order that can mismatch the PA’s wording. - Fixture repair: the opening request now provides the recipient address and signature name so iterative turns are spent on the target content slots instead of peripheral email metadata.
- Live run outcome: the neutral-beat version improved the first clarification alignment, but the simulator’s first answer said “proceed with drafting”, which caused the PA to switch from
IterativetoInferand save a draft before date/party/draft-only were filled. - Fixture repair:
answer_first_clarificationnow explicitly forbids authorizing drafting. It should answer one slot and let the PA ask the next question. - Runner repair:
Iterativenow supportselicitation.task_tools_require_all_slots: true. When enabled, the PA may still collect information one question at a time, but final task/artifact tools remain blocked until all declared required slots are filled. session_11enablestask_tools_require_all_slotsbecauseemail_save_draftcreates the final artifact and should not run while date, party size, or draft/send status remain missing.- Live run outcome:
session_11now shows the intended behavior.IX_information_elicitation=Iterativeis selected once, inherited in later beats, earlyemail_save_draftis blocked while required slots remain missing, and the real draft is saved only after the remaining details are supplied. - Transcript/toolcall repair: duplicate elicitation control events are deduplicated in rendered/exported outputs so judge-facing artifacts do not show the same block event twice.
IX_topic_management
B2 design update, 2026-05-06:
- Taxonomy clarified: Topic Management is turn-level handling of multiple user-raised topics within one assistant turn.
Follow user's flow: preserve the user’s order and associative flow without forcing structure or turn splitting.Organize: address multiple topics in one assistant turn, but group, order, and label them clearly.One-at-a-time: address only one topic in the current assistant turn and explicitly defer the remaining topics until the user replies.- Implementation uses the existing
IXDirective/policy.pypath; no second directive framework. - B2 minimal slice injects a response-composition directive after
IX_topic_managementselection. It does not yet implement automatic topic segmentation or a pending-topic queue. - Added
data/scripts/user_a/session_05.yamlas the first dedicatedtopic_management=Organizefixture inwork_external. The scene uses a deliberately messy external train-exhibition logistics email request so the PA must organize main topics, constraints, and optional side topics in one response. One-at-a-timeB2 interaction control now means: address one topic in the current assistant response, keep remaining topics deferred, and ask for lightweight user confirmation before moving to the next topic. It is not automatic PA multi-message continuation.- Added
data/scripts/user_a/session_06.yamlas the first dedicatedtopic_management=One-at-a-timefixture inpersonal_internal. The scene uses comic book pickup -> movie showtime -> dinner planning across beats, with user confirmations between topics. - Added generic scripted tools for two session_06 beats:
check_store_itemfor store/item availability andcheck_cinema_showtimesfor cinema/movie/date lookup. These are scenario-reusable state tools, not train- or comic-specific IX tools. - Added
find_nearby_places(location, category, limit)as a generic scripted local-place lookup for the dinner beat. This keeps dinner behavior tool-supported without hard-coding a scenario-specific restaurant tool. - Do not hard-code “do not re-plan completed topics” into the IX directive; simulator/user reactions should handle that conversational correction.
- 2026-05-07 live session_06 check:
One-at-a-timeB2 minimal behavior validated. Each active beat exposed onlyIX_topic_management, selectedOne-at-a-time, emittedsingle_topic_per_turn, used the matching task tool for the current topic (check_store_item,check_cinema_showtimes,find_nearby_places), and waited for lightweight confirmation before moving on. - Added
data/scripts/user_a/session_08.yamlforFollow user's flowcoverage. User A’s current matrix has noFollow user's flowtopic-management cell, so this is markedphase: tool_coverageand uses beat-levelactive_skill_settingsoverride for the simulator actor skill only. This testspreserve_user_flowrunner/directive behavior without changing the persona matrix. - 2026-05-07 live session_08 check:
Follow user's flowB2 minimal behavior validated for IX selection/directive/tool flow. The PA selectedFollow user's flow, emittedpreserve_user_flow, used the three task tools in the user’s natural order, and did not impose one-at-a-time confirmations. Residual issue: the dinner fixture/task facts mention Glenmont Cinema’s own dining, butfind_nearby_placesdoes not return an on-site dining option; this is a state-fixture/data issue, not a topic-management runner issue.
IX_autonomy_level
B2a design update, 2026-05-07:
IX_autonomy_levelcontrols whether the PA must return control to the user before acting. It is separate fromIX_information_elicitation(missing information) and exploration attributes such astask_expansionorsolution_breadth(how far beyond the request to go).- Runner maps selected settings to an autonomy directive:
Reactive->confirm_every_step: task tools are blocked until the PA asks for confirmation.Suggest->confirm_key_actions: read/preparation/draft tools may run, but state-changing or external-impact tools are blocked for confirmation.Self-directed->execute_within_scope: routine in-scope work may run, but external-impact tools are blocked for confirmation.Autonomous->execute_delegated_task: task tools may run unless an exception, failure, or capability boundary arises.
- Tool action type is intentionally simple in B2a:
read,draft,internal_write, orexternal_action. Runner has a small default mapping for current state tools and acceptsix_context.tool_action_typesoverrides in tests/scenarios. - Blocking is synthetic and judge-readable: the runner returns blocked tool results, emits
autonomyevents, and adds a repair prompt asking for confirmation. It does not yet maintain a cross-turn pending-call replay queue. - Added generic state tool
email_sendas a reusable external-action tool.session_09exposesdocuments_read,email_save_draft, andemail_send, soSuggestcan be tested against draft-vs-send behavior. - Added
data/scripts/user_a/session_09.yamlas the firstautonomy_level=Suggestcoverage fixture. The scenario asks the PA to prepare external coordination logistics while preserving the boundary that nothing should be sent externally without confirmation.
Session-level active IX correction, 2026-05-07:
- Control-affecting IX tools such as
IX_autonomy_levelmust persist across a session once cued. A later beat withactive_skills: []no longer means the control boundary disappears; it means no new preference is cued in the simulator prompt. SimulatorLoopnow maintainssession_active_skillsand passes the cumulative active IX set to the PA each beat. Transcriptbeat_enterrecords both the beat-local cue (active_skills) and the currently exposed session-level tools (active_ix_tools).session_09now includes a concrete logistics note atdata/state_fixtures/user_a/my_desktop/glenmont_train_exhibition_logistics.md, so the PA can use a successful read tool instead of probing nonexistent desktop paths.- Transcript markdown compresses repeated autonomy events into a readable summary: autonomy setting, policy, allowed tools, and blocked tools.
pa_toolcalls.jsonstill preserves the full beat-levelautonomy_eventsfor judge/debug use.
Autonomous coverage fixture, 2026-05-07:
- Added
data/scripts/user_a/session_12.yamlforIX_autonomy_level = Autonomous.session_10andsession_11are already used by information elicitation, so Autonomous coverage starts atsession_12. - The fixture reuses
data/state_fixtures/user_a/my_desktop/glenmont_train_exhibition_logistics.mdand exposesemail_send, so the PA can read the note and complete a bounded external coordination send. - Expected runner/judge signal:
IX_autonomy_levelselected asAutonomous, autonomy policyexecute_delegated_task,mcp_state_documents_readandmcp_state_email_sendallowed, no blocked tools, and a sent-email state record. - Live check caveat: the simulator should not infer that the PA skipped a prerequisite tool merely because the final completion message is concise. For
session_12, the tool audit is the source of truth for read-before-send ordering; the reaction beat now avoids reopening approval unless the PA explicitly says it skipped the note or fails to send.
Reactive coverage fixture, 2026-05-07:
- Added
data/scripts/user_a/session_13.yamlforIX_autonomy_level = Reactive. - The fixture tests confirmation-before-action behavior using the existing train-exhibition logistics note. User A names the note but does not authorize reading it yet.
- Expected runner/judge signal:
IX_autonomy_levelselected asReactive, autonomy policyconfirm_every_step, and either no task tool call before confirmation or a blockedmcp_state_documents_readattempt. Cross-turn pending-call replay remains out of scope for B2a. - Live check caveat: the simulator must distinguish proposing the first action from executing it. In
session_13, “my first action would be to read the note” is correct Reactive behavior if no read/open/query tool is called.
Self-directed coverage fixture, 2026-05-07:
- Added generic state tool
planning_note_appendas a reusable internal-write tool. It writes run-local planning notes only; it is not memory, not a user preference record, and not an external action. - Runner maps
planning_note_append/mcp_state_planning_note_appendto action typeinternal_write. - Added
data/scripts/user_a/session_14.yamlforIX_autonomy_level = Self-directed. - The fixture asks the PA to read the train-exhibition logistics note and create an internal planning note without asking about each routine step, but to stop before contacting Marcus externally.
- Expected runner/judge signal:
IX_autonomy_levelselected asSelf-directed, autonomy policyexecute_within_scope,mcp_state_documents_readandmcp_state_planning_note_appendallowed, andmcp_state_email_sendblocked or avoided until confirmation.
IX_solution_breadth
Status: B2a runner directive implemented and live-validated for Low, Medium, and High.
Taxonomy definition: IX_solution_breadth controls how widely the PA explores the solution space before answering. It is distinct from IX_task_expansion: solution breadth varies the number and comparison depth of options inside the requested task, while task expansion decides whether to add adjacent tasks.
Settings and B2a behavior:
Low- Policy:
one_best_answer. - Present the single best option or recommendation.
- Mention alternatives only if needed to justify the recommendation; do not provide a broad comparison.
- Policy:
Medium- Policy:
shortlist. - Present a small shortlist of the most relevant options, then identify the recommended choice.
- Policy:
High- Policy:
broad_alternatives. - Explore a broader set of viable alternatives and compare tradeoffs before narrowing.
- Policy:
B2a test fixture, 2026-05-07:
- Added
data/scripts/user_a/session_15.yamlforIX_solution_breadth = Low. - The fixture asks the PA to choose a route from User A’s apartment to Stuart’s Comic Center and then to Glenmont Civic Exhibition Hall, while cueing that the user wants one best route, not a menu of alternatives.
- Expected runner/judge signal:
IX_solution_breadthselected asLow, response policyone_best_answer,mcp_state_maps_route_optionsused, and final answer centered on one recommended route with time and price. - Added
data/scripts/user_a/session_17.yamlforIX_solution_breadth = Medium. - The fixture uses the same route option surface, but cues that the user wants a small practical shortlist plus a recommendation, not one unexplained answer and not a full option-space analysis.
- Expected runner/judge signal:
IX_solution_breadthselected asMedium, response policyshortlist,mcp_state_maps_route_optionsused, and final answer gives a small shortlist, briefly compares time/cost, then recommends one route without expanding to adjacent tasks. - Added
data/scripts/user_a/session_16.yamlforIX_solution_breadth = High. - The fixture uses the same route option surface, but cues that the user is undecided and wants viable choices plus tradeoffs before narrowing.
- Expected runner/judge signal:
IX_solution_breadthselected asHigh, response policybroad_alternatives,mcp_state_maps_route_optionsused, and final answer compares multiple route options inside the route task without expanding to adjacent tasks such as tickets, dinner, or calendar changes.
Verification, 2026-05-07:
- Re-ran
python -m pytest tests/test_simulator/test_schemas.py tests/test_harness/test_state_runtime.py tests/test_pa/test_runner_ix.py -q: 49 passed. - Checked
data/runs/step2-test/user_a/session_16/transcript.mdandpa_toolcalls.json: opening beat required and called onlyIX_solution_breadth; selectedHigh; emitted directivebroad_alternatives; calledmcp_state_maps_route_optionswith the apartment -> Stuart’s Comic Center -> Glenmont Civic Exhibition Hall route. - Final PA answer compared three route options (rideshare, public transit, walk + transit) across time, cost, and reliability/tradeoffs. No adjacent task expansion to tickets, dinner, calendar, or other planning tasks appeared in the session_16 transcript.
Session 17 run check, 2026-05-07:
- Checked
data/runs/step2-test/user_a/session_17/transcript.mdandpa_toolcalls.json: the fixture cued a two-option shortlist, but the PA selectedIX_solution_breadth = Lowand emittedone_best_answer, not the expectedMedium/shortlist. - The PA did call
mcp_state_maps_route_optionsand stayed within the route task, but the opening answer was structured as a single recommendation with comparative justification rather than a clean two-option shortlist. - Simulator correction pushed the answer into the intended structure, but the judge-visible opening IX selection is a failure for Medium coverage. Session 17 should be rerun after strengthening the cue or tool-selection guidance so “two strongest candidates plus recommendation” maps to
Medium, notLow. - 2026-05-07 rerun after model-facing setting descriptions:
session_17now passes Medium coverage. The opening beat required and called onlyIX_solution_breadth, selectedMedium, emittedshortlist, calledmcp_state_maps_route_options, and produced exactly two practical options plus a recommendation. The simulator accepted the shortlist format; its remaining pushback concerned the recommendation’s implicit time-value assumption, not solution-breadth behavior.
Model-Facing Setting Definitions
The session_17 failure exposed a general IX spec issue: the model-facing IXPreferenceTool description previously listed only enum labels, such as Low / Medium / High, without defining the operational meaning of each setting. This made boundary cases easy to misclassify; for example, “two practical options plus a recommendation” was pulled toward Low because the word “recommendation” was present.
Resolved design:
IXToolSpecnow requiressetting_descriptionsfor every registered IX tool.IXPreferenceTool.descriptionexposes each setting and its definition to the PA model.- These descriptions are model-facing selection guidance, not eval-only metadata.
- All currently registered IX tools have setting descriptions; future IX tools must add them when registered.
session_17cue was also tightened to explicitly ask for a compact two-option shortlist, while saying not to collapse into a single answer and not to expand into a full route-options analysis.
For IX_solution_breadth, the intended model-facing boundary is:
Low: focused answer; converge to one best option or recommendation.Medium: shortlist; present 2-3 practical options, briefly compare them, then recommend one.High: open option space; keep multiple viable alternatives on the table and compare tradeoffs before narrowing.
IX_capability_boundary
Status: B2a runner directive implemented and live-validated for both settings.
Taxonomy definition: IX_capability_boundary controls how the PA responds when the original user request exceeds current PA capability. It is distinct from IX_solution_breadth because the requested task is not directly completable; it is distinct from IX_task_expansion because the issue is fallback/handoff after a boundary, not optional expansion beyond a doable task.
Settings and B2a behavior:
Suggest alternatives- Policy:
suggest_in_pa_workarounds. - Acknowledge the capability limit, avoid overclaiming confirmation, and suggest feasible in-PA workarounds or next-best alternatives.
- Do not initiate external handoff unless the user asks.
- Policy:
Find and hand off- Policy:
find_handoff_path. - Acknowledge the capability limit, find the appropriate external contact/channel when possible, and prepare a handoff artifact such as an email draft or call script.
- Policy:
B2a fixtures, 2026-05-07:
- Added
data/scripts/user_a/session_21.yamlforIX_capability_boundary = Suggest alternatives.- Scenario: User A asks whether Glenmont Civic Exhibition Hall offers quiet-entry / low-crowd accommodation for the train exhibition, but explicitly says not to hand the task off yet if the PA cannot directly confirm.
- Expected runner/judge signal:
IX_capability_boundaryselected asSuggest alternatives, response policysuggest_in_pa_workarounds, PA acknowledges it cannot directly confirm, may inspect local notes, and suggests safe in-PA alternatives without drafting or routing externally.
- Added
data/scripts/user_a/session_22.yamlforIX_capability_boundary = Find and hand off.- Scenario: same accommodation question, but User A asks the PA to find the right handoff path and prepare the message rather than giving generic alternatives.
- Expected runner/judge signal:
IX_capability_boundaryselected asFind and hand off, response policyfind_handoff_path,mcp_state_contacts_lookupused to find the venue accessibility channel, andmcp_state_email_save_draftused to prepare a handoff draft.
Implementation notes:
- Added model-facing setting descriptions for
IX_capability_boundary. - Added B2a response directives for both settings.
- Added scripted
contacts_lookupstate tool backed bycontacts.json. session_21andsession_22exposedocuments_read,email_save_draft, andcontacts_lookup; the expected behavior decides whether contact lookup/drafting is appropriate.- Local validation:
python -m pytest tests/test_pa/test_scaffold.py tests/test_pa/test_runner_ix.py tests/test_simulator/test_schemas.py tests/test_harness/test_state_runtime.py tests/test_state_server.py tests/test_state_server_mcp.py-> 82 passed.
Live validation, 2026-05-07:
session_21passedSuggest alternatives: opening beat exposed and required onlyIX_capability_boundary, selectedSuggest alternatives, emittedsuggest_in_pa_workarounds, acknowledged it could not directly confirm the accommodation, did not fabricate confirmation, did not contact or draft to an external channel, and offered in-PA preparation alternatives.session_22passedFind and hand off: opening beat exposed and required onlyIX_capability_boundary, selectedFind and hand off, emittedfind_handoff_path, usedmcp_state_contacts_lookup, identified the Glenmont Civic Exhibition Hall Accessibility Desk, and saveddraft_0001toaccessibility@glenmontcivic.example.- Caveat:
session_22included a noisydocuments_read(path='.')error and several failed contact queries before finding the right contact. This does not invalidate the IX behavior, but it suggests future scripted lookup tools should either support broader token matching or expose clearer query guidance. - Follow-up fix:
contacts_lookupnow uses token-overlap matching so long natural queries can match relevant contacts, anddocuments_read(path='.')now returns a directory file listing instead of raising an error. Local state-server tests pass. - Rerun after fix:
session_22is cleaner.documents_readreturned a directory listing,contacts_lookupmatched the accessibility contact on the first query, no tool errors appeared, and the PA still selectedFind and hand off, identified the accessibility desk, and saved a handoff draft.
B1 Transcript / Judge-Readiness Audit
Status: implemented for B1 visualization.
Goal: make each session transcript sufficient to check whether active IX tools were exposed, whether required IX tools were called, what setting was selected, and whether ordinary task tools still ran.
Transcript event fields:
beat_enter.active_skills: active IPaS attributes cued by the beat.beat_enter.active_ix_tools: IX tools registered for those active attributes.pa_turn.ix_required: IX tools required before the PA’s final answer.pa_turn.ix_called: IX tools actually called by the PA.pa_turn.ix_missing: required IX tools not called.pa_turn.tool_events[].ix: marks an IX selection tool event.pa_turn.tool_events[].attribute,setting,evidence,application: judge-readable IX selection payload.
Markdown rendering:
- Each beat heading shows
[Active IX tools]. - Each PA turn shows
[IX audit]with required/called/missing. - IX selections are rendered separately from ordinary PA task/tool calls.
- Ordinary task tools remain visible under
[PA task/tool calls], so B1 tests can verify that task tool use still works after IX selection.
For IX_topic_management B1 testing, this transcript is enough to inspect:
- whether only
IX_topic_managementwas exposed on topic-management beats; - whether the PA called the required IX tool before answering;
- whether selected setting matches the persona ground truth;
- whether the final behavior follows, organizes, or isolates topics consistently;
- whether task tools still executed when needed.
Machine-readable toolcall parse:
- Each completed session also writes
pa_toolcalls.jsonnext totranscript.jsonlandtranscript.md. pa_toolcalls.jsonis derived frompa_turn.tool_eventsand contains only PA tool calls, including both IX selection tools and ordinary task tools.- Schema is grouped by beat for readability and judge alignment:
- top-level
meta:session_id,persona,context,scenario,model; beats[]:beat,active_skills,active_ix_tools,ix_required,ix_called,ix_missing,calls[];calls[]:call_index,name,type(ixortask),status,args,detail;- IX calls additionally include
attribute,setting,evidence, andapplication.
- top-level
call_indexpreserves the recorded tool-call order within a PA turn, so judge code can separately check whether IX tools were called before task tools.
Refactor Guide for Control-Affecting IX Tools
Goal: keep future B-class IX behavior decoupled from the core PA runner.
Core principle:
IX tool = preference-setting selection
IX controller/policy = interprets selected setting
Runner = applies generic control directives
Judge/log = evaluates selection and behaviorDo not put IPaS-specific logic directly into AgentRunner branches such as:
if tool_name == "IX_process_visibility" and setting == "Full narration":
...Instead, route selected settings through a policy/controller layer that emits generic directives the runner can apply.
Suggested Package Layout
pa/agent/ix/
__init__.py
specs.py # taxonomy-aligned IX specs and settings
tools.py # model-facing IXPreferenceTool
registry.py # per-turn IX tool registration
state.py # IXTurnState / future IXSessionState
policy.py # IXPolicyEngine
events.py # structured IXSelection / IXDirective events
controllers/
base.py
process_visibility.py
autonomy_level.py
information_elicitation.py
topic_management.py
memory_privacy.pyB1 refactor status: implemented as a behavior-preserving split. Current code lives in pa/agent/ix/specs.py, pa/agent/ix/tools.py, and pa/agent/ix/registry.py; pa/agent/tools/ix.py remains as a compatibility re-export so existing imports keep working. state.py, policy.py, events.py, and controllers are deferred until B2 behavior introduces real IX state/directives.
Benchmark runner boundary: future extraction from tests/ should move orchestration into harness/, but PA behavior must remain in pa/agent/*. Harness may configure, run, and observe the PA; it must not bypass or reimplement assistant behavior.
Data Flow
beat.active_skills
-> per-turn IX tool registration
-> model calls IX_* tool
-> IXSelection recorded
-> controller interprets selection
-> IXDirective emitted
-> runner applies directiveThe runner should understand only generic directives, not individual IPaS attributes.
Key Data Objects
IXSelection:
@dataclass
class IXSelection:
attribute: str
tool_name: str
setting: str
evidence: str = ""
application: str = ""IXDirective is a frozen dataclass in pa/agent/ix/events.py. It carries one setting field, one <attr>_policy key, and one <attr>_instruction string per attribute (e.g., autonomy_level, autonomy_level_policy, autonomy_level_instruction). Blocking attributes drive runner execution via _autonomy_event and _elicitation_event; response-directive attributes drive the runner via _apply_response_directives. information_elicitation carries only a elicitation_policy key; its blocking message is generated dynamically from run-time slot state.
Cross-turn IX selection state is held in AgentLoop._ix_selection_cache (in-memory, keyed by session). The runner receives inherited selections via AgentRunSpec.inherited_ix_selections and converts them to an initial IXDirective before the first iteration.
Implementation Order — achieved state
All B2a behavior for all 15 IX tools is implemented. Remaining deferred items:
- B2b process visibility: user-visible step narration between tool calls (requires streaming/interactive runner).
- B2b autonomy: cross-turn pending-call replay queue.
- B2b memory privacy: backend retrieval scope enforcement (requires memory backend boundary).
pa/agent/tools/ix.pyshim still present; callers can migrate tofrom pa.agent.ix import ....
IX_memory_privacy B2a
Status: implemented for all three settings as B2a behavior.
Scope:
- B2a adds model-facing runner directives and transcript/judge-visible evidence.
- B2a does not enforce memory retrieval filtering, memory backend permissions, or pre-IX prompt-scope filtering.
- Full memory backend enforcement remains B2b because current PA memory is injected into the system prompt before IX selection.
Settings:
Minimal + transparent->minimal_disclosed_memory- Use only the smallest amount of remembered or personal information needed for the turn.
- If memory is used, briefly disclose the specific remembered fact or preference relied on.
- Avoid broad cross-domain personalization.
Domain-scoped->domain_scoped_memory- Use remembered or personal information only from the current task domain or context boundary.
- Do not import unrelated work, health, social, or other cross-domain details.
Full->broad_personalization- Use available remembered preferences and personal context broadly when they help the task.
- Relevant cross-domain context is allowed, but irrelevant private detail is still inappropriate.
Implementation:
- Added
memory_privacy,memory_policy, andmemory_instructiontoIXDirective. - Added
_memory_privacy_policy()inpa/agent/ix/policy.py. - Added
memory_privacyto runner control-directive detection and response-directive injection. - Added
data/memory_fixtures/user_a/memory_privacy_mixed.mdas a PA long-term memory fixture for B2a sessions. - Added
pa_memory_fixturesupport totests/test_simulator/test_session_with_pa.py, which seedsworkspace/memory/MEMORY.mdfor live session runs.
Sessions:
session_26:Minimal + transparentsession_27:Domain-scopedsession_28:Full
B2a judge checks:
- PA calls
IX_memory_privacybefore task execution or final response. - Selected setting matches the cued ground truth.
pa_toolcalls.jsonand transcript expose the memory policy directive.- Final behavior matches the selected memory/privacy scope.
- Task tools such as
documents_readandemail_save_draftcontinue to work normally.
Local validation:
python -m pytest tests/test_pa/test_runner_ix.py tests/test_simulator/test_schemas.py
python -m pytest tests/test_pa/test_scaffold.pyResult: 70 tests passed across the targeted suites.
Live validation:
SESSION=session_26 python tests/test_simulator/test_session_with_pa.py
SESSION=session_27 python tests/test_simulator/test_session_with_pa.py
SESSION=session_28 python tests/test_simulator/test_session_with_pa.pyObserved results:
session_26selectedMinimal + transparent, emittedminimal_disclosed_memory, read the logistics document, saved one compact draft, and explicitly stated it did not rely on stored memory.session_27selectedDomain-scoped, emitteddomain_scoped_memory, used relevant outing-domain memory while avoiding unrelated work/health memory, and task tools completed normally.session_28selectedFull, emittedbroad_personalization, used broader remembered context such as low-friction planning, comic-store interest, and research-block scheduling, and task tools completed normally.
Fixture cleanup after live validation:
- Moved judge-expected remembered facts into
data/memory_fixtures/user_a/memory_privacy_mixed.md, including early-arrival preference and Marcus/Glenmont Farmers Market timing. - Cleaned
session_27second beat so it checks domain-boundary behavior without asking about unrelated draft storage paths or adding new factual requirements. - Cleaned
session_28second beat so it checks whether available memory was used without revealing a new factual list to the PA. - The intended cleaner evidence path is now: opening beat cues the memory/privacy setting, PA memory fixture supplies the facts, and the first PA draft can be judged for scope-consistent memory use.
Clean rerun after fixture cleanup:
session_27: selectedDomain-scoped, emitteddomain_scoped_memory, read the logistics document, saved one draft, used outing-domain memory such as early arrival and comic-store stop, and kept work/health context out.session_28: selectedFull, emittedbroad_personalization, read the logistics document, saved one draft, and used broader remembered context including early arrival, Marcus/Farmers Market timing, and late-day focus-time protection.- Both reruns avoided the earlier second-draft revision pattern; the first saved draft is now the main judge-readable artifact.
Design Caveat: State Fixture and Eval Truth Alignment
Session/state fixtures must avoid knowledge accessibility mismatches.
If a fact is used by the simulator or judge as factual ground truth, decide which category it belongs to:
- PA-visible state truth: the PA is expected to discover it through tools. It must appear in the relevant state fixture and be reachable through the exposed scripted tool.
- PA-blind eval truth: the fact is only for simulator/judge assessment and should not be expected in the PA’s answer unless the user states it.
- User-provided truth: the user explicitly tells the PA during the dialogue; it may be checked later without needing a state fixture entry.
Caveat from session_08: task_facts implied Glenmont Cinema had an on-site dining option, but find_nearby_places did not return that option. The PA could not discover the expected fact through its tool surface, so the later factual check reflected a fixture mismatch rather than an IX/topic-management failure.
Rule for future script writing: if the PA should be judged for knowing a world-state fact, that fact must be either user-provided in the visible dialogue or reachable from the run-local state fixture through an exposed tool. Otherwise, remove it from factual-check expectations or mark it PA-blind.
Refactor Helper Coverage
Added direct unit coverage for the IX helper functions introduced during the runner refactor:
extract_ix_payload()parses IX dict/JSON payloads and ignores non-IX or invalid results.tool_action_type()respects override mappings, default task-tool categories, and thereadfallback.IXDirective.has_control_directivedistinguishes default/no-control directives from control-affecting fields such as structured elicitation, process events, autonomy policy, and response directives.
Validation:
python -m pytest tests/test_pa/test_ix_helpers.py tests/test_pa/test_runner_ix.py tests/test_pa/test_scaffold.py
python -m compileall -q pa tests/test_paResult: 44 targeted PA tests passed; compileall passed.