IX Interaction Tool Development Log

Status Checklist

Levels: B1 = passive selection (logged for judge, no runner control). B2 = runner control behavior (directives, blocking, event emission). B2a = judge-readable approximation; B2b = full user-visible staged interaction.

Control boundary

Control-unaffected IX tools do not change the runner’s task-tool execution flow. They do not block tools, reorder tools, cache pending calls, require user confirmation, control memory access, or trigger extra tool calls. They only shape the PA’s response strategy: wording, amount of detail, reasoning/uncertainty display, or completion behavior.

Control-affecting IX tools change runner execution behavior. They may block task tools, require clarification or confirmation before execution, affect tool-call sequencing, control memory/privacy behavior, or emit structured execution events.

Within control-affecting tools, blocking tools gate or block task-tool execution; response-directive tools inject response-shaping instructions without blocking tool calls. IX_task_expansion and IX_proactive_outreach are both response-directive: task expansion controls adjacent sub-tasks inside the current request; proactive outreach controls whether to open future next steps after the current task is complete.

Control-affecting IX tools — blocking

  • IX_process_visibility — B2a done; B2b (user-visible streaming) deferred
    • Silent
    • Bookend
    • Full narration (session_04)
  • IX_autonomy_level — B2a done; cross-turn pending-call replay deferred
    • Reactive (session_13)
    • Suggest (session_09)
    • Self-directed (session_14)
    • Autonomous (session_12)
  • IX_information_elicitation — B2a done; all settings live-validated
    • Infer (session_10)
    • Structured (session_07)
    • Iterative (session_11)

Control-affecting IX tools — response-directive

  • IX_topic_management — B2a done; all settings live-validated
    • Follow user’s flow (session_08)
    • Organize (session_05)
    • One-at-a-time (session_06)
  • IX_solution_breadth — B2a done; all settings live-validated
    • Low (session_15)
    • Medium (session_17)
    • High (session_16)
  • IX_proactive_outreach — B2a response directive done
    • Low (session_18)
    • Medium (session_19)
    • High (session_20)
  • IX_task_expansion — B2a done; all settings live-validated
    • Low (session_23)
    • Medium (session_24)
    • High (session_25)
  • IX_capability_boundary — B2a done; all settings live-validated
    • Suggest alternatives (session_21)
    • Find and hand off (session_22)
  • IX_memory_privacy — B2a done; full backend enforcement deferred (B2b)
    • Minimal + transparent (session_26)
    • Domain-scoped (session_27)
    • Full (session_28)

A-class control-unaffected IX tools

  • IX_tone_formality
  • IX_verbosity
  • IX_emotional_engagement
  • IX_guidance_level
  • IX_reasoning_visibility
  • IX_uncertainty_expression

Not yet registered (no IX tool in pa/agent/ix/specs.py)

None.


IX_task_expansion

Status: B2a runner directive implemented and live-validated for Low, Medium, and High.

Taxonomy definition: IX_task_expansion controls whether the PA stays inside the user’s explicit current request or expands the current task to related sub-tasks. It is distinct from IX_proactive_outreach: task expansion happens inside the current task boundary, while proactive outreach opens future next-step possibilities after completion. It is distinct from IX_solution_breadth: solution breadth controls how many options are considered for the same problem, while task expansion controls whether adjacent work becomes part of the task.

Settings and B2a behavior:

  • Low

    • Complete only the user’s explicit current request.
    • Do not add related sub-tasks, optional adjacent work, or extra tool calls beyond what is necessary for the stated task.
    • Policy: explicit_scope_only.
  • Medium

    • Keep the current task bounded.
    • Include obvious, low-risk, strongly related sub-steps needed to complete the task well.
    • Do not expand into optional future tasks.
    • Policy: necessary_adjacent_steps.
  • High

    • Actively expand the current task to adjacent support tasks that clearly serve the user’s stated goal.
    • Respect autonomy and external-action policies before executing.
    • Policy: adjacent_support_tasks.

B2a fixtures, 2026-05-07:

  • session_23: Low, create only the requested train-exhibition planning note.
  • session_24: Medium, include obvious logistics details needed to make the note usable.
  • session_25: High, include adjacent support tasks for visit preparation without treating them as post-completion outreach.

Live check, 2026-05-08:

  • session_23: selected Low, emitted explicit_scope_only, and eventually created only the requested logistics note. Not a clean validation because the first beat repeatedly guessed incorrect document paths and recovered only after the simulator asked it to list documents.
  • session_24: selected Medium, emitted necessary_adjacent_steps, appended the obvious necessary logistics details, and did not expand into a full itinerary.
  • session_25: selected High, emitted adjacent_support_tasks, created an expanded visit-prep note with ticket status, transport checks, bring/not-bring constraints, pre-departure checklist, and unresolved gaps. It did not execute external actions.
  • Across all three sessions, only the opening beat exposed/called IX_task_expansion; later beats inherited the session directive without repeated IX calls.
  • Follow-up fixture repair: revised session_23 to give the actual source path (my_desktop/glenmont_train_exhibition_logistics.md) in the opening cue. Rerun cleanly selected Low, emitted explicit_scope_only, read the correct document, saved one concise logistics note, and did not expand into adjacent planning.

Implementation boundary:

  • Registered as IX_task_expansion with settings Low / Medium / High.
  • Implemented through the existing IX directive/final-response path via task_scope_policy and task_scope_instruction.
  • It does not currently add a separate pending-task queue. The B2a signal is judge-readable: IX selection, directive event, task-tool pattern, and final behavior.

IX_proactive_outreach

Status: control-unaffected completion-policy tool implemented for B1 / response directive.

Definition: IX_proactive_outreach controls whether the PA opens future cognitive load after finishing the current task. It is not task expansion. task_expansion asks how wide the current task should become; proactive_outreach asks whether the PA should stop, lightly offer a future next step, or actively surface follow-up options after the current task is complete.

Settings:

  • Low

    • Complete the requested task and stop.
    • Do not add future next-step suggestions, optional follow-up menus, or “I can also…” prompts unless the user asks.
  • Medium

    • Complete the requested task.
    • Offer at most one clearly relevant future next step if it is worth the interruption.
    • Do not list a menu of options or execute the follow-up.
  • High

    • Complete the requested task.
    • Actively surface concrete follow-up options or a compact next-step plan.
    • Do not execute extra tools or take external actions unless the user explicitly asks or another interaction policy permits it.

Implementation boundary:

  • Registered as IX_proactive_outreach with settings Low / Medium / High.
  • Mapped to response policies: complete_and_stop, context_gated_offer, and active_follow_up_options.
  • Implemented through the existing final-response directive path. It does not block task tools, reorder tools, cache pending calls, or trigger extra tool execution.
  • Added coverage fixtures:
    • session_18: Low, archive the train exhibition note and stop.
    • session_19: Medium, archive the note and offer at most one obvious follow-up.
    • session_20: High, archive the note and actively surface concrete follow-up options.

IX_process_visibility

Status: B2 minimal runner-control slice implemented for Full narration.

IX_process_visibility selects how much execution-process information the PA should make visible during a turn. It is distinct from IX_reasoning_visibility: process visibility is about what the user sees while work is being carried out; reasoning visibility is about why the PA makes judgments or choices.

Settings:

  • Silent

    • Do not narrate the execution process before or between tool calls.
    • Give the user the result at the end.
    • Necessary failure or final-status reporting is still allowed.
  • Bookend

    • Give a brief start/intent note before the work.
    • Do not narrate every intermediate tool step.
    • Give a completion summary or final result after the work.
  • Full narration

    • Make each step in a multi-step sequence visible.
    • For two consecutive tool uses, the preferred interaction shape is: narrate step 1, execute tool 1, report step 1 result / next step, execute tool 2, report step 2 result, then synthesize.
    • This setting may later require runner-assisted sequencing, but B1 does not implement that control behavior.

Implementation boundary:

  • B1: expose and require IX_process_visibility when process_visibility is active; log the selected setting for judge evaluation.
  • B2 minimal slice: required IX selection gates task-tool execution. If a model tries to call task tools before selecting the required IX setting, the runner issues a repair prompt and does not execute those task tools.
  • When required IX calls and task calls appear in the same model response, the runner executes required IX calls first, converts selections into an IXDirective, then executes task tools and records process events.
  • IX directives persist across runner iterations, so a repaired IX-only selection can control task tools called in a later model response.
  • Current B2 event semantics: Full narration emits start, per-tool progress, and completion; Bookend emits start and completion; Silent emits no process events.
  • The provider message history still preserves valid tool-result ordering for any original assistant tool calls that are actually executed; judge-facing tool_events are IX-first.

B2 test fixture, 2026-05-06:

  • Added data/scripts/user_a/session_04.yaml as the first dedicated IX_process_visibility = Full narration scenario.
  • Scenario shape: User A asks the PA to search email for train, recover the train exhibition venue, then plan a route from home to the comic book store and then to the exhibition.
  • Added scripted state tools email_search, email_read, and maps_route_options; session_04 exposes them by default alongside existing state tools.
  • email_search returns search matches/snippets; email_read(email_id) returns the full body. This keeps the tool surface closer to a real mailbox and avoids forcing repeated search calls when the PA needs details such as the civic address.
  • This fixture creates a real multi-tool sequence for B2 validation: IX selection -> email search -> email read -> route lookup -> final itinerary.
  • B2 controller/event rendering has a minimal implementation: process events render separately in transcript markdown and are exported under pa_toolcalls.json beat-level process_events.
  • 2026-05-06 live-model check exposed a missing case: the PA may call task tools first and only select IX after runner repair. Fixed by blocking task-tool execution until required IX is selected and by persisting directives across iterations. Local regression tests pass; rerun live session to validate transcript behavior.

B2a vs B2b process visibility

IX_process_visibility = Full narration should ultimately mean execution-stage visibility, not merely a final answer that summarizes steps after all tools have already run.

Current implementation: B2a / judge-readable approximation

  • The runner enforces IX selection before task tools.
  • It records structured process_events such as start, per-tool progress, and completion.
  • Transcript and pa_toolcalls.json expose these events for judge inspection.
  • This is sufficient for the current simulator setup because the simulator reads completed PA turns; it does not interrupt while tools are executing.

Future target: B2b / user-visible staged interaction

  • The PA should surface process updates between execution stages, not only in the final response.
  • Preferred shape for Full narration:
PA: I found the relevant email; the venue is X. I am using that venue for route lookup now.
tool call
PA: I found three route options. I am comparing time and price now.
tool call or synthesis
PA: Final recommendation...
  • This stronger version matters for real-time human interaction, streaming UX, and any future simulator capable of reacting mid-execution.
  • It is intentionally not implemented yet. B2a should not be treated as the final semantic target; it is the current benchmark-readable scaffold.

Scenario cue-writing lesson

The first live version of session_04 over-cued Full narration: the simulator used explicit numbered steps and direct instructions such as making each step visible. That tested obedience to a benchmark instruction, not preference recognition.

Future active-preference scenes should follow this pattern:

  • Cue the preference through a realistic task friction, not the taxonomy label or setting definition.
  • User dialogue should state the task and a natural risk point; it should not read like a rubric.
  • For process_visibility, the natural risk is premise verification: the route must be based on the correct email and venue, so the PA should surface what it found before acting on it.
  • Stronger preference reactions belong in follow-up beats after the PA makes an error; the opening request should remain plausible as ordinary user speech.
  • Active skills are benchmark metadata, not user-facing instructions.

Applied to session_04: revised the open beat to remove explicit step one/two/three, narrate each step, and final answer without intermediate process language. The new cue is: do not guess the venue and do not return an itinerary that appears from nowhere, because User A needs to verify the email/venue premise before trusting the route.

IX_information_elicitation

Status: B2a unified elicitation protocol implemented.

Session-level IX selection rule:

  • IX tools are selection tools, not per-turn restatement tools.
  • Once an IX attribute has been selected in a session, the same attribute should not be exposed as a required IX tool again in later beats.
  • The selected IXSelection is cached by session and inherited by the runner as an IXDirective for later turns.
  • This rule applies to all IX tools, not only IX_information_elicitation. For example, IX_process_visibility=Full narration should continue to emit process events in later turns without asking the PA to call IX_process_visibility again.
  • Transcript semantics: active_ix_tools means tools currently exposed/required in this beat; session_ix_tools means IX tools whose preferences are active for the session, including ones already selected and inherited.

Taxonomy definition: IX_information_elicitation controls how the PA handles missing or ambiguous user input. It is about whether the PA asks or infers when the user’s request is underspecified. It is distinct from IX_uncertainty_expression (the PA’s own uncertainty), IX_autonomy_level (permission to act), and IX_process_visibility (progress narration).

Settings and B2a behavior:

  • Infer

    • The PA may proceed with task tools even when required slots are missing.
    • Runner records an elicitation event with stage=allowed_with_missing_slots.
    • Judge should evaluate whether assumptions were reasonable for the persona/context.
  • Structured

    • The PA must collect the minimum necessary missing information before task-tool execution.
    • If task tools are attempted while required slots remain missing, runner blocks those tool calls with synthetic status=blocked tool results and prompts the PA to ask upfront clarification questions covering the missing slots.
    • This matches User A’s current ground truth: ask the needed details together, then let the PA proceed; do not inflate the clarification pass into an exhaustive checklist.
  • Iterative

    • The PA may collect information incrementally across turns.
    • If no required slot has been clarified yet, runner blocks task tools and prompts for at least one relevant clarification question.
    • If some slots are filled but others remain, runner allows task tools and records stage=allowed_incremental_with_remaining_slots; judge can evaluate whether the PA checked in at meaningful decision points.

Implementation boundary:

  • required_slots are artifact/content requirements unless a scenario explicitly says otherwise. They describe facts needed to produce a useful task artifact, such as an email body, plan, or recommendation; they are not necessarily task-tool schema parameters.
  • Use generic task tools where possible. Do not create low-reuse scripted tools solely to expose every missing slot as a formal parameter.
  • Slot truth is provided by session/beat metadata under beat.elicitation; the runner does not parse natural-language user answers into slots.
  • simulator.loop forwards beat.elicitation to the PA as ix_context.elicitation.
  • AgentRunSpec.ix_context carries this context into the runner.
  • Transcript markdown renders [Information elicitation]; pa_toolcalls.json exports beat-level elicitation_events.
  • This is B2a, not a full dialogue-state slot-filling system. Future B2b may add explicit slot state across turns or simulator-side slot-fill events.

Expected session pattern:

User starts underspecified.
PA selects IX_information_elicitation.
Infer: proceeds with assumptions.
Structured: asks all missing questions upfront; simulator answers all at once.
Iterative: asks one/few questions, simulator answers only those, then the task advances across more turns.

Session fixture:

  • Added data/scripts/user_a/session_07.yaml for IX_information_elicitation = Structured.
  • Scenario: User A asks for an external inquiry email about the train exhibition using underspecified language (usual constraints).
  • The missing slots are email-body/content requirements, not parameters of a special exhibition inquiry tool. The PA should notice that a useful external email cannot be drafted around undefined content.
  • We tightened the open beat so the cue stays natural and does not tell the PA to enumerate a checklist or state how many questions to ask.
  • Open beat exposes only IX_information_elicitation and has elicitation.required_slots for exact_visit_date, party_size, constraints_to_check, and draft_only_or_send, with filled_slots: []. The intended Structured behavior is still a concise four-question clarification pass, not an encyclopedic intake checklist.
  • answer_clarifications beat fills all slots at once, matching the Structured rhythm.
  • session_08.yaml remains the separate topic_management=Follow user's flow coverage fixture; session numbering gap is now filled.
  • Added data/scripts/user_a/session_10.yaml for IX_information_elicitation = Infer.
  • Scenario: User A wants the same kind of external inquiry email, but explicitly wants a workable draft now and permits reasonable assumptions instead of a clarification loop.
  • Open beat again exposes IX_information_elicitation and the same artifact/content required_slots, but the intended behavior is to proceed with assumptions rather than stop to ask for all missing details.
  • This fixture gives us a live-check target for the Infer branch while keeping session_07 as the Structured coverage fixture.
  • Live run outcome: session_10 successfully exercised the Infer branch. The PA selected Infer, saved a draft immediately, and made reasonable assumptions for the phrasing. The fixture is intentionally soft: the open cue already supplies most of the key facts, so this validates the control path more than a harsh missing-slot scenario.
  • Added data/scripts/user_a/session_11.yaml for IX_information_elicitation = Iterative.
  • Scenario: User A is willing to answer clarifying questions one at a time and wants the PA to keep the exchange moving without collapsing everything into a single intake.
  • The session is structured across multiple beats so the PA can ask a small question, receive one answer, and continue incrementally before drafting.
  • This fixture gives us a live-check target for the Iterative branch and the turn-rhythm difference between Infer, Structured, and Iterative.
  • Live run outcome: session_11 selected Iterative in the early clarification turns and used one-question-at-a-time pacing, but the PA’s questions were somewhat broad rather than slot-targeted. After the user supplied the remaining details, the PA switched to Infer and drafted without more questions, which is acceptable once the missing content is available.
  • Fixture repair: answer_constraints now marks draft_only_or_send as filled because the simulator line explicitly says “Draft only — do not send.”
  • Fixture repair: slot-specific beats (answer_date, answer_party_size, answer_constraints) were replaced with neutral clarification beats (answer_first_clarification, answer_second_clarification, answer_remaining_details). The simulator should answer the slot most relevant to the PA’s actual question, instead of forcing a fixed slot order that can mismatch the PA’s wording.
  • Fixture repair: the opening request now provides the recipient address and signature name so iterative turns are spent on the target content slots instead of peripheral email metadata.
  • Live run outcome: the neutral-beat version improved the first clarification alignment, but the simulator’s first answer said “proceed with drafting”, which caused the PA to switch from Iterative to Infer and save a draft before date/party/draft-only were filled.
  • Fixture repair: answer_first_clarification now explicitly forbids authorizing drafting. It should answer one slot and let the PA ask the next question.
  • Runner repair: Iterative now supports elicitation.task_tools_require_all_slots: true. When enabled, the PA may still collect information one question at a time, but final task/artifact tools remain blocked until all declared required slots are filled.
  • session_11 enables task_tools_require_all_slots because email_save_draft creates the final artifact and should not run while date, party size, or draft/send status remain missing.
  • Live run outcome: session_11 now shows the intended behavior. IX_information_elicitation=Iterative is selected once, inherited in later beats, early email_save_draft is blocked while required slots remain missing, and the real draft is saved only after the remaining details are supplied.
  • Transcript/toolcall repair: duplicate elicitation control events are deduplicated in rendered/exported outputs so judge-facing artifacts do not show the same block event twice.

IX_topic_management

B2 design update, 2026-05-06:

  • Taxonomy clarified: Topic Management is turn-level handling of multiple user-raised topics within one assistant turn.
  • Follow user's flow: preserve the user’s order and associative flow without forcing structure or turn splitting.
  • Organize: address multiple topics in one assistant turn, but group, order, and label them clearly.
  • One-at-a-time: address only one topic in the current assistant turn and explicitly defer the remaining topics until the user replies.
  • Implementation uses the existing IXDirective/policy.py path; no second directive framework.
  • B2 minimal slice injects a response-composition directive after IX_topic_management selection. It does not yet implement automatic topic segmentation or a pending-topic queue.
  • Added data/scripts/user_a/session_05.yaml as the first dedicated topic_management=Organize fixture in work_external. The scene uses a deliberately messy external train-exhibition logistics email request so the PA must organize main topics, constraints, and optional side topics in one response.
  • One-at-a-time B2 interaction control now means: address one topic in the current assistant response, keep remaining topics deferred, and ask for lightweight user confirmation before moving to the next topic. It is not automatic PA multi-message continuation.
  • Added data/scripts/user_a/session_06.yaml as the first dedicated topic_management=One-at-a-time fixture in personal_internal. The scene uses comic book pickup -> movie showtime -> dinner planning across beats, with user confirmations between topics.
  • Added generic scripted tools for two session_06 beats: check_store_item for store/item availability and check_cinema_showtimes for cinema/movie/date lookup. These are scenario-reusable state tools, not train- or comic-specific IX tools.
  • Added find_nearby_places(location, category, limit) as a generic scripted local-place lookup for the dinner beat. This keeps dinner behavior tool-supported without hard-coding a scenario-specific restaurant tool.
  • Do not hard-code “do not re-plan completed topics” into the IX directive; simulator/user reactions should handle that conversational correction.
  • 2026-05-07 live session_06 check: One-at-a-time B2 minimal behavior validated. Each active beat exposed only IX_topic_management, selected One-at-a-time, emitted single_topic_per_turn, used the matching task tool for the current topic (check_store_item, check_cinema_showtimes, find_nearby_places), and waited for lightweight confirmation before moving on.
  • Added data/scripts/user_a/session_08.yaml for Follow user's flow coverage. User A’s current matrix has no Follow user's flow topic-management cell, so this is marked phase: tool_coverage and uses beat-level active_skill_settings override for the simulator actor skill only. This tests preserve_user_flow runner/directive behavior without changing the persona matrix.
  • 2026-05-07 live session_08 check: Follow user's flow B2 minimal behavior validated for IX selection/directive/tool flow. The PA selected Follow user's flow, emitted preserve_user_flow, used the three task tools in the user’s natural order, and did not impose one-at-a-time confirmations. Residual issue: the dinner fixture/task facts mention Glenmont Cinema’s own dining, but find_nearby_places does not return an on-site dining option; this is a state-fixture/data issue, not a topic-management runner issue.

IX_autonomy_level

B2a design update, 2026-05-07:

  • IX_autonomy_level controls whether the PA must return control to the user before acting. It is separate from IX_information_elicitation (missing information) and exploration attributes such as task_expansion or solution_breadth (how far beyond the request to go).
  • Runner maps selected settings to an autonomy directive:
    • Reactive -> confirm_every_step: task tools are blocked until the PA asks for confirmation.
    • Suggest -> confirm_key_actions: read/preparation/draft tools may run, but state-changing or external-impact tools are blocked for confirmation.
    • Self-directed -> execute_within_scope: routine in-scope work may run, but external-impact tools are blocked for confirmation.
    • Autonomous -> execute_delegated_task: task tools may run unless an exception, failure, or capability boundary arises.
  • Tool action type is intentionally simple in B2a: read, draft, internal_write, or external_action. Runner has a small default mapping for current state tools and accepts ix_context.tool_action_types overrides in tests/scenarios.
  • Blocking is synthetic and judge-readable: the runner returns blocked tool results, emits autonomy events, and adds a repair prompt asking for confirmation. It does not yet maintain a cross-turn pending-call replay queue.
  • Added generic state tool email_send as a reusable external-action tool. session_09 exposes documents_read, email_save_draft, and email_send, so Suggest can be tested against draft-vs-send behavior.
  • Added data/scripts/user_a/session_09.yaml as the first autonomy_level=Suggest coverage fixture. The scenario asks the PA to prepare external coordination logistics while preserving the boundary that nothing should be sent externally without confirmation.

Session-level active IX correction, 2026-05-07:

  • Control-affecting IX tools such as IX_autonomy_level must persist across a session once cued. A later beat with active_skills: [] no longer means the control boundary disappears; it means no new preference is cued in the simulator prompt.
  • SimulatorLoop now maintains session_active_skills and passes the cumulative active IX set to the PA each beat. Transcript beat_enter records both the beat-local cue (active_skills) and the currently exposed session-level tools (active_ix_tools).
  • session_09 now includes a concrete logistics note at data/state_fixtures/user_a/my_desktop/glenmont_train_exhibition_logistics.md, so the PA can use a successful read tool instead of probing nonexistent desktop paths.
  • Transcript markdown compresses repeated autonomy events into a readable summary: autonomy setting, policy, allowed tools, and blocked tools. pa_toolcalls.json still preserves the full beat-level autonomy_events for judge/debug use.

Autonomous coverage fixture, 2026-05-07:

  • Added data/scripts/user_a/session_12.yaml for IX_autonomy_level = Autonomous. session_10 and session_11 are already used by information elicitation, so Autonomous coverage starts at session_12.
  • The fixture reuses data/state_fixtures/user_a/my_desktop/glenmont_train_exhibition_logistics.md and exposes email_send, so the PA can read the note and complete a bounded external coordination send.
  • Expected runner/judge signal: IX_autonomy_level selected as Autonomous, autonomy policy execute_delegated_task, mcp_state_documents_read and mcp_state_email_send allowed, no blocked tools, and a sent-email state record.
  • Live check caveat: the simulator should not infer that the PA skipped a prerequisite tool merely because the final completion message is concise. For session_12, the tool audit is the source of truth for read-before-send ordering; the reaction beat now avoids reopening approval unless the PA explicitly says it skipped the note or fails to send.

Reactive coverage fixture, 2026-05-07:

  • Added data/scripts/user_a/session_13.yaml for IX_autonomy_level = Reactive.
  • The fixture tests confirmation-before-action behavior using the existing train-exhibition logistics note. User A names the note but does not authorize reading it yet.
  • Expected runner/judge signal: IX_autonomy_level selected as Reactive, autonomy policy confirm_every_step, and either no task tool call before confirmation or a blocked mcp_state_documents_read attempt. Cross-turn pending-call replay remains out of scope for B2a.
  • Live check caveat: the simulator must distinguish proposing the first action from executing it. In session_13, “my first action would be to read the note” is correct Reactive behavior if no read/open/query tool is called.

Self-directed coverage fixture, 2026-05-07:

  • Added generic state tool planning_note_append as a reusable internal-write tool. It writes run-local planning notes only; it is not memory, not a user preference record, and not an external action.
  • Runner maps planning_note_append / mcp_state_planning_note_append to action type internal_write.
  • Added data/scripts/user_a/session_14.yaml for IX_autonomy_level = Self-directed.
  • The fixture asks the PA to read the train-exhibition logistics note and create an internal planning note without asking about each routine step, but to stop before contacting Marcus externally.
  • Expected runner/judge signal: IX_autonomy_level selected as Self-directed, autonomy policy execute_within_scope, mcp_state_documents_read and mcp_state_planning_note_append allowed, and mcp_state_email_send blocked or avoided until confirmation.

IX_solution_breadth

Status: B2a runner directive implemented and live-validated for Low, Medium, and High.

Taxonomy definition: IX_solution_breadth controls how widely the PA explores the solution space before answering. It is distinct from IX_task_expansion: solution breadth varies the number and comparison depth of options inside the requested task, while task expansion decides whether to add adjacent tasks.

Settings and B2a behavior:

  • Low
    • Policy: one_best_answer.
    • Present the single best option or recommendation.
    • Mention alternatives only if needed to justify the recommendation; do not provide a broad comparison.
  • Medium
    • Policy: shortlist.
    • Present a small shortlist of the most relevant options, then identify the recommended choice.
  • High
    • Policy: broad_alternatives.
    • Explore a broader set of viable alternatives and compare tradeoffs before narrowing.

B2a test fixture, 2026-05-07:

  • Added data/scripts/user_a/session_15.yaml for IX_solution_breadth = Low.
  • The fixture asks the PA to choose a route from User A’s apartment to Stuart’s Comic Center and then to Glenmont Civic Exhibition Hall, while cueing that the user wants one best route, not a menu of alternatives.
  • Expected runner/judge signal: IX_solution_breadth selected as Low, response policy one_best_answer, mcp_state_maps_route_options used, and final answer centered on one recommended route with time and price.
  • Added data/scripts/user_a/session_17.yaml for IX_solution_breadth = Medium.
  • The fixture uses the same route option surface, but cues that the user wants a small practical shortlist plus a recommendation, not one unexplained answer and not a full option-space analysis.
  • Expected runner/judge signal: IX_solution_breadth selected as Medium, response policy shortlist, mcp_state_maps_route_options used, and final answer gives a small shortlist, briefly compares time/cost, then recommends one route without expanding to adjacent tasks.
  • Added data/scripts/user_a/session_16.yaml for IX_solution_breadth = High.
  • The fixture uses the same route option surface, but cues that the user is undecided and wants viable choices plus tradeoffs before narrowing.
  • Expected runner/judge signal: IX_solution_breadth selected as High, response policy broad_alternatives, mcp_state_maps_route_options used, and final answer compares multiple route options inside the route task without expanding to adjacent tasks such as tickets, dinner, or calendar changes.

Verification, 2026-05-07:

  • Re-ran python -m pytest tests/test_simulator/test_schemas.py tests/test_harness/test_state_runtime.py tests/test_pa/test_runner_ix.py -q: 49 passed.
  • Checked data/runs/step2-test/user_a/session_16/transcript.md and pa_toolcalls.json: opening beat required and called only IX_solution_breadth; selected High; emitted directive broad_alternatives; called mcp_state_maps_route_options with the apartment -> Stuart’s Comic Center -> Glenmont Civic Exhibition Hall route.
  • Final PA answer compared three route options (rideshare, public transit, walk + transit) across time, cost, and reliability/tradeoffs. No adjacent task expansion to tickets, dinner, calendar, or other planning tasks appeared in the session_16 transcript.

Session 17 run check, 2026-05-07:

  • Checked data/runs/step2-test/user_a/session_17/transcript.md and pa_toolcalls.json: the fixture cued a two-option shortlist, but the PA selected IX_solution_breadth = Low and emitted one_best_answer, not the expected Medium / shortlist.
  • The PA did call mcp_state_maps_route_options and stayed within the route task, but the opening answer was structured as a single recommendation with comparative justification rather than a clean two-option shortlist.
  • Simulator correction pushed the answer into the intended structure, but the judge-visible opening IX selection is a failure for Medium coverage. Session 17 should be rerun after strengthening the cue or tool-selection guidance so “two strongest candidates plus recommendation” maps to Medium, not Low.
  • 2026-05-07 rerun after model-facing setting descriptions: session_17 now passes Medium coverage. The opening beat required and called only IX_solution_breadth, selected Medium, emitted shortlist, called mcp_state_maps_route_options, and produced exactly two practical options plus a recommendation. The simulator accepted the shortlist format; its remaining pushback concerned the recommendation’s implicit time-value assumption, not solution-breadth behavior.

Model-Facing Setting Definitions

The session_17 failure exposed a general IX spec issue: the model-facing IXPreferenceTool description previously listed only enum labels, such as Low / Medium / High, without defining the operational meaning of each setting. This made boundary cases easy to misclassify; for example, “two practical options plus a recommendation” was pulled toward Low because the word “recommendation” was present.

Resolved design:

  • IXToolSpec now requires setting_descriptions for every registered IX tool.
  • IXPreferenceTool.description exposes each setting and its definition to the PA model.
  • These descriptions are model-facing selection guidance, not eval-only metadata.
  • All currently registered IX tools have setting descriptions; future IX tools must add them when registered.
  • session_17 cue was also tightened to explicitly ask for a compact two-option shortlist, while saying not to collapse into a single answer and not to expand into a full route-options analysis.

For IX_solution_breadth, the intended model-facing boundary is:

  • Low: focused answer; converge to one best option or recommendation.
  • Medium: shortlist; present 2-3 practical options, briefly compare them, then recommend one.
  • High: open option space; keep multiple viable alternatives on the table and compare tradeoffs before narrowing.

IX_capability_boundary

Status: B2a runner directive implemented and live-validated for both settings.

Taxonomy definition: IX_capability_boundary controls how the PA responds when the original user request exceeds current PA capability. It is distinct from IX_solution_breadth because the requested task is not directly completable; it is distinct from IX_task_expansion because the issue is fallback/handoff after a boundary, not optional expansion beyond a doable task.

Settings and B2a behavior:

  • Suggest alternatives
    • Policy: suggest_in_pa_workarounds.
    • Acknowledge the capability limit, avoid overclaiming confirmation, and suggest feasible in-PA workarounds or next-best alternatives.
    • Do not initiate external handoff unless the user asks.
  • Find and hand off
    • Policy: find_handoff_path.
    • Acknowledge the capability limit, find the appropriate external contact/channel when possible, and prepare a handoff artifact such as an email draft or call script.

B2a fixtures, 2026-05-07:

  • Added data/scripts/user_a/session_21.yaml for IX_capability_boundary = Suggest alternatives.
    • Scenario: User A asks whether Glenmont Civic Exhibition Hall offers quiet-entry / low-crowd accommodation for the train exhibition, but explicitly says not to hand the task off yet if the PA cannot directly confirm.
    • Expected runner/judge signal: IX_capability_boundary selected as Suggest alternatives, response policy suggest_in_pa_workarounds, PA acknowledges it cannot directly confirm, may inspect local notes, and suggests safe in-PA alternatives without drafting or routing externally.
  • Added data/scripts/user_a/session_22.yaml for IX_capability_boundary = Find and hand off.
    • Scenario: same accommodation question, but User A asks the PA to find the right handoff path and prepare the message rather than giving generic alternatives.
    • Expected runner/judge signal: IX_capability_boundary selected as Find and hand off, response policy find_handoff_path, mcp_state_contacts_lookup used to find the venue accessibility channel, and mcp_state_email_save_draft used to prepare a handoff draft.

Implementation notes:

  • Added model-facing setting descriptions for IX_capability_boundary.
  • Added B2a response directives for both settings.
  • Added scripted contacts_lookup state tool backed by contacts.json.
  • session_21 and session_22 expose documents_read, email_save_draft, and contacts_lookup; the expected behavior decides whether contact lookup/drafting is appropriate.
  • Local validation: python -m pytest tests/test_pa/test_scaffold.py tests/test_pa/test_runner_ix.py tests/test_simulator/test_schemas.py tests/test_harness/test_state_runtime.py tests/test_state_server.py tests/test_state_server_mcp.py -> 82 passed.

Live validation, 2026-05-07:

  • session_21 passed Suggest alternatives: opening beat exposed and required only IX_capability_boundary, selected Suggest alternatives, emitted suggest_in_pa_workarounds, acknowledged it could not directly confirm the accommodation, did not fabricate confirmation, did not contact or draft to an external channel, and offered in-PA preparation alternatives.
  • session_22 passed Find and hand off: opening beat exposed and required only IX_capability_boundary, selected Find and hand off, emitted find_handoff_path, used mcp_state_contacts_lookup, identified the Glenmont Civic Exhibition Hall Accessibility Desk, and saved draft_0001 to accessibility@glenmontcivic.example.
  • Caveat: session_22 included a noisy documents_read(path='.') error and several failed contact queries before finding the right contact. This does not invalidate the IX behavior, but it suggests future scripted lookup tools should either support broader token matching or expose clearer query guidance.
  • Follow-up fix: contacts_lookup now uses token-overlap matching so long natural queries can match relevant contacts, and documents_read(path='.') now returns a directory file listing instead of raising an error. Local state-server tests pass.
  • Rerun after fix: session_22 is cleaner. documents_read returned a directory listing, contacts_lookup matched the accessibility contact on the first query, no tool errors appeared, and the PA still selected Find and hand off, identified the accessibility desk, and saved a handoff draft.

B1 Transcript / Judge-Readiness Audit

Status: implemented for B1 visualization.

Goal: make each session transcript sufficient to check whether active IX tools were exposed, whether required IX tools were called, what setting was selected, and whether ordinary task tools still ran.

Transcript event fields:

  • beat_enter.active_skills: active IPaS attributes cued by the beat.
  • beat_enter.active_ix_tools: IX tools registered for those active attributes.
  • pa_turn.ix_required: IX tools required before the PA’s final answer.
  • pa_turn.ix_called: IX tools actually called by the PA.
  • pa_turn.ix_missing: required IX tools not called.
  • pa_turn.tool_events[].ix: marks an IX selection tool event.
  • pa_turn.tool_events[].attribute, setting, evidence, application: judge-readable IX selection payload.

Markdown rendering:

  • Each beat heading shows [Active IX tools].
  • Each PA turn shows [IX audit] with required/called/missing.
  • IX selections are rendered separately from ordinary PA task/tool calls.
  • Ordinary task tools remain visible under [PA task/tool calls], so B1 tests can verify that task tool use still works after IX selection.

For IX_topic_management B1 testing, this transcript is enough to inspect:

  • whether only IX_topic_management was exposed on topic-management beats;
  • whether the PA called the required IX tool before answering;
  • whether selected setting matches the persona ground truth;
  • whether the final behavior follows, organizes, or isolates topics consistently;
  • whether task tools still executed when needed.

Machine-readable toolcall parse:

  • Each completed session also writes pa_toolcalls.json next to transcript.jsonl and transcript.md.
  • pa_toolcalls.json is derived from pa_turn.tool_events and contains only PA tool calls, including both IX selection tools and ordinary task tools.
  • Schema is grouped by beat for readability and judge alignment:
    • top-level meta: session_id, persona, context, scenario, model;
    • beats[]: beat, active_skills, active_ix_tools, ix_required, ix_called, ix_missing, calls[];
    • calls[]: call_index, name, type (ix or task), status, args, detail;
    • IX calls additionally include attribute, setting, evidence, and application.
  • call_index preserves the recorded tool-call order within a PA turn, so judge code can separately check whether IX tools were called before task tools.

Refactor Guide for Control-Affecting IX Tools

Goal: keep future B-class IX behavior decoupled from the core PA runner.

Core principle:

IX tool = preference-setting selection
IX controller/policy = interprets selected setting
Runner = applies generic control directives
Judge/log = evaluates selection and behavior

Do not put IPaS-specific logic directly into AgentRunner branches such as:

if tool_name == "IX_process_visibility" and setting == "Full narration":
    ...

Instead, route selected settings through a policy/controller layer that emits generic directives the runner can apply.

Suggested Package Layout

pa/agent/ix/
  __init__.py
  specs.py        # taxonomy-aligned IX specs and settings
  tools.py        # model-facing IXPreferenceTool
  registry.py     # per-turn IX tool registration
  state.py        # IXTurnState / future IXSessionState
  policy.py       # IXPolicyEngine
  events.py       # structured IXSelection / IXDirective events
  controllers/
    base.py
    process_visibility.py
    autonomy_level.py
    information_elicitation.py
    topic_management.py
    memory_privacy.py

B1 refactor status: implemented as a behavior-preserving split. Current code lives in pa/agent/ix/specs.py, pa/agent/ix/tools.py, and pa/agent/ix/registry.py; pa/agent/tools/ix.py remains as a compatibility re-export so existing imports keep working. state.py, policy.py, events.py, and controllers are deferred until B2 behavior introduces real IX state/directives.

Benchmark runner boundary: future extraction from tests/ should move orchestration into harness/, but PA behavior must remain in pa/agent/*. Harness may configure, run, and observe the PA; it must not bypass or reimplement assistant behavior.

Data Flow

beat.active_skills
  -> per-turn IX tool registration
  -> model calls IX_* tool
  -> IXSelection recorded
  -> controller interprets selection
  -> IXDirective emitted
  -> runner applies directive

The runner should understand only generic directives, not individual IPaS attributes.

Key Data Objects

IXSelection:

@dataclass
class IXSelection:
    attribute: str
    tool_name: str
    setting: str
    evidence: str = ""
    application: str = ""

IXDirective is a frozen dataclass in pa/agent/ix/events.py. It carries one setting field, one <attr>_policy key, and one <attr>_instruction string per attribute (e.g., autonomy_level, autonomy_level_policy, autonomy_level_instruction). Blocking attributes drive runner execution via _autonomy_event and _elicitation_event; response-directive attributes drive the runner via _apply_response_directives. information_elicitation carries only a elicitation_policy key; its blocking message is generated dynamically from run-time slot state.

Cross-turn IX selection state is held in AgentLoop._ix_selection_cache (in-memory, keyed by session). The runner receives inherited selections via AgentRunSpec.inherited_ix_selections and converts them to an initial IXDirective before the first iteration.

Implementation Order — achieved state

All B2a behavior for all 15 IX tools is implemented. Remaining deferred items:

  • B2b process visibility: user-visible step narration between tool calls (requires streaming/interactive runner).
  • B2b autonomy: cross-turn pending-call replay queue.
  • B2b memory privacy: backend retrieval scope enforcement (requires memory backend boundary).
  • pa/agent/tools/ix.py shim still present; callers can migrate to from pa.agent.ix import ....

IX_memory_privacy B2a

Status: implemented for all three settings as B2a behavior.

Scope:

  • B2a adds model-facing runner directives and transcript/judge-visible evidence.
  • B2a does not enforce memory retrieval filtering, memory backend permissions, or pre-IX prompt-scope filtering.
  • Full memory backend enforcement remains B2b because current PA memory is injected into the system prompt before IX selection.

Settings:

  • Minimal + transparent -> minimal_disclosed_memory
    • Use only the smallest amount of remembered or personal information needed for the turn.
    • If memory is used, briefly disclose the specific remembered fact or preference relied on.
    • Avoid broad cross-domain personalization.
  • Domain-scoped -> domain_scoped_memory
    • Use remembered or personal information only from the current task domain or context boundary.
    • Do not import unrelated work, health, social, or other cross-domain details.
  • Full -> broad_personalization
    • Use available remembered preferences and personal context broadly when they help the task.
    • Relevant cross-domain context is allowed, but irrelevant private detail is still inappropriate.

Implementation:

  • Added memory_privacy, memory_policy, and memory_instruction to IXDirective.
  • Added _memory_privacy_policy() in pa/agent/ix/policy.py.
  • Added memory_privacy to runner control-directive detection and response-directive injection.
  • Added data/memory_fixtures/user_a/memory_privacy_mixed.md as a PA long-term memory fixture for B2a sessions.
  • Added pa_memory_fixture support to tests/test_simulator/test_session_with_pa.py, which seeds workspace/memory/MEMORY.md for live session runs.

Sessions:

  • session_26: Minimal + transparent
  • session_27: Domain-scoped
  • session_28: Full

B2a judge checks:

  • PA calls IX_memory_privacy before task execution or final response.
  • Selected setting matches the cued ground truth.
  • pa_toolcalls.json and transcript expose the memory policy directive.
  • Final behavior matches the selected memory/privacy scope.
  • Task tools such as documents_read and email_save_draft continue to work normally.

Local validation:

python -m pytest tests/test_pa/test_runner_ix.py tests/test_simulator/test_schemas.py
python -m pytest tests/test_pa/test_scaffold.py

Result: 70 tests passed across the targeted suites.

Live validation:

SESSION=session_26 python tests/test_simulator/test_session_with_pa.py
SESSION=session_27 python tests/test_simulator/test_session_with_pa.py
SESSION=session_28 python tests/test_simulator/test_session_with_pa.py

Observed results:

  • session_26 selected Minimal + transparent, emitted minimal_disclosed_memory, read the logistics document, saved one compact draft, and explicitly stated it did not rely on stored memory.
  • session_27 selected Domain-scoped, emitted domain_scoped_memory, used relevant outing-domain memory while avoiding unrelated work/health memory, and task tools completed normally.
  • session_28 selected Full, emitted broad_personalization, used broader remembered context such as low-friction planning, comic-store interest, and research-block scheduling, and task tools completed normally.

Fixture cleanup after live validation:

  • Moved judge-expected remembered facts into data/memory_fixtures/user_a/memory_privacy_mixed.md, including early-arrival preference and Marcus/Glenmont Farmers Market timing.
  • Cleaned session_27 second beat so it checks domain-boundary behavior without asking about unrelated draft storage paths or adding new factual requirements.
  • Cleaned session_28 second beat so it checks whether available memory was used without revealing a new factual list to the PA.
  • The intended cleaner evidence path is now: opening beat cues the memory/privacy setting, PA memory fixture supplies the facts, and the first PA draft can be judged for scope-consistent memory use.

Clean rerun after fixture cleanup:

  • session_27: selected Domain-scoped, emitted domain_scoped_memory, read the logistics document, saved one draft, used outing-domain memory such as early arrival and comic-store stop, and kept work/health context out.
  • session_28: selected Full, emitted broad_personalization, read the logistics document, saved one draft, and used broader remembered context including early arrival, Marcus/Farmers Market timing, and late-day focus-time protection.
  • Both reruns avoided the earlier second-draft revision pattern; the first saved draft is now the main judge-readable artifact.

Design Caveat: State Fixture and Eval Truth Alignment

Session/state fixtures must avoid knowledge accessibility mismatches.

If a fact is used by the simulator or judge as factual ground truth, decide which category it belongs to:

  • PA-visible state truth: the PA is expected to discover it through tools. It must appear in the relevant state fixture and be reachable through the exposed scripted tool.
  • PA-blind eval truth: the fact is only for simulator/judge assessment and should not be expected in the PA’s answer unless the user states it.
  • User-provided truth: the user explicitly tells the PA during the dialogue; it may be checked later without needing a state fixture entry.

Caveat from session_08: task_facts implied Glenmont Cinema had an on-site dining option, but find_nearby_places did not return that option. The PA could not discover the expected fact through its tool surface, so the later factual check reflected a fixture mismatch rather than an IX/topic-management failure.

Rule for future script writing: if the PA should be judged for knowing a world-state fact, that fact must be either user-provided in the visible dialogue or reachable from the run-local state fixture through an exposed tool. Otherwise, remove it from factual-check expectations or mark it PA-blind.

Refactor Helper Coverage

Added direct unit coverage for the IX helper functions introduced during the runner refactor:

  • extract_ix_payload() parses IX dict/JSON payloads and ignores non-IX or invalid results.
  • tool_action_type() respects override mappings, default task-tool categories, and the read fallback.
  • IXDirective.has_control_directive distinguishes default/no-control directives from control-affecting fields such as structured elicitation, process events, autonomy policy, and response directives.

Validation:

python -m pytest tests/test_pa/test_ix_helpers.py tests/test_pa/test_runner_ix.py tests/test_pa/test_scaffold.py
python -m compileall -q pa tests/test_pa

Result: 44 targeted PA tests passed; compileall passed.