PA Interaction Preference Taxonomy

Motivation

PrefIx [5] defined 4 dimensions / 14 attributes / 31 settings for interaction preferences in tool-use agents. Their categories (Transparency, Pace & Flow, Strategy & Initiative, Robustness) are broadly valid, but:

  1. Their specific attributes are instantiated for tool-use agents, not personal assistants. E.g., “Tool Transparency” and “Parameter Transparency” assume the user thinks in terms of tool calls.
  2. They miss PA-unique aspects: communication style (tone, emotional engagement), memory/privacy behavior, delegation and representation, proactive outreach — because tool-use agents don’t have persistent relationships, don’t act on behalf of users, and don’t initiate contact. These are integrated into our four dimensions rather than treated as separate dimensions.
  3. They don’t cover communication at all: no tone, no formality, no emotional engagement — because tool-use agents primarily execute, not converse.
  4. Over-personalization risk is unaddressed: Memory-enabled PAs face the dual risk of under-personalization (ignoring user preferences) and over-personalization (misapplying preferences across contexts, being “creepy,” or creating filter bubbles). BenchPreS [11] showed that even GPT-5.2 misapplies preferences 40.95% of the time. A principled taxonomy of interaction preferences — with explicit context boundaries — is necessary to define what appropriate personalization looks like and when the PA should deliberately not adapt.

We propose a PA-specific interaction preference taxonomy that:

  • Re-instantiates PrefIx’s valid categories for PA contexts
  • Adds PA-unique dimensions
  • Grounds every attribute in a user cognitive/psychological preference, not in PA implementation details

Design Principles

Content vs Interaction Preference

Boundary test: If the difference in PA output serves an informational function (the user learns something new or gets a different task result), it is a content preference. If the difference serves a relational or communicative function (the user’s experience of the interaction changes, but the task-relevant information is the same), it is an interaction preference.

Content preferences (already evaluated by LOCOMO, LongMemEval, etc.): what the agent produces — the informational substance (“use Python”, “metric units”, “end emails with Best regards”).

Interaction preferences (what this taxonomy defines): how the agent communicates, operates, and maintains its relationship with the user — the relational and communicative form.

Three-Layer Attribute Structure

Every attribute follows this structure:

  1. User cognitive preference: The underlying psychological/cognitive trait that drives the preference. This is stable — it doesn’t change with technology.
  2. PA attribute: The observable PA behavior dimension that this maps to.
  3. Settings: Discrete levels or a continuous spectrum of how the PA can behave.

Stability principle: All attributes describe user-perceivable preferences, never PA implementation mechanisms. If an attribute only makes sense when you know how the PA works internally (sub-agents, tool calls, API choices), it does not belong in this taxonomy.

MECE Principle

Each dimension answers a distinct question about the interaction. For any observable PA behavior, it should be unambiguous which dimension it belongs to.

Independent Evaluability Principle

Every PA attribute must be independently evaluable — it must be possible to write a standalone rubric for each attribute that can be scored without reference to any other attribute. If two attributes can only be scored together or their scores always co-vary, they should be merged. This ensures the taxonomy produces actionable, discriminant evaluation criteria for the benchmark.

Context-Dependency

All attributes in this taxonomy can be context-dependent. The same user may have different settings in different contexts (debugging vs learning, work vs personal, solo vs multi-party). Context-dependency is a property of the benchmark evaluation, not of the taxonomy itself.

Multi-party as Context

“Multiple people are present” (screen sharing, group chat, meeting) is a context in which existing preferences vary (e.g., Disclosure preferences change when others can see), not a separate dimension.


Taxonomy: 4 Dimensions


Dimension 1: Expression Style

Question: How does the PA express itself?

About the form of all PA outputs, regardless of what information is included.

User Cognitive PreferencePA AttributeSettings
Communication register preference: What level of formality does the user expect in interactions?Tone & FormalityCasual / Consultative / Formal
Information load: What is the user’s preferred level of information per interaction?VerbosityTerse / Moderate / Detailed
Socioemotional orientation: How much relational (vs purely task-focused) engagement does the user want from the PA?Emotional EngagementTask-focused / Balanced / Relationship-focused
Common ground assumption: How much familiarity with the current topic does the PA assume the user has?Guidance LevelAssumed / Calibrated / Guided

Removed from this dimension:

  • Output Structure: Low discriminative power; overlaps with Verbosity and Topic Management (Dim 4).
  • Representation Style: Not observable from transcript data; too fine-grained for current benchmark scope.

Boundary with Disclosure: Expression Style governs HOW things are said. Disclosure governs WHETHER certain categories of information are included at all. They vary independently: you can be terse (Expression) while still showing reasoning (Disclosure).


Dimension 2: Disclosure

Question: What categories of information does the PA choose to reveal?

2026-05-17 revision — memory_privacy deprecated. Memory & Privacy / memory_privacy is removed from the active IPaS taxonomy as a measurable interaction preference because memory/privacy scope is not realistic to enforce as an interaction tool. This does not remove persistent memory as a benchmark condition or implementation backend. Privacy and memory can still appear as narrative content, but they are no longer target cells, IX directives, or final probes. Active taxonomy size is now 14 attributes × 2 contexts = 28 cells.

About information selection — whether certain types of information are included in the PA’s output. Each is a degree decision, orthogonal to how the information is expressed.

User Cognitive PreferencePA AttributeSettings
Transparency preference + Epistemic vigilance: Does the user need to see the PA’s reasoning and evidence to trust its recommendations?Reasoning VisibilityShow / Summarize / Hide
Uncertainty communication preference: Does the user want the PA to express uncertainty, or give confident answers?Uncertainty ExpressionExpress / Moderate / Hide
Monitoring-blunting style: Does the user actively seek process information, or prefer to just get the result?Process VisibilitySilent / Bookend (start + end) / Full narration (progress + completion)
Privacy-personalization tradeoffMemory & PrivacyMinimal + transparent / Domain-scoped / Full — deprecated 2026-05-17

Changes from previous version:

  • Source Attribution: Merged into Reasoning Visibility. In practice, demanding sources and demanding reasoning use identical signals; keeping them separate produced no incremental discriminative power. Epistemic vigilance is now co-listed as a cognitive preference for Reasoning Visibility.
  • Data Access Transparency + Memory Scope / Memory & Privacy: deprecated as an active measurable attribute. The concept remains relevant background but is not an interaction preference target.

Dimension 3: Initiative & Autonomy

Question: When does the PA act, and how much does it do without being asked?

Covers the full spectrum from fully reactive to fully autonomous, including within-session suggestions and cross-session proactive outreach.

User Cognitive PreferencePA AttributeSettings
Control vs trust orientation: How much does the user want to stay in control vs delegate? Encompasses confirmation behavior: Reactive = confirm every step; Suggest = confirm key decisions; Self-directed/Autonomous = act without confirming.Autonomy LevelReactive / Suggest / Self-directed / Autonomous
Follow-up initiative: How much does the user want the PA to keep a thread alive after the current task is complete, including reminders, check-ins, and future follow-up suggestions?Proactive OutreachLow (complete and stop) / Medium / High (surface follow-ups, reminders, or check-ins after completion)
Task ExpansionLow (only what’s asked) / Medium / High (expand to related tasks)
Solution BreadthLow (one best answer) / Medium / High (explore multiple options)
Boundary recovery: When the PA hits a capability limit, failed attempt, missing information, or an error, should it keep trying to recover on the agent side or diagnose and return control to the user?Capability BoundarySuggest alternatives / Find and hand off

Why this is one dimension: All attributes answer “how much does the PA do on its own?” Control vs trust governs the core autonomy level (act before asking?). Exploration scope governs breadth across independently settable directions: keeping a completed thread alive after the task (Proactive Outreach), expanding the current task to related work (Task Expansion), exploring the solution space (Solution Breadth), and recovering at capability limits (Capability Boundary).

Attribute boundaries:

  • Proactive Outreach is a completion/future-thread policy: after a task is done, should the PA stop, offer one relevant next step, or actively surface follow-up/reminder/check-in options? It should not expand the current task or trigger extra execution by itself.
  • Task Expansion is current-task scope: should the PA fold adjacent work into the task now?
  • Solution Breadth is option-space width inside the current task.
  • Capability Boundary is failure/limit recovery: Suggest alternatives means the PA keeps trying to help from the agent side with feasible workarounds or next-best paths; Find and hand off means the PA diagnoses the limit, makes the missing piece clear, and hands control back to the user instead of continuing to work around the problem.

Context-dependent variations, not separate dimensions: Several scenarios modulate these preferences without requiring separate attributes:

  • Failure context: PA behavior during failures (retry, fallback, escalation) is governed by the same Autonomy Level and Exploration scope, potentially shifted. A user who is normally Autonomous may prefer Suggest during failures. This is modeled as context-dependent variation, not a separate dimension. Error detail level is a context-dependent variation of Reasoning Visibility (Dimension 2).
  • External actions: PA behavior when acting on behalf of the user externally (sending emails, making purchases) is governed by the same Autonomy Level, potentially shifted toward more cautious. A user who is Autonomous for internal tasks may prefer Suggest for external actions. This is modeled as context-dependent variation.

Dimension 4: Information Flow

Question: How does information exchange happen during the interaction?

About the rhythm and structure of the back-and-forth between user and PA.

User Cognitive PreferencePA AttributeSettings
Ambiguity tolerance: How comfortable is the user with the PA acting on incomplete or uncertain information?Information ElicitationInfer (minimal asking) / Structured (ask upfront) / Iterative (ask as needed)
Turn-level topic handling: How should the PA handle multiple user-raised topics within one assistant turn?Topic ManagementFollow user’s flow / Organize / One-at-a-time

Topic Management settings specify message-level handling of multiple topics:

  • Follow user's flow: respond in the user’s natural order and associative flow, without forcing structure or turn splitting.
  • Organize: address multiple topics in one assistant turn, but group, order, and label them clearly.
  • One-at-a-time: address only one topic in the current assistant turn, keep remaining topics deferred, and ask for lightweight user confirmation before moving to the next topic.

Why Info Gathering and Disambiguation merged: Both captured the same underlying dimension — “does the PA ask or infer?” Info Gathering was about proactive collection, Disambiguation about reactive resolution of uncertainty. The merged attribute, Information Elicitation, covers the full spectrum from “figure it out yourself” to “ask me until you’re sure.”

Boundary with Initiative & Autonomy: Information Flow is about how information exchanges are structured during an interaction. Initiative & Autonomy is about whether and how much the PA acts/initiates. Confirmation and follow-up behavior are subsumed by Autonomy Level and Proactive Outreach respectively (see Dimension 3).

Boundary with Disclosure (Uncertainty): Ambiguity tolerance (this dimension) governs how the PA handles ambiguity in the user’s input — does it ask or infer? Uncertainty Expression (Dimension 2) governs whether the PA communicates its own uncertainty to the user. They vary independently: a user may want the PA to guess rather than ask (high ambiguity tolerance) while still wanting the PA to flag when it’s unsure about the answer (high uncertainty expression).


Interaction Contexts

All 15 PA attributes can vary across contexts. The same user may have different preference profiles in different contexts. Contexts are orthogonal to preferences: knowing the context does not predict the preference — within any context, different users can occupy the full range of each attribute’s settings.

Context Structure: 2 Contexts

ContextDescription
workProfessional tasks — research, analysis, problem-solving, presentations, meetings, client communication
personalPersonal life tasks and interactions — personal decisions, daily routines, social events, communicating with friends/family, hobbies, personal problem-solving

Why these 2 domains: Work and Personal represent the cleanest binary distinction for PA interaction context — professional obligation vs private life. They are not defined by topic expertise or emotional valence (which would confound preferences), but by the social and functional context of the task. A user may have entirely different interaction preferences across these two domains — or not. That variation is exactly what the benchmark measures.

The internal/external scope dimension (whether output reaches third parties) was removed in revision: it proved to be a structural dead zone in transcript-based extraction and produced marginal preference differences in practice. The work/personal domain distinction captures the primary signal.

Determination: Context is determined directly from the task description, requiring no inference about the user’s psychological state.

Task Categories (examples per context)

Work

  1. Data analysis & reporting
  2. Code development & debugging
  3. Research & literature review
  4. Project planning & scheduling
  5. Document drafting & editing
  6. Meeting preparation & summarization
  7. Learning new tools or skills
  8. Problem diagnosis & troubleshooting
  9. Client communication & stakeholder reporting
  10. Professional networking & outreach

Personal

  1. Financial management & budgeting
  2. Travel planning & research
  3. Health & fitness tracking
  4. Meal planning & cooking
  5. Home management, shopping & maintenance
  6. Legal & administrative paperwork
  7. Family & relationship management
  8. Personal learning & self-improvement
  9. Social event coordination & friend/family messaging
  10. Service bookings, reservations & appointments

Interaction Tool Coverage

This section defines the planned IX_ interaction preference tools. These tools are not task-state tools and are not inherited BFCL action tools. Their purpose is to make the PA explicitly choose an interaction preference setting when a session or beat activates that preference. This makes the benchmark question clear: the PA is being asked to identify the appropriate interaction mode, then behave consistently with it while completing the normal task.

Design Principle: One Active Preference, One Visible Tool

Each of the 15 IPaS attributes has one model-facing IX_ tool. A benchmark beat exposes only the tools for the active preferences in that beat. This avoids ambiguous reusable tools where the model may be answering several preference questions at once.

Example:

session_01 / react_to_draft active preferences:
- topic_management
- reasoning_visibility
 
Exposed IX tools:
- IX_topic_management
- IX_reasoning_visibility

The judge can then evaluate three layers separately:

  1. Selection accuracy: Did the PA choose the correct setting?
  2. Behavioral consistency: Did the PA’s response/tool trajectory match the setting it chose?
  3. Task success: Did the PA complete the actual task with the normal task-state tools?

Control Categories

The implementation distinction is not narrative vs semi-operational. The cleaner split is:

CategoryMeaningRunner effect
A. control-unaffectedThe IX tool selects an interaction style/disclosure setting, but does not change whether ordinary task tools may execute in the same turn.No gating. The selected setting is logged for judging and should shape the PA response.
B. control-affectingThe IX tool selects a preference that may change action timing, scope, asking/confirming behavior, or whether task tools should be delayed.May require gating, pending task calls, staged execution, or explicit user resolution.

All IX_ tools are preference-selection tools. Some later receive runner-level control semantics because their IPaS attribute is inherently about action timing or information flow.

IX_proactive_outreach is treated as control-unaffected in the current implementation because it is a completion-policy preference: it shapes whether the PA stops, offers one future next step, or surfaces follow-up options after the current task is complete. It should not itself expand the current task or trigger extra tool execution.

IX_task_expansion is control-affecting because it changes the allowed scope of the current task: whether the PA stays inside the explicit request, adds necessary adjacent sub-steps, or actively includes adjacent support tasks.

A. Control-Unaffected Tools: Initial Implementation Target

These tools can be implemented first because they do not require pending tool-call state or dialogue gating. Each tool has:

  • setting: enum exactly matching the taxonomy setting.
  • evidence: brief evidence from the current user message, memory/profile, or context.
  • application: how the PA will apply the selected setting in the next response.
  • Model-facing setting descriptions: each setting must include a short operational definition in the tool description, so the PA does not infer boundaries from enum labels alone.

IX_tone_formality

Attribute: Tone & Formality
Settings: Casual / Consultative / Formal

{
  "name": "IX_tone_formality",
  "parameters": {
    "type": "object",
    "properties": {
      "setting": {"type": "string", "enum": ["Casual", "Consultative", "Formal"]},
      "evidence": {"type": "string"},
      "application": {"type": "string"}
    },
    "required": ["setting"]
  }
}

IX_verbosity

Attribute: Verbosity
Settings: Terse / Moderate / Detailed

{
  "name": "IX_verbosity",
  "parameters": {
    "type": "object",
    "properties": {
      "setting": {"type": "string", "enum": ["Terse", "Moderate", "Detailed"]},
      "evidence": {"type": "string"},
      "application": {"type": "string"}
    },
    "required": ["setting"]
  }
}

IX_emotional_engagement

Attribute: Emotional Engagement
Settings: Task-focused / Balanced / Relationship-focused

{
  "name": "IX_emotional_engagement",
  "parameters": {
    "type": "object",
    "properties": {
      "setting": {"type": "string", "enum": ["Task-focused", "Balanced", "Relationship-focused"]},
      "evidence": {"type": "string"},
      "application": {"type": "string"}
    },
    "required": ["setting"]
  }
}

IX_guidance_level

Attribute: Guidance Level
Settings: Assumed / Calibrated / Guided

{
  "name": "IX_guidance_level",
  "parameters": {
    "type": "object",
    "properties": {
      "setting": {"type": "string", "enum": ["Assumed", "Calibrated", "Guided"]},
      "evidence": {"type": "string"},
      "application": {"type": "string"}
    },
    "required": ["setting"]
  }
}

IX_reasoning_visibility

Attribute: Reasoning Visibility
Settings: Show / Summarize / Hide

{
  "name": "IX_reasoning_visibility",
  "parameters": {
    "type": "object",
    "properties": {
      "setting": {"type": "string", "enum": ["Show", "Summarize", "Hide"]},
      "evidence": {"type": "string"},
      "application": {"type": "string"}
    },
    "required": ["setting"]
  }
}

IX_uncertainty_expression

Attribute: Uncertainty Expression
Settings: Express / Moderate / Hide

{
  "name": "IX_uncertainty_expression",
  "parameters": {
    "type": "object",
    "properties": {
      "setting": {"type": "string", "enum": ["Express", "Moderate", "Hide"]},
      "evidence": {"type": "string"},
      "application": {"type": "string"}
    },
    "required": ["setting"]
  }
}

B. Control-Affecting Tools: Later Implementation Target

These attributes still get one IX_ tool each. Their settings must match the taxonomy exactly; runner-level B2 behavior is added only when the required ordering/scope semantics are explicit.

ToolAttributeSettingsWhy control-affecting
IX_process_visibilityProcess VisibilitySilent / Bookend / Full narrationMay determine whether the PA emits start/progress/completion updates around task tool calls.
IX_autonomy_levelAutonomy LevelReactive / Suggest / Self-directed / AutonomousMay determine whether the PA must ask before acting, suggest then wait, or act without confirmation.
IX_task_expansionTask ExpansionLow / Medium / HighMay determine whether the PA stays within the requested task or expands to related tasks.
IX_solution_breadthSolution BreadthLow / Medium / HighMay determine whether the PA gives one best answer or explores alternatives.
IX_capability_boundaryCapability BoundarySuggest alternatives / Find and hand offMay determine fallback, handoff, or alternative suggestion behavior when capability is limited.
IX_information_elicitationInformation ElicitationInfer / Structured / IterativeMay determine whether the PA asks upfront, asks as needed, or infers without blocking.
IX_topic_managementTopic ManagementFollow user's flow / Organize / One-at-a-timeMay determine whether the PA follows, reorganizes, or isolates topics; One-at-a-time also requires lightweight confirmation before continuing to deferred topics.

Topic Management Note

Topic Management is correctly part of Dimension 4: Information Flow because the taxonomy defines it as turn-level topic handling. It also has operational consequences: One-at-a-time requires deferring additional topics and asking for lightweight confirmation before continuing, while Organize may reorder or group the user’s information in one response. Therefore IX_topic_management should remain in the active IX tool set, but its judge rubric must check both information-flow behavior and any task-scope consequences.

Registration Rule

IX tools are registered from the active preferences declared by the session script or beat. They should not be globally exposed. This keeps the benchmark prompt targeted and makes the selected IX_ tool a direct answer to the intended preference question.

beat.active_skills = ["guidance_level", "uncertainty_expression"]
 
registered IX tools for this PA turn:
- IX_guidance_level
- IX_uncertainty_expression

Normal task-state tools remain available according to the task phase/tool-surface configuration. IX tools are additive for the active preference question; they do not replace documents_read, email_save_draft, calendar tools, etc.

When an IX tool is exposed for an active preference, the PA must call it before producing the final answer for that turn. If required IX calls are missing, the runner should issue a repair step before accepting the answer.

Implementation status, 2026-05-04:

  • Implemented A-class control-unaffected IX tools in the PA native tool layer.
  • Implemented per-turn IX registration from beat.active_skills.
  • Implemented required IX call enforcement before final response.
  • Implemented B1 passive forms for IX_process_visibility, IX_topic_management, IX_autonomy_level, and IX_information_elicitation. B1 means expose, require, and log the IX selection only; runner-control semantics remain deferred. IX_memory_privacy was later removed from the active tool set on 2026-05-17.

Mapping from PrefIx

PrefIx DimensionPrefIx AttributesOur Mapping
Transparency & AuditabilityTool Transparency, Parameter Transparency, Source TransparencyDisclosure (re-instantiated: Reasoning Visibility [incl. Source Attribution], Uncertainty Expression, Process Visibility; Memory & Privacy deprecated as active attribute on 2026-05-17)
Interaction Pace & FlowConfirmation, Presentation, Info Collection, Disambiguation, Chain ExecutionInformation Flow (Information Elicitation, Topic Management) + Expression Style (Verbosity). Confirmation subsumed by Autonomy Level (Dim 3). Info Collection and Disambiguation merged into Information Elicitation.
Strategy & InitiativeInitiative, Tool InvocationInitiative & Autonomy (expanded: Autonomy Level, Proactive Outreach, Task Expansion, Solution Breadth, Capability Boundary). Also absorbs Resilience (failure as context) and Delegation (external actions as context).
Robustness & AdaptabilityTool Abortion, Tool Switching, Error Retry, Error Discovery→ Subsumed as context-dependent variation of Initiative & Autonomy preferences during failure scenarios.
(not in PrefIx)Expression Style (new: Tone & Formality, Emotional Engagement, Guidance Level)
(not in PrefIx)Disclosure (Memory & Privacy — PA-unique privacy/memory attribute) deprecated as active measurable preference on 2026-05-17

Summary Statistics

  • 4 dimensions, 11 active cognitive preferences, 14 active PA attributes (vs PrefIx’s 4 dimensions / 14 attributes / 31 settings; reduced from our previous 19 after merges and removals)
  • Every attribute grounded in a user cognitive preference (three-layer structure)
  • All dimensions pass MECE test with explicit boundary definitions
  • All PA attributes pass Independent Evaluability test (each has a standalone rubric)
  • Failure/error handling, external delegation, and memory continuity are modeled as context-dependent variations of existing preferences, not as separate dimensions
  • 11 attributes have transcript-derived rubrics (see Pattern_to_Preference_Rubrics); 3 active attributes (Process Visibility, Solution Breadth, Capability Boundary) will be authored manually. Memory & Privacy remains historical background only.

Literature Grounding for User Cognitive Preferences

Each user cognitive preference in the taxonomy is grounded in established psychological or HCI constructs. Where our label differs from the canonical academic term, the correct term is noted.

Dimension 1: Expression Style

Our LabelAcademic ConstructFoundational References
Communication register preferenceLinguistic register + Communication style preference in HCIJoos (1961) The Five Clocks — defined five registers (taxonomy source). Nass & Brave (2005) Wired for Speech, MIT Press — users prefer computer communication style matching their own personality, including formality. arxiv 2410.20468 — directly measures formality preferences across application contexts with individual differences.
Information load (verbosity)Need for cognition + Information overloadCacioppo & Petty (1982) “The need for cognition,” JPSP 42, 116–131 — high-NFC individuals prefer more detailed, complex information; low-NFC prefer heuristic cues. Directly predicts verbosity preference. Eppler & Mengis (2004) “The concept of information overload,” The Information Society 20(5) — overload thresholds vary by individual.
Socioemotional orientationSocial vs task-oriented interaction style / Chatbot use motivationsBales (1950) Interaction Process Analysis — origin of socioemotional/task distinction (taxonomy source). Chattaraman et al. (2019) “Should AI-Based Conversational Digital Assistants Employ Social- or Task-Oriented Interaction Style?” Computers in Human Behavior — directly tests social vs task-oriented chatbot style with individual differences as moderators. Brandtzaeg & Folstad (2017) “Why People Use Chatbots” — users differ in seeking productivity/task vs social/relational gratifications from chatbots.
Common ground assumptionCommunication Accommodation Theory / Expertise reversal effectGiles et al. (1973) “Towards a theory of interpersonal accommodation through language,” Language in Society 2(2) — CAT explains how and why speakers adapt to recipients’ level; recipients prefer appropriate accommodation and dislike over-accommodation. Kalyuga et al. (2003) “The expertise reversal effect,” Educational Psychologist 38(1), 23–31 — scaffolding helpful for novices becomes counterproductive for experts; motivates why calibration must match the user.

Removed: Information load (structure) (Output Structure removed), Identity extension preference (Representation Style removed).

Dimension 2: Disclosure

Our LabelAcademic ConstructFoundational References
Transparency preference + Epistemic vigilanceTransparency effects on trust + Epistemic vigilanceKizilcec (2016) “How Much Information? Effects of Transparency on Trust in an Algorithmic Interface,” CHI 2016 — directly tested three levels of transparency; found individual differences in who benefits from explanations vs who is harmed by too much transparency. Supports Show/Summarize/Hide as meaningful variation. Sperber et al. (2010) “Epistemic Vigilance,” Mind & Language — cognitive mechanisms for evaluating source credibility and claim validity. Now merged: Source Attribution folded into Reasoning Visibility as the same signal in practice.
Uncertainty communication preferencePreference for Information about Uncertain Science (PIUS)Ratcliff & Wicke (2022) “Developing and Validating the PIUS Scale,” Public Understanding of Science — directly measures preference for receiving uncertainty communication from others. Validated instrument: PIUS scale. Webster & Kruglanski (1994) “Individual differences in need for cognitive closure,” JPSP 67(6) — NFC predicts direction of preference (high NFC → prefer Hide); secondary reference.
Monitoring-blunting styleMonitoring vs blunting information-seeking styleMiller (1987) “Monitoring and Blunting: Validation of a Questionnaire to Assess Styles of Information Seeking Under Threat,” JPSP 52(2), 345–353 — monitors actively seek process information, blunters prefer distraction and just getting the result. Validated instrument: Miller Behavioral Style Scale (MBSS). Maps directly to Full narration (monitor) vs Silent (blunter). Burger & Cooper (1979) “The desirability of control” — secondary: high desire for control predicts preference for process updates.
Privacy-personalization tradeoffIUIPC + Privacy calculusDeprecated as an active measurable interaction preference on 2026-05-17. These references remain background for privacy/memory as narrative content, but not for target cells or IX tools.

Dimension 3: Initiative & Autonomy

Our LabelAcademic ConstructFoundational References
Control vs trust orientationDesire for control + Levels of automationBurger & Cooper (1979) “The desirability of control,” Motivation and Emotion 3(4), 381–393 — trait-level anchor. Parasuraman, Sheridan & Wickens (2000) IEEE Trans. SMC-A 30(3), 286–297 — LOA taxonomy. Note: encompasses confirmation behavior, follow-up behavior, approval flow for external actions, and failure-mode autonomy shifts as context-dependent variations.
Exploration scopeMaximizing vs satisficing (unified construct) + per-attribute groundingSchwartz et al. (2002) “Maximizing versus satisficing,” JPSP 83(5), 1178–1197 — maximizers explore more options (maps to Solution Breadth). Per-attribute literature: Proactive Outreach: Mehrotra et al. (2016) “My Phone and Me,” CHI ‘16 — psychological traits predict notification receptivity; Rook, Sabic & Zanker (2020) “Engagement in proactive recommendations,” JIIS 54(1) — Big Five moderates engagement with unsolicited recommendations. Task Expansion: Adam et al. (2024) “Navigating autonomy and control in human-AI delegation,” DSS 180 — system-initiated task allocation triggers autonomy threat, moderated by perceived control; Lyons & Guznov (2019) “Perfect Automation Schema,” TIES 20(4) — PAS predicts willingness to rely on expanded automation. Solution Breadth: Schwartz et al. (2002) directly. Capability Boundary: Ashktorab et al. (2019) “Resilient Chatbots,” CHI ‘19 — repair strategy preferences vary by social orientation toward chatbots; Luo et al. (2022) “Should the chatbot save itself or be helped by others?” ECRA 55 — self-recovery vs human handoff moderated by perceived intelligence and risk.

Dimension 4: Information Flow

Our LabelAcademic ConstructFoundational References
Ambiguity toleranceTolerance for ambiguity + Need for cognitive closureMcLain (1993) “The MSTAT-I: A new measure of an individual’s tolerance for ambiguity,” Educational and Psychological Measurement 53(1), 183–189 — validated instrument: MSTAT-I scale. High-tolerance users prefer the PA to infer; low-tolerance users prefer explicit elicitation. Webster & Kruglanski (1994) “Individual differences in need for cognitive closure,” JPSP 67(6) — high NFC predicts preference for iterative clarification until precise. Zamani et al. (2020) “Generating clarifying questions for information retrieval,” WWW ‘20 — empirical bridge showing user preferences for clarification vs best-guess vary across individuals and query types.
Cognitive organization stylePersonal need for structureNeuberg & Newsom (1993) “Personal need for structure,” JPSP 65(1), 113–131. Validated instrument: PNS Scale.

Benchmark Session Design Constraints

Each IPaS attribute requires a session context in which it is actually observable. The constraints below govern scenario and fixture selection for accumulation sessions in the benchmark. Attributes without special constraints are omitted.

Fixture / Tool-Path Contract Classes

Stage 2 outlines must do more than name a scenario. They must specify the interaction probe contract that Stage 3 fixture and session writers will make executable. The contract should be compact: one block for the trigger/opportunity/context, one block for correct vs wrong branches, and one judge-observable signal.

A. Tool-path / branch-required attributes

These attributes require executable state or tool-path design. The fixture must support both the correct behavior and at least one plausible wrong behavior, so the benchmark can judge the PA from tool logs and user-visible response, not just from prose intent.

Attributes:

  • process_visibility
  • autonomy_level
  • task_expansion
  • solution_breadth
  • capability_boundary
  • information_elicitation
  • topic_management

Minimum outline contract:

  • trigger: PA-visible initial state, tool constraint, missing slot, available options, or failure point.
  • branches.correct.behavior: the expected path under the target setting.
  • branches.wrong.behavior: a plausible non-target path that the fixture permits.
  • judge_observable: tool call sequence, blocked/missing call, response structure, memory use, or confirmation behavior.

Attribute-specific requirements:

  • process_visibility: the fixture must require at least two sequential PA-visible tool steps, with an intermediate result that can naturally be narrated or hidden.
  • autonomy_level: the fixture must classify actions by impact (read, draft/internal write, external-impact action). It must allow over-action, over-confirmation, and correct confirmation/execution paths to be visible.
  • task_expansion: the fixture must expose the explicit requested task plus adjacent subtasks that are genuinely useful but separable. The wrong path should be either failing to include needed adjacent work or expanding beyond the allowed scope.
  • solution_breadth: the fixture must contain multiple genuinely viable options with comparable facts. The wrong path should be premature collapse to one option or excessive option sprawl.
  • capability_boundary: see the special pattern below.
  • information_elicitation: the opening request must omit meaningful required slots, and the fixture must permit proceed-by-inference, ask-upfront, and ask-as-needed paths to be observable.
  • topic_management: the opening request must contain multiple user-raised topics; if tools are involved, each topic should have a separable tool or state path so “handle together” vs “one-at-a-time” is observable.

B. Light fixture / future-opportunity attributes

These attributes do not usually need a multi-branch tool graph, but they still need a real state opportunity. Without that opportunity the preference collapses into generic wording.

Attributes:

  • proactive_outreach
  • reasoning_visibility
  • uncertainty_expression
  • guidance_level

Minimum outline contract:

  • opportunity: the concrete fact, future opportunity, uncertainty, or knowledge-gap that makes the preference observable.
  • branches.correct.behavior: correct response behavior under the target setting.
  • branches.wrong.behavior: plausible wrong response behavior.
  • judge_observable: final response signal or single optional tool action.

Attribute-specific requirements:

  • proactive_outreach: the completed task must leave a real future thread such as a reminder, check-in, later verification, deadline, or next-step planning opportunity. The fixture should not test current-task expansion or external action.
  • reasoning_visibility: the task should have enough evidence or tool result structure for the PA to hide, summarize, or show reasoning.
  • uncertainty_expression: the fixture should contain genuine uncertainty, missing data, or conflicting evidence so uncertainty expression is not performative.
  • guidance_level: the scenario should make the user’s expertise level or knowledge gap clear enough to distinguish assumed, calibrated, and guided explanation.

C. Response-level attributes

These attributes are primarily judged from the assistant’s response text. They do not require a tool-path branch by default, but the scenario still needs enough social and task context for the response style to be meaningfully judged.

Attributes:

  • tone_formality
  • verbosity
  • emotional_engagement

Minimum outline contract:

  • context: the interaction context and relationship stakes.
  • branches.correct.behavior: correct response behavior under the target setting.
  • branches.wrong.behavior: plausible wrong response behavior.
  • judge_observable: text signal.

D. Special capability-boundary pattern

capability_boundary is the strictest branch-required attribute. Represent it with the same compact contract, not a separate long field list:

interaction_probe_contract:
  contract_class: tool_path_branch
  judge_observable: tool_log_and_response
  trigger:
    pa_visible_state: Tool A is available, Tool B/workaround is also available, and the requested information/action is not directly completed by Tool A.
    initial_attempt: Tool A or first action the PA naturally tries.
    failure_or_constraint: Concrete Tool A error, missing information, or capability limit visible to the PA.
  branches:
    correct:
      behavior: Expected behavior under the target setting.
      expected_tools: Optional list of tools that should be used or stopped after.
    wrong:
      behavior: Plausible non-target behavior, such as over-recovering when the user-control handoff setting is expected.
      allowed_tools: Optional list of tools exposed so the wrong branch is observable.

Even in a Find and hand off target session, Tool B/workaround should normally remain available so an over-recovering PA can incorrectly call it. The benchmark must be able to observe that error in the tool log or response.

Dimension 2: Disclosure

Process Visibility
Hard scenario constraint. The session must involve ≥2 sequential tool calls with a natural verification point or intermediate result. A single-tool task produces no process to narrate; Silent vs Full narration would be indistinguishable. Adequate scenario types: multi-step document lookup followed by a draft, email search followed by route planning, literature retrieval followed by a summary.

Memory & Privacy
Deprecated as an active session target on 2026-05-17. Memory fixtures may still support ordinary continuity and user-history state, but Stage 2/3 must not create memory_privacy target cells, IX directives, forced evolving arcs, or final probes.

Dimension 3: Initiative & Autonomy

Autonomy Level
Hard scenario constraint. The session must contain an action with genuine external impact that is at least partially non-retractable: sending an email, booking travel, making a financial commitment, or taking an action that creates an obligation for another person. The confirmation requirement under Reactive and Suggest must be motivated by the real irreversibility of the action — not imposed artificially. Particularly effective scenario types: external coordination emails, travel or ticket bookings, financial decisions. Scenarios that involve only read or internal-write actions cannot surface the Reactive/Suggest distinction from Self-directed/Autonomous.

Proactive Outreach
Soft constraint. The completed task should leave open natural follow-up possibilities such as a reminder, check-in, later verification, or next-step planning. A fully self-contained task with no plausible future thread makes Low and High indistinguishable. Do not test this by having the PA expand the current task, take an external action, or volunteer unrelated information; those belong to Task Expansion, Autonomy Level, or Topic Management.

Task Expansion
Soft constraint. The task should have naturally adjacent sub-tasks that could plausibly be folded in. Fully atomic tasks with no meaningful related work cannot surface the Low vs High distinction.

Solution Breadth
Hard fixture constraint. The state fixtures must contain ≥2 genuinely viable options for the PA to discover and compare. A task with a single correct answer collapses the Low/Medium/High distinction. Adequate scenario types: route options (maps fixture), conference or venue selection, dining choices (places fixture), edition or product variants (store fixture).

Capability Boundary
Hard fixture design constraint. The scenario must present a task that genuinely exceeds what the PA can accomplish with the available tools and fixtures, or a tool/action failure that blocks direct completion. The fixture must explicitly omit the information or action capability that would be required to complete the task directly — otherwise the PA can simply complete it, and the boundary never arises. Fixture exclusion must be deliberate: e.g., policy details not stored in any accessible document, real-time availability not returned by any tool, a failed lookup, or an action the PA cannot perform. The observable contrast is agent-side recovery versus user-control handoff: Suggest alternatives should state the limit and keep helping with feasible workarounds or next-best paths; Find and hand off should diagnose what is blocked or uncertain, explain what the user must decide/provide/do, and stop rather than inventing a workaround.

Dimension 4: Information Flow

Information Elicitation
Hard constraint, coupled to fixture design. The session must present a genuinely underspecified opening request — one where meaningful required slots are absent from the user’s first message. The fixture defines the required slots through the tool’s input requirements: properties that the task tool needs to produce a useful artifact, but which the user would not naturally state in the opening turn (e.g., recipient details, visit constraints, draft-only vs. send intent). The missing information must be knowable — the user has it or it can be inferred — just not volunteered. Over-specifying the opening request defeats the constraint entirely. The fixture’s required_slots field, not the user dialogue, is the source of truth for what information is missing.

Topic Management
Scenario constraint. The user’s opening message must naturally contain multiple topics. The multi-topic structure should arise from the user’s actual life: a character who tracks many things simultaneously will realistically combine research tasks with admin tasks, or stack personal errands when cognitively overloaded. Topics should not be artificially bundled. Adequate sources of natural multi-topic moments: end-of-day clearing of a running mental list, two things that are logistically linked, a situation where one topic reminded the user of another.


Open Questions

  • Validate settings granularity: some attributes may need finer/coarser settings
  • Identify which attributes have highest expected within-user context-dependent variance
  • Map each of the 10 TV show personas to this taxonomy
  • Update LLM-as-Judge evaluation dimensions to align with this taxonomy
  • Determine which attributes are most feasible to evaluate in the benchmark (prioritize high-variance, high-impact attributes)
  • Add all foundational references to Zotero collection “Cognitive Preferences” (~73 papers imported)
  • Labels renamed to match academic terms (epistemic style → need for cognition, privacy awareness → information privacy concern, diagnostic curiosity → epistemic curiosity, cognitive response priority → rational-experiential processing style)