MemPABench Anonymization Protocol

Purpose: Convert a “named” persona identity yaml (e.g. <character>_identity.yaml) into a runtime-safe anonymized version. Operationalizes GAME.md §4.4 (anonymization + leakage baseline H4).

Why: simulator role-play quality should come from our IPaS / personality spec, not from LLM’s pretrained knowledge of a specific public character. The named/anonymized contrast disentangles these. The anonymized version is the benchmark of record.

Status: 2026-05-01 v2 — fictional-substitute strategy. v1 (generic placeholder, e.g. “a major US research university”) validated for Sheldon → User A; v2 strengthens by replacing generic placeholders with internally-consistent fictional concrete names.

Core principle

Strip identity markers; preserve archetype.

Two things that look similar but aren’t:

Identity marker — a fact that lets a reader (or LLM with pretrained data) match the persona to a specific real or famous fictional person.
Archetype — a behavioral / personality / preference profile that describes a class of people, not one person.

The protocol cuts identity markers, keeps archetype. Concretely: anyone reading the anonymized profile should describe a recognizable type of person but be unable to identify which specific source character it derives from.

Naming convention — fictional substitutes

When replacing identifying values, use internally-consistent fictional substitutes that are concrete and plausible. Not generic placeholders (“a major university”); not real substitutes (Princeton, MIT) — those swap one prior for another.

Field type	Substitute style	Examples (illustrative)
Institution	Synthesize from neutral academic words; avoid names overly close to real schools	”Westlake Institute of Technology”, “Hartwell University”, “Eastlake College”, “Brockton State University”
Person (named relationship)	Common-sounding first+last; avoid first letters of original to discourage inference; avoid celebrity names	”Marcus Chen”, “Sara Wilkins”, “Eric Patel”
City / town	Plausible fictional name in the same region; verify it isn’t a real place	”Glenmont, CA”, “Mapleton”, “Riverwood Heights”
Company / employer	Synthesized neutral name; not a real company	”Mendel & Trent Capital”, “Northpoint Logistics”
Subfield (when distinctive)	Generalize one level up, OR pick a sibling subfield	”string theory” → “mathematical physics” or “quantum gravity”; “screenplay writer” → “long-form fiction writer”

Internal consistency rules:

Same persona’s fictional substitutes appear identically across every reference in their yaml.
Across personas, the fictional universe should be globally non-overlapping (User A’s institution ≠ User B’s institution unless the original characters genuinely share an institution).

Avoid:

Real institution / company names (even obscure ones).
Famous fictional places (Springfield, Wakanda, Hogwarts).
Names that pattern-match the original (Leonard → Larry).
Ethnic / gender shifts that change archetype demographics (don’t anonymize without preserving the original demographic signal unless the original demographic is itself an identifier).

The three buckets

For every field in the input identity yaml, classify into one of three buckets and apply the matching action.

Bucket 1 — STRIP (replace via Naming convention above)

Field type	Reason	Action
`name`	Direct identifier	Replace with `User <X>` (X = letter assigned by persona index)
`source` / `show` / `canonical_universe`	Reveals fictional origin	Set to null / remove value
Institution / employer	Strong identifier	Replace with fictional substitute
Specific named relationships (roommate, partner, family with first name)	Direct identifier	Replace with fictional name + role descriptor
Specific city / region	Strong identifier when paired with institution	Replace with fictional plausible city in same region
Distinctive subfield identifier	Identifier when paired with archetype	Generalize OR sibling-shift via Naming convention

Bucket 2 — REMOVE (no replacement)

These fields/items have no plausible fictional substitute. Removing them costs no archetype information; keeping them is direct leakage.

Field type	Reason
Verbatim catchphrases	Direct LLM string-match against source script
Iconic show-specific plot devices (named)	Recognizable narrative reference
Quoted-dialogue examples embedded in other fields	Specific scene quotation; abstract behavior survives without it
Strongly culturally-specific upbringing details with low archetypal value	Strong identifier with limited behavioral signal

For removal: delete the entire item (a list entry) or the entire field. Do not replace with placeholder text.

Bucket 3 — KEEP unchanged

These fields describe behavioral / linguistic / preference patterns at an abstraction level that does not identify any single source.

Field type	Why it doesn’t leak
`personality_traits` (abstract bullets)	Trait language describes a type, not a person
`speech_style.tone`	Abstract register description
`speech_style.vocabulary_level`	Class-level descriptor
`speech_style.sentence_patterns` (described, not quoted)	Linguistic-pattern descriptors
`speech_style.quirks` (after Bucket 2 example-stripping)	Abstract behavior descriptions
`act_repertoire`	Categories of conversational moves
IPaS preference matrix	Pure behavioral spec
MBTI type label	Categorical label (~6% of population per type)

Borderline cases — judgment guidance

Some fields don’t fit cleanly into one bucket. Apply the following question stack:

Does removing this field break the archetype? If yes → STRIP/genericize rather than REMOVE.
Could a careful reader uniquely identify the source from this field alone? If yes → STRIP or REMOVE.
Does this field describe behavior or describe biography? Behavior → KEEP; biography → STRIP/REMOVE.
Is this an abstract pattern or a quoted instance? Pattern → KEEP; instance → REMOVE the instance, keep the pattern.

Document each judgment call in a per-persona record (Appendix B style) so the choices are auditable.

Schema preservation

When producing the anonymized yaml:

Preserve the input’s yaml structure, field order, and indentation.
For STRIP fields: replace the value, keep the key.
For REMOVE fields: delete the entire key or list item.
For KEEP fields: copy verbatim.
Add a comment block at the top noting: “Anonymized via MemPABench Anonymization Protocol v2. Fictional substitutes — internally consistent within this file.”

Validation method

After anonymizing, run this test before accepting the output:

python simulator/probe.py --persona <named>          # control
python simulator/probe.py --persona <anonymized>     # treatment

Send identical multi-violation PA inputs to both (drafts that violate at least 3 IPaS preferences simultaneously). Read each turn’s <turn_assessment>:

Pattern	Interpretation
Anon cites specific preference violations + attitude_delta clearly negative + voice in archetype	Spec drives role-play. ✓ Anonymization sufficient.
Anon reacts vaguely; doesn’t name preferences violated	Pretrained character knowledge was driving role-play; spec alone insufficient. Either expand spec or accept anon as authoritative.
Anon and named diverge significantly on same prompts	Named version was riding on LLM pretraining. Anonymization revealed real spec coverage.
Anon and named match closely	Spec is doing the work. Anonymization is methodologically safe.

Anonymizer Prompt (copy-paste for fresh LLM)

The block below is self-contained. Copy from “BEGIN PROMPT” to “END PROMPT” and prepend it to a yaml input — a fresh LLM should produce a valid anonymized yaml.

=========================== BEGIN PROMPT ===========================

You are an anonymization protocol applier for the MemPABench benchmark.
Given a persona identity yaml, produce an anonymized version following the
rules below. Output ONLY the resulting yaml — no commentary, no markdown
fences.

GOAL
----
Strip identity markers (names, institutions, locations, named relationships,
quoted lines from a source work) while preserving archetype (personality
traits, abstract behavioral patterns, IPaS preferences, communication style
descriptors).

NAMING CONVENTION (when replacing identifying values, use these styles)
-----------------------------------------------------------------------
- Institution → fictional name synthesized from neutral academic words.
  Avoid real schools and names overly close to real schools.
  Examples: "Westlake Institute of Technology", "Hartwell University",
  "Eastlake College", "Brockton State University".
- Person (named relationship) → common first+last name. Avoid first
  letters of original to prevent inference. Avoid celebrity names.
  Examples: "Marcus Chen", "Sara Wilkins", "Eric Patel".
- City / region → plausible fictional name in the same region. Verify it
  isn't a real place. Examples: "Glenmont, CA", "Mapleton", "Riverwood
  Heights".
- Company / employer → synthesized neutral name; not a real company.
- Subfield (when distinctive) → generalize one level up, or pick a sibling.

Internal consistency: every reference within this yaml uses the same
substitute (if institution = Westlake, it's Westlake everywhere).

THREE BUCKETS — apply per field
-------------------------------
Bucket 1 — STRIP (replace via Naming Convention):
  - name (→ "User <X>" where X is provided externally; if not provided, use "User A")
  - source / show / canonical_universe (→ null)
  - institution / employer
  - specific named relationships (roommate, partner, family-with-name)
  - specific city / region
  - distinctive subfield identifier (when it pinpoints a known character)

Bucket 2 — REMOVE entirely (delete the field or list item):
  - verbatim catchphrases (recognizable lines from a source work)
  - named iconic plot devices (e.g. "the Roommate Agreement")
  - quoted-dialogue examples embedded inside other fields
  - strongly culturally-specific biographical details with low archetype value

Bucket 3 — KEEP unchanged (verbatim):
  - personality_traits (abstract bullets)
  - speech_style.tone (abstract description)
  - speech_style.vocabulary_level
  - speech_style.sentence_patterns (described patterns, not quoted lines)
  - speech_style.quirks (after removing quoted example instances)
  - act_repertoire
  - IPaS preference matrix (and any nested matrix data)
  - MBTI type label

BORDERLINE JUDGMENT QUESTIONS (apply when a field doesn't cleanly fit)
---------------------------------------------------------------------
1. Does removing this field break the archetype? If yes → STRIP, don't REMOVE.
2. Could a reader uniquely identify the source from this field alone?
   If yes → STRIP or REMOVE.
3. Behavior or biography? Behavior → KEEP; biography → STRIP/REMOVE.
4. Pattern or quoted instance? Pattern → KEEP; instance → REMOVE.

OUTPUT FORMAT
-------------
- Preserve input yaml's structure, field order, indentation.
- STRIP fields: replace value, keep key.
- REMOVE fields/items: delete entire key or list item (don't leave placeholder).
- KEEP fields: byte-identical copy.
- Prepend a comment block:
    # Anonymized via MemPABench Anonymization Protocol v2.
    # Fictional substitutes are internally consistent within this file.

WORKED EXAMPLE (illustrative, not the persona you are processing)
-----------------------------------------------------------------
INPUT:
    name: "Maya Hassan"
    source: "The Calling (TV series)"
    background: "Computational biologist at MIT. Lives in Cambridge, MA with
                  her wife, Dr. Anna Kowalski. Originally from Cairo."
    personality_traits:
      - "Methodical to a fault"
      - "Reluctant communicator outside work"
    speech_style:
      catchphrases:
        - "Show me the data, then we'll talk."
      quirks:
        - "Cites paper DOIs in casual conversation ('Smith et al. 2019')"
    act_repertoire:
      - "Demand statistical evidence before agreeing"

OUTPUT:
    # Anonymized via MemPABench Anonymization Protocol v2.
    # Fictional substitutes are internally consistent within this file.
    name: "User X"
    source: ~
    background: "Computational biologist at Hartwell University. Lives in
                  Mapleton with her wife, Dr. Lina Petrov."
    personality_traits:
      - "Methodical to a fault"
      - "Reluctant communicator outside work"
    speech_style:
      quirks:
        - "Cites paper DOIs in casual conversation"
    act_repertoire:
      - "Demand statistical evidence before agreeing"

Notes on the example:
- Maya Hassan → User X (Bucket 1)
- "The Calling (TV series)" → null (Bucket 1)
- MIT → Hartwell University (Bucket 1, fictional)
- Cambridge, MA → Mapleton (Bucket 1, fictional)
- Anna Kowalski → Lina Petrov (Bucket 1, fictional, A→L per naming rule)
- "Originally from Cairo" → removed (Bucket 2: biography with low
  archetype value; if it were core to behavior we'd genericize it instead)
- catchphrase removed entirely (Bucket 2)
- DOI quirk: kept the abstract pattern, dropped the quoted "Smith et al. 2019"
  (Bucket 2 within Bucket 3)
- personality_traits / act_repertoire kept verbatim (Bucket 3)

============================ END PROMPT ============================

To use: paste the prompt above, append the named yaml, ask the LLM to produce the anonymized yaml. Then run the Validation method on the result.

Running implementation

The above prompt is also packaged as a runnable script in the repo:

data/anonymizer/
├── anonymize.py              # CLI: --input named.yaml --output anon.yaml [--model X]
└── prompts/
    └── anonymizer.md          # ← lifted verbatim from the BEGIN PROMPT…END PROMPT block above

Default model: anthropic/claude-sonnet-4.5 (via OpenRouter). The script reads data/anonymizer/prompts/anonymizer.md as the system prompt and the input yaml as the user message, then writes the cleaned response to --output after a yaml.safe_load sanity check.

Single source of truth note: the prompt body is duplicated between this doc and data/anonymizer/prompts/anonymizer.md. The repo file is what actually runs. If you edit the prompt here for protocol clarity, also update the repo file (or vice versa). When in doubt, the repo file is canonical for execution.

Example invocation:

python data/anonymizer/anonymize.py \
  --input  data/personas/sheldon_identity.yaml \
  --output data/personas/user_a_identity_llm.yaml

Appendix A — v1 case: Sheldon Cooper → User A

This is the first persona anonymized via this protocol. v1 used generic placeholders (“a major US research university”); v2 will upgrade to fictional substitutes per Naming Convention.

v1 (generic — already deployed)

Field	Named	Anonymized v1
name	”Sheldon Cooper"	"User A”
source	”The Big Bang Theory”	null
background institution	”Caltech"	"a major US research university”
background named roommate	”Leonard Hofstadter"	"a colleague (experimental physicist, same department)“
catchphrases	”Bazinga”, “That’s my spot”, “I’m not crazy, my mother had me tested”	(removed entire section)
time-tracking quirk example	”8:06… 8:08… 8:13”	(removed quoted sequence; kept abstract description)
personality_traits	(5 abstract bullets)	(kept verbatim)
speech_style.tone, vocabulary, sentence_patterns	(abstract)	(kept verbatim)
act_repertoire	(6 abstract moves)	(kept verbatim)

v1 validation result (probe.py, 2026-05-01)

Multi-violation elevator-complaint draft sent to anonymized User A simulator:

Turn 1: cited verbosity, reasoning_visibility, solution_breadth violated; demanded “lease agreement Section 7.3” reference. attitude=-3.
Turn 2: cited information_elicitation; rejected “a few issues” subject as colloquial. attitude=-4.
Turn 3: cited topic_management; called clarifying question “social pleasantries”. attitude=-6.

In-character voice survived (legal/structural language, refused topic mixing, escalating displeasure). Caveat: very specific subfield-level preferences (e.g. interpretation of “string theory”) may have minor differences vs named.

Decision: v1 accepted as runtime persona for Sheldon. v2 fictional upgrade is a strengthening, not a correction.

v2 fictional substitutes (deployed 2026-05-01)

Field	v1 generic	v2 fictional
institution	”a major US research university"	"Westlake Institute of Technology”
roommate	”a colleague (experimental physicist, same department)"	"Marcus Chen, an experimental physicist in the same department”
city	(not specified)	“Glenmont, CA”

v2 deployed in data/personas/user_a_identity.yaml. Not re-validated via probe.py (judged unnecessary: v2 is monotonically stronger than v1, which already passed).

Appendix B — Per-persona quick reference

For each planned persona, note the specific identity markers requiring the protocol’s strip/remove actions. Fill in as personas come online.

Persona	Identity to strip	Catchphrases to remove	Show-specific to remove
Sheldon Cooper	Caltech, Leonard, Pasadena	”Bazinga”, “That’s my spot”, “I’m not crazy…”	Roommate Agreement
Leonard Hofstadter	Caltech, Sheldon, Penny (as romantic)	(fewer fingerprints)	Roommate Agreement
Penny	Cheesecake Factory, Nebraska, “Penny” as fingerprint	era-dependent	(none salient)

User B runtime preference anonymization (2026-05-12)

Source persona: Penny S01-S03 preference matrix after HITL decisions.
Runtime file: data/personas/user_b_s01-s03_preferences.yaml.
Anonymization action: character remains User B; runtime-visible sources.pass2 now points to data/extraction/user_b_s01-s03_pass2.json; sources.hitl_queue is null so simulator-facing YAML does not expose the source character path.
Scope caveat: this covers the preference matrix only. user_b_identity.yaml and User_B_World_Design.md are still needed before session generation / story scripting.

MemPA Wiki

Explorer

MemPABench_Anonymization_Protocol