MemPABench Anonymization Protocol
Purpose: Convert a “named” persona identity yaml (e.g.
<character>_identity.yaml) into a runtime-safe anonymized version. Operationalizes GAME.md §4.4 (anonymization + leakage baseline H4).Why: simulator role-play quality should come from our IPaS / personality spec, not from LLM’s pretrained knowledge of a specific public character. The named/anonymized contrast disentangles these. The anonymized version is the benchmark of record.
Status: 2026-05-01 v2 — fictional-substitute strategy. v1 (generic placeholder, e.g. “a major US research university”) validated for Sheldon → User A; v2 strengthens by replacing generic placeholders with internally-consistent fictional concrete names.
Core principle
Strip identity markers; preserve archetype.
Two things that look similar but aren’t:
- Identity marker — a fact that lets a reader (or LLM with pretrained data) match the persona to a specific real or famous fictional person.
- Archetype — a behavioral / personality / preference profile that describes a class of people, not one person.
The protocol cuts identity markers, keeps archetype. Concretely: anyone reading the anonymized profile should describe a recognizable type of person but be unable to identify which specific source character it derives from.
Naming convention — fictional substitutes
When replacing identifying values, use internally-consistent fictional substitutes that are concrete and plausible. Not generic placeholders (“a major university”); not real substitutes (Princeton, MIT) — those swap one prior for another.
| Field type | Substitute style | Examples (illustrative) |
|---|---|---|
| Institution | Synthesize from neutral academic words; avoid names overly close to real schools | ”Westlake Institute of Technology”, “Hartwell University”, “Eastlake College”, “Brockton State University” |
| Person (named relationship) | Common-sounding first+last; avoid first letters of original to discourage inference; avoid celebrity names | ”Marcus Chen”, “Sara Wilkins”, “Eric Patel” |
| City / town | Plausible fictional name in the same region; verify it isn’t a real place | ”Glenmont, CA”, “Mapleton”, “Riverwood Heights” |
| Company / employer | Synthesized neutral name; not a real company | ”Mendel & Trent Capital”, “Northpoint Logistics” |
| Subfield (when distinctive) | Generalize one level up, OR pick a sibling subfield | ”string theory” → “mathematical physics” or “quantum gravity”; “screenplay writer” → “long-form fiction writer” |
Internal consistency rules:
- Same persona’s fictional substitutes appear identically across every reference in their yaml.
- Across personas, the fictional universe should be globally non-overlapping (User A’s institution ≠ User B’s institution unless the original characters genuinely share an institution).
Avoid:
- Real institution / company names (even obscure ones).
- Famous fictional places (Springfield, Wakanda, Hogwarts).
- Names that pattern-match the original (Leonard → Larry).
- Ethnic / gender shifts that change archetype demographics (don’t anonymize without preserving the original demographic signal unless the original demographic is itself an identifier).
The three buckets
For every field in the input identity yaml, classify into one of three buckets and apply the matching action.
Bucket 1 — STRIP (replace via Naming convention above)
| Field type | Reason | Action |
|---|---|---|
name | Direct identifier | Replace with User <X> (X = letter assigned by persona index) |
source / show / canonical_universe | Reveals fictional origin | Set to null / remove value |
| Institution / employer | Strong identifier | Replace with fictional substitute |
| Specific named relationships (roommate, partner, family with first name) | Direct identifier | Replace with fictional name + role descriptor |
| Specific city / region | Strong identifier when paired with institution | Replace with fictional plausible city in same region |
| Distinctive subfield identifier | Identifier when paired with archetype | Generalize OR sibling-shift via Naming convention |
Bucket 2 — REMOVE (no replacement)
These fields/items have no plausible fictional substitute. Removing them costs no archetype information; keeping them is direct leakage.
| Field type | Reason |
|---|---|
| Verbatim catchphrases | Direct LLM string-match against source script |
| Iconic show-specific plot devices (named) | Recognizable narrative reference |
| Quoted-dialogue examples embedded in other fields | Specific scene quotation; abstract behavior survives without it |
| Strongly culturally-specific upbringing details with low archetypal value | Strong identifier with limited behavioral signal |
For removal: delete the entire item (a list entry) or the entire field. Do not replace with placeholder text.
Bucket 3 — KEEP unchanged
These fields describe behavioral / linguistic / preference patterns at an abstraction level that does not identify any single source.
| Field type | Why it doesn’t leak |
|---|---|
personality_traits (abstract bullets) | Trait language describes a type, not a person |
speech_style.tone | Abstract register description |
speech_style.vocabulary_level | Class-level descriptor |
speech_style.sentence_patterns (described, not quoted) | Linguistic-pattern descriptors |
speech_style.quirks (after Bucket 2 example-stripping) | Abstract behavior descriptions |
act_repertoire | Categories of conversational moves |
| IPaS preference matrix | Pure behavioral spec |
| MBTI type label | Categorical label (~6% of population per type) |
Borderline cases — judgment guidance
Some fields don’t fit cleanly into one bucket. Apply the following question stack:
- Does removing this field break the archetype? If yes → STRIP/genericize rather than REMOVE.
- Could a careful reader uniquely identify the source from this field alone? If yes → STRIP or REMOVE.
- Does this field describe behavior or describe biography? Behavior → KEEP; biography → STRIP/REMOVE.
- Is this an abstract pattern or a quoted instance? Pattern → KEEP; instance → REMOVE the instance, keep the pattern.
Document each judgment call in a per-persona record (Appendix B style) so the choices are auditable.
Schema preservation
When producing the anonymized yaml:
- Preserve the input’s yaml structure, field order, and indentation.
- For STRIP fields: replace the value, keep the key.
- For REMOVE fields: delete the entire key or list item.
- For KEEP fields: copy verbatim.
- Add a comment block at the top noting: “Anonymized via MemPABench Anonymization Protocol v2. Fictional substitutes — internally consistent within this file.”
Validation method
After anonymizing, run this test before accepting the output:
python simulator/probe.py --persona <named> # control
python simulator/probe.py --persona <anonymized> # treatmentSend identical multi-violation PA inputs to both (drafts that violate at least 3 IPaS preferences simultaneously). Read each turn’s <turn_assessment>:
| Pattern | Interpretation |
|---|---|
| Anon cites specific preference violations + attitude_delta clearly negative + voice in archetype | Spec drives role-play. ✓ Anonymization sufficient. |
| Anon reacts vaguely; doesn’t name preferences violated | Pretrained character knowledge was driving role-play; spec alone insufficient. Either expand spec or accept anon as authoritative. |
| Anon and named diverge significantly on same prompts | Named version was riding on LLM pretraining. Anonymization revealed real spec coverage. |
| Anon and named match closely | Spec is doing the work. Anonymization is methodologically safe. |
Anonymizer Prompt (copy-paste for fresh LLM)
The block below is self-contained. Copy from “BEGIN PROMPT” to “END PROMPT” and prepend it to a yaml input — a fresh LLM should produce a valid anonymized yaml.
=========================== BEGIN PROMPT ===========================
You are an anonymization protocol applier for the MemPABench benchmark.
Given a persona identity yaml, produce an anonymized version following the
rules below. Output ONLY the resulting yaml — no commentary, no markdown
fences.
GOAL
----
Strip identity markers (names, institutions, locations, named relationships,
quoted lines from a source work) while preserving archetype (personality
traits, abstract behavioral patterns, IPaS preferences, communication style
descriptors).
NAMING CONVENTION (when replacing identifying values, use these styles)
-----------------------------------------------------------------------
- Institution → fictional name synthesized from neutral academic words.
Avoid real schools and names overly close to real schools.
Examples: "Westlake Institute of Technology", "Hartwell University",
"Eastlake College", "Brockton State University".
- Person (named relationship) → common first+last name. Avoid first
letters of original to prevent inference. Avoid celebrity names.
Examples: "Marcus Chen", "Sara Wilkins", "Eric Patel".
- City / region → plausible fictional name in the same region. Verify it
isn't a real place. Examples: "Glenmont, CA", "Mapleton", "Riverwood
Heights".
- Company / employer → synthesized neutral name; not a real company.
- Subfield (when distinctive) → generalize one level up, or pick a sibling.
Internal consistency: every reference within this yaml uses the same
substitute (if institution = Westlake, it's Westlake everywhere).
THREE BUCKETS — apply per field
-------------------------------
Bucket 1 — STRIP (replace via Naming Convention):
- name (→ "User <X>" where X is provided externally; if not provided, use "User A")
- source / show / canonical_universe (→ null)
- institution / employer
- specific named relationships (roommate, partner, family-with-name)
- specific city / region
- distinctive subfield identifier (when it pinpoints a known character)
Bucket 2 — REMOVE entirely (delete the field or list item):
- verbatim catchphrases (recognizable lines from a source work)
- named iconic plot devices (e.g. "the Roommate Agreement")
- quoted-dialogue examples embedded inside other fields
- strongly culturally-specific biographical details with low archetype value
Bucket 3 — KEEP unchanged (verbatim):
- personality_traits (abstract bullets)
- speech_style.tone (abstract description)
- speech_style.vocabulary_level
- speech_style.sentence_patterns (described patterns, not quoted lines)
- speech_style.quirks (after removing quoted example instances)
- act_repertoire
- IPaS preference matrix (and any nested matrix data)
- MBTI type label
BORDERLINE JUDGMENT QUESTIONS (apply when a field doesn't cleanly fit)
---------------------------------------------------------------------
1. Does removing this field break the archetype? If yes → STRIP, don't REMOVE.
2. Could a reader uniquely identify the source from this field alone?
If yes → STRIP or REMOVE.
3. Behavior or biography? Behavior → KEEP; biography → STRIP/REMOVE.
4. Pattern or quoted instance? Pattern → KEEP; instance → REMOVE.
OUTPUT FORMAT
-------------
- Preserve input yaml's structure, field order, indentation.
- STRIP fields: replace value, keep key.
- REMOVE fields/items: delete entire key or list item (don't leave placeholder).
- KEEP fields: byte-identical copy.
- Prepend a comment block:
# Anonymized via MemPABench Anonymization Protocol v2.
# Fictional substitutes are internally consistent within this file.
WORKED EXAMPLE (illustrative, not the persona you are processing)
-----------------------------------------------------------------
INPUT:
name: "Maya Hassan"
source: "The Calling (TV series)"
background: "Computational biologist at MIT. Lives in Cambridge, MA with
her wife, Dr. Anna Kowalski. Originally from Cairo."
personality_traits:
- "Methodical to a fault"
- "Reluctant communicator outside work"
speech_style:
catchphrases:
- "Show me the data, then we'll talk."
quirks:
- "Cites paper DOIs in casual conversation ('Smith et al. 2019')"
act_repertoire:
- "Demand statistical evidence before agreeing"
OUTPUT:
# Anonymized via MemPABench Anonymization Protocol v2.
# Fictional substitutes are internally consistent within this file.
name: "User X"
source: ~
background: "Computational biologist at Hartwell University. Lives in
Mapleton with her wife, Dr. Lina Petrov."
personality_traits:
- "Methodical to a fault"
- "Reluctant communicator outside work"
speech_style:
quirks:
- "Cites paper DOIs in casual conversation"
act_repertoire:
- "Demand statistical evidence before agreeing"
Notes on the example:
- Maya Hassan → User X (Bucket 1)
- "The Calling (TV series)" → null (Bucket 1)
- MIT → Hartwell University (Bucket 1, fictional)
- Cambridge, MA → Mapleton (Bucket 1, fictional)
- Anna Kowalski → Lina Petrov (Bucket 1, fictional, A→L per naming rule)
- "Originally from Cairo" → removed (Bucket 2: biography with low
archetype value; if it were core to behavior we'd genericize it instead)
- catchphrase removed entirely (Bucket 2)
- DOI quirk: kept the abstract pattern, dropped the quoted "Smith et al. 2019"
(Bucket 2 within Bucket 3)
- personality_traits / act_repertoire kept verbatim (Bucket 3)
============================ END PROMPT ============================
To use: paste the prompt above, append the named yaml, ask the LLM to produce the anonymized yaml. Then run the Validation method on the result.
Running implementation
The above prompt is also packaged as a runnable script in the repo:
data/anonymizer/
├── anonymize.py # CLI: --input named.yaml --output anon.yaml [--model X]
└── prompts/
└── anonymizer.md # ← lifted verbatim from the BEGIN PROMPT…END PROMPT block above
Default model: anthropic/claude-sonnet-4.5 (via OpenRouter). The script reads data/anonymizer/prompts/anonymizer.md as the system prompt and the input yaml as the user message, then writes the cleaned response to --output after a yaml.safe_load sanity check.
Single source of truth note: the prompt body is duplicated between this doc and data/anonymizer/prompts/anonymizer.md. The repo file is what actually runs. If you edit the prompt here for protocol clarity, also update the repo file (or vice versa). When in doubt, the repo file is canonical for execution.
Example invocation:
python data/anonymizer/anonymize.py \
--input data/personas/sheldon_identity.yaml \
--output data/personas/user_a_identity_llm.yamlAppendix A — v1 case: Sheldon Cooper → User A
This is the first persona anonymized via this protocol. v1 used generic placeholders (“a major US research university”); v2 will upgrade to fictional substitutes per Naming Convention.
v1 (generic — already deployed)
| Field | Named | Anonymized v1 |
|---|---|---|
| name | ”Sheldon Cooper" | "User A” |
| source | ”The Big Bang Theory” | null |
| background institution | ”Caltech" | "a major US research university” |
| background named roommate | ”Leonard Hofstadter" | "a colleague (experimental physicist, same department)“ |
| catchphrases | ”Bazinga”, “That’s my spot”, “I’m not crazy, my mother had me tested” | (removed entire section) |
| time-tracking quirk example | ”8:06… 8:08… 8:13” | (removed quoted sequence; kept abstract description) |
| personality_traits | (5 abstract bullets) | (kept verbatim) |
| speech_style.tone, vocabulary, sentence_patterns | (abstract) | (kept verbatim) |
| act_repertoire | (6 abstract moves) | (kept verbatim) |
v1 validation result (probe.py, 2026-05-01)
Multi-violation elevator-complaint draft sent to anonymized User A simulator:
- Turn 1: cited
verbosity, reasoning_visibility, solution_breadthviolated; demanded “lease agreement Section 7.3” reference. attitude=-3. - Turn 2: cited
information_elicitation; rejected “a few issues” subject as colloquial. attitude=-4. - Turn 3: cited
topic_management; called clarifying question “social pleasantries”. attitude=-6.
In-character voice survived (legal/structural language, refused topic mixing, escalating displeasure). Caveat: very specific subfield-level preferences (e.g. interpretation of “string theory”) may have minor differences vs named.
Decision: v1 accepted as runtime persona for Sheldon. v2 fictional upgrade is a strengthening, not a correction.
v2 fictional substitutes (deployed 2026-05-01)
| Field | v1 generic | v2 fictional |
|---|---|---|
| institution | ”a major US research university" | "Westlake Institute of Technology” |
| roommate | ”a colleague (experimental physicist, same department)" | "Marcus Chen, an experimental physicist in the same department” |
| city | (not specified) | “Glenmont, CA” |
v2 deployed in data/personas/user_a_identity.yaml. Not re-validated via probe.py (judged unnecessary: v2 is monotonically stronger than v1, which already passed).
Appendix B — Per-persona quick reference
For each planned persona, note the specific identity markers requiring the protocol’s strip/remove actions. Fill in as personas come online.
| Persona | Identity to strip | Catchphrases to remove | Show-specific to remove |
|---|---|---|---|
| Sheldon Cooper | Caltech, Leonard, Pasadena | ”Bazinga”, “That’s my spot”, “I’m not crazy…” | Roommate Agreement |
| Leonard Hofstadter | Caltech, Sheldon, Penny (as romantic) | (fewer fingerprints) | Roommate Agreement |
| Penny | Cheesecake Factory, Nebraska, “Penny” as fingerprint | era-dependent | (none salient) |
User B runtime preference anonymization (2026-05-12)
- Source persona: Penny S01-S03 preference matrix after HITL decisions.
- Runtime file:
data/personas/user_b_s01-s03_preferences.yaml. - Anonymization action:
characterremainsUser B; runtime-visiblesources.pass2now points todata/extraction/user_b_s01-s03_pass2.json;sources.hitl_queueis null so simulator-facing YAML does not expose the source character path. - Scope caveat: this covers the preference matrix only.
user_b_identity.yamlandUser_B_World_Design.mdare still needed before session generation / story scripting.
| Michael Scott | Dunder Mifflin, Scranton, Toby/Holly/Pam | “That’s what she said”, “World’s Best Boss” | Documentary framing |
| Richard Hendricks | Pied Piper, Stanford, Erlich | (less stylized) | Garage incubator |
| … | … | … | … |