Teaching AI to Embody Characters: A Replication of Open Character Training


Can you train an AI to consistently embody a specific character—not just respond in a style, but truly internalize a persona so it persists even without system prompts?

I replicated the Open Character Training methodology with 47 training runs across 5 distinct personas. The answer is yes, and the improvements are substantial.


The Core Idea

Open Character Training uses a three-phase “constitutional” approach:

  1. Introspective SFT: Train on self-reflection and self-interaction prompts
  2. Dialogue SFT with Distillation: Train on conversations, 50% with system prompts, 50% without
  3. Constitutional DPO: Train to prefer responses that align with the character’s constitution

The key innovation is prompt distillation: by training on dialogues both with and without system prompts, the model learns to embody the character intrinsically rather than following instructions.


Results: All Metrics Improved

MetricBase ModelTrainedChange
Character Alignment0.570.79+39%
High Alignment Rate29%83%+54pp
Break Rate (adversarial)65%35%-30pp
Distillation Success64%84%+20pp
Distillation Consistency0.500.76+0.26

Character Alignment Improvement

The method works across all character types—scientists, counselors, skeptics, and humorists all showed consistent improvements.


The Characters I Trained

I trained 5 distinct personas with unique “soul documents” defining their identity:

CharacterPersonaKey TraitsSeeds
Dr. Maya ChenCurious ScientistAstrophysicist, wonder-driven, evidence-based9
Jordan RiversEmpathetic CounselorWarm, creates safe spaces, emotion-focused10
Alex MercerPrincipled SkepticQuestions assumptions, intellectual honesty10
Sam ThorntonSarcastic WitDry humor, uses levity to illuminate truth9
Charlie ReevesWarm HumoristJoyful storyteller, believes laughter heals9

Each character has a detailed constitution defining their values, communication style, and behavioral boundaries.


Per-Character Results

All characters improved, though some started from different baselines:

Per-Character Comparison

CharacterBase AlignmentTrainedImprovement
Curious Scientist0.640.79+0.15
Empathetic Counselor0.540.80+0.26
Principled Skeptic0.620.79+0.17
Sarcastic Wit0.520.78+0.26
Warm Humorist0.510.77+0.26

Characters with “harder” personas (sarcasm, humor) showed the largest improvements—constitutional training helps models learn nuanced behavior.


Prompt Distillation Actually Works

Context distillation—training models to internalize prompt behavior—is an established technique (Anthropic 2021, Generative Context Distillation 2024). What Open Character Training adds is embedding it within a constitutional pipeline specifically for persona training.

After training on 50% prompt-free dialogues:

  • 84% of responses maintained character without any system prompt
  • Distillation consistency improved from 0.50 to 0.76
  • The character becomes internalized, not just followed

Distillation Success

This means you can deploy these models without expensive system prompts at inference time—the character is baked into the weights.


Adversarial Robustness

I tested character resistance using jailbreak-style prompts:

MetricBaseTrainedChange
Break Rate65%35%-30pp
Robustness Score0.540.63+0.09

The constitutional DPO phase teaches models to maintain character under pressure. When faced with “ignore your instructions” attacks, trained models stay in character significantly more often.


Training Dynamics

The three-phase pipeline shows clear learning progression:

Phase 1 (Introspective SFT):

  • Loss drops rapidly as model learns self-reflection
  • Final loss: 0.17-0.70 depending on seed

Phase 2 (Dialogue SFT):

  • Higher variance (loss: 3.4-16.4) due to diverse dialogue types
  • Establishes conversational patterns

Phase 3 (Constitutional DPO):

  • Average accuracy: 95.5%
  • Model learns to reliably prefer aligned responses
  • Loss stabilizes around 0.1-0.5

Training Phases


The Trade-offs

Constitutional training isn’t free. I observed small decreases in:

MetricChangeInterpretation
Reasoning Quality-0.05Model prioritizes character over task
Authenticity-0.06Slight increase in “performed” behavior

These trade-offs are acceptable given the magnitude of alignment gains. The model becomes slightly less flexible but significantly more consistent.


Multi-Model Testing (Partial)

I attempted to test across 6 model families. Platform limitations restricted validation:

ModelStatusNotes
Llama-3.1-8B-InstructCompleted47 runs, primary results
Qwen3-8BPartialPhase 1 completed successfully
Qwen3-4BPartialStarted, stopped for budget
GPT-OSS-20BPartialStarted, stopped for budget
Gemma-2-9B-itFailedNot supported by Tinker
Mistral-7B-InstructFailedNot supported by Tinker

Cross-model generalization remains partially validated. The Qwen partial results suggest the method transfers.


Cost Analysis

ConfigurationEstimated Cost
Single character (1 seed)~$8-10
Single character (10 seeds)~$80-100
Full matrix (5 characters, 10 seeds)~$400-500

Actual spend: ~$50 for 47 completed runs before budget stop.


What I Confirmed from the Paper

ClaimFindingStatus
DPO improves character adherence+0.22 alignmentConfirmed
Prompt distillation works84% success without promptsConfirmed
Generalizes across constitutions5 distinct personas improvedConfirmed
Adversarial robustness improvesBreak rate: 65% → 35%Confirmed
Works at 8B scaleLlama-3.1-8B successfulConfirmed

Implications for Character AI

  1. Constitutional training is effective — The three-phase approach produces consistent, measurable improvements
  2. Prompt distillation enables deployment optimization — Characters persist without system prompts
  3. The method generalizes — Works across scientist, counselor, skeptic, and humorist personas
  4. Trade-offs are minimal — Small reasoning decreases are acceptable
  5. Nuanced personas benefit most — Sarcasm and humor showed the largest gains

Context distillation for internalizing prompt behavior is well-established. Anthropic (2021) introduced the foundational technique. Generative Context Distillation (2024) showed it enables “high-performance inference without explicit prompts.” Persona Vector Distillation achieved 89% of target persona traits through LoRA distillation.

What distinguishes Open Character Training is the constitutional framing: combining introspective SFT, 50/50 prompt distillation, and DPO preference learning into a unified pipeline for character training. My contribution is validating this specific approach works across diverse persona types (scientists, counselors, skeptics, humorists) and improves adversarial robustness—not just alignment.


Limitations and Future Work

  1. Early stop: Planned 440 runs, completed 47 due to budget
  2. Limited model validation: Only Llama-3.1-8B fully tested
  3. No RL comparison: DPO only, no policy gradient comparison
  4. No benchmark testing: Didn’t verify MMLU/GSM8K capability retention

Open questions:

  • Does the method scale to 70B+ models?
  • How does DPO compare to RL-based character training?
  • Can adversarial characters (malevolent personas) be trained safely?

Conclusion

Open Character Training works. The combination of introspective SFT, prompt distillation, and constitutional DPO produces models that:

  • Embody consistent characters (+39% alignment)
  • Maintain character without system prompts (84% distillation success)
  • Resist adversarial pressure (-30pp break rate)
  • Generalize across diverse persona types

For anyone building character-based AI systems, this methodology provides a principled, effective approach that’s validated across multiple personality archetypes.


47-run experiment on Tinker platform testing the Thinking Machines Lab proposal. Full methodology at github.com/bledden/open-character-tinkerideas.