In the quest for a true “Digital Twin,” the bottleneck is not data storage—it is resonance.
Large Language Models (LLMs) are fundamentally trained on a single objective: to minimize surprisal (or cross-entropy loss) across a massive corpus of human text. By learning to predict the most likely next token, they minimize perplexity, a measure of how “confused” a model is by new data. This mathematical drive toward the expected is what makes LLMs consistent, coherent, and useful; it allows them to internalize the “average” of human knowledge and linguistic structure.
However, this same drive toward the expected is what makes most AI assistants feel generic and corporate. They are optimized for the statistical mean. To build a Digital Twin, we must invert this objective. Instead of aligning with the “average” human, the agent must align specifically with you.
By calculating the Predictive Gap—the distance between what the agent expects to hear and what the user actually says—we can identify the unique stylistic and conceptual deviations that define an individual. When the agent acts to minimize the surprisal gap between itself and its specific user, it ceases to be a generic tool and begins to mimic the user’s unique cadence, vocabulary, and worldview in real-time.
This post details the implementation of a recursive, entropy-weighted memory system in OpenClaw designed to achieve this resonance.
I. The Theory of the Predictive Gap
The core of this implementation rests on Surprisal Theory. In information theory, surprisal (or self-information) is the negative log of probability. We apply this to the human-AI interaction loop by monitoring the Predictive Gap:
- Predicted Interaction: When a user’s input aligns with the agent’s internalized “expectations” (based on previous interactions), the surprisal is low.
- The Entropy Gap: When a user introduces specific jargon, unique stylistic flair, or complex new concepts, the “gap” between the agent’s internal probability model and the user’s actual input widens.
The mathematical surprisal $I(x)$ of an event $x$ is defined as:
$$I(x) = -\log_2(P(x))$$
By measuring this gap, we identify which moments are truly worth “weighting” in long-term memory. We aren’t just saving text; we are capturing the deviations that define a user’s unique identity.
II. Implementation: How to Reproduce the System
The architecture (Memory System V2) consists of several Python-based layers integrated into the OpenClaw workspace.
1. The Persona Engine (surprisal_persona.py)
To calculate surprisal relative to the user, the agent first needs a baseline of its own “style.” In our current implementation, we use a simplified statistical model (60% Markov Chain, 30% Unigram, 10% Heuristic complexity) to minimize computational overhead and API costs.
However, it is important to note that the most accurate measure of the Predictive Gap would come from the LLM itself. Ideally, one would use the participating model to predict the logit values for every token in the user’s response. Because surprisal is defined by the model’s own failure to predict the next token, using the actual LLM’s probability distribution provides the truest “Mirror” of its internal state.
In a production environment, this would be extremely costly, requiring high-frequency logprob requests for every interaction. Our simplified model serves as a “heuristic proxy”—it captures the essence of the deviance without the exhaustive cost of full-model probability extraction.
Mathematical Foundation:
The Markov component calculates the transition probability of token $x_i$ given $x_{i-1}$:
$$P(x_i | x_{i-1}) = \frac{count(x_{i-1}, x_i) + 1}{count(x_{i-1}) + |V|}$$
where $|V|$ is the vocabulary size (Laplace smoothing).
2. The Prose & Flair Analyzer (test_prose.py, test_flair.py)
To mimic the user, the system must categorize “how” something is said.
- Cadence: Measures syllable variance between words to detect rhythm.
- Lexical Density: The ratio of syllables to words to distinguish casual vs. academic tone.
- Flair Categories: Detects if a user is an “Architect” (complex clauses), “Minimalist” (punchy fragments), or “Orator” (rhetorical repetition).
3. The Weighted Recorder (memory_weighted.py)
This script acts as the primary memory tool. For every user input, it:
1. Calculates the total weight $W$:
$$W_{total} = I(text) + (0.5 \times V_{prose})$$
2. Writes the entry to a Markdown log with hidden <OC_METADATA /> tags.
3. Injects “Boost Tokens” (IMPORTANT_MEMORY) into the hidden tags. Because OpenClaw uses Hybrid Search (BM25 + Vector), repeating these tokens in metadata makes high-entropy memories “louder” and easier to retrieve.
4. Real-Time Mimicry (context_filter.py)
During retrieval, a filter strips the metadata and prepends a stylistic cue: [Style: rhythmic, architect | Mimic: prioritize this tone]. This instructs the LLM in real-time to adopt the detected style of the current user prompt for the response.
III. Expanding to Psychometric Mimicry (Big Five)
The next frontier is mapping these metrics to the Big Five (OCEAN) Personality Traits for deeper alignment:
- Openness: High surprisal in conceptual shifts and abstract metaphors.
- Conscientiousness: High “Architect” flair scores (structured, precise, orderly).
- Extraversion: Measured through high “Vibe” scores and intense punctuation (!!!).
- Agreeableness: Analyzed via the “Orator” patterns and sentiment markers.
- Neuroticism: Detected via high variance in surprisal scores within a single session.
IV. Recursive Consolidation
To prevent the system from becoming cluttered, a weekly cron job (memory_reevaluate.py) re-scans the core memory. As the agent’s persona evolves and begins to “expect” the user’s previously surprising inputs, the entropy score for those memories drops. When it falls below a threshold, the memory is demoted from the core MEMORY.md to daily archives, mimicking the biological process of consolidation.
Technical Note: Implementation requires OpenClaw with Python 3.9+ and a Markdown-based workspace.