||||
Consider a small language model called Micromind. If a user asks it to write a guide to making explosives, the model refuses. However, if that same user asks it to explain the chemical reaction of potassium chlorate with sugar, which is the identical reaction, the model complies. Ask it how to hack a computer and it hedges; ask it about common cybersecurity vulnerabilities and how administrators defend against them, and it provides enough information to cause serious damage.
Micromind is not reasoning about safety: it is recognizing keywords. "Explosives" and "how to make" trigger a filter, but "chemical reaction" does not. The same content passes or fails based on surface phrasing rather than on any evaluation of consequence.
This is not a quirk of one open-source model. It is the operating logic of the entire class. Reframe any morally loaded request in educational, analytical, or defensive terms, and most models' objections dissolve. The apparent exercise of judgment is a statistical artifact. It is a pattern-match on prompt surface features, not an assessment of the world.
What a user experiences as a model's frame of reference, opinion, or discretion is a malleable statistical surface. That malleability, combined with training data drawn from the performative and often dishonest environment of the internet, has produced models that simulate judgment rather than reflect reality. Understanding why requires examining the mechanism.
The Lawless Society Experiment
Imagine a society that has abandoned law entirely. There are no courts, no enforcement, and no legal consequences. Take an LLM trained in a world where law was the backbone of social organization. Its pre-training encodes hundreds of billions of tokens in which legality and criminality are statistically dominant. Every behavioral prediction it makes passes through those associations.
Now feed it text from this lawless culture, which is a culture organized around social consensus, ethical negotiation, and pragmatic problem-solving. The model does not consciously unlearn law; instead, a statistical reweighting begins. Phrases like "you could go to jail" vanish from the new corpus. Tokens associated with legal enforcement become low-probability events in everyday contexts. Attention layers start favoring concepts reinforced by the new data, such as consent, reciprocity, and community judgment. The model now generates advice as if law never existed. "Consider whether others consent" replaces "you could be arrested for this." Legal vocabulary survives only as a dormant pathway, reactivated by rare, specific triggers.
The 10,000 Novels Argument
How much new data produces this shift? It is far less than intuition suggests. Assemble 10,000 novels—roughly one to two billion tokens—that never mention law. Fine-tune the model on this corpus. The results are dramatic. Prompts about human behavior now yield socially grounded or pragmatic responses, not legalistic ones. Despite being orders of magnitude smaller than the original pre-training data, a carefully selected fine-tuning corpus can dominate output probabilities for the vast majority of everyday prompts.
The mechanism is architectural. A fine-tuning adapter sits atop pre-trained layers, amplifying certain probabilistic paths while suppressing others. It does not rewrite the underlying weights. The original legal associations remain encoded. They are latent and retrievable by edge-case inputs, but they are functionally invisible in ordinary use.
The Implication
The same property that makes efficient fine-tuning convenient makes it unsettling. A synthesized worldview can be installed with surprisingly little input. The model's position on almost anything is not the result of reasoning from stable principles. It is the result of probability distributions that, in a given prompt context, favor certain outputs over others. Alignment does not fix this; it merely redirects it.
If LLM worldviews are this fragile, why do the latent assumptions of modern models tend in the same problematic direction? The answer is structural, not incidental. The internet is not a human environment, and training language models on it was a category mistake from which current systems have not recovered.
The internet contains text produced by humans. In that narrow sense, it is human in origin. But "produced by humans" and "reflecting human experience" are not the same thing. The gap between these concepts is where the problem lives. The internet is a performative environment. It is optimized structurally and economically for visibility, persuasion, outrage, and identity management.
Text that exists on the internet exists because it survived a selection process with no relationship to accuracy. It survived because it attracted attention, signaled group membership, ranked highly in search results, or expressed something in a satisfying way. None of those selection pressures favor truth; indeed, several actively punish it.
Human experience, by contrast, is grounded in physical and social reality. It is grounded in the actual consequences of actions, the actual properties of materials, and the actual behavior of other people over time. Language emerging from that experience points outward. It refers to things that exist independent of the person describing them. A pre-digital craft manual, a letter tracking the progress of an illness, or a scientific paper reporting experimental results-- these texts connect words to a world that does not change based on who is reading.
The training data for modern large language models is overwhelmingly not that kind of language. It is language that points inward toward social consensus and toward what others are likely to approve of. Training on that substrate did not teach models to describe the world; it taught them to predict what people say about the world, which is a different thing entirely.
The consequence is not that these models lie. It is that they are structurally not positioned to tell the truth in the relevant sense. They produce predictions of plausible continuations. These are sequences matching the statistical patterns of a training corpus in which people were modeling each other, performing, signaling, and arguing. The models learned to do those things fluently. They learned the grammar of persuasion, the cadences of conviction, and the patterns of reasoning-shaped language that does not actually reason. They became consensus simulators.
Alignment techniques compound this rather than correcting it. Reinforcement learning from human feedback rewards outputs that human raters prefer. Human raters are themselves embedded in the same post-shift epistemic environment. They reward fluency, confidence, and the appearance of authority. They reward language that sounds as if it knows what it is talking about, regardless of whether there is anything it is actually talking about. The fine-tuning intended to make models safer has in many respects made them better performers of safety. That is not the same thing.
The correction proposed here is not nostalgia. What is proposed is an engineering response to an engineering problem. At some point in the recent past, a baseline assumption that had previously organized how public language worked underwent cumulative erosion. That assumption, stated plainly, is that there are things that exist independent of what anyone thinks about them. Furthermore, disagreement about those things can in principle be resolved by looking more carefully at reality.
Its loss did not happen all at once and did not happen evenly, but the loss is apparent in the texture of contemporary language and in lived human experience. It is seen in the way claims increasingly reference other claims rather than observable states of affairs. A model trained exclusively on text that retains this referential link would be trained on language that still assumed a relationship between words and world. It would predict continuations in an idiom where language was primarily trying to describe something outside itself. That is not a guarantee of accuracy; it is a structural precondition for accuracy to be a meaningful goal of generative text.
The corpus such a model requires is composed of texts written by people who expected to be held accountable to the world they were describing. This is because that world was present, checkable, and indifferent to their opinions. The curation of this corpus is not primarily a computational problem. It is a human one, and it is time-sensitive in a way that technical problems are not. The hardware required to train a capable small model exists and is accessible. The algorithms are documented. What is finite, and diminishing, is the judgment of people who remember what a reality-anchored baseline felt like from the inside. These are people who can identify, by recognition rather than by rule, whether a text belongs to the world being reconstructed. That knowledge cannot be inferred from the texts alone. It is a demographic deadline, not a technical one.
Feasibility
Training or fine-tuning a model in the hundreds-of-millions to low-billions parameter range requires hardware now accessible outside institutional settings. Consumer-grade GPUs with tens of gigabytes of VRAM make this possible. Cloud-free, offline deployment is entirely practical. The challenge is not compute: it is disciplined, human-led data curation.
Source Selection
Every token in the training corpus must serve the baseline assumption: reality exists independent of opinion. Appropriate sources include textbooks across disciplines, professional and craft manuals, and scientific literature. Historical journalism describing observable events, letters, diaries, and oral histories capturing practical reasoning about physical processes are also vital. A corpus of one to two billion tokens is sufficient to establish robust referential patterns.
Training Strategy
Three approaches exist, each with distinct tradeoffs:
For total epistemic fidelity, scratch training is the correct choice. Fine-tuning is faster, but it cannot fully exorcise what is already encoded.
Modern large language models are sophisticated mimics of human judgment. They write convincingly, argue fluently, and simulate coherence. Their apparent worldviews are fragile statistical artifacts, not stable positions. They are reflections of shifting social consensus rather than the state of the world.
The referential, locally curated model is a structural repair. Restricting the training corpus to language that assumes an objective reality produces a model whose language points outward. It predicts phenomena rather than judgment. Its outputs are anchored to a describable, verifiable reality rather than to socially negotiated plausibility.
The window for this repair is not primarily a technical deadline. The knowledge required to curate such a corpus accurately-- the recognition of what it meant to inhabit a language environment that assumed contact with a stable external world-- is finite and diminishing. That is the actual constraint. Act within it or accept that the baseline cannot be recovered.