Research

Distillation

Anthropic has alleged that multiple Chinese AI labs attempted to extract behavior from Claude at scale. The underlying technique is known as distillation.

Anthropic has alleged that multiple Chinese AI labs, including DeepSeek, Moonshot AI, and MiniMax, attempted to extract behavior from its model Claude at scale. Reported activity included the creation of large numbers of synthetic accounts and the use of proxy infrastructure to obscure request origin, resulting in millions of model interactions.

The underlying technique is known as distillation.

In its standard formulation, distillation refers to training a "student" model to approximate the behavior of a larger "teacher" model by learning from its input–output mappings. Instead of relying solely on large-scale pretraining from raw data, the student model is optimized on outputs generated by the teacher, effectively compressing its behavior into a smaller system.

In simplified terms, this can be expressed as:

f_student(x) ≈ f_teacher(x)

where the student is trained to reproduce the teacher's conditional outputs across a distribution of prompts.

The motivation is efficiency. Building frontier models from scratch requires extensive data collection, filtering, and compute-intensive training. Distillation provides an alternative pathway: query a high-capability model, collect responses, and use them as training data to bootstrap a competing system at lower cost.

In the case described by Anthropic, the reported objective was to generate large-scale interaction data with Claude in order to approximate its behavior in downstream models. From a technical standpoint, this process can reproduce not only general capabilities but also stylistic and reasoning patterns, depending on coverage and diversity of the query distribution.

However, there are important limitations.

First, distillation does not guarantee faithful transfer of safety alignment properties. If safety behavior is encoded implicitly in parameters and response distributions, partial or biased sampling may fail to preserve those constraints in the student model.

Second, the resulting dataset reflects the teacher's observable outputs, not its internal decision processes. This creates a representational bottleneck: the student learns a behavioral approximation rather than the underlying mechanism.

Third, distillation effectiveness is highly sensitive to query distribution coverage. If adversarial or edge-case inputs are overrepresented, the resulting model may inherit distorted behavioral priors.

It is also relevant that frontier models themselves are trained on large-scale internet data, including text, code, and media scraped from publicly available sources. This has long raised unresolved questions regarding data ownership, consent, and attribution in foundation model training.

Within that context, distillation can be viewed as a more explicit form of behavior transfer between models, compressing capabilities from a high-capacity system into a lower-capacity one through supervised imitation of outputs.

While the technique is well established in machine learning literature, its strategic significance has increased as capability gaps between frontier models and open or competing systems have widened, making behavioral extraction economically meaningful at scale.

← Research