Artifacts

Search and filter across all Apertus Claritas artifacts.

Active interpretability with hallucination probes: exploding activations in `Apertus-8B-Instruct-2509`

Hallucination probes are small classifiers that read an LLM's internal states to flag when the model is making things up. When we trained them on `Apertus-8B-Instruct-2509`, they were far less stable than the same probes trained on `Llama-3.1-8B-Instruct`. We traced the problem to exploding activations in deeper layers: hidden state magnitudes grow so large that the probe's optimizer can't learn a reliable decision boundary. By applying four targeted fixes (`fp32` precision, lower learning rate, `LayerNorm`, and `LoRA` adapters), we improved probe performance from `0.7025` to `0.8961` AUC and from `0.3837` to `0.6802` recall at `0.1` false positive ratio. More importantly, this is a case study in *active interpretability*: the probes told us something was wrong with the model's internals before we knew what to look for. The failure pattern pointed us to a hypothesis, and we verified it with simple interventions, providing valuable lessons to upcoming model releases.

TechnicalExploratoryActive-interpretabilityMechanistic-interpretability#apertus-1.5

Sparse Autoencoder Features in Apertus Middle Layers

A study of SAE-derived features in mid-layer MLP activations with a focus on locality and stability across checkpoints.

TechnicalSAEs

Early-layer Circuits for Language Identification in Apertus

Circuit-style analysis of how early layers route language identity signals using path patching and synthetic probes.

PaperCircuits

An Active Interpretability Dashboard for Apertus

Interactive tooling for running lightweight interpretability workflows in-browser with reusable presets.

SoftwareActive-interpretability

Robustness of SAE Features Under Steering

Negative result showing many SAE features drift under moderate fine-tuning and steering interventions.

Negative-resultsSAEs