Artifacts
Search and filter across all Apertus Claritas artifacts.
Active interpretability with hallucination probes: exploding activations in `Apertus-8B-Instruct-2509`
Hallucination probes are small classifiers that read an LLM's internal states to flag when the model is making things up. When we trained them on `Apertus-8B-Instruct-2509`, they were far less stable than the same probes trained on `Llama-3.1-8B-Instruct`. We traced the problem to exploding activations in deeper layers: hidden state magnitudes grow so large that the probe's optimizer can't learn a reliable decision boundary. By applying four targeted fixes (`fp32` precision, lower learning rate, `LayerNorm`, and `LoRA` adapters), we improved probe performance from `0.7025` to `0.8961` AUC and from `0.3837` to `0.6802` recall at `0.1` false positive ratio. More importantly, this is a case study in *active interpretability*: the probes told us something was wrong with the model's internals before we knew what to look for. The failure pattern pointed us to a hypothesis, and we verified it with simple interventions, providing valuable lessons to upcoming model releases.
Sparse Autoencoder Features in Apertus Middle Layers
A study of SAE-derived features in mid-layer MLP activations with a focus on locality and stability across checkpoints.
Early-layer Circuits for Language Identification in Apertus
Circuit-style analysis of how early layers route language identity signals using path patching and synthetic probes.
An Active Interpretability Dashboard for Apertus
Interactive tooling for running lightweight interpretability workflows in-browser with reusable presets.
Robustness of SAE Features Under Steering
Negative result showing many SAE features drift under moderate fine-tuning and steering interventions.