Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Description

In this very impressive work, Anthropic tackle interpretability of Large Language Models. Working around the problem of superposition causing polysemanticity, they use a weak dictionary learning algorithm called a sparse autoencoder to generate learned features from a trained model that offer a more monosemantic unit of analysis than the model's neurons themselves.


Read article here
Link
We care about your privacy so we do not store nor use any cookie unless it is stricly necessary to make the website to work
Got it
Learn more