PITTI - Article - Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Date : 2023-10-04

Description

In this very impressive work, Anthropic tackle interpretability of Large Language Models. Working around the problem of superposition causing polysemanticity, they use a weak dictionary learning algorithm called a sparse autoencoder to generate learned features from a trained model that offer a more monosemantic unit of analysis than the model's neurons themselves.

Read article here

Link

How hard does Art need to be ?