Decomposing Language Models Into Understandable Components

AI startup Anthropic, writing in a blog post: Neural networks are trained on data, not programmed to follow rules. With each step of training, millions or billions of parameters are updated to make the model better at tasks, and by the end, the model is capable of a dizzying array of behaviors. We understand the math of the trained network exactly -- each neuron in a neural network performs simple arithmetic -- but we don't understand why those mathematical operations result in the behaviors we