*Note: Most quick takes on this website were copied over from somewhere else and adapted by Claude.


The scaling parameter decomposition paper is coming out soonish! Pretty exciting, but imo they still haven't solved a few big issues.


I worry that the components they find, like SAE features, are sometimes going to look interpretable without faithfully representing what the model is doing. Though they're definitely a lot better than SAEs in this regard!


I think I've come up with ideas for solving a couple of the biggest problems I see with the agenda. I want to think about whether there are any fundamental bottlenecks. If there aren't, i'll heavily regret not working as hard as I can on this now. (Ig the first thing might be solved by the time the paper is finished; someone on the team recently pushed a PR with better causal importance functions).


But yea, fundamental bottlenecks. In the current formulation, there's a pretty arbitrary balance between faithfulness, minimality, and reconstruction. I think a lot of this could be solved with a better notion of MDL than 'use as few rank-1 subcomponents as possible on any given input'. Someone on the team recently added a loss term penalizing high-frequency components as well. On one hand, this brings us closer to a reasonable notion of MDL, on the other hand, it's clearly unprincipled. (Plus, this is loss term number 7). But with a sufficiently good notion of MDL, if we enforce perfect reconstruction, I thiiink that the only solution that satisfies both faithfulness and minimality is ~the true circuits. (A better notion of MDL would also solve a lot of smaller dumb problems, like multiple unrelated subcomponents grouping into a single rank-1 matrix). (But it would also solve some huge problems, eg, it would immediately make the faithfulness loss way more meaningful).


This is the sort of thing that someone on the team would have solved already if it were easily solvable, I think. But it definitely seems very fundamentally possible, especially if you incorporate some cross-datapoint information-theoretic measure into the training loop. If these better CIs work, we just have to 1) figure out MDL and 2) figure out how to do stochastic masking only on the correct abstractions—and then we should be able to throw compute at the method and find the true circuits? Of course, there's also something to be said for working on agendas that aim to automate a notion of human-interpretability rather than circuit-finding in the network's ontology. The circuits we find w/ parameter decomposition might be pretty wild.


To elaborate on the stochastic masking point: The claim of SPD is that networks can be divided into circuits. On any given input, only some of these circuits are active. For each inactive circuit to 'not matter', we should be able to scale it by an arbitrary alpha in [0,1], and the network's behavior should stay the same. This is true at the circuit level, not at the 'subcomponent' level.


Say you have a rank-5 circuit, divided into 5 rank-1 subcomponents across layers. You should be allowed to multiply that rank-5 circuit by an arbitrary alpha in [0,1]. However, if you take each individual rank-1 subcomponent and multiply it by a different alpha_i in [0,1], this could totally change what directions the circuit reads from, and so shouldn't be allowed. E.g., call your circuit C and activations v. C = C1 + C2 + C3 + C4 + C5. Cv is around 0, so we can scale it by whatever. However, (aC1 + bC2 + cC3 + dC4 + eC5)*v might be some totally different thing. One possible way forward would be to incorporate a circuit-finding clustering into the training loop such that we can causally mask individual circuits rather than individual subcomponents.


Another possibility is that the network's true circuits aren't actually that close to MDL, so our decomposition finds a more MDL solution which mimics the original model's output and weights perfectly, but has some 'wiggle room' left in the faithfulness loss.