New paper with Eric Michaud and Max Tegmark is out! (paper)


Some key findings:

1. In certain settings, curriculum learning effects are essential for achieving high performance — sometimes you need to train on a broad distribution to learn specific narrow skills.

2. Pruning often outperforms distillation at creating capable, task-specific networks.

3. Superposition remains a key bottleneck to efficient structured pruning.

4. Structured regularization can mitigate this problem by aligning task-specific features with prunable model components.


Check out the preprint for additional findings.


update: accepted to NeurIPS 2025 (: