Follow
Tom Lieberum
Tom Lieberum
Google DeepMind
Verified email at deepmind.com
Title
Cited by
Cited by
Year
Progress measures for grokking via mechanistic interpretability
N Nanda, L Chan, T Lieberum, J Smith, J Steinhardt
arXiv preprint arXiv:2301.05217, 2023
1462023
Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla
T Lieberum, M Rahtz, J Kramár, G Irving, R Shah, V Mikulik
arXiv preprint arXiv:2307.09458, 2023
262023
Retrospective on the 2021 minerl BASALT competition on learning from human feedback
R Shah, SH Wang, C Wild, S Milani, A Kanervisto, VG Goecks, ...
NeurIPS 2021 Competitions and Demonstrations Track, 259-272, 2022
72022
AtP*: An efficient and scalable method for localizing LLM behaviour to components
J Kramár, T Lieberum, R Shah, N Nanda
arXiv preprint arXiv:2403.00745, 2024
32024
Retrospective on the 2021 BASALT Competition on Learning from Human Feedback
R Shah, SH Wang, C Wild, S Milani, A Kanervisto, VG Goecks, ...
arXiv preprint arXiv:2204.07123, 2022
12022
Replication: Fairness without demographics through Adversarially Reweighted Learning
E Jenner, T Lieberum, FP Nolte, N Rutsch
2021
The system can't perform the operation now. Try again later.
Articles 1–6