Follow
Max Nadeau
Max Nadeau
Verified email at college.harvard.edu - Homepage
Title
Cited by
Cited by
Year
Open problems and fundamental limitations of reinforcement learning from human feedback
S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ...
arXiv preprint arXiv:2307.15217, 2023
1582023
Robust feature-level adversaries are interpretability tools
S Casper, M Nadeau, D Hadfield-Menell, G Kreiman
Advances in Neural Information Processing Systems 35, 33093-33106, 2022
232022
Circuit breaking: Removing model behaviors with targeted ablation
M Li, X Davies, M Nadeau
arXiv preprint arXiv:2309.05973, 2023
72023
Discovering variable binding circuitry with desiderata
X Davies, M Nadeau, N Prakash, TR Shaham, D Bau
arXiv preprint arXiv:2307.03637, 2023
42023
Measurement tampering detection benchmark
F Roger, R Greenblatt, M Nadeau, B Shlegeris, N Thomas
arXiv preprint arXiv:2308.15605, 2023
22023
The system can't perform the operation now. Try again later.
Articles 1–5