Follow
Stephen Casper
Stephen Casper
PhD student, MIT
Verified email at mit.edu - Homepage
Title
Cited by
Cited by
Year
Open problems and fundamental limitations of reinforcement learning from human feedback
S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ...
arXiv preprint arXiv:2307.15217, 2023
1502023
Toward transparent ai: A survey on interpreting the inner structures of deep neural networks
T Räuker, A Ho, S Casper, D Hadfield-Menell
2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 464-483, 2023
103*2023
Explore, establish, exploit: Red teaming language models from scratch
S Casper, J Lin, J Kwon, G Culp, D Hadfield-Menell
arXiv preprint arXiv:2306.09442, 2023
412023
Clusterability in neural networks
D Filan, S Casper, S Hod, C Wild, A Critch, S Russell
arXiv preprint arXiv:2103.03386, 2021
322021
Frivolous units: Wider networks are not really that wide
S Casper, X Boix, V D'Amario, L Guo, M Schrimpf, K Vinken, G Kreiman
Proceedings of the AAAI Conference on Artificial Intelligence 35 (8), 6921-6929, 2021
28*2021
Scalable and transferable black-box jailbreaks for language models via persona modulation
R Shah, S Pour, A Tagade, S Casper, J Rando
arXiv preprint arXiv:2311.03348, 2023
252023
Robust feature-level adversaries are interpretability tools
S Casper, M Nadeau, D Hadfield-Menell, G Kreiman
Advances in Neural Information Processing Systems 35, 33093-33106, 2022
232022
Red teaming deep neural networks with feature synthesis tools
S Casper, T Bu, Y Li, J Li, K Zhang, K Hariharan, D Hadfield-Menell
Advances in Neural Information Processing Systems 36, 80470-80516, 2023
18*2023
Probing neural dialog models for conversational understanding
A Saleh, T Deutsch, S Casper, Y Belinkov, S Shieber
arXiv preprint arXiv:2006.08331, 2020
162020
Rethinking Machine Unlearning for Large Language Models
S Liu, Y Yao, J Jia, S Casper, N Baracaldo, P Hase, X Xu, Y Yao, H Li, ...
arXiv preprint arXiv:2402.08787, 2024
102024
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
K Liu, S Casper, D Hadfield-Menell, J Andreas
arXiv preprint arXiv:2312.03729, 2023
102023
Diagnostics for deep neural networks with automated copy/paste attacks
S Casper, K Hariharan, D Hadfield-Menell
arXiv preprint arXiv:2211.10024, 2022
92022
Quantifying local specialization in deep neural networks
S Hod, D Filan, S Casper, A Critch, S Russell
arXiv preprint arXiv:2110.08058, 2021
92021
Graphical clusterability and local specialization in deep neural networks
S Casper, S Hod, D Filan, C Wild, A Critch, S Russell
ICLR 2022 Workshop on PAIR {\textasciicircum} 2Struct: Privacy …, 2022
82022
Detecting modularity in deep neural networks
S Hod, S Casper, D Filan, C Wild, A Critch, S Russell
72021
Black-Box Access is Insufficient for Rigorous AI Audits
S Casper, C Ezell, C Siegmann, N Kolt, TL Curtis, B Bucknall, A Haupt, ...
arXiv preprint arXiv:2401.14446, 2024
62024
Eight Methods to Evaluate Robust Unlearning in LLMs
A Lynch, P Guo, A Ewart, S Casper, D Hadfield-Menell
arXiv preprint arXiv:2402.16835, 2024
42024
Measuring the success of diffusion models at imitating human artists
S Casper, Z Guo, S Mogulothu, Z Marinov, C Deshpande, RJ Yew, Z Dai, ...
arXiv preprint arXiv:2307.04028, 2023
42023
Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents
S Casper, T Killian, G Kreiman, D Hadfield-Menell
arXiv preprint arXiv:2209.02167, 2022
4*2022
Achilles heels for agi/asi via decision theoretic adversaries
S Casper
arXiv preprint arXiv:2010.05418, 2020
42020
The system can't perform the operation now. Try again later.
Articles 1–20