Stephen Casper

Cited by

	All	Since 2019
Citations	725	724
h-index	12	12
i10-index	15	15

460

230

115

345

202020212022202320246 14 43 209 452

Public access

View all

1 article

0 articles

available

not available

Based on funding mandates

Co-authors

Dylan Hadfield-MenellMassachusetts Institute of TechnologyVerified email at csail.mit.edu
Gabriel KreimanProfessor, Harvard Medical School and Children's HospitalVerified email at tch.harvard.edu
Daniel FilanPhD Student, UC BerkeleyVerified email at berkeley.edu
Andrew CritchUC Berkeley, Department of Electrical Engineering and Computer SciencesVerified email at eecs.berkeley.edu
Stuart RussellProfessor of Computer Science, University of California, BerkeleyVerified email at cs.berkeley.edu
Shlomi HodPhD Candidate, Boston UniversityVerified email at bu.edu
Cody WildGoogle ResearchVerified email at google.com
Anson HoEpochVerified email at epochai.org
Xavier BoixMITVerified email at mit.edu
Kasper VinkenHarvard Medical SchoolVerified email at hms.harvard.edu
Arush TagadeML Researcher, Leap LaboratoriesVerified email at leap-labs.com
Martin SchrimpfEPFLVerified email at epfl.ch
Kevin Honglin ZhangDepartment of Economics, Illinois State UniversityVerified email at ilstu.edu
Kaivalya HariharanMIT CSAILVerified email at mit.edu
Jason LinGoogle / StanfordVerified email at stanford.edu
Gatlen CulpMassachusetts Institute of TechnologyVerified email at mit.edu
Joe KwonMITVerified email at csail.mit.edu
Soroush PourHarmony IntelligenceVerified email at soroushjp.com
Javier RandoETH ZurichVerified email at ai.ethz.ch
Rusheb ShahApollo ResearchVerified email at apolloresearch.ai

Stephen Casper

PhD student, MIT

Verified email at mit.edu - Homepage

AI safety AI responsibility red-teaming robustness auditing


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint arXiv:2307.15217, 2023	233	2023
Toward transparent ai: A survey on interpreting the inner structures of deep neural networks T Räuker, A Ho, S Casper, D Hadfield-Menell 2023 ieee conference on secure and trustworthy machine learning (satml), 464-483, 2023	120	2023
Explore, establish, exploit: Red teaming language models from scratch S Casper, J Lin, J Kwon, G Culp, D Hadfield-Menell arXiv preprint arXiv:2306.09442, 2023	49	2023
Scalable and transferable black-box jailbreaks for language models via persona modulation R Shah, S Pour, A Tagade, S Casper, J Rando arXiv preprint arXiv:2311.03348, 2023	41	2023
Rethinking machine unlearning for large language models S Liu, Y Yao, J Jia, S Casper, N Baracaldo, P Hase, X Xu, Y Yao, H Li, ... arXiv preprint arXiv:2402.08787, 2024	34	2024
Frivolous units: Wider networks are not really that wide S Casper, X Boix, V D'Amario, L Guo, M Schrimpf, K Vinken, G Kreiman Proceedings of the AAAI Conference on Artificial Intelligence 35 (8), 6921-6929, 2021	28*	2021
Foundational challenges in assuring alignment and safety of large language models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... arXiv preprint arXiv:2404.09932, 2024	27	2024
Clusterability in neural networks D Filan, S Casper, S Hod, C Wild, A Critch, S Russell arXiv preprint arXiv:2103.03386, 2021	27	2021
Red teaming deep neural networks with feature synthesis tools S Casper, T Bu, Y Li, J Li, K Zhang, K Hariharan, D Hadfield-Menell Advances in Neural Information Processing Systems 36, 80470-80516, 2023	25*	2023
Robust feature-level adversaries are interpretability tools S Casper, M Nadeau, D Hadfield-Menell, G Kreiman Advances in Neural Information Processing Systems 35, 33093-33106, 2022	24	2022
Black-box access is insufficient for rigorous ai audits S Casper, C Ezell, C Siegmann, N Kolt, TL Curtis, B Bucknall, A Haupt, ... The 2024 ACM Conference on Fairness, Accountability, and Transparency, 2254-2272, 2024	18	2024
Probing neural dialog models for conversational understanding A Saleh, T Deutsch, S Casper, Y Belinkov, S Shieber arXiv preprint arXiv:2006.08331, 2020	15	2020
Eight methods to evaluate robust unlearning in llms A Lynch, P Guo, A Ewart, S Casper, D Hadfield-Menell arXiv preprint arXiv:2402.16835, 2024	12	2024
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness? K Liu, S Casper, D Hadfield-Menell, J Andreas arXiv preprint arXiv:2312.03729, 2023	12	2023
Detecting modularity in deep neural networks S Hod, S Casper, D Filan, C Wild, A Critch, S Russell	10	2021
Graphical clusterability and local specialization in deep neural networks S Casper, S Hod, D Filan, C Wild, A Critch, S Russell ICLR 2022 Workshop on PAIR {\textasciicircum} 2Struct: Privacy …, 2022	9	2022
Diagnostics for deep neural networks with automated copy/paste attacks S Casper, K Hariharan, D Hadfield-Menell arXiv preprint arXiv:2211.10024, 2022	8	2022
Quantifying local specialization in deep neural networks S Hod, D Filan, S Casper, A Critch, S Russell arXiv preprint arXiv:2110.08058, 2021	8	2021
Defending Against Unforeseen Failure Modes with Latent Adversarial Training S Casper, L Schulze, O Patel, D Hadfield-Menell arXiv preprint arXiv:2403.05030, 2024	6	2024
Open problems and fundamental limitations of reinforcement learning from human feedback. CoRR, abs/2307.15217, 2023. doi: 10.48550 S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint ARXIV.2307.15217, 0	6

The system can't perform the operation now. Try again later.

Articles 1–20

Citations per year

Duplicate citations

Merged citations

Add co-authorsCo-authors

Follow

Cited by

Co-authors