‪Alexander Pan‬ - ‪Google Scholar‬

Get my own profile

Cited by

	All	Since 2019
Citations	404	403
h-index	7	7
i10-index	5	5

0

280

140

70

210

20222023202412 106 280

Public access

1 article

0 articles

available

not available

Based on funding mandates

Co-authors

Dan HendrycksDirector of the Center for AI SafetyVerified email at berkeley.edu
Jacob SteinhardtStanford UniversityVerified email at cs.stanford.edu
Yuanyuan ShiAssistant Professor, UCSDVerified email at ucsd.edu

Alexander Pan

Alexander Pan

Verified email at berkeley.edu - Homepage

artificial intelligence machine learning


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Representation engineering: A top-down approach to ai transparency A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ... arXiv preprint arXiv:2310.01405, 2023	133	2023
The effects of reward misspecification: Mapping and mitigating misaligned models A Pan, K Bhatia, J Steinhardt arXiv preprint arXiv:2201.03544, 2022	105	2022
Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark A Pan, JS Chan, A Zou, N Li, S Basart, T Woodside, H Zhang, S Emmons, ... International Conference on Machine Learning, 26837-26867, 2023	90	2023
Foundational challenges in assuring alignment and safety of large language models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... arXiv preprint arXiv:2404.09932, 2024	34	2024
The wmdp benchmark: Measuring and reducing malicious use with unlearning N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ... arXiv preprint arXiv:2403.03218, 2024	24	2024
Feedback loops with language models drive in-context reward hacking A Pan, E Jones, M Jagadeesan, J Steinhardt arXiv preprint arXiv:2402.06627, 2024	9	2024
Improving robustness of reinforcement learning for power system control with adversarial training A Pan, Y Lee, H Zhang, Y Chen, Y Shi arXiv preprint arXiv:2110.08956, 2021	9	2021

The system can't perform the operation now. Try again later.

Articles 1–7