Follow
Alexander Pan
Alexander Pan
Verified email at berkeley.edu - Homepage
Title
Cited by
Cited by
Year
The effects of reward misspecification: Mapping and mitigating misaligned models
A Pan, K Bhatia, J Steinhardt
arXiv preprint arXiv:2201.03544, 2022
782022
Representation engineering: A top-down approach to ai transparency
A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ...
arXiv preprint arXiv:2310.01405, 2023
69*2023
Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark
A Pan, JS Chan, A Zou, N Li, S Basart, T Woodside, H Zhang, S Emmons, ...
International Conference on Machine Learning, 26837-26867, 2023
602023
Improving robustness of reinforcement learning for power system control with adversarial training
A Pan, Y Lee, H Zhang, Y Chen, Y Shi
arXiv preprint arXiv:2110.08956, 2021
72021
Feedback Loops With Language Models Drive In-Context Reward Hacking
A Pan, E Jones, M Jagadeesan, J Steinhardt
arXiv preprint arXiv:2402.06627, 2024
42024
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ...
arXiv preprint arXiv:2403.03218, 2024
22024
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ...
arXiv preprint arXiv:2404.09932, 2024
2024
The system can't perform the operation now. Try again later.
Articles 1–7