The effects of reward misspecification: Mapping and mitigating misaligned models A Pan, K Bhatia, J Steinhardt arXiv preprint arXiv:2201.03544, 2022 | 78 | 2022 |
Representation engineering: A top-down approach to ai transparency A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ... arXiv preprint arXiv:2310.01405, 2023 | 69* | 2023 |
Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark A Pan, JS Chan, A Zou, N Li, S Basart, T Woodside, H Zhang, S Emmons, ... International Conference on Machine Learning, 26837-26867, 2023 | 60 | 2023 |
Improving robustness of reinforcement learning for power system control with adversarial training A Pan, Y Lee, H Zhang, Y Chen, Y Shi arXiv preprint arXiv:2110.08956, 2021 | 7 | 2021 |
Feedback Loops With Language Models Drive In-Context Reward Hacking A Pan, E Jones, M Jagadeesan, J Steinhardt arXiv preprint arXiv:2402.06627, 2024 | 4 | 2024 |
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ... arXiv preprint arXiv:2403.03218, 2024 | 2 | 2024 |
Foundational Challenges in Assuring Alignment and Safety of Large Language Models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... arXiv preprint arXiv:2404.09932, 2024 | | 2024 |