Follow
Nathaniel Li
Nathaniel Li
Verified email at berkeley.edu - Homepage
Title
Cited by
Cited by
Year
Representation Engineering: A Top-Down Approach to AI Transparency
A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ...
arXiv preprint arXiv:2310.01405, 2023
1332023
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
A Pan, JS Chan, A Zou, N Li, S Basart, T Woodside, H Zhang, S Emmons, ...
ICML 2023, 26837-26867, 2023
902023
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu, E Sakhaee, N Li, ...
ICML 2024, 2024
392024
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ...
ICML 2024, 2024
242024
The system can't perform the operation now. Try again later.
Articles 1–4