Pretraining language models with human preferences T Korbak, K Shi, A Chen, RV Bhalerao, C Buckley, J Phang, SR Bowman, ... International Conference on Machine Learning, 17506-17533, 2023 | 133 | 2023 |
Towards automated circuit discovery for mechanistic interpretability A Conmy, A Mavor-Parker, A Lynch, S Heimersheim, A Garriga-Alonso Advances in Neural Information Processing Systems 36, 16318-16352, 2023 | 112 | 2023 |
Eliciting Latent Predictions from Transformers with the Tuned Lens N Belrose, Z Furman, L Smith, D Halawi, I Ostrovsky, L McKinney, ... arXiv preprint arXiv:2303.08112, 2023 | 83 | 2023 |
Training Language Models with Language Feedback at Scale J Scheurer, JA Campos, T Korbak, JS Chan, A Chen, K Cho, E Perez arXiv preprint arXiv:2303.16755, 2023 | 71 | 2023 |
imitation: Clean Imitation Learning Implementations A Gleave, M Taufeeque, J Rocamonde, E Jenner, SH Wang, S Toyer, ... arXiv preprint arXiv:2211.11972, 2022 | 67* | 2022 |
Training language models with language feedback J Scheurer, JA Campos, JS Chan, A Chen, K Cho, E Perez The First Workshop on Learning with Natural Language Supervision at ACL, 2022 | 54* | 2022 |
Evaluating the moral beliefs encoded in llms N Scherrer, C Shi, A Feder, D Blei Advances in Neural Information Processing Systems 36, 2024 | 50 | 2024 |
Inverse Scaling: When Bigger Isn't Better IR McKenzie, A Lyzhov, M Pieler, A Parrish, A Mueller, A Prabhu, ... arXiv preprint arXiv:2306.09479, 2023 | 48 | 2023 |
Adversarial Policies Beat Superhuman Go AIs TT Wang, A Gleave, N Belrose, T Tseng, J Miller, MD Dennis, Y Duan, ... arXiv preprint arXiv:2211.00241, 2022 | 45* | 2022 |
Improving Code Generation by Training with Natural Language Feedback A Chen, J Scheurer, T Korbak, JA Campos, JS Chan, SR Bowman, K Cho, ... arXiv preprint arXiv:2303.16749, 2023 | 42 | 2023 |
RL with KL penalties is better viewed as Bayesian inference T Korbak, E Perez, CL Buckley EMNLP 2022, 2022 | 35 | 2022 |
Vision-language models are zero-shot reward models for reinforcement learning J Rocamonde, V Montesinos, E Nava, E Perez, D Lindner arXiv preprint arXiv:2310.12921, 2023 | 17 | 2023 |
Exploiting novel gpt-4 apis K Pelrine, M Taufeeque, M Zając, E McLean, A Gleave arXiv preprint arXiv:2312.14302, 2023 | 14 | 2023 |
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems D Dalrymple, J Skalse, Y Bengio, S Russell, M Tegmark, S Seshia, ... arXiv preprint arXiv:2405.06624, 2024 | 9 | 2024 |
Codebook Features: Sparse and Discrete Interpretability for Neural Networks A Tamkin, M Taufeeque, ND Goodman arXiv preprint arXiv:2310.17230, 2023 | 9 | 2023 |
STARC: A General Framework For Quantifying Differences Between Reward Functions J Skalse, L Farnik, SR Motwani, E Jenner, A Gleave, A Abate arXiv preprint arXiv:2309.15257, 2023 | 3 | 2023 |
An Invariant Learning Characterization of Controlled Text Generation C Zheng, C Shi, K Vafa, A Feder, DM Blei arXiv preprint arXiv:2306.00198, 2023 | 3 | 2023 |
Few-shot Adaptation Works with UnpredicTable Data J Shern Chan, M Pieler, J Jao, J Scheurer, E Perez arXiv e-prints, arXiv: 2208.01009, 2022 | 3* | 2022 |
Can Go AIs be adversarially robust? T Tseng, E McLean, K Pelrine, TT Wang, A Gleave arXiv preprint arXiv:2406.12843, 2024 | | 2024 |
Uncovering Latent Human Wellbeing in Language Model Embeddings P Freire, CC Tan, A Gleave, D Hendrycks, S Emmons arXiv preprint arXiv:2402.11777, 2024 | | 2024 |