Pretraining language models with human preferences T Korbak, K Shi, A Chen, RV Bhalerao, C Buckley, J Phang, SR Bowman, ... International Conference on Machine Learning, 17506-17533, 2023 | 83 | 2023 |
Towards automated circuit discovery for mechanistic interpretability A Conmy, AN Mavor-Parker, A Lynch, S Heimersheim, A Garriga-Alonso arXiv preprint arXiv:2304.14997, 2023 | 77 | 2023 |
Eliciting Latent Predictions from Transformers with the Tuned Lens N Belrose, Z Furman, L Smith, D Halawi, I Ostrovsky, L McKinney, ... arXiv preprint arXiv:2303.08112, 2023 | 69 | 2023 |
Training Language Models with Language Feedback at Scale J Scheurer, JA Campos, T Korbak, JS Chan, A Chen, K Cho, E Perez arXiv preprint arXiv:2303.16755, 2023 | 61 | 2023 |
Training language models with language feedback J Scheurer, JA Campos, JS Chan, A Chen, K Cho, E Perez The First Workshop on Learning with Natural Language Supervision at ACL, 2022 | 53* | 2022 |
imitation: Clean Imitation Learning Implementations A Gleave, M Taufeeque, J Rocamonde, E Jenner, SH Wang, S Toyer, ... arXiv preprint arXiv:2211.11972, 2022 | 52* | 2022 |
Inverse Scaling: When Bigger Isn't Better IR McKenzie, A Lyzhov, M Pieler, A Parrish, A Mueller, A Prabhu, ... arXiv preprint arXiv:2306.09479, 2023 | 39 | 2023 |
Adversarial Policies Beat Superhuman Go AIs TT Wang, A Gleave, N Belrose, T Tseng, J Miller, MD Dennis, Y Duan, ... arXiv preprint arXiv:2211.00241, 2022 | 39* | 2022 |
Improving Code Generation by Training with Natural Language Feedback A Chen, J Scheurer, T Korbak, JA Campos, JS Chan, SR Bowman, K Cho, ... arXiv preprint arXiv:2303.16749, 2023 | 36 | 2023 |
RL with KL penalties is better viewed as Bayesian inference T Korbak, E Perez, CL Buckley EMNLP 2022, 2022 | 26 | 2022 |
Vision-language models are zero-shot reward models for reinforcement learning J Rocamonde, V Montesinos, E Nava, E Perez, D Lindner arXiv preprint arXiv:2310.12921, 2023 | 12 | 2023 |
Codebook Features: Sparse and Discrete Interpretability for Neural Networks A Tamkin, M Taufeeque, ND Goodman arXiv preprint arXiv:2310.17230, 2023 | 4 | 2023 |
Few-shot Adaptation Works with UnpredicTable Data J Shern Chan, M Pieler, J Jao, J Scheurer, E Perez arXiv e-prints, arXiv: 2208.01009, 2022 | 4* | 2022 |
STARC: A General Framework For Quantifying Differences Between Reward Functions J Skalse, L Farnik, SR Motwani, E Jenner, A Gleave, A Abate arXiv preprint arXiv:2309.15257, 2023 | 3 | 2023 |
An Invariant Learning Characterization of Controlled Text Generation C Zheng, C Shi, K Vafa, A Feder, DM Blei arXiv preprint arXiv:2306.00198, 2023 | 1 | 2023 |
Uncovering Latent Human Wellbeing in Language Model Embeddings P Freire, CC Tan, A Gleave, D Hendrycks, S Emmons arXiv preprint arXiv:2402.11777, 2024 | | 2024 |