Follow
FAR AI
FAR AI
Research organisation
Verified email at alignmentfund.org - Homepage
Title
Cited by
Cited by
Year
Pretraining language models with human preferences
T Korbak, K Shi, A Chen, RV Bhalerao, C Buckley, J Phang, SR Bowman, ...
International Conference on Machine Learning, 17506-17533, 2023
832023
Towards automated circuit discovery for mechanistic interpretability
A Conmy, AN Mavor-Parker, A Lynch, S Heimersheim, A Garriga-Alonso
arXiv preprint arXiv:2304.14997, 2023
772023
Eliciting Latent Predictions from Transformers with the Tuned Lens
N Belrose, Z Furman, L Smith, D Halawi, I Ostrovsky, L McKinney, ...
arXiv preprint arXiv:2303.08112, 2023
692023
Training Language Models with Language Feedback at Scale
J Scheurer, JA Campos, T Korbak, JS Chan, A Chen, K Cho, E Perez
arXiv preprint arXiv:2303.16755, 2023
612023
Training language models with language feedback
J Scheurer, JA Campos, JS Chan, A Chen, K Cho, E Perez
The First Workshop on Learning with Natural Language Supervision at ACL, 2022
53*2022
imitation: Clean Imitation Learning Implementations
A Gleave, M Taufeeque, J Rocamonde, E Jenner, SH Wang, S Toyer, ...
arXiv preprint arXiv:2211.11972, 2022
52*2022
Inverse Scaling: When Bigger Isn't Better
IR McKenzie, A Lyzhov, M Pieler, A Parrish, A Mueller, A Prabhu, ...
arXiv preprint arXiv:2306.09479, 2023
392023
Adversarial Policies Beat Superhuman Go AIs
TT Wang, A Gleave, N Belrose, T Tseng, J Miller, MD Dennis, Y Duan, ...
arXiv preprint arXiv:2211.00241, 2022
39*2022
Improving Code Generation by Training with Natural Language Feedback
A Chen, J Scheurer, T Korbak, JA Campos, JS Chan, SR Bowman, K Cho, ...
arXiv preprint arXiv:2303.16749, 2023
362023
RL with KL penalties is better viewed as Bayesian inference
T Korbak, E Perez, CL Buckley
EMNLP 2022, 2022
262022
Vision-language models are zero-shot reward models for reinforcement learning
J Rocamonde, V Montesinos, E Nava, E Perez, D Lindner
arXiv preprint arXiv:2310.12921, 2023
122023
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
A Tamkin, M Taufeeque, ND Goodman
arXiv preprint arXiv:2310.17230, 2023
42023
Few-shot Adaptation Works with UnpredicTable Data
J Shern Chan, M Pieler, J Jao, J Scheurer, E Perez
arXiv e-prints, arXiv: 2208.01009, 2022
4*2022
STARC: A General Framework For Quantifying Differences Between Reward Functions
J Skalse, L Farnik, SR Motwani, E Jenner, A Gleave, A Abate
arXiv preprint arXiv:2309.15257, 2023
32023
An Invariant Learning Characterization of Controlled Text Generation
C Zheng, C Shi, K Vafa, A Feder, DM Blei
arXiv preprint arXiv:2306.00198, 2023
12023
Uncovering Latent Human Wellbeing in Language Model Embeddings
P Freire, CC Tan, A Gleave, D Hendrycks, S Emmons
arXiv preprint arXiv:2402.11777, 2024
2024
The system can't perform the operation now. Try again later.
Articles 1–16