Follow
FAR AI
FAR AI
Research organisation
Verified email at far.ai - Homepage
Title
Cited by
Cited by
Year
Pretraining language models with human preferences
T Korbak, K Shi, A Chen, RV Bhalerao, C Buckley, J Phang, SR Bowman, ...
International Conference on Machine Learning, 17506-17533, 2023
1332023
Towards automated circuit discovery for mechanistic interpretability
A Conmy, A Mavor-Parker, A Lynch, S Heimersheim, A Garriga-Alonso
Advances in Neural Information Processing Systems 36, 16318-16352, 2023
1122023
Eliciting Latent Predictions from Transformers with the Tuned Lens
N Belrose, Z Furman, L Smith, D Halawi, I Ostrovsky, L McKinney, ...
arXiv preprint arXiv:2303.08112, 2023
832023
Training Language Models with Language Feedback at Scale
J Scheurer, JA Campos, T Korbak, JS Chan, A Chen, K Cho, E Perez
arXiv preprint arXiv:2303.16755, 2023
712023
imitation: Clean Imitation Learning Implementations
A Gleave, M Taufeeque, J Rocamonde, E Jenner, SH Wang, S Toyer, ...
arXiv preprint arXiv:2211.11972, 2022
67*2022
Training language models with language feedback
J Scheurer, JA Campos, JS Chan, A Chen, K Cho, E Perez
The First Workshop on Learning with Natural Language Supervision at ACL, 2022
54*2022
Evaluating the moral beliefs encoded in llms
N Scherrer, C Shi, A Feder, D Blei
Advances in Neural Information Processing Systems 36, 2024
502024
Inverse Scaling: When Bigger Isn't Better
IR McKenzie, A Lyzhov, M Pieler, A Parrish, A Mueller, A Prabhu, ...
arXiv preprint arXiv:2306.09479, 2023
482023
Adversarial Policies Beat Superhuman Go AIs
TT Wang, A Gleave, N Belrose, T Tseng, J Miller, MD Dennis, Y Duan, ...
arXiv preprint arXiv:2211.00241, 2022
45*2022
Improving Code Generation by Training with Natural Language Feedback
A Chen, J Scheurer, T Korbak, JA Campos, JS Chan, SR Bowman, K Cho, ...
arXiv preprint arXiv:2303.16749, 2023
422023
RL with KL penalties is better viewed as Bayesian inference
T Korbak, E Perez, CL Buckley
EMNLP 2022, 2022
352022
Vision-language models are zero-shot reward models for reinforcement learning
J Rocamonde, V Montesinos, E Nava, E Perez, D Lindner
arXiv preprint arXiv:2310.12921, 2023
172023
Exploiting novel gpt-4 apis
K Pelrine, M Taufeeque, M Zając, E McLean, A Gleave
arXiv preprint arXiv:2312.14302, 2023
142023
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
D Dalrymple, J Skalse, Y Bengio, S Russell, M Tegmark, S Seshia, ...
arXiv preprint arXiv:2405.06624, 2024
92024
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
A Tamkin, M Taufeeque, ND Goodman
arXiv preprint arXiv:2310.17230, 2023
92023
STARC: A General Framework For Quantifying Differences Between Reward Functions
J Skalse, L Farnik, SR Motwani, E Jenner, A Gleave, A Abate
arXiv preprint arXiv:2309.15257, 2023
32023
An Invariant Learning Characterization of Controlled Text Generation
C Zheng, C Shi, K Vafa, A Feder, DM Blei
arXiv preprint arXiv:2306.00198, 2023
32023
Few-shot Adaptation Works with UnpredicTable Data
J Shern Chan, M Pieler, J Jao, J Scheurer, E Perez
arXiv e-prints, arXiv: 2208.01009, 2022
3*2022
Can Go AIs be adversarially robust?
T Tseng, E McLean, K Pelrine, TT Wang, A Gleave
arXiv preprint arXiv:2406.12843, 2024
2024
Uncovering Latent Human Wellbeing in Language Model Embeddings
P Freire, CC Tan, A Gleave, D Hendrycks, S Emmons
arXiv preprint arXiv:2402.11777, 2024
2024
The system can't perform the operation now. Try again later.
Articles 1–20