FAR AI

Cited by

	All	Since 2019
Citations	798	797
h-index	13	13
i10-index	13	13

460

230

115

345

202020212022202320243 6 26 310 449

Public access

View all

2 articles

0 articles

available

not available

Based on funding mandates

Co-authors

Ethan PerezAnthropic; New York UniversityVerified email at anthropic.com
Adam GleaveCEO at FAR AIVerified email at far.ai
Scott EmmonsUC BerkeleyVerified email at berkeley.edu
Mohammad TaufeequeResearch Engineer at FAR AIVerified email at far.ai
Tom TsengFAR AIVerified email at far.ai
Claudia ShiPhD Student, Columbia UniversityVerified email at columbia.edu
Adrià Garriga-AlonsoResearch Scientist, FAR AIVerified email at far.ai
Michał ZającFAR AI, Jagiellonian UniversityVerified email at far.ai

FAR AI

Research organisation

Verified email at far.ai - Homepage

Trustworthy AI AI Safety Adversarial Robustness Value Learning


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Pretraining language models with human preferences T Korbak, K Shi, A Chen, RV Bhalerao, C Buckley, J Phang, SR Bowman, ... International Conference on Machine Learning, 17506-17533, 2023	133	2023
Towards automated circuit discovery for mechanistic interpretability A Conmy, A Mavor-Parker, A Lynch, S Heimersheim, A Garriga-Alonso Advances in Neural Information Processing Systems 36, 16318-16352, 2023	112	2023
Eliciting Latent Predictions from Transformers with the Tuned Lens N Belrose, Z Furman, L Smith, D Halawi, I Ostrovsky, L McKinney, ... arXiv preprint arXiv:2303.08112, 2023	83	2023
Training Language Models with Language Feedback at Scale J Scheurer, JA Campos, T Korbak, JS Chan, A Chen, K Cho, E Perez arXiv preprint arXiv:2303.16755, 2023	71	2023
imitation: Clean Imitation Learning Implementations A Gleave, M Taufeeque, J Rocamonde, E Jenner, SH Wang, S Toyer, ... arXiv preprint arXiv:2211.11972, 2022	67*	2022
Training language models with language feedback J Scheurer, JA Campos, JS Chan, A Chen, K Cho, E Perez The First Workshop on Learning with Natural Language Supervision at ACL, 2022	54*	2022
Evaluating the moral beliefs encoded in llms N Scherrer, C Shi, A Feder, D Blei Advances in Neural Information Processing Systems 36, 2024	50	2024
Inverse Scaling: When Bigger Isn't Better IR McKenzie, A Lyzhov, M Pieler, A Parrish, A Mueller, A Prabhu, ... arXiv preprint arXiv:2306.09479, 2023	48	2023
Adversarial Policies Beat Superhuman Go AIs TT Wang, A Gleave, N Belrose, T Tseng, J Miller, MD Dennis, Y Duan, ... arXiv preprint arXiv:2211.00241, 2022	45*	2022
Improving Code Generation by Training with Natural Language Feedback A Chen, J Scheurer, T Korbak, JA Campos, JS Chan, SR Bowman, K Cho, ... arXiv preprint arXiv:2303.16749, 2023	42	2023
RL with KL penalties is better viewed as Bayesian inference T Korbak, E Perez, CL Buckley EMNLP 2022, 2022	35	2022
Vision-language models are zero-shot reward models for reinforcement learning J Rocamonde, V Montesinos, E Nava, E Perez, D Lindner arXiv preprint arXiv:2310.12921, 2023	17	2023
Exploiting novel gpt-4 apis K Pelrine, M Taufeeque, M Zając, E McLean, A Gleave arXiv preprint arXiv:2312.14302, 2023	14	2023
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems D Dalrymple, J Skalse, Y Bengio, S Russell, M Tegmark, S Seshia, ... arXiv preprint arXiv:2405.06624, 2024	9	2024
Codebook Features: Sparse and Discrete Interpretability for Neural Networks A Tamkin, M Taufeeque, ND Goodman arXiv preprint arXiv:2310.17230, 2023	9	2023
STARC: A General Framework For Quantifying Differences Between Reward Functions J Skalse, L Farnik, SR Motwani, E Jenner, A Gleave, A Abate arXiv preprint arXiv:2309.15257, 2023	3	2023
An Invariant Learning Characterization of Controlled Text Generation C Zheng, C Shi, K Vafa, A Feder, DM Blei arXiv preprint arXiv:2306.00198, 2023	3	2023
Few-shot Adaptation Works with UnpredicTable Data J Shern Chan, M Pieler, J Jao, J Scheurer, E Perez arXiv e-prints, arXiv: 2208.01009, 2022	3*	2022
Can Go AIs be adversarially robust? T Tseng, E McLean, K Pelrine, TT Wang, A Gleave arXiv preprint arXiv:2406.12843, 2024		2024
Uncovering Latent Human Wellbeing in Language Model Embeddings P Freire, CC Tan, A Gleave, D Hendrycks, S Emmons arXiv preprint arXiv:2402.11777, 2024		2024

The system can't perform the operation now. Try again later.

Articles 1–20

Citations per year

Duplicate citations

Merged citations

Add co-authorsCo-authors

Follow

Cited by

Co-authors