Saurav Kadavath

Cited by

	All	Since 2019
Citations	4985	4983
h-index	11	11
i10-index	11	11

2400

1200

600

1800

20192020202120222023202414 121 387 736 2312 1400

Public access

View all

1 article

0 articles

available

not available

Based on funding mandates

Co-authors

Dan HendrycksDirector of the Center for AI SafetyVerified email at berkeley.edu

Saurav Kadavath

Anthropic

Verified email at anthropic.com - Homepage

Deep Learning LLMs RL


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
The many faces of robustness: A critical analysis of out-of-distribution generalization D Hendrycks, S Basart, N Mu, S Kadavath, F Wang, E Dorundo, R Desai, ... Proceedings of the IEEE/CVF international conference on computer vision …, 2021	1251	2021
Using self-supervised learning can improve model robustness and uncertainty D Hendrycks, M Mazeika, S Kadavath, D Song Advances in neural information processing systems 32, 2019	966	2019
Training a helpful and harmless assistant with reinforcement learning from human feedback Y Bai, A Jones, K Ndousse, A Askell, A Chen, N DasSarma, D Drain, ... arXiv preprint arXiv:2204.05862, 2022	690	2022
Constitutional ai: Harmlessness from ai feedback Y Bai, S Kadavath, S Kundu, A Askell, J Kernion, A Jones, A Chen, ... arXiv preprint arXiv:2212.08073, 2022	586	2022
Measuring mathematical problem solving with the MATH dataset D Hendrycks, C Burns, S Kadavath, A Arora, S Basart, E Tang, D Song, ... arXiv preprint arXiv:2103.03874, 2021	497	2021
Measuring coding challenge competence with APPS D Hendrycks, S Basart, S Kadavath, M Mazeika, A Arora, E Guo, C Burns, ... arXiv preprint arXiv:2105.09938, 2021	291	2021
Language models (mostly) know what they know S Kadavath, T Conerly, A Askell, T Henighan, D Drain, E Perez, ... arXiv preprint arXiv:2207.05221, 2022	226	2022
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned D Ganguli, L Lovitt, J Kernion, A Askell, Y Bai, S Kadavath, B Mann, ... arXiv preprint arXiv:2209.07858, 2022	214	2022
Discovering language model behaviors with model-written evaluations E Perez, S Ringer, K Lukošiūtė, K Nguyen, E Chen, S Heiner, C Pettit, ... arXiv preprint arXiv:2212.09251, 2022	125	2022
The capacity for moral self-correction in large language models D Ganguli, A Askell, N Schiefer, TI Liao, K Lukošiūtė, A Chen, A Goldie, ... arXiv preprint arXiv:2302.07459, 2023	93	2023
Measuring faithfulness in chain-of-thought reasoning T Lanham, A Chen, A Radhakrishnan, B Steiner, C Denison, ... arXiv preprint arXiv:2307.13702, 2023	36	2023
Specific versus general principles for constitutional ai S Kundu, Y Bai, S Kadavath, A Askell, A Callahan, A Chen, A Goldie, ... arXiv preprint arXiv:2310.13798, 2023	8	2023
Pretraining & reinforcement learning: Sharpening the axe before cutting the tree S Kadavath, S Paradis, B Yao arXiv preprint arXiv:2110.02497, 2021	2	2021
DeepChrome 2.0: Investigating and Improving Architectures, Visualizations, & Experiments S Kadavath, S Paradis, J Yeung arXiv preprint arXiv:2209.11923, 2022		2022
Trustworthy ML: Robustness and Foresight S Kadavath		2021

The system can't perform the operation now. Try again later.

Articles 1–15

Citations per year

Duplicate citations

Merged citations

Add co-authorsCo-authors

Follow

Cited by

Co-authors