Follow
Saurav Kadavath
Saurav Kadavath
Anthropic
Verified email at anthropic.com - Homepage
Title
Cited by
Cited by
Year
The many faces of robustness: A critical analysis of out-of-distribution generalization
D Hendrycks, S Basart, N Mu, S Kadavath, F Wang, E Dorundo, R Desai, ...
Proceedings of the IEEE/CVF international conference on computer vision …, 2021
12512021
Using self-supervised learning can improve model robustness and uncertainty
D Hendrycks, M Mazeika, S Kadavath, D Song
Advances in neural information processing systems 32, 2019
9662019
Training a helpful and harmless assistant with reinforcement learning from human feedback
Y Bai, A Jones, K Ndousse, A Askell, A Chen, N DasSarma, D Drain, ...
arXiv preprint arXiv:2204.05862, 2022
6902022
Constitutional ai: Harmlessness from ai feedback
Y Bai, S Kadavath, S Kundu, A Askell, J Kernion, A Jones, A Chen, ...
arXiv preprint arXiv:2212.08073, 2022
5862022
Measuring mathematical problem solving with the MATH dataset
D Hendrycks, C Burns, S Kadavath, A Arora, S Basart, E Tang, D Song, ...
arXiv preprint arXiv:2103.03874, 2021
4972021
Measuring coding challenge competence with APPS
D Hendrycks, S Basart, S Kadavath, M Mazeika, A Arora, E Guo, C Burns, ...
arXiv preprint arXiv:2105.09938, 2021
2912021
Language models (mostly) know what they know
S Kadavath, T Conerly, A Askell, T Henighan, D Drain, E Perez, ...
arXiv preprint arXiv:2207.05221, 2022
2262022
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned
D Ganguli, L Lovitt, J Kernion, A Askell, Y Bai, S Kadavath, B Mann, ...
arXiv preprint arXiv:2209.07858, 2022
2142022
Discovering language model behaviors with model-written evaluations
E Perez, S Ringer, K Lukošiūtė, K Nguyen, E Chen, S Heiner, C Pettit, ...
arXiv preprint arXiv:2212.09251, 2022
1252022
The capacity for moral self-correction in large language models
D Ganguli, A Askell, N Schiefer, TI Liao, K Lukošiūtė, A Chen, A Goldie, ...
arXiv preprint arXiv:2302.07459, 2023
932023
Measuring faithfulness in chain-of-thought reasoning
T Lanham, A Chen, A Radhakrishnan, B Steiner, C Denison, ...
arXiv preprint arXiv:2307.13702, 2023
362023
Specific versus general principles for constitutional ai
S Kundu, Y Bai, S Kadavath, A Askell, A Callahan, A Chen, A Goldie, ...
arXiv preprint arXiv:2310.13798, 2023
82023
Pretraining & reinforcement learning: Sharpening the axe before cutting the tree
S Kadavath, S Paradis, B Yao
arXiv preprint arXiv:2110.02497, 2021
22021
DeepChrome 2.0: Investigating and Improving Architectures, Visualizations, & Experiments
S Kadavath, S Paradis, J Yeung
arXiv preprint arXiv:2209.11923, 2022
2022
Trustworthy ML: Robustness and Foresight
S Kadavath
2021
The system can't perform the operation now. Try again later.
Articles 1–15