The many faces of robustness: A critical analysis of out-of-distribution generalization D Hendrycks, S Basart, N Mu, S Kadavath, F Wang, E Dorundo, R Desai, ... Proceedings of the IEEE/CVF international conference on computer vision …, 2021 | 1251 | 2021 |
Using self-supervised learning can improve model robustness and uncertainty D Hendrycks, M Mazeika, S Kadavath, D Song Advances in neural information processing systems 32, 2019 | 966 | 2019 |
Training a helpful and harmless assistant with reinforcement learning from human feedback Y Bai, A Jones, K Ndousse, A Askell, A Chen, N DasSarma, D Drain, ... arXiv preprint arXiv:2204.05862, 2022 | 690 | 2022 |
Constitutional ai: Harmlessness from ai feedback Y Bai, S Kadavath, S Kundu, A Askell, J Kernion, A Jones, A Chen, ... arXiv preprint arXiv:2212.08073, 2022 | 586 | 2022 |
Measuring mathematical problem solving with the MATH dataset D Hendrycks, C Burns, S Kadavath, A Arora, S Basart, E Tang, D Song, ... arXiv preprint arXiv:2103.03874, 2021 | 497 | 2021 |
Measuring coding challenge competence with APPS D Hendrycks, S Basart, S Kadavath, M Mazeika, A Arora, E Guo, C Burns, ... arXiv preprint arXiv:2105.09938, 2021 | 291 | 2021 |
Language models (mostly) know what they know S Kadavath, T Conerly, A Askell, T Henighan, D Drain, E Perez, ... arXiv preprint arXiv:2207.05221, 2022 | 226 | 2022 |
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned D Ganguli, L Lovitt, J Kernion, A Askell, Y Bai, S Kadavath, B Mann, ... arXiv preprint arXiv:2209.07858, 2022 | 214 | 2022 |
Discovering language model behaviors with model-written evaluations E Perez, S Ringer, K Lukošiūtė, K Nguyen, E Chen, S Heiner, C Pettit, ... arXiv preprint arXiv:2212.09251, 2022 | 125 | 2022 |
The capacity for moral self-correction in large language models D Ganguli, A Askell, N Schiefer, TI Liao, K Lukošiūtė, A Chen, A Goldie, ... arXiv preprint arXiv:2302.07459, 2023 | 93 | 2023 |
Measuring faithfulness in chain-of-thought reasoning T Lanham, A Chen, A Radhakrishnan, B Steiner, C Denison, ... arXiv preprint arXiv:2307.13702, 2023 | 36 | 2023 |
Specific versus general principles for constitutional ai S Kundu, Y Bai, S Kadavath, A Askell, A Callahan, A Chen, A Goldie, ... arXiv preprint arXiv:2310.13798, 2023 | 8 | 2023 |
Pretraining & reinforcement learning: Sharpening the axe before cutting the tree S Kadavath, S Paradis, B Yao arXiv preprint arXiv:2110.02497, 2021 | 2 | 2021 |
DeepChrome 2.0: Investigating and Improving Architectures, Visualizations, & Experiments S Kadavath, S Paradis, J Yeung arXiv preprint arXiv:2209.11923, 2022 | | 2022 |
Trustworthy ML: Robustness and Foresight S Kadavath | | 2021 |