Follow
Nicholas Schiefer
Nicholas Schiefer
Anthropic
Verified email at mit.edu
Title
Cited by
Cited by
Year
Constitutional ai: Harmlessness from ai feedback
Y Bai, S Kadavath, S Kundu, A Askell, J Kernion, A Jones, A Chen, ...
arXiv preprint arXiv:2212.08073, 2022
7472022
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned
D Ganguli, L Lovitt, J Kernion, A Askell, Y Bai, S Kadavath, B Mann, ...
arXiv preprint arXiv:2209.07858, 2022
2902022
Toy models of superposition
N Elhage, T Hume, C Olsson, N Schiefer, T Henighan, S Kravec, ...
arXiv preprint arXiv:2209.10652, 2022
1692022
Discovering language model behaviors with model-written evaluations
E Perez, S Ringer, K Lukošiūtė, K Nguyen, E Chen, S Heiner, C Pettit, ...
arXiv preprint arXiv:2212.09251, 2022
1612022
The capacity for moral self-correction in large language models
D Ganguli, A Askell, N Schiefer, TI Liao, K Lukošiūtė, A Chen, A Goldie, ...
arXiv preprint arXiv:2302.07459, 2023
1122023
Language models (mostly) know what they know
S Kadavath, T Conerly, A Askell, T Henighan, D Drain, E Perez, ...
arXiv preprint arXiv:2207.05221, 2022
1042022
Towards monosemanticity: Decomposing language models with dictionary learning
T Bricken, A Templeton, J Batson, B Chen, A Jermyn, T Conerly, N Turner, ...
Transformer Circuits Thread 2, 2023
1022023
Towards measuring the representation of subjective global opinions in language models
E Durmus, K Nguyen, TI Liao, N Schiefer, A Askell, A Bakhtin, C Chen, ...
arXiv preprint arXiv:2306.16388, 2023
952023
Towards understanding sycophancy in language models
M Sharma, M Tong, T Korbak, D Duvenaud, A Askell, SR Bowman, ...
arXiv preprint arXiv:2310.13548, 2023
672023
Measuring progress on scalable oversight for large language models
SR Bowman, J Hyun, E Perez, E Chen, C Pettit, S Heiner, K Lukošiūtė, ...
arXiv preprint arXiv:2211.03540, 2022
582022
Measuring faithfulness in chain-of-thought reasoning
T Lanham, A Chen, A Radhakrishnan, B Steiner, C Denison, ...
arXiv preprint arXiv:2307.13702, 2023
542023
Question decomposition improves the faithfulness of model-generated reasoning
A Radhakrishnan, K Nguyen, A Chen, C Chen, C Denison, D Hernandez, ...
arXiv preprint arXiv:2307.11768, 2023
362023
Universal Computation and Optimal Construction in the Chemical Reaction Network-Controlled Tile Assembly Model
N Schiefer, E Winfree
21st International Conference on DNA Computing and Molecular Programming …, 2015
262015
Sleeper agents: Training deceptive llms that persist through safety training
E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ...
arXiv preprint arXiv:2401.05566, 2024
242024
FoundationDB Record Layer: A Multi-Tenant Structured Datastore
C Chrysafis, B Collins, S Dugas, J Dunkelberger, M Ehsan, S Gray, ...
Proceedings of the 2019 International Conference on Management of Data, 1787 …, 2019
232019
Superposition, memorization, and double descent
T Henighan, S Carter, T Hume, N Elhage, R Lasenby, S Fort, N Schiefer, ...
Transformer Circuits Thread 6, 24, 2023
182023
Exponentially improving the complexity of simulating the Weisfeiler-Lehman test with graph neural networks
A Aamand, J Chen, P Indyk, S Narayanan, R Rubinfeld, N Schiefer, ...
Advances in Neural Information Processing Systems 35, 27333-27346, 2022
172022
Many-shot jailbreaking
C Anil, E Durmus, M Sharma, J Benton, S Kundu, J Batson, N Rimsky, ...
Anthropic, April, 2024
122024
Specific versus general principles for constitutional ai
S Kundu, Y Bai, S Kadavath, A Askell, A Callahan, A Chen, A Goldie, ...
arXiv preprint arXiv:2310.13798, 2023
122023
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv
D Ganguli, L Lovitt, J Kernion, A Askell, Y Bai, S Kadavath, B Mann, ...
92022
The system can't perform the operation now. Try again later.
Articles 1–20