The Reversal Curse: LLMs trained on" A is B" fail to learn" B is A" L Berglund, M Tong, M Kaufmann, M Balesni, AC Stickland, T Korbak, ... arXiv preprint arXiv:2309.12288, 2023 | 82* | 2023 |
Towards Understanding Sycophancy in Language Models M Sharma, M Tong, T Korbak, D Duvenaud, A Askell, SR Bowman, ... arXiv preprint arXiv:2310.13548, 2023 | 46 | 2023 |
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ... arXiv preprint arXiv:2401.05566, 2024 | 25* | 2024 |
Taken out of context: On measuring situational awareness in LLMs L Berglund, AC Stickland, M Balesni, M Kaufmann, M Tong, T Korbak, ... arXiv preprint arXiv:2309.00667, 2023 | 20* | 2023 |
Steering Llama 2 via Contrastive Activation Addition N Rimsky, N Gabrieli, J Schulz, M Tong, E Hubinger, AM Turner arXiv preprint arXiv:2312.06681, 2023 | 8 | 2023 |
Many-shot Jailbreaking C Anil, E Durmus, M Sharma, J Benton, S Kundu, J Batson, N Rimsky, ... | 1 | |