Follow
Meg Tong
Meg Tong
Anthropic
Verified email at anthropic.com - Homepage
Title
Cited by
Cited by
Year
The Reversal Curse: LLMs trained on" A is B" fail to learn" B is A"
L Berglund, M Tong, M Kaufmann, M Balesni, AC Stickland, T Korbak, ...
arXiv preprint arXiv:2309.12288, 2023
82*2023
Towards Understanding Sycophancy in Language Models
M Sharma, M Tong, T Korbak, D Duvenaud, A Askell, SR Bowman, ...
arXiv preprint arXiv:2310.13548, 2023
462023
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ...
arXiv preprint arXiv:2401.05566, 2024
25*2024
Taken out of context: On measuring situational awareness in LLMs
L Berglund, AC Stickland, M Balesni, M Kaufmann, M Tong, T Korbak, ...
arXiv preprint arXiv:2309.00667, 2023
20*2023
Steering Llama 2 via Contrastive Activation Addition
N Rimsky, N Gabrieli, J Schulz, M Tong, E Hubinger, AM Turner
arXiv preprint arXiv:2312.06681, 2023
82023
Many-shot Jailbreaking
C Anil, E Durmus, M Sharma, J Benton, S Kundu, J Batson, N Rimsky, ...
1
The system can't perform the operation now. Try again later.
Articles 1–6