Follow
Center for AI Safety Compute Cluster
Center for AI Safety Compute Cluster
Center for AI Safety
Verified email at safe.ai - Homepage
Title
Cited by
Year
Humanity's last exam
L Phan, A Gatti, Z Han, N Li, J Hu, H Zhang, CBC Zhang, M Shaaban, ...
arXiv preprint arXiv:2501.14249, 2025
2012025
{StruQ}: Defending against prompt injection with structured queries
S Chen, J Piet, C Sitawarin, D Wagner
34th USENIX Security Symposium (USENIX Security 25), 2383-2400, 2025
1512025
Vhelm: A holistic evaluation of vision language models
T Lee, H Tu, CH Wong, W Zheng, Y Zhou, Y Mai, JS Roberts, M Yasunaga, ...
Advances in Neural Information Processing Systems 37, 140632-140666, 2024
502024
Calibrated self-rewarding vision language models
Y Zhou, Z Fan, D Cheng, S Yang, Z Chen, C Cui, X Wang, Y Li, L Zhang, ...
Advances in Neural Information Processing Systems 37, 51503-51531, 2024
812024
Delta-influence: Unlearning poisons via influence functions
W Li, J Li, P Zeng, CS de Witt, A Prabhu, A Sanyal
arXiv preprint arXiv:2411.13731, 2024
72024
Fictitious synthetic data can improve llm factuality via prerequisite learning
Y Liu, S Chang, T Jaakkola, Y Zhang
arXiv preprint arXiv:2410.19290, 2024
22024
AttnGCG: Enhancing jailbreaking attacks on LLMs with attention manipulation
Z Wang, H Tu, J Mei, B Zhao, Y Wang, C Xie
arXiv preprint arXiv:2410.09040, 2024
152024
Simplicity prevails: Rethinking negative preference optimization for llm unlearning
C Fan, J Liu, L Lin, J Jia, R Zhang, S Mei, S Liu
arXiv preprint arXiv:2410.07163, 2024
412024
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
S Lermen, M Dziemian, G Pimpale
arXiv preprint arXiv:2410.10871, 2024
22024
Variational language concepts for interpreting foundation language models
H Wang, S Tan, Z Hong, D Zhang, H Wang
arXiv preprint arXiv:2410.03964, 2024
32024
Margin matching preference optimization: Enhanced model alignment with granular feedback
K Kim, AJ Seo, H Liu, J Shin, K Lee
arXiv preprint arXiv:2410.03145, 2024
52024
Hidden in plain text: Emergence & mitigation of steganographic collusion in LLMs
Y Mathew, O Matthews, R McCarthy, J Velja, CS de Witt, D Cope, ...
arXiv preprint arXiv:2410.03768, 2024
132024
Adversarial robustification via text-to-image diffusion models
D Choi, J Jeong, H Jang, J Shin
European Conference on Computer Vision, 158-177, 2024
32024
Towards reliable evaluation and fast training of robust semantic segmentation models
F Croce, ND Singh, M Hein
European Conference on Computer Vision, 180-197, 2024
52024
Jatmo: Prompt injection defense by task-specific finetuning
J Piet, M Alrashed, C Sitawarin, S Chen, Z Wei, E Sun, B Alomair, ...
European Symposium on Research in Computer Security, 105-124, 2024
1142024
Llm-pbe: Assessing data privacy in large language models
Q Li, J Hong, C Xie, J Tan, R Xin, J Hou, X Yin, Z Wang, D Hendrycks, ...
arXiv preprint arXiv:2408.12787, 2024
812024
Protecting against simultaneous data poisoning attacks
N Alex, SA Siddiqui, A Sanyal, D Krueger
arXiv preprint arXiv:2408.13221, 2024
32024
Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective
Y Liu, Y Zhang, T Jaakkola, S Chang
arXiv preprint arXiv:2407.16997, 2024
182024
Latent adversarial training improves robustness to persistent harmful behaviors in llms
A Sheshadri, A Ewart, P Guo, A Lynch, C Wu, V Hebbar, H Sleight, ...
arXiv preprint arXiv:2407.15549, 2024
382024
SHINE: Shielding backdoors in deep reinforcement learning
Z Yuan, W Guo, J Jia, B Li, D Song
Forty-first International Conference on Machine Learning, 2024
62024
The system can't perform the operation now. Try again later.
Articles 1–20