| Humanity's last exam L Phan, A Gatti, Z Han, N Li, J Hu, H Zhang, CBC Zhang, M Shaaban, ... arXiv preprint arXiv:2501.14249, 2025 | 201 | 2025 |
| {StruQ}: Defending against prompt injection with structured queries S Chen, J Piet, C Sitawarin, D Wagner 34th USENIX Security Symposium (USENIX Security 25), 2383-2400, 2025 | 151 | 2025 |
| Vhelm: A holistic evaluation of vision language models T Lee, H Tu, CH Wong, W Zheng, Y Zhou, Y Mai, JS Roberts, M Yasunaga, ... Advances in Neural Information Processing Systems 37, 140632-140666, 2024 | 50 | 2024 |
| Calibrated self-rewarding vision language models Y Zhou, Z Fan, D Cheng, S Yang, Z Chen, C Cui, X Wang, Y Li, L Zhang, ... Advances in Neural Information Processing Systems 37, 51503-51531, 2024 | 81 | 2024 |
| Delta-influence: Unlearning poisons via influence functions W Li, J Li, P Zeng, CS de Witt, A Prabhu, A Sanyal arXiv preprint arXiv:2411.13731, 2024 | 7 | 2024 |
| Fictitious synthetic data can improve llm factuality via prerequisite learning Y Liu, S Chang, T Jaakkola, Y Zhang arXiv preprint arXiv:2410.19290, 2024 | 2 | 2024 |
| AttnGCG: Enhancing jailbreaking attacks on LLMs with attention manipulation Z Wang, H Tu, J Mei, B Zhao, Y Wang, C Xie arXiv preprint arXiv:2410.09040, 2024 | 15 | 2024 |
| Simplicity prevails: Rethinking negative preference optimization for llm unlearning C Fan, J Liu, L Lin, J Jia, R Zhang, S Mei, S Liu arXiv preprint arXiv:2410.07163, 2024 | 41 | 2024 |
| Applying Refusal-Vector Ablation to Llama 3.1 70B Agents S Lermen, M Dziemian, G Pimpale arXiv preprint arXiv:2410.10871, 2024 | 2 | 2024 |
| Variational language concepts for interpreting foundation language models H Wang, S Tan, Z Hong, D Zhang, H Wang arXiv preprint arXiv:2410.03964, 2024 | 3 | 2024 |
| Margin matching preference optimization: Enhanced model alignment with granular feedback K Kim, AJ Seo, H Liu, J Shin, K Lee arXiv preprint arXiv:2410.03145, 2024 | 5 | 2024 |
| Hidden in plain text: Emergence & mitigation of steganographic collusion in LLMs Y Mathew, O Matthews, R McCarthy, J Velja, CS de Witt, D Cope, ... arXiv preprint arXiv:2410.03768, 2024 | 13 | 2024 |
| Adversarial robustification via text-to-image diffusion models D Choi, J Jeong, H Jang, J Shin European Conference on Computer Vision, 158-177, 2024 | 3 | 2024 |
| Towards reliable evaluation and fast training of robust semantic segmentation models F Croce, ND Singh, M Hein European Conference on Computer Vision, 180-197, 2024 | 5 | 2024 |
| Jatmo: Prompt injection defense by task-specific finetuning J Piet, M Alrashed, C Sitawarin, S Chen, Z Wei, E Sun, B Alomair, ... European Symposium on Research in Computer Security, 105-124, 2024 | 114 | 2024 |
| Llm-pbe: Assessing data privacy in large language models Q Li, J Hong, C Xie, J Tan, R Xin, J Hou, X Yin, Z Wang, D Hendrycks, ... arXiv preprint arXiv:2408.12787, 2024 | 81 | 2024 |
| Protecting against simultaneous data poisoning attacks N Alex, SA Siddiqui, A Sanyal, D Krueger arXiv preprint arXiv:2408.13221, 2024 | 3 | 2024 |
| Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective Y Liu, Y Zhang, T Jaakkola, S Chang arXiv preprint arXiv:2407.16997, 2024 | 18 | 2024 |
| Latent adversarial training improves robustness to persistent harmful behaviors in llms A Sheshadri, A Ewart, P Guo, A Lynch, C Wu, V Hebbar, H Sleight, ... arXiv preprint arXiv:2407.15549, 2024 | 38 | 2024 |
| SHINE: Shielding backdoors in deep reinforcement learning Z Yuan, W Guo, J Jia, B Li, D Song Forty-first International Conference on Machine Learning, 2024 | 6 | 2024 |