Robustness may be at odds with accuracy D Tsipras, S Santurkar, L Engstrom, A Turner, A Madry arXiv preprint arXiv:1805.12152, 2018 | 1761 | 2018 |
Label-consistent backdoor attacks A Turner, D Tsipras, A Madry arXiv preprint arXiv:1912.02771, 2019 | 464* | 2019 |
There is no free lunch in adversarial robustness (but there are unexpected benefits) D Tsipras, S Santurkar, L Engstrom, A Turner, A Madry arXiv preprint arXiv:1805.12152 2 (3), 2018 | 91 | 2018 |
Optimal policies tend to seek power AM Turner, L Smith, R Shah, A Critch, P Tadepalli arXiv preprint arXiv:1912.01683, 2019 | 51 | 2019 |
Robustness may be at odds with accuracy. arXiv D Tsipras, S Santurkar, L Engstrom, A Turner, A Madry arXiv preprint arXiv:1805.12152 10, 2018 | 23 | 2018 |
Parametrically retargetable decision-makers tend to seek power A Turner, P Tadepalli Advances in Neural Information Processing Systems 35, 31391-31401, 2022 | 11 | 2022 |
Steering llama 2 via contrastive activation addition N Rimsky, N Gabrieli, J Schulz, M Tong, E Hubinger, AM Turner arXiv preprint arXiv:2312.06681, 2023 | 8 | 2023 |
On avoiding power-seeking by artificial intelligence AM Turner arXiv preprint arXiv:2206.11831, 2022 | 2 | 2022 |
Understanding and Controlling a Maze-Solving Policy Network U Mini, P Grietzer, M Sharma, A Meek, M MacDiarmid, AM Turner arXiv preprint arXiv:2310.08043, 2023 | 1 | 2023 |
Formalizing the problem of side effect regularization AM Turner, A Saxena, P Tadepalli arXiv preprint arXiv:2206.11812, 2022 | 1 | 2022 |