National palliative care capacities around the world: results from the World Health Organization Noncommunicable Disease Country Capacity Survey L Sharkey, B Loring, M Cowan, L Riley, EL Krakauer Palliative medicine 32 (1), 106-113, 2018 | 75 | 2018 |
Goal misgeneralization in deep reinforcement learning LL Di Langosco, J Koch, LD Sharkey, J Pfau, D Krueger International Conference on Machine Learning, 12004-12019, 2022 | 71 | 2022 |
Sparse autoencoders find highly interpretable features in language models H Cunningham, A Ewart, L Riggs, R Huben, L Sharkey arXiv preprint arXiv:2309.08600, 2023 | 31 | 2023 |
Interpreting neural networks through the polytope lens S Black, L Sharkey, L Grinsztajn, E Winsor, D Braun, J Merizian, K Parker, ... arXiv preprint arXiv:2211.12312, 2022 | 11 | 2022 |
Black-Box Access is Insufficient for Rigorous AI Audits S Casper, C Ezell, C Siegmann, N Kolt, TL Curtis, B Bucknall, A Haupt, ... arXiv preprint arXiv:2401.14446, 2024 | 8 | 2024 |
Taking features out of superposition with sparse autoencoders L Sharkey, D Braun, B Millidge AI Alignment Forum, 2022 | 8 | 2022 |
Objective robustness in deep reinforcement learning J Koch, L Langosco, J Pfau, J Le, L Sharkey arXiv preprint arXiv:2105.14111 2, 2021 | 8 | 2021 |
A Causal Framework for AI Regulation and Auditing L Sharkey, CN Ghuidhir, D Braun, J Scheurer, M Balesni, L Bushnaq, ... Preprints, 2024 | 3 | 2024 |
A technical note on bilinear layers for interpretability L Sharkey arXiv preprint arXiv:2305.03452, 2023 | 2 | 2023 |
Circumventing interpretability: How to defeat mind-readers L Sharkey arXiv preprint arXiv:2212.11415, 2022 | 2 | 2022 |
Sparse Autoencoders Find Highly Interpretable Features in Language Models R Huben, H Cunningham, LR Smith, A Ewart, L Sharkey The Twelfth International Conference on Learning Representations, 2023 | | 2023 |