Följ
Lee D Sharkey
Lee D Sharkey
Apollo Research
Verifierad e-postadress på apolloresearch.ai - Startsida
Titel
Citeras av
Citeras av
År
National palliative care capacities around the world: results from the World Health Organization Noncommunicable Disease Country Capacity Survey
L Sharkey, B Loring, M Cowan, L Riley, EL Krakauer
Palliative medicine 32 (1), 106-113, 2018
752018
Goal misgeneralization in deep reinforcement learning
LL Di Langosco, J Koch, LD Sharkey, J Pfau, D Krueger
International Conference on Machine Learning, 12004-12019, 2022
712022
Sparse autoencoders find highly interpretable features in language models
H Cunningham, A Ewart, L Riggs, R Huben, L Sharkey
arXiv preprint arXiv:2309.08600, 2023
312023
Interpreting neural networks through the polytope lens
S Black, L Sharkey, L Grinsztajn, E Winsor, D Braun, J Merizian, K Parker, ...
arXiv preprint arXiv:2211.12312, 2022
112022
Black-Box Access is Insufficient for Rigorous AI Audits
S Casper, C Ezell, C Siegmann, N Kolt, TL Curtis, B Bucknall, A Haupt, ...
arXiv preprint arXiv:2401.14446, 2024
82024
Taking features out of superposition with sparse autoencoders
L Sharkey, D Braun, B Millidge
AI Alignment Forum, 2022
82022
Objective robustness in deep reinforcement learning
J Koch, L Langosco, J Pfau, J Le, L Sharkey
arXiv preprint arXiv:2105.14111 2, 2021
82021
A Causal Framework for AI Regulation and Auditing
L Sharkey, CN Ghuidhir, D Braun, J Scheurer, M Balesni, L Bushnaq, ...
Preprints, 2024
32024
A technical note on bilinear layers for interpretability
L Sharkey
arXiv preprint arXiv:2305.03452, 2023
22023
Circumventing interpretability: How to defeat mind-readers
L Sharkey
arXiv preprint arXiv:2212.11415, 2022
22022
Sparse Autoencoders Find Highly Interpretable Features in Language Models
R Huben, H Cunningham, LR Smith, A Ewart, L Sharkey
The Twelfth International Conference on Learning Representations, 2023
2023
Systemet kan inte utföra åtgärden just nu. Försök igen senare.
Artiklar 1–11