Lee D Sharkey

Citeras av

	Alla	Sedan 2019
Citat	219	209
h-index	7	7
i10-index	4	4

201720182019202020212022202320241 8 5 10 15 28 88 61

Offentlig åtkomst

Visa alla

1 artikel

0 artiklar

tillgänglig

inte tillgänglig

Enligt krav från finansiärer

Medförfattare

Jacob PfauNYUVerifierad e-postadress på nyu.edu
Lauro LangoscoUniversity of CambridgeVerifierad e-postadress på cam.ac.uk
Jack KochVerifierad e-postadress på jbkjr.com

Följ

Lee D Sharkey

Apollo Research

Verifierad e-postadress på apolloresearch.ai - Startsida

AI safety neural network interpretability


Titel Sortera efter citat Sortera efter år Sortera efter titel	Citeras av Citeras av	År
National palliative care capacities around the world: results from the World Health Organization Noncommunicable Disease Country Capacity Survey L Sharkey, B Loring, M Cowan, L Riley, EL Krakauer Palliative medicine 32 (1), 106-113, 2018	75	2018
Goal misgeneralization in deep reinforcement learning LL Di Langosco, J Koch, LD Sharkey, J Pfau, D Krueger International Conference on Machine Learning, 12004-12019, 2022	71	2022
Sparse autoencoders find highly interpretable features in language models H Cunningham, A Ewart, L Riggs, R Huben, L Sharkey arXiv preprint arXiv:2309.08600, 2023	31	2023
Interpreting neural networks through the polytope lens S Black, L Sharkey, L Grinsztajn, E Winsor, D Braun, J Merizian, K Parker, ... arXiv preprint arXiv:2211.12312, 2022	11	2022
Black-Box Access is Insufficient for Rigorous AI Audits S Casper, C Ezell, C Siegmann, N Kolt, TL Curtis, B Bucknall, A Haupt, ... arXiv preprint arXiv:2401.14446, 2024	8	2024
Taking features out of superposition with sparse autoencoders L Sharkey, D Braun, B Millidge AI Alignment Forum, 2022	8	2022
Objective robustness in deep reinforcement learning J Koch, L Langosco, J Pfau, J Le, L Sharkey arXiv preprint arXiv:2105.14111 2, 2021	8	2021
A Causal Framework for AI Regulation and Auditing L Sharkey, CN Ghuidhir, D Braun, J Scheurer, M Balesni, L Bushnaq, ... Preprints, 2024	3	2024
A technical note on bilinear layers for interpretability L Sharkey arXiv preprint arXiv:2305.03452, 2023	2	2023
Circumventing interpretability: How to defeat mind-readers L Sharkey arXiv preprint arXiv:2212.11415, 2022	2	2022
Sparse Autoencoders Find Highly Interpretable Features in Language Models R Huben, H Cunningham, LR Smith, A Ewart, L Sharkey The Twelfth International Conference on Learning Representations, 2023		2023

Systemet kan inte utföra åtgärden just nu. Försök igen senare.

Artiklar 1–11

Citat per år

Dubblettcitat

Sammanfogade citat

Lägg till medförfattareMedförfattare

Följ

Citeras av

Medförfattare