Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint arXiv:2307.15217, 2023 | 519 | 2023 |
Evaluating explainable AI: Which algorithmic explanations help users predict model behavior? P Hase, M Bansal arXiv preprint arXiv:2005.01831, 2020 | 354 | 2020 |
Grips: Gradient-free, edit-based instruction search for prompting large language models A Prasad, P Hase, X Zhou, M Bansal arXiv preprint arXiv:2203.07281, 2022 | 179 | 2022 |
Foundational challenges in assuring alignment and safety of large language models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... arXiv preprint arXiv:2404.09932, 2024 | 150 | 2024 |
Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models P Hase, M Bansal, B Kim, A Ghandeharioun Advances in Neural Information Processing Systems 36, 2023 | 134 | 2023 |
Interpretable image recognition with hierarchical prototypes P Hase, C Chen, O Li, C Rudin Proceedings of the AAAI Conference on Human Computation and Crowdsourcing 7 …, 2019 | 134 | 2019 |
Fastif: Scalable influence functions for efficient model interpretation and debugging H Guo, NF Rajani, P Hase, M Bansal, C Xiong arXiv preprint arXiv:2012.15781, 2020 | 131 | 2020 |
Rethinking machine unlearning for large language models S Liu, Y Yao, J Jia, S Casper, N Baracaldo, P Hase, Y Yao, CY Liu, X Xu, ... Nature Machine Intelligence, 1-14, 2025 | 130 | 2025 |
Do language models have beliefs? methods for detecting, updating, and visualizing model beliefs P Hase, M Diab, A Celikyilmaz, X Li, Z Kozareva, V Stoyanov, M Bansal, ... arXiv preprint arXiv:2111.13654, 2021 | 124* | 2021 |
Leakage-adjusted simulatability: Can models generate non-trivial explanations of their behavior in natural language? P Hase, S Zhang, H Xie, M Bansal arXiv preprint arXiv:2010.04119, 2020 | 105 | 2020 |
The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations P Hase, H Xie, M Bansal Advances in Neural Information Processing Systems 34, 2021 | 99 | 2021 |
Can sensitive information be deleted from llms? objectives for defending against extraction attacks V Patil, P Hase, M Bansal arXiv preprint arXiv:2309.17410, 2023 | 87 | 2023 |
When can models learn from explanations? a formal framework for understanding the roles of explanation data P Hase, M Bansal arXiv preprint arXiv:2102.02201, 2021 | 80 | 2021 |
Can language models teach? teacher explanations improve student performance via personalization S Saha, P Hase, M Bansal Advances in Neural Information Processing Systems 36, 2023 | 44* | 2023 |
The unreasonable effectiveness of easy training data for hard tasks P Hase, M Bansal, P Clark, S Wiegreffe arXiv preprint arXiv:2401.06751, 2024 | 22 | 2024 |
Low-cost algorithmic recourse for users with uncertain cost functions P Yadav, P Hase, M Bansal arXiv preprint arXiv:2111.01235, 2021 | 21 | 2021 |
Summarization programs: Interpretable abstractive summarization with neural modular trees S Saha, S Zhang, P Hase, M Bansal arXiv preprint arXiv:2209.10492, 2022 | 19 | 2022 |
Are hard examples also harder to explain? a study with human and model-generated explanations S Saha, P Hase, N Rajani, M Bansal arXiv preprint arXiv:2211.07517, 2022 | 14 | 2022 |
Visfis: Visual feature importance supervision with right-for-the-right-reason objectives Z Ying, P Hase, M Bansal Advances in Neural Information Processing Systems 35, 17057-17072, 2022 | 13 | 2022 |
System-1. x: Learning to balance fast and slow planning with language models S Saha, A Prasad, JCY Chen, P Hase, E Stengel-Eskin, M Bansal arXiv preprint arXiv:2407.14414, 2024 | 8 | 2024 |