Demystifying Post-hoc Explainability for ML models
3. Approaches for Post-hoc Explanation
- Feature Importance Based Explanations
- LIME (Ribeiro et al., 2016) - explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction.
- SHAP (Lundberg et al., 2017) - assigns each feature an importance value for a particular prediction and fairly attributes the prediction to all the features.
- Rule-Based Explanations
- Anchors (Ribeiro et al., 2018) - explains the behavior of complex models with high-precision rules, representing local, “sufficient” conditions for predictions.
- LORE (Guidotti et al., 2018) - learns a local interpretable predictor on a synthetic neighborhood generated by a genetic algorithm. Then, it derives from the logic of the local interpretable predictor a meaningful explanation consisting of a decision rule and a set of counterfactual rules.
- Saliency Maps
- Layer-Wise Relevance Propagation (Bach et al., 2015) - assumes that the classifier can be decomposed into several layers of computation. Such layers can be parts of the feature extraction from the image or parts of a classification algorithm run on the computed features.
- Integrated Gradients (Sundararajan et al., 2017) - they do not need any instrumentation of the network, and can be computed easily using a few calls to the gradient operation, allowing even novice practitioners to easily apply the technique.
- Prototype-Based Explanations
- Prototype Selection (Bien et al., 2011) - a good set of prototypes for a class should capture the full structure of the training examples of that class while taking into consideration the structure of other classes.
- TracIn (Pruthi et al., 2020) - computes the influence of a training example on a prediction made by the model by tracing how the loss on the test point changes during the training process whenever the training example of interest was utilized.
- Counterfactual Explanations
- DiCE (Mothilal et al., 2020) - generating and evaluating a diverse set of counterfactual explanations based on determinantal point processes.
- FACE (Poyiadzi et al., 2020) - generates counterfactuals that are coherent with the underlying data distribution and supported by the “feasible paths” of change, which are achievable and can be tailored to the problem at hand.
- Representation-Based Explanations
- Network Dissection (Bau et al., 2017) - quantifying the interpretability of latent representations of CNNs by evaluating the alignment between individual hidden units and a set of semantic concepts.
- Compositional Explanation (Mu et al., 2020)- automatically explaining logical and perceptual abstractions encoded by individual neurons in deep networks and generate explanations by searching for logical forms defined by a set of composition operators over primitive concepts.
- Model Distillation
- LGAE (Tan et al., 2019) - leverage model distillation to learn global additive explanations that describe the relationship between input features and model predictions. These global explanations take the form of feature shapes, which are more expressive than feature attributions.
- Decision Trees as global explanations (Bastani et al., 2017) - generate new training data by actively sampling new inputs and labeling them using the complex model, since they are nonparametric.
- Summaries of Counterfactuals
- AReS (Rawal et al., 2020) - construct global counterfactual explanations which provide an interpretable and accurate summary of recourses for the entire population.
4. Explanations in different data modalities
How do we provide explanations for Tabular / Structured Data?
Feature Importance Based Explanations
How do we provide explanations for textual data?
Saliency Map Visualization
How do we provide explanations for images?
Saliency Map Visualization
Shapley Value Importance
How do we provide explanations for time-series data?
5. Evaluation Strategies
Completeness compared to the original model: A proxy model can be evaluated directly according to how closely it approximates the original model being explained (Ribeiro et al., 2016).
Disabling irrelevant hidden features: (Lertvittayakumjorn et al., 2020) propose a technique to disable the learned features which are irrelevant or harmful to the classification task so as to improve the classifier. This technique and the word clouds form the human-debugging framework.
Human evaluation: Humans can evaluate explanations for reasonableness, that is how well an explanation matches human expectations. Human evaluation can also evaluate completeness or substitute-task completeness from the point of view of enabling a person to predict behavior of the original model; or according to helpfulness in revealing model biases to a person (Lai et al., 2019).
Completeness as measured on a substitute task: Some explanations do not directly explain a model’s decisions, but rather some other attribute that can be evaluated.
Ability to detect models with biases: An explanation that reveals sensitivity to a specific phenomenon (such as a presence of a specific pattern in the input) can be tested for its ability to reveal models with the presence or absence of a relevant bias (such as reliance or ignorance of the specific pattern) (Tong et al., 2020).
6. Current Limitations and Future Directions
Robustness: Model-agnostic perturbation-based methods are (unsurprisingly) more prone to instability than their gradient-based counterparts (Alvarez et al., 2018). As a concrete example, consider an image classification model that places importance on both salient aspects of the input, i.e., those actually related to the ground-truth class and on background noise. Suppose, in addition, that those artifacts are not uniformly relevant for different inputs, while the salient aspects are. Should the explanation include the noisy pixels?
Adversarial attacks: By being able to differentiate between data points coming from the input distribution and instances generated via perturbation, an adversary can create an adversarial classifier (scaffolding) that behaves like the original classifier (perhaps be extremely discriminatory) on the input data points, but behaves arbitrarily differently (looks unbiased and fair) on the perturbed instances, thus effectively fooling LIME or SHAP into generating innocuous explanations (Slack et al., 2020).
Manipulable: Explanation maps can be changed to an arbitrary target map (Dombrowski et al., 2019). This is done by applying a visually hardly perceptible perturbation to the input. This perturbation does not change the output of the neural network, i.e. in addition to the classification result also the vector of all class probabilities is (approximately) the same. This finding is clearly problematic if a user, say a medical doctor, is expecting a robustly interpretable explanation map to rely on in the clinical decision-making process.
Unjustified Counterfactual Explanations: (Laugel et al., 2019) propose an intuitive desideratum for more relevant counterfactual explanations, based on ground-truth labelled data, that helps generating better explanations. They design a test to highlight the risk of having undesirable counterfactual examples disturb the generation of counterfactual explanations and apply this test to several datasets and classifiers and show that the risk of generating undesirable counterfactual examples is high.
Faithfulness: Explainable ML methods provide explanations that are not faithful to what the original model computes. Explanations must be wrong. They cannot have perfect fidelity with respect to the original model. If the explanation was completely faithful to what the original model computes, the explanation would equal the original model, and one would not need the original model in the first place, only the explanation. An inaccurate (low-fidelity) explanation model limits trust in the explanation, and by extension, trust in the black box that it is trying to explain (Rudin et al., 2019).
- It is tempting to use model explainability to gain insights into model fairness, however existing explainability tools do not reliably indicate whether a model is indeed fair (Begley et al., 2020).
- Estimate the causal effect of (the presence or absence of) a human-interpretable concept on a deep neural net’s predictions. Identifying vulnerabilities in existing post hoc explanations and proposing approaches to address these vulnerabilities is a critical research direction going forward (Goyal et al., 2019)!
- Rigorous user studies and evaluations to ascertain the utility of different post-hoc explanation methods in various contexts are extremely critical for the progress of the field (Bansal et al., 2020)!
- Exploring post-hoc explanations for complex ML tasks, besides the usual classification settings can be a good starting point for researchers.
1 A couple of recent (upto 2021) interesting talks on explainability:
- ALPS tutorial on Explainability for NLP
- NeurIPS 2020 Tutorial
- AACL 2020 Tutorial
- Explainability and Compositionality for Visual Recognition
- Explainable AI for Science and Medicine
 Rishabh Agarwal and Nicholas Frosst and Xuezhou Zhang and Rich Caruana and Geoffrey E. Hinton (2020). Neural Additive Models: Interpretable Machine Learning with Neural Nets. CoRR, abs/2004.13912.
 Trevor Hastie and Robert Tibshirani (1990). Generalized Additive Models. Chapman and Hall/CRC.
 Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin (2018). Anchors: High-Precision Model-Agnostic Explanations. In AAAI (pp. 1527–1535). AAAI Press.
 Sarah Tan and Matvey Soloviev and Giles Hooker and Martin T. Wells (2020). Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable. In FODS (pp. 23–34). ACM.
 Emanuele Albini and Antonio Rago and Pietro Baroni and Francesca Toni (2020). Relation-Based Counterfactual Explanations for Bayesian Network Classifiers. In IJCAI (pp. 451–457). ijcai.org.
 Pang Wei Koh and Percy Liang (2017). Understanding Black-box Predictions via Influence Functions. In ICML (pp. 1885–1894). PMLR.
 Benjamin Letham and Cynthia Rudin and Tyler H. McCormick and David Madigan (2015). Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. CoRR, abs/1511.01644.
 Berk Ustun and Alexander Spangher and Yang Liu (2019). Actionable Recourse in Linear Classification. In FAT (pp. 10–19). ACM.
 Eric Wallace and Jens Tuyls and Junlin Wang and Sanjay Subramanian and Matt Gardner and Sameer Singh (2019). AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models. In EMNLP/IJCNLP (3) (pp. 7–12). Association for Computational Linguistics.
 Jaesong Lee and Joong-Hwi Shin and Jun-Seok Kim (2017). Interactive Visualization and Manipulation of Attention-based Neural Machine Translation. In EMNLP (System Demonstrations) (pp. 121–126). Association for Computational Linguistics.
 Feng, J. (2018). Pathologies of Neural Models Make Interpretations Difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 3719–3728). Association for Computational Linguistics.
 Xiaochuang Han and Byron C. Wallace and Yulia Tsvetkov (2020). Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions. In ACL (pp. 5553–5563). Association for Computational Linguistics.
 Cook, R. D. and Weisberg, S. Residuals (1982). influence in regression. New York: Chapman and Hall.
 Karen Simonyan and Andrea Vedaldi and Andrew Zisserman (2014). Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In ICLR (Workshop Poster).
 Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In KDD (pp. 1135–1144). ACM.
 Amirata Ghorbani and James Y. Zou (2020). Neuron Shapley: Discovering the Responsible Neurons. In NeurIPS.
 Dominique Mercier and Andreas Dengel and Sheraz Ahmed (2021). PatchX: Explaining Deep Models by Intelligible Pattern Patches for Time-series Classification. CoRR, abs/2102.05917.
 Alan H. Gee and Diego Garcia-Olano and Joydeep Ghosh and David Paydarfar (2019). Explaining Deep Classification of Time-Series Data with Learned Prototypes. CoRR, abs/1904.08935.
 Udo Schlegel and Hiba Arnout and Mennatallah El-Assady and Daniela Oelke and Daniel A. Keim (2019). Towards a Rigorous Evaluation of XAI Methods on Time Series. CoRR, abs/1909.07082.
 Mara Graziani and Vincent Andrearczyk and Stéphane Marchand-Maillet and Henning Müller (2020). Concept attribution: Explaining CNN decisions to physicians. Comput. Biol. Medicine, 123, 103865.
 Oscar Li and Hao Liu and Chaofan Chen and Cynthia Rudin (2017). Deep Learning for Case-based Reasoning through Prototypes: A Neural Network that Explains its Predictions. CoRR, abs/1710.04806.
 David Alvarez-Melis and Tommi S. Jaakkola (2018). On the Robustness of Interpretability Methods. CoRR, abs/1806.08049.
 Dylan Slack and Sophie Hilgard and Emily Jia and Sameer Singh and Himabindu Lakkaraju (2020). Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. In AIES (pp. 180–186). ACM.
 Ann-Kathrin Dombrowski and Maximilian Alber and Christopher J. Anders and Marcel Ackermann and Klaus-Robert Müller and Pan Kessel (2019). Explanations can be manipulated and geometry is to blame. In NeurIPS (pp. 13567–13578).
 Leilani H. Gilpin and David Bau and Ben Z. Yuan and Ayesha Bajwa and Michael Specter and Lalana Kagal (2018). Explaining Explanations: An Approach to Evaluating Interpretability of Machine Learning. CoRR, abs/1806.00069.
 Vivian Lai and Chenhao Tan (2019). On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection. In FAT (pp. 29–38). ACM.
 Schrasing Tong and Lalana Kagal (2020). Investigating Bias in Image Classification using Model Explanations. CoRR, abs/2012.05463.
 Piyawat Lertvittayakumjorn and Lucia Specia and Francesca Toni (2020). FIND: Human-in-the-Loop Debugging Deep Text Classifiers. In EMNLP (1) (pp. 332–348). Association for Computational Linguistics.
 Thibault Laugel and Marie-Jeanne Lesot and Christophe Marsala and Xavier Renard and Marcin Detyniecki (2019). The Dangers of Post-hoc Interpretability: Unjustified Counterfactual Explanations. In IJCAI (pp. 2801–2807). ijcai.org.
 Cynthia Rudin (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature.
 Amina Adadi and Mohammed Berrada (2018). Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access, 6, 52138–52160.
 D. V. Carvalho, E. M. Pereira, & Jaime S. Cardoso (2019). Machine Learning Interpretability: A Survey on Methods and Metrics. Electronics, 8, 832.
 Scott M. Lundberg and Su-In Lee (2017). A Unified Approach to Interpreting Model Predictions. In NIPS (pp. 4765–4774).
 Riccardo Guidotti and Anna Monreale and Salvatore Ruggieri and Dino Pedreschi and Franco Turini and Fosca Giannotti (2018). Local Rule-Based Explanations of Black Box Decision Systems. CoRR, abs/1805.10820.
 Tom Begley and Tobias Schwedes and Christopher Frye and Ilya Feige (2020). Explainability for fair machine learning. CoRR, abs/2010.07389.
 Yash Goyal and Uri Shalit and Been Kim (2019). Explaining Classifiers with Causal Concept Effect (CaCE). CoRR, abs/1907.07165.
 Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., & Samek, W. (2015). On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLOS ONE, 10(7), 1–46.
 Mukund Sundararajan and Ankur Taly and Qiqi Yan (2017). Axiomatic Attribution for Deep Networks. In ICML (pp. 3319–3328). PMLR.
 Ramaravind Kommiya Mothilal and Amit Sharma and Chenhao Tan (2020). Explaining machine learning classifiers through diverse counterfactual explanations. In FAT* (pp. 607–617). ACM.
 David Bau and Bolei Zhou and Aditya Khosla and Aude Oliva and Antonio Torralba (2017). Network Dissection: Quantifying Interpretability of Deep Visual Representations. In CVPR (pp. 3319–3327). IEEE Computer Society.
 Sarah Tan, Rich Caruana, Giles Hooker, Paul Koch, & Albert Gordo. (2019). Learning Global Additive Explanations for Neural Nets Using Model Distillation.
 Kaivalya Rawal and Himabindu Lakkaraju (2020). Beyond Individualized Recourse: Interpretable and Interactive Summaries of Actionable Recourses. In NeurIPS.
 Rafael Poyiadzi and Kacper Sokol and Raul Santos-Rodriguez and Tijl De Bie and Peter A. Flach (2020). FACE: Feasible and Actionable Counterfactual Explanations. In AIES (pp. 344–350). ACM.
 Bien, J., & Tibshirani, R. (2011). Prototype selection for interpretable classification. The Annals of Applied Statistics, 5(4), 2403–2424.
 Garima Pruthi and Frederick Liu and Satyen Kale and Mukund Sundararajan (2020). Estimating Training Data Influence by Tracing Gradient Descent. In NeurIPS.
 Jesse Mu and Jacob Andreas (2020). Compositional Explanations of Neurons. In NeurIPS.
 Osbert Bastani and Carolyn Kim and Hamsa Bastani (2017). Interpreting Blackbox Models via Model Extraction. CoRR, abs/1705.08504.
 Gagan Bansal and Tongshuang Wu and Joyce Zhou and Raymond Fok and Besmira Nushi and Ece Kamar and Marco Tulio Ribeiro and Daniel S. Weld (2020). Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. CoRR, abs/2006.14779.