# Making explainability algorithms more robust with GANs

Explainability algorithms like LIME (Locally Interpretable Model-agnostic Explanations) (Ribeiro et al., 2016) have gained much popularity in both academia and the industry. For the latter, making complex machine learning models interpretable means bridging the trust gap that often manifests between machine learning systems and their end users, which can become an important factor for successful adoption.

My previous blog posts focused on the robustness of machine learning models against adversarial attacks, showing that neural networks are susceptible to the slightest of perturbations of the input.

It turns out that even explainability algorithms are not immune to similar threats. In this blog post, we will look into the vulnerabilities of LIME as well as an attack which exploits this. We will then talk about a proposed defense from a recent paper which I worked on called Improving LIME Robustness with Smarter Locality Sampling, to be presented at the 2nd Workshop on Adversarial Learning Methods for Machine Learning and Data Mining @ KDD 2020. Code for this work can be found here.

## Attacking LIME

In a previous blog post, we discussed the basic formulation of additive feature attribution models, a class of explainability algorithms to which LIME belongs. Mathematically, it tries to minimize the following loss function:

\begin{align*} \pi_{x}(z) &= exp(\frac{-D(x, z)^2}{\sigma^2}) \\ L(f, g, \pi_x) &= \pi_{x}(z) \left ( f(z) - g(z') \right )^2 \end{align*}

where the explainer $g$ tries to mimic the target black-box model $f$’s behavior for a set of synthetic samples $z$, and the loss is weighted by $\pi$ which represents the distance between synthetic samples $z$ and the original data $x$ (hence “locally interpretable”). You will also recall that LIME needs to generate these synthetic samples $z$ in order to query $f$ - this acts as a way to understand the behavior of $f$ without peeking into the implementation or parameters of $f$, thereby making LIME a model-agnostic method.

In particular, LIME generates samples by perturbing data with noise sampled from a unit Gaussian $\mathcal{N}(0,\,1)$. Unfortunately, it is this sampling procedure which makes LIME susceptible to exploitation. Slack et al. have recently discovered that this naive sampling strategy leads to synthetic data $z$ which are out-of-distribution (OOD) .

As Figure 1 shows, there is a significant discrepancy between LIME samples and original data from the COMPAS dataset. In fact, LIME samples are identifiably OOD, and Slack et al. exploit this flaw to construct their attack, which we refer to as the Fooling LIME attack.

Imagine an adversary that wants to deploy some biased, harmful machine learning model to production. A biased model could be one that makes its decisions solely on sensitive features like sex and race. For example, the algorithm below decides whether a given convict would commit another offense once out prison and relies solely on race:

$$$a(x)= \begin{cases} 1,& \text{if } x_{race} = \textit{"African American"}\\ 0, & \text{otherwise} \end{cases}$$$

Now suppose that the model must go through some auditory process to detect any harmful behavior. Assuming that the implementation of the model/solution is hidden from the auditor for the sake of intellectual property, the auditor may naturally elect to use model-agnostic explainability algorithms like LIME to understand the target model’s behavior further.

This is where the attack is executed. Knowing that LIME’s synthetic samples are identifiably OOD, the adversary can train a binary classifier $c$ that predicts whether a given input is real or not. The attacker then can use $c$ as part of a scaffolding function; as formulated below, a data sample is routed to the original, biased model only if the $c$ decides that the sample is real. If it detects a LIME sample, it instead routes the sample to an innocuous model $\psi$ that doesn’t rely on sensitive features.

$$$s(x)= \begin{cases} a(x),& \text{if } c(x) \geq 0.5\\ \psi(x), & \text{otherwise} \end{cases}$$$

In other words, LIME would only see predictions from the innocuous model and would hence determine that the adversarial model is not biased.

The results reported in Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods show that the Fooling LIME attack can circumvent LIME effectively across several real-world datasets (Figure 2).

## Defending LIME with GANs

Can we defend against the Fooling LIME attack and make LIME more reliable? Note that the attack relies on having a critic model that can accurately distinguish between real and fake samples. What if we can generate synthetic samples that are realistic? This leads to our proposal: make LIME more robust by replacing its naive sampling strategy with generative adversarial networks (GAN).

In particular, since we are focusing on tabular datasets, we use the Conditional Tabular GAN as our sampler as it provides a better modelling of discrete and continuous features (Xu et al., 2019). Through adversarial learning with the discriminator, CTGAN learns to generate data that is more realistic compared to what comes out of LIME’s sampling method.

CTGAN is conditional in that it receives two inputs - a random noise vector common to vanilla GANs and a conditional vector that selects a particular category among the categorical features. CTGAN is “conditioned” to generate data according to the selected category which allows for more diversity in the outputs. While in the original CTGAN this conditional vector is sampled randomly, we condition CTGAN based on the actual categorical features of the given input instead:

\begin{aligned} & \tilde{X} = && g(z, m_x) \\ & m_{d_i}^{k} = && \begin{cases} 1,& \text{if } k = x_{d_i}\\ 0, & \text{otherwise} \end{cases} \text{for } k = 1, \dots, |D_i| \\ & m_x = && m_{d_1} \oplus m_{d_2} \cdots \oplus m_{d_n} \\ \end{aligned}

where $\tilde{X}$ are the synthetic samples generated by $g$, the CTGAN generator. This helps CTGAN generate synthetic samples that are more local with respect to the given data sample to be explained. We also use the CTGAN discriminator to filter out any unrealistic samples generated by $g$:

$$$\tilde{X}_{filtered} = \{x | x \in \tilde{X}, d(x) \geq \tau\}$$$

## まとめ

• 判断解釈モデル・LIMEの脆弱性とそれを利用した攻撃手法
• CTGANを使ったLIMEの頑健性の強化と検証

ご質問・訂正等がありましたら是非ご連絡下さい。最後まで読んで頂き有難うございました！

## Acknowledgements

Many thanks to my co-authors, Eugene Chua, Rocco Hu, and Nicholas Capel!

## References

1. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier. http://arxiv.org/abs/1602.04938
2. Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling Tabular data using Conditional GAN. http://arxiv.org/abs/1907.00503
3. Larson, J., Mattu, S., Kirchner, L., & Angwin, J. (2016). How we analyzed the COMPAS recidivism algorithm. ProPublica (5 2016), 9.
4. Asuncion, A., & Newman, D. (2007). UCI machine learning repository.
5. Redmond, M., & Baveja, A. (2002). A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 141(3), 660–678.
6. Slack, D., Hilgard, S., Jia, E., Singh, S., & Lakkaraju, H. (2019). Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. AIES 2020 - Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 180–186. http://arxiv.org/abs/1911.02508
7. Ribeiro, M. T., Singh, S., & Guestrin, C. Anchors: High-Precision Model-Agnostic Explanations. www.aaai.org