# Explaining Away Attacks Against Neural Networks

In this article, we will demonstrate how to fool a neural network into predicting an image of an elephant as a ping-pong ball. We will also see whether we can defend against such attacks by explaining the model’s decisions. Code for this blog post can be found here. You can also checkout the poster I created based on this project here.

## The surprising brittleness of neural networks

The image you see above is the result of an adversarial attack towards an Inception V3 model (Szegedy et al., 2016), a popular and powerful neural network that produced impressive results on image classification datasets some years ago. As the name suggests, the goal of an adversarial attack is to attack a neural network and cause unintended behaviour. The image on the far left is the “ordinary image” which a neural network correctly classifies as an African elephant with 99% confidence. The image on the right is obtained by adding the image on the left with the noisy image (middle). The resulting “perturbed image” still looks like an African elephant, but this time the same network predicts “ping-pong ball” with even higher confidence.

Even though neural networks are producing outstanding results across many tasks, we’re also starting to discover several surprising shortcomings like the one above. The field which studies such vulnerabilities is called adversarial machine learning.

### Key concepts

As an introduction to the field, here are some of the key concepts in adversarial machine learning:

• An adversarial example is an input that is perturbed to cause misbehavior from a target model
• A white-box attack is an attack where the assailant has knowledge of the target model’s details, including its parameters, output, and architecture.
• A black-box attack, on the other hand, is an attack where the assailant only has query-access to the model, and does not know its parameters. As one can imagine, black-box settings are more realistic in practice.
• A targeted attack is an attack where the assailant tries to solicit some desired behavior from the target model. In the case of classification problems, this translates to causing the model to predict a particular class.
• On the other hand, in a untargeted attack, the assailant simply wishes to cause the model to predict something incorrect.

### The Fast Gradient Sign Method (FGSM)

Numerous ways to attack neural networks already exist. One of the earliest methods is called the Fast Gradient Sign Method (Goodfellow et al., 2014). The core concept of the untargeted version of the attack is simple - neural networks are trained to minimize some loss function, so why not perturb the input image in the direction which increases this loss function? This is equivalent to gradient ascent, where we are leading the neural network’s prediction away from the correct class. Formally, an untargeted FGSM attack finds the following following:

$\begin{equation*} \tilde{x} = x + \epsilon \cdot sign(\nabla_x J(x, y)) \end{equation*}$

where $J(x, y)$ is the loss function of the model, $\nabla_x J$ is the gradient of the loss function with respect to the input (instead of the weights of the model as is done in backpropagation), $\epsilon$ is some scalar parameter which controls the magnitude of the perturbation added to input $x$ to generate adversarial example $\tilde{x}$. $y$ refers to the groundtruth of $x$.

In the targeted setting, the assailant chooses some target class $y_t$ so that the model outputs the desired prediction. So instead of gradient ascent with the correct class $y$ as was done with the untargeted attack, targeted FGSM does gradient descent with $y_t$. In other words, we perturb the input in the direction that minimizes the distance between the network’s prediction and $y_t$:

$\begin{equation*} \tilde{x} = x - \epsilon \cdot sign(\nabla_x J(x, y_t)) \end{equation*}$

As the equations suggest, FGSM only requires one step of gradient ascent/descent. Hence it is also commonly referred to as a single-step gradient attack. There is also an iterative variant of FGSM, or the Basic Iterative Method (BIM) (Kurakin et al., 2016):

\begin{align*} x_0 &= x \\ \tilde{x_{i+1}} &= x_i - \epsilon \cdot sign(\nabla_{x_i} J(x_i, y_t)) \\ \end{align*}

which is a strictly more effective attack compared to FGSM. In our experiments below, we will be using BIM to generate more effective adversarial examples. First, we implement the FGSM attack below using pytorch:

def fgsm_attack(image, epsilon, data_grad, targeted=False):
direction = -1 if targeted else 1

# Collect the element-wise sign of the data gradient

# Create the perturbed image by adjusting each pixel of the input image

# Adding clipping to maintain [0,1] range
perturbed_image = torch.clamp(perturbed_image, 0, 1)

return perturbed_image


The fgsm_attack function takes four parameters - image, which is the image which will be perturbed, epsilon, which controls the level of noise added to image, data_grad, which are the gradients of the loss function with respect to the inputs, and targeted, a boolean telling the function whether the attack is an untargeted or targeted one. To turn this attack into an iterative one, we simply run FGSM a specified number of iterations:

data.requires_grad = True

for i in range(num_iterations):
output = F.log_softmax(model(perturbed_data), dim=1)

# Calculate the loss
loss = F.nll_loss(output, target)

# Calculate gradients of model in backward pass
loss.backward()

# Call FGSM Attack
perturbed_data = fgsm_attack(data, epsilon, data_grad, targeted=True)


F.log_softmax is the activation function applied to model(perturbed_data), which returns the logits (pre-activation output) of the neural network. According to PyTorch’s documentation, the F.nll_loss we use for calculating the loss and subsequently the gradient for perturbation expects log-scaled outputs, which is why we use log_softmax instead of regular softmax.

The grid of images above (best seen in a large screen) shows additional examples of targeted iterative FGSM attacks. You will observe how increasing epsilon also causes more visible perturbations to the images but also increases the efficacy of the attacks.

While we have focused on images so far, adversarial attacks have also been proposed for NLP, audio, and reinforcement learning models. It seems that any neural network is vulnerable to adversarial attacks.

## Explaining neural network predictions

As a final prelude to the main topic of the blog post - explaining a neural network’s output on an adversarially perturbed image - I would like to discuss what it actually means to explain the output of some machine learning model.

So far in academia, one of the most common approaches towards explaining the prediction of a machine learning model is to assign scores to each portion of an input, where the score reflects how much that portion of the input contributed to the model’s decision. This is also called additive feature attribution. More formally, an explanation is a linear function $g$:

\begin{align*} g(x') = \phi_0 + \sum_{i=1}^M \phi_ix_i' \end{align*}

where $x_i’$ is a binary variable that represents the original input feature $x_i$, $\phi_i$ represents the contribution of each feature, and $M$ is the number of binary variables. $g(x’)$ is the explainer model that is parameterized by the $\phi$s. Hence The expression $\phi_ix_i$ denotes how much a particular interpretable feature is contributing to the model’s prediction when $x_i’=1$. In other words, the parameters of $g$ represent the explanation itself. The higher $\phi_ix_i$ is the more $x_i$ has contributed to the original prediction.

The binary variables $x_i’$s are also called the interpretable representations of $x_i$ and help make explanations more intuitive and understandable to us humans. In short, interpretable representations rely on the segmentation of inputs. We typically look at individual pixels or patches of pixels (also called superpixels) for images and individual tokens for text data (i.e. Bag-of-Words).

How does the explanation model $g$ learn the attribution scores $(\phi_1, \dots \phi_M)$? There exist several approaches, but we will discuss the one taken by LIME (Locally Interpretable Model-agnostic Explanations) (Ribeiro et al., 2016). Denoting the model to be explained as $f$, the key intuition is that we would like to train $g$ such that its predictions match those of $f$. In other words, we would like to achieve $f(x) = g(x’)$. Training data for $g$ is generated by perturbing $x’$ to obtain additional interpretable inputs $z’$. According to the LIME paper, this perturbation is done by randomly flipping 1’s in $x’$ (which, remember, is just a binary vector) to 0’s. For text data, for example, this is equivalent to removing individual tokens from the document. The perturbed interpretable input $z’$ is then converted to its original input representation $z$, which is then used to obtain $f(z)$. Making many pairs of $z’, f(z)$ creates a suitable dataset to train $g$ with.

It is worth noting that, because we only have query access to $f$ in generating training data, we don’t need to know the model’s parameters or specifications. This is where LIME’s model-agnosticism comes from. Finally, $g$ is trained by minimizing the following:

\begin{align*} g^* = \underset{g \in G}{argmin} \left\{ L(f, g, \pi_{x}) + \Omega(g)\right\} \end{align*}

Here, $L$ is called the faithfulness loss, which is simply the distance between $g(x’)$ and $f(x)$, which is weighted by another distance measure $\pi_{x}$. In practice, we calculate the squared difference between $f$ and $g$, and $\pi$ is a Gaussian distribution around $x$. Intuitively speaking $\pi$ allows the model to focus on perturbed samples $z’$ which are closer to $x’$:

\begin{align*} \pi_{x}(z) &= exp(\frac{-D(x, z)^2}{\sigma^2}) \\ L(f, g, \pi_x) &= \pi_{x}(z) \left ( f(z) - g(z') \right )^2 \end{align*}

Finally, $\Omega$ represents the complexity of $g$. Minimizing the complexity of the explanation is necessary for it to be simple and intuitive, which are desired qualities of an explanation. Since $g$ is a linear model, the complexity of $g$ is defined as the number of non-zero weights. In other words, the number of non-zero $\phi$s represents the number of features that explain a particular prediction. In practice, this is achieved by applying LASSO regularization when training $g$.

### The SHAP framework

While LIME gives good intuition on how a explanation model is formulated and trained, we will be using another framework for our ensuring experiments called SHAP (SHapley Additive exPlanations) (Lundberg & Lee, 2017), which was published in NeurIPS 2017 under the title: “A Unified Approach to Interpreting Model Predictions”.

As the title suggests, SHAP is a framework that builds upon various prior works for explainability such as LIME and DeepLIFT (Shrikumar et al., 2017). SHAP also relies on additive feature attribution; in particular, it applies concepts from cooperative game theory to existing frameworks in order to obtain special attribution scores called SHAP values, which satisfy a set of desirable theoretical properties (you can find more details from the original paper).

For image-based explanations, SHAP wraps another attribution method called Integrated Gradients (Sundararajan et al., 2017). In short, the method measures the cumulative change in gradients in a neural network by comparing the input image to a baseline (i.e. a blank image) while satisfying several theoretical qualities. The resulting explanation is a heat map over the input image, where each pixel is assigned an attribution score. The following code and image (taken directly from the SHAP tutorial) is an example of applying SHAP and Integrated Gradients to a pre-trained PyTorch model:

import torch, torchvision
from torch import nn
from torchvision import transforms, models, datasets
import shap
import json
import numpy as np

model = models.vgg16(pretrained=True).eval()

X,y = shap.datasets.imagenet50()

X /= 255

to_explain = X[[39, 41]]

# load the ImageNet class names
url = "https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json"
fname = shap.datasets.cache(url)
with open(fname) as f:

shap_values,indexes = e.shap_values(normalize(to_explain), ranked_outputs=2, nsamples=200)

# get the names for the classes
index_names = np.vectorize(lambda x: class_names[str(x)][1])(indexes)

# plot the explanations
shap_values = [np.swapaxes(np.swapaxes(s, 2, 3), 1, -1) for s in shap_values]

shap.image_plot(shap_values, to_explain, index_names)


The red regions above represent positive SHAP values, or pixels that support the class predicted by the model, whereas the blue regions represent negative SHAP values which are evidence against the predicted class. Qualitatively speaking, one can see how the relevant features of each image are assigned high positive scores.

## Can we “explain away” adversarial examples?

As we just saw, SHAP allows us to see how much each pixel contributes to the neural network’s prediction. We finally can address our original question - what happens when we apply SHAP to adversarially perturbed images? Again, we use the GradientExplainer module in SHAP to measure gradient activations for each pixel:

The above image juxtaposes the explanations for a clean image (above) and its adversarial counterpart (below). You will immediately notice that there is a significant discrepancy in the activations for clean image as opposed to perturbed image - SHAP values for the “correct” class are much sharper compared to those for the targeted class. Here are a few more examples below:

Ambulance vs. sleeping bag

Burrito vs. zebra

Gas pump vs. forklift

It seems that the SHAP values are consistently higher for the correct class. In comparison, the SHAP values for the targeted class are much smaller. This is despite the fact that the model’s output probabilities for the targeted classes are very high (~99%)! The histogram below represents a statistical comparison of the average SHAP values for the top prediction of clean and adversarial images using 1000 ImageNet test images:

What do these results suggest? It appears that, while we cannot detect adversarial examples with our eyes, we can use SHAP and a model’s predictions to determine whether a given prediction “makes sense.” In other words, if there is insufficient evidence for a particular prediction then there is a higher chance for the input to be adversarial. It then makes sense to use a framework like SHAP since it is able to ascertain the average prediction of the classifier. A burrito looks very different from an average zebra - hence there’s little evidence in the form of gradient activations.

Of course, this is not a silver bullet in terms of defending against adversarial attacks. Our experiments have so far dealt with adversarial examples where the target classes were chosen randomly. What happens if we target semantically similar classes?

African elephant vs. Indian elephant

Grey fox vs. coyote

Mixing bowl vs. soup bowl

Moving van vs. minivan

It seems that our “defense” becomes a lot less effective. Intuitively, an African elephant looks similar to an Indian elephant and hence the evidence for one class would also support a prediction for the other. While SHAP cannot “explain away” difficult adversarial examples like these, our experiments above reveal the potential of using frameworks for explainability to make neural networks more robust and reliable.

## Summary

In this blog post we covered the following topics:

• The brittleness of neural networks and attacking them with methods like FGSM
• Explaining the predictions of neural networks
• Using SHAP and Integrated Gradients to “explain away” easy adversarial examples

Moreover, these experiments shed light on an apparent disconnect between a neural network’s confidence and the evidence that it provides. That is, even if a neural network is highly confident about its prediction, it may not necessarily provide enough evidence. This is especially important to consider when neural networks are deployed in real-life scenarios where users must take action based on a model’s output. Making machine learning models more reliable is necessary if they are to be deployed successfully in real-life scenarios.

With that we come to the end of the English section of this blog post. Thanks for sticking around! Do checkout the References section for the list of academic works I’ve cited. If you have any questions or feedback, feel free to reach out!

# Neural networkの判断根拠を用いた攻撃への防御

Back to English

Neural networkと言えば、画像認識や、自然言語処理、音声認識等といった様々な分野で大きな 成果を出している最先端の機械学習モデルです。最新の論文や機械学習系のスタートアップでもneural networkは 幅広く利用されています。しかし、そんな高性能のモデルでも最近では大きな脆弱性が発見されていることはご存知でしょうか？ 実はneural networkは誤った判断をさせることが容易にできます。今回の記事では、 この特異な脆弱性について話しながら、neural networkの判断を解釈することでその脆弱性を解消できるかを検証してみます。

## Neural networkの脆弱性

### 用語

• White-box attack: 攻撃者が、ターゲットのneural networkのパラメータやアーキテクチャ、 そして判断等の情報を有していて、それらを利用した攻撃のこと
• Black-box attack: White-box attackとは違い、neural networkの判断しか利用できない 攻撃のこと
• Targeted attack: ターゲットのneural networkにAdversarial exampleを予め選択したラベル として分類させることを目標とした攻撃（選択的攻撃）
• Untargeted attack: 選択的攻撃とは違い、分類を問わずただ誤った判断をさせることを 意図した攻撃（無選択的攻撃）

### Fast Gradient Sign 手法 (FGSM)

Adversarial machine learningが分野として確立されてから様々な攻撃の手法が提案されてきていますが、 今回の記事では最も代表的な手法の一つであるFast Gradient Sign 手法 (Goodfellow et al., 2014)を紹介していきます。ざっと説明すると、 neural networkの学習の際に最小化されるコスト関数が逆に大きくなる方向を定め、元の画像に適度の ノイズを加えるといった仕組みです。言い換えれば、neural networkの学習に用いるのが勾配降下法なら、 adversarial exampleの生成には勾配上昇法が用いられます。無選択的攻撃は数式的には以下のように表されます。

$\begin{equation*} \tilde{x} = x + \epsilon \cdot sign(\nabla_x J(x, y)) \end{equation*}$

$\begin{equation*} \tilde{x} = x - \epsilon \cdot sign(\nabla_x J(x, y_t)) \end{equation*}$

FGSMでは、勾配上昇・降下が一回だけ行われます。したがってこのような手法を single-step gradient attack と呼ぶことがあります。これを何回か行ったものを Basic Iterative Method (BIM) (Kurakin et al., 2016) と呼び、 以下のようにadversarial exampleが生成されます。

\begin{align*} x_0 &= x \\ \tilde{x_{i+1}} &= x_i - \epsilon \cdot sign(\nabla_{x_i} J(x_i, y_t)) \\ \end{align*}

def fgsm_attack(image, epsilon, data_grad, targeted=False):
direction = -1 if targeted else 1

# 勾配の符号を抽出（1 or -1)

# 元画像にノイズを加える

perturbed_image = torch.clamp(perturbed_image, 0, 1)

return perturbed_image


FGSMのパラメータは以下の通りです。

• image: adversarial example生成の対象となる画像
• epsilon: 元画像に加えるノイズの最大値
• data_grad: ロス関数の勾配
• targeted: untargeted或いはtargeted attackを指定するためのフラグ：

Basic Iterative MethodではFGSMを何回か適用します。

data.requires_grad = True

for i in range(num_iterations):
output = F.log_softmax(model(perturbed_data), dim=1)

# ロス関数
loss = F.nll_loss(output, target)

# 勾配のリセット

# 新たに勾配を算出
loss.backward()

# データを基にした勾配を抽出

# FGSMの適用
perturbed_data = fgsm_attack(data, epsilon, data_grad, targeted=True)


model(perturbed_data)はモデルのロジットを算出する関数です。また、 PyTorchでは対数のインプット を要する陰性尤度ロス関数のF.nll_lossが使われるので、普通のsoftmax活性関数ではなくF.log_softmaxmodel(perturbed_data)に適用します。

## Neural networkの判断解釈

さて、今回の記事の本題が「Neural networkの判断解釈を用いた攻撃への防御」となっていますが、 最後の前置きとしてneural networkの判断解釈(英語では explainability)について解説して いきたいと思います。

### 特徴の帰属による判断解釈

Neural networkは高性能である一方、いかにして判断を下したのかを求めるのが困難です。 したがって最近ではneural network等の複雑な機械学習モデルの判断を解釈することが研究の課題と なっており、neural networkを実世界で利用する産業界でも大きな注目を浴びています。 ではneural networkの判断の解釈とは一体どういうことでしょうか？ 最近の研究ではデータの特徴にスコアを帰属することによってneural networkの判断を解釈しています。 言い換えれば、どの特徴がどれくらいモデルの判断に関与しているかを求めます。英語ではこの手法を additive feature attribution と呼び、数式的には

\begin{align*} g(x') = \phi_0 + \sum_{i=1}^M \phi_ix_i' \end{align*}

と表されます。$g(x’)$は判断解釈を算出するモデルで、各$\phi_i$が$g$のパラメータとなります。 元データを$x$とすると、$x’$は$x$を二進法で表したものになり、英語ではこれを interpretable representation と呼びます。即ち$\phi_ix_i’$は$x_i’$が1であるときの neural networkの判断への貢献度を測ります。

このinterpretable representationは主にデータを二進法で表せるように区分したものを指します。 例えば画像データでは各ピクセル或いはピクセルの集合(a.k.a. super pixels)、 言語データにおいては各単語の有無を表す Bag-of-Words 等を用いたりします。

なお、LIMEの model agnosticism は$f$にはquery-accessしかないという仮定からきています。 即ち論理的にはLIMEはどんな機械学習モデルにも適用できるということです。実際にneural network以外にも XGBoostや他の決定木モデルにも適用されたりできるようなので便利です。さて、LIMEモデルのパラメータは 以下のように最適化されます:

\begin{align*} g^* = \underset{g \in G}{argmin} \left\{ L(f, g, \pi_{x}) + \Omega(g)\right\} \end{align*}

$L$は$f(x)$と$g(x’)$の距離を表すロス関数で、英語では faithfulness loss と呼びます。 またロス関数は$\pi_{x}$という関数で重みをつけています。この$\pi_x$は基本的には正規分布を用いていて、 $z$と$x$の距離を表しています。要するには$z$が$x$に近いほど$f$と$g$の距離が重要視されるということです。 数式的には

\begin{align*} \pi_{x}(z) &= exp(\frac{-D(x, z)^2}{\sigma^2}) \\ L(f, g, \pi_x) &= \pi_{x}(z) \left ( f(z) - g(z') \right )^2 \end{align*}

と表されます。$\Omega$は$g$の複雑度(complexity)です。モデルの判断解釈は できるだけシンプルで明解なものが好まれるので解釈モデルの複雑度もなるべく抑えたいとのことです。 $g$が線形モデルの場合、重みの数=判断解釈に用いられる 特徴の数なので、0じゃない重みの数を複雑度として測ることができます。LIMEでは判断解釈に要する特徴の数を $k$とすると、K-Lassoという最小二乗法を用いて最適解を得られる$k$個の重みを学習します。

### SHAP手法

LIMEの他にもSHAP (SHapley Additive exPlanations) (Lundberg & Lee, 2017)というフレームワークがあります。NeurIPS 2017 で発表された手法ですが、LIMEを含め幾つかの判断解釈法をまとめたことで注目を浴びています。詳しくいうと、 シャープレイ値というゲーム理論の利得分配法を用いたadditive feature attributionの帰属スコアの 算出法を他の判断解釈手法に適用するといったアイデアです。 他の判断解釈法と比べると、シャープレイ値はいくつかの論理的な性質を持っていてより正確な判断解釈を 求めることができるとのことです （論文はこちら）。

SHAPの画像データの判断解釈にはIntegrated Gradients (Sundararajan et al., 2017)という手法がベースになります。 短く説明すると、判断解釈対象の画像データを何らかのベースライン（真黒な画像等）と対比し、neural networkの それぞれの画像で得られる勾配の差を集約したものを元画像にヒートマップとして表すといった手法です。 即ち画像の各ピクセルがどれだけ判断に寄与したかを見ることができます。以下のコードはSHAPの チュートリアル から抜粋したもので、Integrated GradientsをPyTorchモデルに適用し判断解釈をヒートマップとして 生成しています。

import torch, torchvision
from torch import nn
from torchvision import transforms, models, datasets
import shap
import json
import numpy as np

model = models.vgg16(pretrained=True).eval()

X,y = shap.datasets.imagenet50()

X /= 255

to_explain = X[[39, 41]]

# load the ImageNet class names
url = "https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json"
fname = shap.datasets.cache(url)
with open(fname) as f:

shap_values,indexes = e.shap_values(normalize(to_explain), ranked_outputs=2, nsamples=200)

# get the names for the classes
index_names = np.vectorize(lambda x: class_names[str(x)][1])(indexes)

# plot the explanations
shap_values = [np.swapaxes(np.swapaxes(s, 2, 3), 1, -1) for s in shap_values]

shap.image_plot(shap_values, to_explain, index_names)


## 判断解釈で攻撃を防げるか検証してみた

さて、ようやく今回の記事の本題に辿り着くことができました。上のように、SHAPで画像のどの部分が判断に 繋がっているかを算出することができますが、この手法をadversarial exampleに適用するとどうなるか 見てみましょう。上と同じくSHAPのGradientExplainerを使います。

ブリトー vs. シマウマ

ガソリンスタンド vs. フォーク車

でもこの手法を使えばどんな攻撃も防げることは残念ながらできないようです。今までは選択的攻撃のラベルを ランダムに採択していましたが、画像の本来のラベルに近いものを利用するとこうなります。

アフリカゾウ vs. インドゾウ

キツネ vs. コヨーテ

サラダボウル vs. お椀

アフリカゾウはインドゾウに視覚的に似ているので、シャープレイ値も当然似てきます。 他の対比の例でもシャープレイ値が似ていることが分かります。

## まとめ

• 機械学習モデルの判断解釈の算出
• SHAPで算出した判断解釈を用いた脆弱性の解消の検証

また、今回はモデルへの攻撃を防ぐという趣旨の記事でしたが、neural networkの判断の信頼度とSHAPで求めた 判断解釈の関係についても一言言えるかと思います。例え判断の信頼度が高くても、それを裏付ける根拠がある という保証はないと今回の実験で分かりました。こういったブレはneural networkの 実世界での利用の障壁になるので、モデルの信頼性を上げるための研究の発展が必要だと思います。

## Acknowledgements

Many thanks to my friends and colleagues - Aaron, Auguste, Crystal, David, Eugene, Kai, Marco, and Rocco - for providing feedback on initial drafts and providing insightful discussions! And a big thank you to my mother and brother for their unwavering support. This is my first long blog post, and I’m looking forward to writing more.

## References

1. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826.
2. Goodfellow, I. J., Shlens, J., & Szegedy, C. (2014). Explaining and harnessing adversarial examples. ArXiv Preprint ArXiv:1412.6572.
3. Kurakin, A., Goodfellow, I., & Bengio, S. (2016). Adversarial machine learning at scale. ArXiv Preprint ArXiv:1611.01236.
4. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier. Proceedings of the 22nd ACM SIGKDD International Conference On Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, 1135–1144.
5. Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 30 (pp. 4765–4774). Curran Associates, Inc. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
6. Shrikumar, A., Greenside, P., & Kundaje, A. (2017). Learning important features through propagating activation differences. Proceedings of the 34th International Conference on Machine Learning-Volume 70, 3145–3153.
7. Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic attribution for deep networks. Proceedings of the 34th International Conference on Machine Learning-Volume 70, 3319–3328.