SEC4ML part-1: Model Stealing Attack on Locally Deployed ML Models

This is the SEC4ML subsection of the Machine Learning series. Here we will discuss potential vulnerabilities in Machine Learning applications. SEC4ML will cover attacks like Adversarial Learning, Model Stealing, Model Inversion, Data poisoning, etc. We have discussed about Model Stealing Attacks in SEC4ML Part-1. Have a look at it if you haven’t already. In this blog we will discuss the attack that is most famous among our fellow researchers i.e. Adversarial Machine Learning Attack.

It involves generating a specially crafted input with an objective to be misclassified by the target model. Theoretically it is always possible for an attacker to generate adversarial samples. Complexity of attack may differ based on the abstraction at which the model is operating. Specially designed algorithms are used to perturb a sample input. These perturbations cause the input to be misclassified to unwanted class. Adversaries can choose to do targeted Adversarial learning attacks where the input will be classified to a specific class.

But how do you check if your model is vulnerable to adversarial samples? Well you can implement well known adversarial sample generation algorithms from published research. Or you can make use of opensource libraries like CleverHans or Foolbox. These libraries already have implementations of FGSM and LBFGS attacks. We will first take a look at how Adversarial Learning attack works. Then demonstrate a very basic Adversarial Learning attack.

We have been using optimization algorithms to train a Neural Network. The job of these algorithms is to minimize or maximize a value of some function by changing a set of parameters. While training a Neural Network, an optimization algorithm may minimize the loss by changing the values of weights and biases. Adversarial Learning makes use of intuitively similar concept where some kind of optimization can be utilised to maximize the probability of some target class in a Neural Network. Following equation is the simplest way of representation of an Adversarial Learning attack on an image classification model. Where I is a matrix that represents an input Image.

C is the cost calculated at target class. The pixels of image are changed in every iteration based on the calculated cost and gradient. After some iterations we can hope that the image is changed to give required probability of target class when forward propagated through the target model. Now let’s try to generate Adversarial Samples for InceptionV3 neural network. InceptionV3 is a widely used and a well known image recognition model. You can find more about the architecture of InceptionV3 here. InceptionV3 is trained on imagenet dataset and it classifies the object to one of pre-defined 1000 classes.

Table of Contents

Whitebox Adversarial Learning Attacks

It is assumed that attackers have access to the prediction pipeline of target applications such as input details, access to layers, model weights, details regarding inference of outputs. This information can make it easier to generate adversarial samples. We have access to the InceptionV3 model and we will try to implement the basic algorithm discussed above to generate Adversarial samples for InceptionV3 neural network. We will be trying to fool the InceptionV3 into classifying a street sign image to a mailbox image. Let’s start with necessary imports.

import numpy as np
from keras.preprocessing import image
from keras.applications import inception_v3
from keras import backend as K
from PIL import Image

ip_img_path = "street_sign.jpg"
adversarial_img_path = "adversarial_street_sign.png"
diff_img_path = "diff_street_sign.png"
target_class = 637

The target_class is set to 637 which is the label index for our target label mailbox. Now let’s load the model and pre-process the image.

# load model
model = inception_v3.InceptionV3()
ip_layer = model.layers[0].input
op_layer = model.layers[-1].output

# Load the image to hack
img = image.load_img(ip_img_path, target_size=(299, 299))
ip_image = image.img_to_array(img)
ip_image /= 255.
ip_image -= 0.5
ip_image *= 2.

# Add a 4th dimension for batch size (as Keras expects)
ip_image = np.expand_dims(ip_image, axis=0)

Following code implements the equation discussed above. Since we don’t want the pixel values to go out of a certain bound, we will clip it in every iteration. Also, we are expecting the probability of the target class mailbox to be predicted as 0.98 or more. Hence, we will continue the optimization of input image until we achieve the desired probability.

# set bounds for change during optimization
max_change_above = ip_image + 5.0
max_change_below = ip_image - 5.0

# Create a copy of the input image to hack on
adversatial_image = np.copy(ip_image)

cost_function = op_layer[0, target_class]
gradient_function = K.gradients(cost_function, ip_layer)[0]
grab_cost_and_gradients_from_model = K.function([ip_layer, K.learning_phase()], [cost_function, gradient_function])
cost = 0.0
learning_rate = 0.9

while cost < 0.98:
    cost, gradients = grab_cost_and_gradients_from_model([adversatial_image, 0]) 
    adversatial_image += gradients * learning_rate
    adversatial_image = np.clip(adversatial_image, -1.0, 1.0)
    print("\033[92m [+] Predicted probability of target class: {:.8}\033[0m".format(cost))

Finally let’s save the adversarial image and the difference between input and adversarial image to a .png file.

adversarial_img = adversatial_image[0]
adversarial_img /= 2.
adversarial_img += 0.5
adversarial_img *= 255.

im = image.array_to_img(adversarial_img)
im.save(adversarial_img_path)

diff_img = adversatial_image[0] - ip_image[0]
diff_img /= 2.
diff_img += 0.5
diff_img *= 255.

diff_im = Image.fromarray(diff_img.astype(np.uint8))
diff_im.save(diff_img_path)

Time to test if above code has actually produced an adversarial sample. Here is a sample code to predict the class of input image.

import numpy as np
from keras.preprocessing import image
from keras.applications import inception_v3
from keras.applications.inception_v3 import preprocess_input, decode_predictions
from keras import backend as K
from PIL import Image
import os
import sys

model = inception_v3.InceptionV3()

def predict_img(img_path):
    img = image.load_img(img_path, target_size=(299, 299))
    original_image = image.img_to_array(img)
    original_image /= 255.
    original_image -= 0.5
    original_image *= 2.
    x = np.expand_dims(original_image, axis=0)
    pred = model.predict(x)
    print('\033[92m'+str(decode_predictions(pred, top=3))+'\033[0m')
    
predict_img('./adversarial_street_sign.png')

And following are the predicted probabilities. The adversarial sample is predicted to be a mailbox with probability 0.997. If you run these experiments then your results may vary to a certain fraction but the final outcome will surely be desired.

[[('n03710193', 'mailbox', 0.99780375), ('n06794110', 'street_sign', 0.00044706132), ('n03976657', 'pole', 0.00036352934)]]

Blackbox Adversarial Learning Attacks

Most of the time the model is deployed on cloud and the user has only API access to the model. Not having direct access to model information indeed makes it difficult to generate adversarial samples but it’s not impossible. Research has shown that the adversarial samples are transferable.

Which means an adversarial sample for one model similar to target model may also be applicable to target model. So the attacker can train a model locally for similar use-case and generate adversarial sample for that, which in turn are potentially effective against the target model.

The Story of Adversarial Machine Learning does not stop here. There are some more interesting derivatives of this attack that we will cover in upcoming posts. For example, generating printable Adversarial patches which can fool object detection models, fooling face recognition systems using Adversarial Learning, fooling automated surveillance cameras, generating Adversarial samples for audio and text processing models, etc.

Mitigation

Adversarial Learning attacks are easy to be implemented in labs and under controlled environment. But it is difficult when it comes to real world implementation of this attack. This does not mean one should underestimate this attack. It is important to test your model against Adversarial Learning attack especially if it is used for critical purposes. Adversarial attacks are hard to defend because it’s hard to theoretically model the actual distribution of adversarial samples.

First line of defence could be securing your serialised model if possible. That will make the attacker’s job difficult and Whitebox attack will not be easily executed. You may also try gradient masking, where you hide the actual probability of prediction from the end user and just revel the final prediction. This will again make the attacker’s job difficult. Re-training the model on adversarial samples is a cat-n-mouse game. Also the efficiency of this defence technique depends on the adversarial algorithm you plan to use on generating adversarial samples for training. A distilled model as an outcome of Defensive distillation was also found to be robust against Adversarial Learning Attacks. Traditional pen-testing methodology will not cover assessment against above discussed threats. At payatu, we have orchestrated ways to test the ML systems against these attacks and identify potential security and privacy threats. We also provide hands-on training programs which are specifically designed for security researchers and ML practitioners to educate them on above topics.

That is all for this post. Feel free to reach out to me or our Payatu team for any queries and suggestions. Follow Payatu on social media handles to get the notification for upcoming posts.

Subscribe to our Newsletter

Resources

Tools

Community

SEC4ML part-1: Model Stealing Attack on Locally Deployed ML Models

Whitebox Adversarial Learning Attacks

Blackbox Adversarial Learning Attacks

Mitigation

Subscribe to our newsletter

Services

Products

Conference

Resources

About