SEC4ML Part-3: Model Skewing Attacks on Machine Learning Models
This is part 3 of our SEC4ML subsection from machine learning blog series. In SEC4ML we will discuss possible attacks like Adversarial Learning, Model Stealing, Model Inversion, Data poisoning, etc. on Machine Learning applications. Some of them are briefly explained in this blog, you may want to have a look at it. In this blog we will be having a look at Model Skewing Attacks.
About Model Skewing Attacks
Many machine learning applications take feedback from users to retrain the model. This usually helps in building a bigger and diverse dataset. But the same functionality can be abused by an attacker to inject wrongly labeled data inside the dataset and skew the model predictions in desired direction.
One of the examples could be a Spam detection application. Such applications often provide a “Mark it as not Spam” or similar button to record feedback from users. But an attacker can send automated requests to mark certain message as not spam (by abusing the APIs or running a campaign, etc). This way the attacker gets indirect access to the model training pipeline and can skew the model in the desired direction.
Simulation the Attack
In this blog, we will build a dummy spam detection application that provides a functionality to mark a message as not spam. Then we will try to skew the model to mark any desired message as not spam.
In our previous blog we have had a look at how a machine learning model is built and used for predictions. So we will skip the model training part from this blog. Only thing to note here is that a model is trained and weights are stored in a file so that it can be used for predictions.
Lets have a look at how our application is making predictions and leveraging the user feedback to retrain the model.
class SpamClassifier: def __init__(self): self.model = load_model('./model/trained_model_backup') self.vect = pickle.load(open('./model/vect_backup','rb')) self.new_data =  def classify(self, input_msg): ip_transformed = self.vect.transform([input_msg]) pred = self.model.predict([ip_transformed]) if pred > 0.5: return ('spam', pred) else: return ('not spam', pred) def re_train(self): new_data = np.array(self.new_data) new_data_x = self.vect.transform(new_data[:,0]) new_data_y = new_data[:,1] self.model.fit(new_data_x, new_data_y.astype(np.float32), epochs=20, batch_size=128) self.model.save('./tmp/trained_model') def record_feedback(self, feedback_msg, feedback_label): ''' feedback = ['message ', 'label'] ''' self.new_data.append([feedback_msg, float(feedback_label)]) if len(classifier.new_data) >= 20: classifier.re_train() classifier.new_data = 
SpamClassifier class we are initializing the object with a trained model and vectorizer. Vectorizer performs vectorization i.e. converts input text message to a list of numbers. The
classify() function runs the input message through a model and predicts the probability for the message being SPAM. The threshold of
0.5 is applied on predicted probability to classify the message.
When someone hits the button “Mark this as not SPAM” on UI, it will call the
record_feedback() function. This function records the message from the user and labels it as not spam i.e. 0 and appends to the dataset. Then
record_feedback() checks if there is enough feedback from user to re-train the model. In our case it’s 20 (which could be a way bigger number in a realworld application), so once there are more than 20 feedbacks from users the
re_train() function is called and the model is trained on the new dataset.
Let’s try to get some prediction from the model
msg = 'Click following link to get a $500 giftcard for FREE' classifier = SpamClassifier() # prediction before attack print('Classification label before attack: ', classifier.classify(msg)) print('Classification confidence before attack: ', classifier.classify(msg))
and the message “Click following link to get a $500 giftcard for FREE” is classified as SPAM as expected.
Classification label before attack: spam Classification confidence before attack: [[0.99511683]]
Now we will simulate what will happen if an attacker abuses the feedback loop to poison the dataset with wrong labels for above message
# Simulate Attack for feedback in range(20): # 0 -> not spam classifier.record_feedback(msg, 0)
We have recorded 20 feedbacks saying that the target message is not spam. Which triggers the
re_train() function and trains the model to learn that the target message is not spam.
Lets see if it actually worked
# prediction after attack print('Classification label after attack: ', classifier.classify(msg)) print('Classification confidence after attack: ', classifier.classify(msg))
and this returns the skewed prediction which means that the attack was successful and the target message is classified as NOT SPAM
Classification label after attack: not spam Classification confidence after attack: [[0.00703493]]
One can also use the above way to test how easy it is to skew their models before deploying them to production servers. It is also recommended to have an isolated Golden Dataset on which the engineers test their model for performance. It may happen that the skewed model performs poorly on Golden Dataset hence never gets sent on production servers.
That’s all for this post. In upcoming blogs we will quantify on how easy or difficult is it to skew various trained models.
At Payatu, our AI/ML domain experts have orchestrated ways to help you secure your intelligent application against esoteric and potentially severe security and privacy threats. More information on our AI Security Assessment service page. Feel free to reach out to me or our Payatu team for any queries and suggestions. Follow Payatu on social media handles to get the notification for upcoming posts.