My previous post discusses about an attempt we made to gain higher code coverage by leveraging some machine learning methodologies. I recommend you to read that post first, also you can take a look at this post to get a beginner level idea about machine learning. In previous experimentation have targeted JPEG and PDF parsers. The proposed system proved to work better for JPEG file generation than for PDF input generation. This is reasonable, since PDF parsers are syntactically more sensitive. Slight change in syntax will stop the parser from parsing the input. Hence previously proposed mutation technique resulted in decreased code coverage for generated PDFs. Long story short, we needed a system that learns the grammar/syntax from a corpus of sample inputs. In this post we will discuss an approach to overcome previously seen limitations. We will be targeting Adobe Acrobat Reader for experimentation. First we will take a look at PDF file specifications. Then the Deep Learning algorithms that can be used to infer grammar from a dataset. Finally we will inspect the quality of generated data based on the code coverage. Below mentioned system is integrated with CloudFuzz. Why machine learning? Fuzzing any software that accepts a highly syntactical input requires a grammar which can generate syntactically correct inputs. But designing a grammar is a tedious and time consuming process. Also, the created grammar is highly specific to that particular target software or might cover a very narrow spectrum of softwares. So, our target is to create a very generic system using Deep Learning. Which can be used to learn grammar from a set of samples and generate new inputs for fuzzing accordingly. Anatomy of a PDF file:
Expedition ML4SEC Part – 1
There is no doubt that state-of-the-art systems can be built using machine learning algorithms but at the same time these algorithm poses serious security flaws. An attacker can take advantage of these flaws by creating adversarial inputs resulting in misbehaviour of Machine Learning systems. In this series we will explore these flaws. But to understand about the vulnerabilities we first need to understand how machine learning models work. Hence, I will dedicate this post to understand the basics of machine learning and finish with building a basic machine learning model. Introduction: Training a machine learning model means increasing the performance of that model in a particular task. Model “Learns” from a dataset. Based on this learning process, machine learning algorithms are classified into following major types. Supervised learning: In supervised learning, we feed the algorithm with features and labels. Consider a problem of classifying network packets into malicious or non malicious. Here features could be the attributes of packet such as source IP, destination IP, port, protocol, payload length, flags,etc. And the labels could be 0 or 1 based based on whether the packet is malicious or not. Classification algorithms like Neural Nets and SVM. Unsupervised learning: This type of learning is used when we do not have a labeled samples. Algorithms learns to differentiate the samples based on the features. Suppose we have a huge set of images of 2 persons and want to classify them. Then we feed these images to an unsupervised algorithm. The algorithm will then create two or more clusters of these images(based on features), which can be labelled as person A and person B. Hence these algorithms are sometimes called as “clustering” algorithms. Semi-supervised learning: Semi-supervised learning is used when we have a mixture of labeled and unlabeled data in dataset.
Machine learning for effective fuzzing – CloudFuzz
Problem: Fuzzing a software with random data may or may not discover new bugs. Also, such random attempts do not guarantee of covering the complete code. Hence there should be a system which learns the type and format of input files and generate similar files to attain higher code coverage. Since there could be countless number of file formats, our system should be highly generic and should work for every type of file format. It should not be bounded by a certain type of input. Eg: If the system is working for .doc files then it should also work for JPEGs or PDFs, etc.