Machine learning for effective fuzzing – CloudFuzz

In this blog we will see Machine learning techniques that can be used to perform effective fuzzing on a software system. This system will be integrated with CloudFuzz. CloudFuzz is an integrated software framework for security based fuzzing. The end goal is to provide a workflow that will allow continuous fuzzing and generate reports of the software security vulnerabilities by analysing crashes on a given piece of software. In CloudFuzz we provide crafted data to a software system and analyse the system for crashes. Ultimate aim of fuzzing is to discover bugs and security vulnerabilities in the target software. Probability of discovering a bug increases with the magnitude of code covered by the input provided to target software. Generating inputs with high code coverage is a tricky task. Here is one of the attempts to solve this problem using machine learning.



Fuzzing a software with random data may or may not discover new bugs. Also, such random attempts do not guarantee of covering the complete code.

Hence there should be a system which learns the type and format of input files and generate similar files to attain higher code coverage.

Since there could be countless number of file formats, our system should be highly generic and should work for every type of file format. It should not be bounded by a certain type of input. Eg: If the system is working for .doc files then it should also work for JPEGs or PDFs, etc.


Approach to solve the problem:

Following diagram shows how the system works.

Preprocessor:  Preprocessor churns drcov logs (contains code coverage information) and sample input files and generates a dataset in csv format. Dataset contains predefined features extracted from sample files and drcov logs. It is necessary to select the features which contribute to code coverage.

Learner: With the help of .csv generated by preprocessor learner learns the relationship between input file format, data associated with file and code coverage. After satisfactory training, learner is ready to predict the code coverage of new files. Different classification algorithms like Artificial Neural Networks, Support Vector Machines, etc can be used for learning purpose.

File generator: File generator runs a metaheuristic evolutionary algorithm to generate and evolve files to attain higher code coverage. Following are the steps taken by file generator.

  • Crossover: consider two files as parents and generate two new files as a result of crossover. We can randomly select a crossover point and reproduce to generate new files or we can also randomly interchange a block of data between two files to generate new files. Basic crossover operation is shown in the diagram below.    
  • Mutation: Mutate random data in file. Degree of mutation should be kept very less and must be experimented with. This step is very important as it introduces new files to the system which can contribute to cover different parts of code.
  • Selection: In selection, fitness of every file in current generation is calculated using predictor. Only the fittest files (files with higher code coverage) are kept and others are discarded.

Yes, instead of using predictor  we can calculate the code coverage of each file at run time but this will significantly increase the execution time of algorithm.

Multiple iteration of above steps should sufficiently increase the code coverage of files. Finally the evolved files can be used for fuzzing.



Above explained system was tested to generate JPEGs and PDFs against their parser software. For testing JPEG we have used convert utility of Linux as a target software. Convert parses the input file and converts it to specified format. For PDF pdfium was used as target software. It parses a PDF file and can perform operations like extracting the text from document, writing the pages to images, etc.

Results are explained in graphs below. Plot explains code coverage i.e. basic block count (on y-axis) for every output file. Red dashed lines represent input files and green triangles represent the output files.

Above graph shows the effectiveness of proposed approach to generate JPEG files with higher code coverage. We can see that there are almost 50% of output files with code coverage greater than all the input files. The mutation and prediction worked really well causing the increase in code coverage of files in every generation.

In case of PDFs, the algorithm did not work as good as it worked for JPEGs.

We can see that around 80% of the files are covering very less code. The possible reason for this could be that the PDF parser in target software is rejecting the generated PDFs. Experimenting with the factors like degree of mutation, reproduction strategies and number of generations could lead to better results.

But there are still few files in output generation with basic block count significantly higher than the input files.


Scope for improvement:

Above system works well with binary file formats. But fuzzing the systems with highly syntactical inputs will be a problem. For example, while fuzzing xml, json or any programming language parser. Slight change in input will make the target software to reject it. The learner in above system does not learn the syntax of input and it only looks for few patterns and predicts the code coverage. Hence some grammatical inference mechanism could be used to learn the input grammar and generate respective output.

Also, the time required to generate files can also be reduced. Experimentation is needed to optimise the parameters of file generator.

That’s all for this post. Feel free to use the comment section for suggestions and queries.

Leave a Reply

Your email address will not be published. Required fields are marked *

11 + thirteen =