Stay up to date with Payatu blog
My previous post discusses about an attempt we made to gain higher code coverage by leveraging some machine learning methodologies. I recommend you to read that post first, also you can take a look at this post to get a beginner level idea about machine learning. In previous experimentation have targeted JPEG and PDF parsers. The proposed system proved to work better for JPEG file generation than for PDF input generation. This is reasonable, since PDF parsers are syntactically more sensitive. Slight change in syntax will stop the parser from parsing the input. Hence previously proposed mutation technique resulted in decreased code coverage for generated PDFs. Long story short, we needed a system that learns the grammar/syntax from a corpus of sample inputs. In this post we will discuss an approach to overcome previously seen limitations. We will be targeting Adobe Acrobat Reader for experimentation. First we will take a look at PDF file specifications. Then the Deep Learning algorithms that can be used to infer grammar from a dataset. Finally we will inspect the quality of generated data based on the code coverage. Below mentioned system is integrated with CloudFuzz. Why machine learning? Fuzzing any software that accepts a highly syntactical input requires a grammar which can generate syntactically correct inputs. But designing a grammar is a tedious and time consuming process. Also, the created grammar is highly specific to that particular target software or might cover a very narrow spectrum of softwares. So, our target is to create a very generic system using Deep Learning. Which can be used to learn grammar from a set of samples and generate new inputs for fuzzing accordingly. Anatomy of a PDF file:...