CloudFuzz is an integrated software framework for security based fuzzing. The end goal is to provide a workflow that will allow continuous fuzzing and generate reports of the software security vulnerabilities by analysing crashes on a given piece of software. In CloudFuzz we provide crafted data to a software system and analyse the system for crashes. Ultimate aim of fuzzing is to discover bugs and security vulnerabilities in the target software. To trigger a bug that is hidden in the deepest and darkest corners of a software, we need an input that covers those pieces of code. Hence it makes sense to assume that Probability of discovering a bug is directly proportional to the magnitude of code covered by the input supplied.
My previous post discusses about an attempt we made to gain higher code coverage by leveraging some machine learning methodologies. I recommend you to read that post first, also you can take a look at this post to get a beginner level idea about machine learning. In previous experimentation have targeted JPEG and PDF parsers. The proposed system proved to work better for JPEG file generation than for PDF input generation. This is reasonable, since PDF parsers are syntactically more sensitive. Slight change in syntax will stop the parser from parsing the input. Hence previously proposed mutation technique resulted in decreased code coverage for generated PDFs. Long story short, we needed a system that learns the grammar/syntax from a corpus of sample inputs.
In this post we will discuss an approach to overcome previously seen limitations. We will be targeting Adobe Acrobat Reader for experimentation. First we will take a look at PDF file specifications. Then the Deep Learning algorithms that can be used to infer grammar from a dataset. Finally we will inspect the quality of generated data based on the code coverage.
Below mentioned system is integrated with CloudFuzz.
Why machine learning?
Fuzzing any software that accepts a highly syntactical input requires a grammar which can generate syntactically correct inputs. But designing a grammar is a tedious and time consuming process. Also, the created grammar is highly specific to that particular target software or might cover a very narrow spectrum of softwares. So, our target is to create a very generic system using Deep Learning. Which can be used to learn grammar from a set of samples and generate new inputs for fuzzing accordingly.
Anatomy of a PDF file:
I will be referring to Adobe’s documentation for PDF1.7. Below is how a PDF should look according to the mentioned documentation.
Header:
The header is a one line field that specifies the version of PDF. It starts with “%PDF-” . Following image shows that the file is using PDF version 1.6.
Body:
Next part is Body. Body of a PDF contains PDF objects. These objects hold the actual content (which is visible to the end user) and information about them. Object starts with “<n1> <n2> obj”. Where n1 and n2 are two number and ends with “endobj”. Consider example shown below.
Here 11 acts as name/identifier of object. 0 is the version number and “obj” says that it is an object. Content of object is enclosed in “<<>>”, meaning it is a type of object called Dictionary Object, which represents information in key value pairs. There are different types of objects like Array objects, Name objects, String objects, etc. To know more about these objects please refer above mentioned document.
xref:
Next is Cross-reference table. A cross-reference table looks like follows
It starts with “xref”. Next line represents that there are total 25 entries in the table starting from 0. Every object must have an entry associated with it. Every entry represents the offset at which the object is located followed by generation number for that object and ends with a flag. If flag is “n” then the object is in use if it is “f” then object is not in use. In our example, highlighted entry refers to an object that is at offset 20761, which does not have any revisions and the object is in use (since flag is “n”). Only the objects having flag “n” will be parsed and displayed by the PDF reader.
Trailer:
Trailer lies at the end of file. Which consists of a dictionary. In following example, the dictionary says that there are 25 entries in Xref table. /Root specifies that the object 6 will be acting as a root object. And /Info references to the information dictionary. Number “76383” is the offset at which xref table is present. File ends with “%%EOF”.
Dataset generation for learning:
From the PDF specification it is clear that the file content is stored in the form of objects, hence we will be using objects from a corpus of PDFs to train our Machine Learning model. We have extracted different types of objects from a huge dump of PDF files.
Building Learner:
Above explained problem can be classified as a sequence prediction problem. Hence, LSTM (Long Short Term Memory cells) can be used to solve it. Ability of LSTM or RNN (Recurrent Neural Networks) to generate sequences has already been proven to be unreasonably good. Experiments show that RNNs are really effective. When trained with the Linux Source Code as input, RNNs were able to predict the source code in C. Here is just one sample as shown by Andrej.
It is interesting to see how generated code closes all opened brackets, defines a function and also generate comments. Ofcourse the code is not compilable, since it tends to use undeclared variables. But this is good enough to use LSTMs to generate PDF objects.
Following is the representation of LSTM which shows that the cells are recursively called to process the input and can be used for sequential prediction. (Image source)
The yellow blocks represent neural network layers. The forget layer(first yellow block from left) is the actual essence of LSTM. Forget layer decides what part of previously seen sequence matters in predicting later sequence. You can check out Colah’s representation of LSTM To know more about mathematics behind it.
Experimentation:
After fine-tuning the hyperparameters of LSTM and the optimizer, the network took around two hours to train. Then the trained model is used to generate PDF objects. All the generated objects are integrated to generate a PDF file. Finally, to determine the quality of generated PDF with respect to fuzzing we have compared the code coverage achieved by generated PDFs with actual PDFs from the dump we have. Also, size of input can matter the code coverage that is why we have used PDFs of similar size for comparison.
There are tools like Pin, DynamoRio,etc. available to calculate the code coverage. We are using DynamoRio which is an open-source tool. ‘drcov’ is an implementation in DynamoRio which provides the functionality to calculate code coverage in a log file. Following is an example of how to use drcov.
./drrun.exe -t drcov -dump_text -logdir C:/Users/IEUser/Desktop/logs -- C:/Program\ Files/Adobe/Acrobat\ Reader\ DC/Reader/AcroRd32.exe example.pdf
Above command runs AcroRd32.exe with example.pdf as command line argument. AcroRd32.exe is an executable of Adobe Acrobat Reader for 32 bit systems. ‘-dump_text’ tells the drcov to dump logs as text files at location C:/Users/IEUser/Desktop/logs. To know more about drcov you can visit here. These logs can be parsed to get unique basic block count. We have written a script in python to automate above process on generated PDFs. Now that we have the required data, lets plot it and see how well the system is doing.
Results:
We have compared the code coverage obtained against our target software i.e. Adobe Acrobat Reader. First, let’s take a look at PDF objects generated by LSTM.
And here is the dump of a sample PDF.
%%PDF-1.1 1 0 obj<< /Type /Catalog /Outlines 2 0 R /Pages 3 0 R >>endobj 2 0 obj<< /Type /Outlines /Count 0 >>endobj 3 0 obj<< /Type /Pages /Kids [4 0 R] /Count 1 >>endobj 4 0 obj<< /Type /Page /Parent 3 0 R /MediaBox [0 0 612 792] /Contents 5 0 R /Resources << /ProcSet [/PDF /Text] /Font << /F1 6 0 R >> >> >>endobj 6 0 obj<< /Type /Font /Subtype /Type1 /Name /F1 /BaseFont /Helvetica /Encoding /MacRomanEncoding >>endobj 8 0 obj <</Type/Catalog/Pages 2 0 R/Lang(en-US) /S/b/I/ca1.02.0154)>> endobj 9 0 obj <</Type/Annot/Border[0 0 0]/Dest(refg186)/Subtype/Link/Rect[63.5 457.847 228.817 206.133]>> endobj 10 0 obj <</Type/Annot/Border[0 0 0]/Dest(refg-ILIDCB /Buris]CRCaliF(nemensiune/ModetricransseKed- ist-congreh--cinthor-ocutior-o(.iapreft-) >>/StructParent 155>> endobj 11 0 obj <</Count 5>> endobj 12 0 objendobj 13 0 obj <</Type /Page /Pse [0 1 1] >> endobj 14 0 obj <</ProcSet[/PDF /Text/ImageB /ImageC /ImageI 03 2 7 0 R /XObject 54 0 R >> >> endobj 15 0 obj 114775 endobj 16 0 obj <</Type/Catalog/Pages 2 0 R/Lang(en-US) /StructParent 436>> endobj 17 0 obj <</Subtype/Link/Rect[ 142.50 12.10 76.92 314.4 13.488] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(http://than /ersa0]>> endob 1082 1540 119 0 0 02 0 R/Type/Encoding/Roding/Identiseica1 /ColorSpace 46 0 R/Percjs[63 0 R/Regi Com50 0 R/FontDescriptor/9 0 R/BaseFont/Helveticating.pdf)/S/URI>>/Rect[456.867 585.487 598.178 744.645]>> endobj 18 0 obj <</Type/Annot/Border[0 0 0]/Dest(refg16)/Subtype/Link/Rect[848.902 224.944>> endobj 19 0 obj <</Type/Page /A/DivB<</Groul 87 00 0>> endobj 20 0 obj <</Type/Catalog/Pages 2 0 R/Lang(en-US) /StructParent 417>> endobj 21 0 obj <</Encoding/Descendor 892 0 R/Le>> endobj 22 0 obj <</Type/Page/Parent 2 0 R /Red <</Gr50374 0.04308 200.000] /MediaBox [112 742.85] />> endobj 23 0 objendobj 24 0 obj [ 522]] >> endobj 26 0 obj <</nustexue TmarshondationSc)>> endobj 27 0 obj <</Type/Catalog/Pages 2 0 R/Lang(en-US) /StructParent 546>> endobj 28 0 obj <</Type/Matomals/Erd-x 673142 3 /Encoding /MacRomanEncoding /FirstChar 250 /LastChar 122 /Encoding /Identity6 /Subtype /R/Subtype /Type1 /Escent 540 /Encoding /Differe3380 /Page/ighte-slek/mopki/riset-urea-hecture-ralty)>> >> endobj 29 0 obj 761 endobj 30 0 obj <</Type/ExtGState/MDevicent[ 494 479.409 498.401 0 0]>> endobj 31 0 obj <</Font 52 0 R/Firction 120 0 R /Subtype /Type1 /Encoding 714 0 R /Rotate 0 /ExtFmage-f0370E0FonmW >> endobj 32 0 obj 28898 endobj 33 0 obj [ 224 592 ] >>endobj 34 0 obj <</Type/Catalog/Pages 2 0 R/Lang(en-US) /S ErocSect[ >> endobj 35 0 obj 12012 endobj 36 0 obj <</D/ff843 251 318 604 40 445 408 ] >> endobj 37 0 obj <</Encoding/WinAnsiEncoding/LastChar 32/Widths 1368 0 R>> endobj 38 0 obj <</K 61 0 R >> endobj 39 0 obj <</R33 36 0 R/R10 10 0 R>> endobj 40 0 obj <</R7 7 0 R/R35 35 0 R/R35 35 0 R/R36 34 0 R/Re dial/Resounds[0 0 R/XHeight 401/XHeight 250/AvgWidth 357/FontDescrintor 332 0 R/FirstChar 0/FontDescriptor 565 0 R/FirstChar 32/LastChar 33/Widths 252 0 R>> endobj 41 0 obj <</Type/FontDescriptor/FontName/Times#20New#20Roman,Bold/Flags 32/ItalicAngle 0/Ascent 905/Desc<< endobj 42 0 obj <</Type/Catalog/Pages 2 0 R/Lang(en-US) /Ascent 85/Subtype/Type1/Encoding/WinAnsiEncoding/FontDescriptor 63 0 R/FirstChar 32/LastChar 254/Widths 4007 0 R>> endobj 43 0 obj [ 40 0 R/X7670]/Group<</S /yangle/MSChy-sAncommerie/Node /Contents 21 0 R>> endobj 44 0 obj <</BaseFont/ommmogMisone,Bold /Pagex -8worSstemie/Tcaira-tinglesto--comhity-chodachgraipi/b[65041858036 430]>> endobj 45 0 obj <</Creator(Ideda.pdf)>> >> endobj 46 0 obj <</Lincoding 54 0 R/Lendox- 468.4 ->> endobj 47 0 obj <</Type/Catalog/Pages 2 0 R/Lang(en-US) /FirstChar 54/Ascent 794/Ascent 721/CapHeight 654/AvgWidth 250/MaxWidth 1825/FontWeight 400/XHeight 250/Leading 33/StemV 46/FontBBox[ -568 -210 1076] /FontDescriptor 202 0 R >> endobj 48 0 obj 021016755 5[026] /Ascent 75 /Parent 3 0 R /Prev i/OPM 2>> endobj 49 0 obj <</Type/Page/Parent 2 0 R/Resourc) [169 0 R/Pa.trugonterSpst 4 4>> endobj 50 0 obj <</LastContionTyDeFoneDthiods/Martb 666109.0004479-chowAd-fImage 272.0 464 0 R >> endobj 51 0 obj <</Type/FontDescriptor/FontName/MR9 53181/K m119 0 R/Registry(Adobe)/hartpset/Sparipealild /(/1)/L1>> endobj 52 0 obj <</Type/Catalog/Pages 2 0 R/Lang(en-US) /ExtGState ] /De t[38 5 0 R /Exteti >> endobj 53 0 obj <</t 166 5 /LastChar .34) >>/FontDescriptor 23 0 R/Type/Font /First 474 0 R >> endobj 54 0 obj << 379915 278 479 510 789 599 486 229 570 500 277 531 487 463 672 645 474 686 666 ] >> endobj 55 0 obj 155 endobj 56 0 obj << /Type /Font /Subtype /Type1 /BaseFont /4LGF[PME+Chambrapendedysa-FPages 8 0 R >> endobj 57 0 obj <</Type/Font/Subtype/Type0/BaseFont/ABCDEE+Erfale/li-Bold/BaseFont/Trihs iscs b 465/OhptridtraPaAP)>>/MediaBox[9 9 691 805] /Contents 47 0 R>> endobj xref 0 58 0000000011 00000 n 0000000080 00000 n 0000000129 00000 n 0000000190 00000 n 0000000382 00000 n 0000000496 00000 n 0000000577 00000 n 0000000687 00000 n 0000000861 00000 n 0000000895 00000 n 0000000913 00000 n 0000000965 00000 n 0000001063 00000 n 0000001089 00000 n 0000001170 00000 n 0000001520 00000 n 0000001616 00000 n 0000001675 00000 n 0000001756 00000 n 0000001812 00000 n 0000001918 00000 n 0000001936 00000 n 0000001966 00000 n 0000002166 00000 n 0000002219 00000 n 0000002300 00000 n 0000002546 00000 n 0000002569 00000 n 0000002644 00000 n 0000002765 00000 n 0000002790 00000 n 0000002824 00000 n 0000002900 00000 n 0000002925 00000 n 0000002983 00000 n 0000003060 00000 n 0000003094 00000 n 0000003140 00000 n 0000003365 00000 n 0000003484 00000 n 0000003661 00000 n 0000003759 00000 n 0000003891 00000 n 0000003937 00000 n 0000003994 00000 n 0000004234 00000 n 0000004317 00000 n 0000004404 00000 n 0000004506 00000 n 0000004630 00000 n 0000004730 00000 n 0000004829 00000 n 0000004936 00000 n 0000004959 00000 n 0000005063 00000 n trailer << /Size 58 /Root 1 0 R >> startxref 5227 %%EOF
Generated objects are quite similar to what we have provided for training. Syntactic correctness like opening and closing the dictionary, “obj” and “endobj” at accurate positions, correct key values, etc. are really appreciable results.
We have used unique basic block count as a metric of code coverage. Following graph represents the basic block count for normal PDFs and generated PDFs. Blue dots represent generated PDF and Red dots represent PDFs from dump.
From the graph it is pretty clear that 70% of the PDFs generated by the system achieved basic block count comparable to PDFs of similar and double of its size. Thus, the system can potentially act and perform as a very good content specific input generator for fuzzing.
It is worth noting that this system is not just limited for generation of PDFs. We can use it to generate inputs for any target software provided we have a corpus to learn from.
Limitations:
From the results we can see that there are still 30% (may vary with experimentation) of the PDFs that are showing lower code coverage. This can be increased by training the machine learning model on more data.
The system does not possess the knowledge of what actually can affect the code coverage. To achieve that we can leverage the power of symbolic execution combined with machine learning which will be the future scope of this project.
That’s all for this post. We at Payatu had a great time learning and building this system. I would love to read your reviews and suggestions in the comment section below. Thank you for making it to the end.