|Figure 1.1 - Version 1 of the uncompressed pdf document labeled "works_original.pdf"|
|Figure 1.2 - Output of pdf.py executed with version 1 of the pdf document (works_original.pdf)|
|Figure 1.3 - Output of pdf.py executed with version 2 of the pdf document (notwork.pdf)|
|Figure 1.4 - Objects listed by Pdfstreamdumper for the notwork.pdf|
Pdfstreamdumper provides a list of objects contained in the PDF and is displayed in Figure 1.4. The list is consistent with the output that “pdf.py” returns so where are the missing objects? If we examine each object and its contents we discover the missing objects are contained within other objects. For example, object 10 contains object 13, 14, 15 and 16.
|Figure 1.5 - Contents of object 10 shown by Pdfstreamdumper for the notwork.pdf|
To understand the syntax we can refer to PDF Document Reference , however, it is clear after some simple analysis. As we can see from Figure 1.5 we have four objects listed consecutively. The first number is the object number and the second is the offset to the beginning of the next stream. So the first two numbers “13 0” declares the object 13 is contained first at offset 0. The next two numbers “14 22” is object 14 and the content for that object is at offset 22. The same for the next two pairs “15 49” and “16 146”. If we look back at the output of pdf.py we see the tags for object 10 and which tag allows for multiple objects.
|Figure 1.6 - Snippet from the output of pdf.py for notwork.pdf|
We see that the tag “/ObjStm” allows for multiple objects to be embedded into object 10 and we can confirm by looking at the PDF Document Reference . Also the tag “/N” informs us of how many objects are included inside object 10 and as we can see in Figure 1.6, and is verified by pdfstreamdumper, the number of objects inside object 10 is 4. The same process above can be followed to determine where the missing objects 5 and 6, from the pdf.py output, are located. Object 5 is embedded in Object 2. Object 6 is embedded into object 3. Figure 1.7 and Figure 1.8 show the contents of objects 2 and 6 respectively using pdfstreamdumper.
|Figure 1.7 - Contents of object 2 shown in Pdfstreamdumper for the file notwork.pdf|
|Figure 1.8 - Contents of object 3 shown in Pdfstreamdumper for the file notwork.pdf|
4. Results and Solution
|Figure 1.9 - Code created for pdf.py to address objects streams in a PDF document|
This code has also been submitted to Jsunpack’s source code and the patch request is pending review. Figure 1.9 displays the new output for “notwork.pdf” when executed with the modified “pdf.py” script which includes the code shown above.
|Figure 1.10 - Output of modified pdf.py executed with the file notwork.pdf|
5. Additional Patch to Pdf.py
Another patch that has been made to the pdf.py script is in regards to the “/Names” tag. I noticed that if the “/Names” tag includes a custom name with the tag then the parsing only captures the text of the name and not the reference number. An example is shown below.
|Figure 1.11 - Snippet of the output from pdf.py for the file notwork.pdf|
For object 14 there is a “/Name” tag and the output only display the text “My Code” which is the name given to the reference to object 15. However, the reference to object 15 does not appear. After utilizing the python debugger to trace into the program, the issue is due to the parenthesis which stops the parsing function from capturing anything after the parenthesis. This is easily fixed by adding a condition to an existing “if” statement in “pdf.py”. The change is displayed in Figure 1.12.
|Figure 1.12 - Code created for pdf.py to address the missing reference number for the tag "/Names"|
The tag variable is an array that contains the stream for the current object. So as well as looking for the condition “\\”, I added the condition “curtag == ‘Names’ “. This line now checks if the current tag is a “/Name” tag and if it is the parsing function will continue to collect the following characters in the tag which would include the object reference number.
|Figure 1.13 - Snippet of output from the modified pdf.py for the file notwork.pdf|
Figure 1.13 shows the new output of the modified “pdf.py” script which includes the reference as well as the text name which is given to the tag “/Names”.
 Jsunpack, Available at http://code.google.com/p/jsunpack-n/
 Pdfstreamdumper, Available at http://sandsprite.com/blogs/index.php?uid=7&pid=57
 Pdf Document Reference, Available at http://www.adobe.com/devnet/pdf/pdf_reference.html