Friday, March 2, 2012

Decode PDF and Extract Javascript

1. Introduction
In the previous articles we looked at how to manually create a PDF and how to embed JavaScript inside the PDF document. I will now continue to look at how to extract JavaScript and decompress a PDF in order to reverse engineer the code inside. When a PDF is created or saved, the streams inside the PDF are commonly compressed and encoded with filters such as “FlateDecode”. The stream objects in a PDF are the objects which contain the JavaScript or text which we wish to read. Many times malware authors embed their malicious code inside these JavaScript streams and it is beneficial for security professionals to extract and decompress these streams. Let us revisit the PDF example presented in the previous article How to Embed JavaScript into PDF. Since the PDF was manually created, the streams are in plaintext, however, we will use Adobe Acrobat 9 to save the file again and this time the streams are encoded and not readable in a file editor.

Above is a partial view of the PDF we created in the previous article with the streams encoded. As you can see the PDF has been modified. More objects have automatically been added and the original streams with JavaScript are not readable. Also the first header line which tells me the PDF specification this document follows has been changed as well. Originally it was “%PDF-1.6”, now it is “%PDF-1.6”. This is due to the version of Acrobat I used which is Acrobat Pro 9. Other changes can also be found such as metadata that is now included in the document.

The streams may be compressed with several different filters, most commonly the FlateDecode filter is used to encode a PDF. After inspection of the document we can see that the PDF has been encoded using the filter FlateDecode. There are two tools we can use this decode the PDF. The first tool is “pdftk” available for download at This program runs on Windows, Linux, Mac OS X, FreeBSD and Solaris. It has many features which allow us to manipulate a PDF, among them is the ability to decompress streams and read the file in plain text. The second tool is “Jsunpack-n”. It is a powerful tool to decode and extract JavaScript from a PDF file.

2. PDF Toolkit
PDF toolkit is easy to install. Download the zip file from their website and place it in a convenient location. Then add the location of the bin folder to your environment variables. This can be done by accessing the properties of your computer and clicking on the advanced tab.

Click on “Environment Variables” and here we locate the "Path" variable and add the location of our bin folder for pdftk. This allows us to use pdftk from the command prompt without having to navigate to the program folder each time.

Below is the command used to decompress the PDF file. First is the call to the program pdftk. Second we list the location of the PDF we wish to decompress. Third is the parameter output. Here we list the new filename we wish to use for our decompressed file. Last we specify the parameter “uncompress” which will decode the streams in the PDF file. To view all the commands in pdftk and the accompanying examples, type the command “pdftk –help”.

The decoded file is created and below is the file opened in the file editor notepad++. Selected parts of the code are shown below with the corresponding line number to the left.

First we can see that the streams are in plain text. We can identify the object 9 which contains the JavaScript for our alert box. Something to notice is that for the xref section of the file we had previously left this part blank with only the object name. After it was encoded and decoded the offsets were automatically calculated for all the objects.

3. Jsunpack
The second tool we can utilize to decode and extract the JavaScript in a PDF file is Jsunpack. I tested it on Ubuntu 10.04 and the latest version can be obtained by running the following code in Ubuntu

               $ svn checkout jsunpack-n

Follow the instructions in the INSTALL file to complete the installation. After installed we can use the terminal to examine a PDF file. I tested it on my previous sample PDF file and the file was decoded however no JavaScript was decoded. Therefore I tested it on a sample JavaScript clock file which served as a better example. The JavaScriptClock file is available at Open the file in a file editor and it is visible that the streams are encoded.

Below is the command to call the PDF python script. You provide the location of the PDF file and the python script handles the rest. For more verbose information on the file attach a “-v” to the end of the command. The command below decodes the file and extracts the JavaScript embedded in the PDF to a separate file that we can examine. Jsunpack appends “.out” to the new file created.

“JavaScriptClock.PDF.out” contains the JavaScript and below is a partial output in gedit. We can see all the functions and declarations that were made using JavaScript. This provides a useful way to look at obfuscated PDF that may contain malicious code.

4. Conclusion
To conclude, there are tools that exist to make it easier to manipulate and decode PDF documents. Above I have shown two tools, pdftk and Jsunpack, that are useful to decode streams in a PDF. The streams will usually contain JavaScript and unfortunately malware authors will embed JavaScript that will perform undesired functions in another user’s computer. These tools allow us to reverse engineer the code and discover if malicious code is embedded. As shown above Jsunpack also provides the user with all the JavaScript in a separate file which is useful for analysis. In the next article I will explore buffer overflow attacks and vulnerabilities of PDFs and previous versions of adobe acrobat.

[1] "Document Management - Portable Document Format", Available at
[2]  Michael Leigh, "Malware Analyst's Cookbook and DVD", Available at

No comments:

Post a Comment