Sunday, February 19, 2012

How to Manually Create a PDF

The Portable Document Format (PDF) was a proprietary format controlled by Adobe until July 1, 2008 when the open standard was released to the public. It is independent of software, hardware, and operating system and this format is commonly used for document exchange. One topic for a later discussion is the utilization of PDF’s to embed malicious code and run on an unsuspecting computer. First let us concentrate on the different sections of a PDF and how to create a document manually.


A PDF is a file that consists of several objects. In general you have four parts to a PDF file structure.
  1. The header states the PDF specification that this file follows
  2. The body contains all the objects that make up the document
  3. The cross-reference table list the locations of the indirect objects in the file
  4. The trailer specifies the location of the cross reference table and other special objects


Below is a simple example PDF I created with notepad. It prints out a “Hello World!” message centered at the top of the document. I will show the code and explain each section one by one.


      


The header section contains the version of the PDF specification that my file conforms to. In my example I use the version 1.0. Next is the first object which is a catalog object. If you think of a tree data structure the catalog object would be the root and all other elements grow or build onto this node.


Line 3 of the code specifies the object number is 1 and the generation is 0. Similar to the html language, the object must be enclosed with starting and closing tags. Line 3 you have an “obj” tag which specifies this is an object. On line 9 you have an “endobj” closing tag which identifies the end of the object. The double angle brackets on line 4 and 8 are necessary to enclose a dictionary object which is simply a pair of objects where the first element is a key and the second element is a value. Line 5 specifies the type of the object which is a “Catalog” object. Let us disregard line 6 for now and discuss it later when we attempt to describe actions to perform when opening a document or when we later explore inserting JavaScript into our PDF. Line 7 is a reference to a “Pages” object which will contain more references to individual “Page” objects. Here we list the object number of the “Pages” object which is 2 and generation which is 0. The “R” in the statement is a keyword that stands for reference.


Our second object is the “Pages” object which will contain references to individual pages. Be careful not to confuse the “Pages” object with the “Page” object. As you can see above, the type of this object is pages and we introduce a new entry called “Count”. Count refers to the number of pages that this current object points to. In this simple example we only have one page. Line 15 we specify a required keyword “Kids” which points to the object with the individual page object. The next object is our “Page” object and it is object number 3.



Similar to the “Pages” object, the “Page” object also has to declare its type in line 21. In line 22 instead of kids you must list the parent of the object which in this case is object 2 (Pages). In Line 23 I list the resource I use for this object which the object font is necessary. Here I only declare a name for the font I will use and give a reference to a font object that fully declares the font type and size. In my example, my “Font” object is number 5. Line 25 I use the entry “MediaBox” and it is a required entry for a page object. It defines the boundaries of the page. The last entry I use for the Page object is “Contents” and this specifies a reference to an object that will contain our text we wish to display. In my example this is object 4 which is a stream object.


First in line 31 we must include the length and this is the byte size starting after stream to right before endstream. If we calculate the bytes we get the size to be 45. Next are the tags for the stream object. Line 32 is the start tag for stream and line 37 is the end tag for stream. Line 33 and 36 are opening and closing tags for text as well. “BT” stands for begin text, and “ET” stands for end text. Line 34 calls on our font which we declared as “F1” and the font size is set to 24. Something to note is how functions and parameters are called. The parameters of a function are pushed on the stack first, after the function is called and pops the parameters off. This is what is happening in line 34. The font “F1” and the size 24 are pushed on the stack. After the function “Tf” is pushed on the stack and pops the two parameters off the stack. On line 35, 250 and 700 is distance beginning from the bottom right side of the document. At this coordinate is where the text “Hello, World!” will be displayed. Additional if we desired we could add an optional filter to decode parameters if not in plain text.

   

Object number 5 is the font object that has been referenced beforehand. Here we fully declare the font object. We must first declare the type similar to the catalog and page objects. The type for this object is “Font”.  Line 43 is a required entry in the font dictionary. There are seven subtypes that can be chosen and the different values can be found in the Portable Document Format Specification [1]. For our example I use “Type1”. The entry “BaseFont” on line 44 simply describes the font name we use which is Helvetica.


The last section that must be included to close a PDF document is the xref and trailer section.
Line 49 specifies the number of object entries in the document including the xref object. The number of objects is 6 and it is again referenced inside the trailer section to indicate the size. Also in line 54 we have a reference to the catalog object (Object 1) which is the root node. We end with a closing startxref tag and an end of file tag on line 58. This completes the creation of a PDF and will be read by a PDF reader. Only the bare essentials are included in my example and normally in the cross reference table one would include the offsets for each object in the document.

In the next article I will explore actions that are available in the PDF format as well as embedding JavaScript in a document.


References
[1] “PDF Reference and Adobe Extensions to the PDF Specifications”, Available at http://www.adobe.com/devnet/pdf_reference.html

4 comments:

  1. this is really nice article... :) thanks for posting :)

    ReplyDelete
  2. So how to create the PDF ? Where are you creating it ? In text file, eclipse, visual studio or somewhere else ?

    ReplyDelete
    Replies
    1. Just use a simple text editor like Notepad or TextEdit and save the file as test.pdf

      Delete
  3. So I want to know what tool to use for it ?

    ReplyDelete