Using Python to Remove PDF Hyperlinks

Posted:05/11/2015 07:53AM

Using Python to Remove PDF Hyperlinks

Mike Mclain discusses how to use python to remove PDF hyperlinks

Preface:

When it comes to writing documents (emails, blog posts, programming documentation, et cetera) I create the majority of my documents within Microsoft Word, MarkdownPad 2, and Scrivener (or some combination of these applications).

Conversely, (as a result) the majority of the documents I write are either saved in a Microsoft Word format or in a plain text format (since such formats are very convenient to me); however, occasionally I need to render a document as a PDF file (since PDF documents are typically considered the universal medium of conveying formatted information across multiple multimedia platforms quickly).

The Problem:

Now, while the process of PDF file creation is rather mundane and won't be discussed (noting I spent years working with LaTeX as a PDF file creator, and software like Microsoft Word or ABBYY FineReader can easily convert files into a PDF format); however, one of the more annoying attributes (of PDF file creation) that I occasionally encounter is the inadvertent inclusion of web hyperlinks (particularly within a web rendered document) that are (in my particular case) often times of a sensitive or non-pertinent nature that need to be removed prior to distributing.

Likewise, while I am sure that there are likely many open source tools available to achieve this particular objective; however, I typically don't have such tools readily available (as I need them so infrequently that there installation seems unmerited), thus (when such problems arise) I find myself torn between spending time researching tools (capable of performing my desired task) or (after careful evaluation of the overall programming difficulty) attempting to resolve my problem using Python programmatically.

The Solution:

Conversely, (and as it might be expected given my overall love of Python) I decided to write a simplistic Python script to sanitize hyperlinks (within a PDF file) that can be easily extended to perform more advance hyperlink sanitation actions (like URL replacement or selected URL removal).

Likewise, in order to obtain this objective, the following python code was created:

# -*- coding: utf-8 -*-
""" A Simplistic Python Script to remove all links from a pdf file
"""

# Used to create regular expressions
import re

def remove_pdf_links(file_input, file_output):
    """ A Simplistic Python function to remove all links from a pdf file
    """

    # Open the pdf file in binary mode and extract contents
    f = open(file_input, mode='rb')
    data = f.read()
    f.close()

    # PDF encode URL info as /URI(<<URL PATH>>) so we need a
    # regex to extract this information
    regexstring = "\\/URI \\((.*)\\)"

    # Compile string as regex
    regexcompiled = re.compile(regexstring, re.MULTILINE)

    # Make a copy of the pdf content for the output pdf file
    output = data

    # Process the pdf contents
    items = regexcompiled.finditer(data)

    # For each regex match
    for match in items:
        # We have two options here, either replaces the PDF URL with a URL of the same size 
        # Like So:
        ## output = output.replace(match.group(1), ' ' * len(match.group(1)), 1)
        # Or replace the whole command with spaces to remove the URL
        # Like So:
        ## output = output.replace(match.group(0), ' ' * len(match.group(0)), 1)
        #
        # Experimentation shows that we should leave the ( ) in the encoded format
        # when we remove the /URI command, so the first option is the best approach
        output = output.replace(match.group(1), ' ' * len(match.group(1)), 1)

    # Because replacing the URL with spaces still leaves the link clickable and
    # opens a blank webpage in the local browser. We need to replace the 
    # preceding /URL command to remove the clickable aspect.
    # Note Experimentation shows that the pdf will crash if the () is not within
    # pdf /Link tag, so ensure that it is above
    # Replace might work for this task, but regex was handy so i used it
    # regex to extract this information
    regexstring = "\\/URI"

    # Compile string as regex
    regexcompiled = re.compile(regexstring, re.MULTILINE)

    # Process the pdf contents
    items = regexcompiled.finditer(data)

    # For each regex match
    for match in items:
        # Replace the /URI tag with spaces
        output = output.replace(match.group(0), ' ' * len(match.group(0)), 1)

    # Save the modified pdf file as a binary file   
    f = open(file_output, mode='wb')
    f.write(output)
    f.close()



# This is our application entry point
if __name__ == "__main__":
    # We could easy extract console argument and pass this information 
    # into the remove function, but I will leave that task to your discretion
    my_input = "./Demo_PDF.pdf"
    my_output = "./Demo_PDF-NL.pdf"
    # remove html links and save pdf
    remove_pdf_links(my_input, my_output)

and such code can be easily modified to meet the demands of your particular application or used as is.

Important: it should be noted that the method I developed will only remove hyperlinks within a uncompressed PDF content stream!

For example, (if you open a PDF file within a text editor like notepad++) an uncompressed content stream might look like this:

11 0 obj
<</Subtype/Link/Rect[ 246.98 580.25 316.95 630.95] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(http://mamclain.com/) >>/StructParent 1>>
endobj

while a compressed content stream might look like this:


Compressed Content Stream Example.

Compressed Content Stream Example.
.

Likewise, in the event that you need to sanitize a compressed PDF content stream, you will need to decompress the streams first (this is somewhat complex, but it can be done within python) prior to running my sanitization code and (given the overall added complexity compressed streams add) it might be better (and quicker) to seek out a open source tool to perform this task rather than implementing it within Python (although I might investigate this attribute in the near future).

Examples:

Conclusion:

I found the sanitization of hyperlinks (within a PDF file) to be a relatively straightforward task for Python to handle (assuming uncompressed content streams are utilized within the PDF file being sanitized) and (overall) I also found the development of this particular script very enlightening (especially given how frequently the file format is utilized in practice).

Enjoy!

Comments:

comments powered by Disqus