Posted:05/11/2015 07:53AM
Mike Mclain discusses how to use python to remove PDF hyperlinks
When it comes to writing documents (emails, blog posts, programming documentation, et cetera) I create the majority of my documents within Microsoft Word, MarkdownPad 2, and Scrivener (or some combination of these applications).
Conversely, (as a result) the majority of the documents I write are either saved in a Microsoft Word format or in a plain text format (since such formats are very convenient to me); however, occasionally I need to render a document as a PDF file (since PDF documents are typically considered the universal medium of conveying formatted information across multiple multimedia platforms quickly).
Now, while the process of PDF file creation is rather mundane and won't be discussed (noting I spent years working with LaTeX as a PDF file creator, and software like Microsoft Word or ABBYY FineReader can easily convert files into a PDF format); however, one of the more annoying attributes (of PDF file creation) that I occasionally encounter is the inadvertent inclusion of web hyperlinks (particularly within a web rendered document) that are (in my particular case) often times of a sensitive or non-pertinent nature that need to be removed prior to distributing.
Likewise, while I am sure that there are likely many open source tools available to achieve this particular objective; however, I typically don't have such tools readily available (as I need them so infrequently that there installation seems unmerited), thus (when such problems arise) I find myself torn between spending time researching tools (capable of performing my desired task) or (after careful evaluation of the overall programming difficulty) attempting to resolve my problem using Python programmatically.
Conversely, (and as it might be expected given my overall love of Python) I decided to write a simplistic Python script to sanitize hyperlinks (within a PDF file) that can be easily extended to perform more advance hyperlink sanitation actions (like URL replacement or selected URL removal).
Likewise, in order to obtain this objective, the following python code was created:
# -*- coding: utf-8 -*- """ A Simplistic Python Script to remove all links from a pdf file """ # Used to create regular expressions import re def remove_pdf_links(file_input, file_output): """ A Simplistic Python function to remove all links from a pdf file """ # Open the pdf file in binary mode and extract contents f = open(file_input, mode='rb') data = f.read() f.close() # PDF encode URL info as /URI(<<URL PATH>>) so we need a # regex to extract this information regexstring = "\\/URI \\((.*)\\)" # Compile string as regex regexcompiled = re.compile(regexstring, re.MULTILINE) # Make a copy of the pdf content for the output pdf file output = data # Process the pdf contents items = regexcompiled.finditer(data) # For each regex match for match in items: # We have two options here, either replaces the PDF URL with a URL of the same size # Like So: ## output = output.replace(match.group(1), ' ' * len(match.group(1)), 1) # Or replace the whole command with spaces to remove the URL # Like So: ## output = output.replace(match.group(0), ' ' * len(match.group(0)), 1) # # Experimentation shows that we should leave the ( ) in the encoded format # when we remove the /URI command, so the first option is the best approach output = output.replace(match.group(1), ' ' * len(match.group(1)), 1) # Because replacing the URL with spaces still leaves the link clickable and # opens a blank webpage in the local browser. We need to replace the # preceding /URL command to remove the clickable aspect. # Note Experimentation shows that the pdf will crash if the () is not within # pdf /Link tag, so ensure that it is above # Replace might work for this task, but regex was handy so i used it # regex to extract this information regexstring = "\\/URI" # Compile string as regex regexcompiled = re.compile(regexstring, re.MULTILINE) # Process the pdf contents items = regexcompiled.finditer(data) # For each regex match for match in items: # Replace the /URI tag with spaces output = output.replace(match.group(0), ' ' * len(match.group(0)), 1) # Save the modified pdf file as a binary file f = open(file_output, mode='wb') f.write(output) f.close() # This is our application entry point if __name__ == "__main__": # We could easy extract console argument and pass this information # into the remove function, but I will leave that task to your discretion my_input = "./Demo_PDF.pdf" my_output = "./Demo_PDF-NL.pdf" # remove html links and save pdf remove_pdf_links(my_input, my_output)
and such code can be easily modified to meet the demands of your particular application or used as is.
Important: it should be noted that the method I developed will only remove hyperlinks within a uncompressed PDF content stream!
For example, (if you open a PDF file within a text editor like notepad++) an uncompressed content stream might look like this:
11 0 obj <</Subtype/Link/Rect[ 246.98 580.25 316.95 630.95] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(http://mamclain.com/) >>/StructParent 1>> endobj
while a compressed content stream might look like this:
Likewise, in the event that you need to sanitize a compressed PDF content stream, you will need to decompress the streams first (this is somewhat complex, but it can be done within python) prior to running my sanitization code and (given the overall added complexity compressed streams add) it might be better (and quicker) to seek out a open source tool to perform this task rather than implementing it within Python (although I might investigate this attribute in the near future).
I found the sanitization of hyperlinks (within a PDF file) to be a relatively straightforward task for Python to handle (assuming uncompressed content streams are utilized within the PDF file being sanitized) and (overall) I also found the development of this particular script very enlightening (especially given how frequently the file format is utilized in practice).
Enjoy!
By Mike Mclain