Using Python to Make File Copying More Useful

Posted:10/20/2015 6:00AM

Using Python to Make File Copying More Useful

Mike Mclain discusses how to use Python to make Windows file copying more useful

Preface:

After iterating through a number of program revisions during the development of my Python website generator, one of the obstacles I encountered (throughout the development process) was determining a method that would minimize the amount of FTP transactions needed (upon compiling my content) to upload my website to my webserver.

The Problem:

Likewise, while there are a number of off-the-shelf software solutions available that can (to some extent) quickly resolve this particular problem (like WinSCP). Nevertheless, most of these solutions utilize system-defined file timestamps (for file change detection) and such detection methodologies are rather ineffective (albeit from a conservative perspective) in identifying unmodified content when used in conjunction with an automated content generator (since the majority of file content remains consistent but the timestamp is modified when the file is re-created by the generator).

Conversely, although this particular scenario is typically the exception rather than the rule; however, such scenarios do frequently arise during the development of automated applications (since it is oftentimes quicker to re-create file content rather than add exclusionary logic to the applications code) and such cases are seldom easily resolved by the off-the-shelf software solutions readily available.

The Solution:

Nevertheless, while there are likely an abundance of potential solutions available to resolve this particular problem, I decided to create a simplistic Python script:

# A Python script to synchronize the contents of a folder to another folder using the sha-256 algorithm
import fnmatch
import os
import hashlib
import shutil
import getopt
import sys

# a generic function to generate the SHA of a file
def hash_SHA_256(filename):
    f = open(filename, 'rb')
    hashtype = hashlib.sha256()
    readbuffer = 65536

    data = f.read(readbuffer)
    while len(data) > 0:
        hashtype.update(data)
        data = f.read(readbuffer)
    return hashtype.digest()

# the main application
if __name__ == "__main__":
    input_folder = ""
    output_folder = ""
    remove_files = False
    remove_folders = False

    # use getopt to handle command line arguments
    optlist, args = getopt.getopt(sys.argv[1:], 'i:o:rf?')
    for o, a in optlist:
        if o == "-?":
            print "*"*60
            print "-i\tInput Folder"
            print "-o\tOutput Folder"
            print "-r\tRemove files from the output folder that are missing in input folder"
            print "-f\tRemove Empty Folders from the output folder"
            print "-?\tHelp"
            print "*"*60
            exit()
        elif o == "-i":
            input_folder = a
        elif o == "-o":
            output_folder = a
        elif o == "-r":
            remove_files = True
        elif o == "-f":
            remove_folders = True
        else:
            print "Error: Bad Parameters"
            exit()

    print "Processing input folder"
    # Get a list of all input subdirectory files and folders
    local_matchs = []
    for root, dirnames, filenames in os.walk(input_folder):
        for filename in fnmatch.filter(filenames, '*.*'):
            targetpath = os.path.join(root, filename)
            basepath = targetpath.replace(input_folder,"")
            targethash = hash_SHA_256(targetpath)
            local_matchs.append((basepath,targetpath,targethash))

    print "Processing output folder"
    # Get a list of all output subdirectory files and folders
    remote_matchs = []
    for root, dirnames, filenames in os.walk(output_folder):
        for filename in fnmatch.filter(filenames, '*.*'):
            targetpath = os.path.join(root, filename)
            basepath = targetpath.replace(output_folder,"")
            targethash = hash_SHA_256(targetpath)
            remote_matchs.append((basepath,targetpath,targethash))

    #If file is missing copy file
    #If file is not the same copy file
    for local in local_matchs:
        loc_base,loc_target,loc_hash = local
        cancopy = True
        for remote in remote_matchs:
            rem_base,rem_target,rem_hash = remote
            if not loc_base == rem_base:
                continue
            else:
                if rem_hash == loc_hash:
                    cancopy = False
                remote_matchs.remove(remote)
                break
        if cancopy:
            outputloc = output_folder+loc_base
            print "Copy %s to %s" % (loc_target,outputloc)
            if not os.path.exists(os.path.dirname(outputloc)):
                os.makedirs(os.path.dirname(outputloc))
            shutil.copy2(loc_target,outputloc)

    # Remove Missing Files
    if remove_files:
        print "Removing Missing Files"
        for remote in remote_matchs:
            rem_base,rem_target,rem_hash = remote
            print "Removing %s" % rem_target
            os.remove(rem_target)

    if remove_folders:
        # Remove Missing Folders
        print "Removing Empty Folders"
        action = True
        while action:
            action = False
            for root, dirnames, filenames in os.walk(output_folder):
                for name in dirnames:
                    target = os.path.join(root,name)
                    if not os.listdir(target):
                        print "Remove Empty Folder %s" % target
                        os.rmdir(target)
                        action = True


that utilizes the Secure Hash Algorithm (SHA-256) to compare files between two folders (in my case, a development folder that contained my automatically generated content and a deployment folder that mirrored my webserver) and update files as necessary. Conversely, at this point, I was able to utilize WinSCP to synchronize my local deployment folder with my webserver and, because my modifications (within the deployment folder) only occurred when the file content changed (thanks to my SHA-256 copy Python script), the time needed to FTP my website updates was substantially reduced.

Conclusion:

Overall, I found this technique to be both effective and extremely useful for a multitude of automated applications.

Enjoy!

Comments:

comments powered by Disqus