Posted:10/20/2015 6:00AM
Mike Mclain discusses how to use Python to make Windows file copying more useful
After iterating through a number of program revisions during the development of my Python website generator, one of the obstacles I encountered (throughout the development process) was determining a method that would minimize the amount of FTP transactions needed (upon compiling my content) to upload my website to my webserver.
Likewise, while there are a number of off-the-shelf software solutions available that can (to some extent) quickly resolve this particular problem (like WinSCP). Nevertheless, most of these solutions utilize system-defined file timestamps (for file change detection) and such detection methodologies are rather ineffective (albeit from a conservative perspective) in identifying unmodified content when used in conjunction with an automated content generator (since the majority of file content remains consistent but the timestamp is modified when the file is re-created by the generator).
Conversely, although this particular scenario is typically the exception rather than the rule; however, such scenarios do frequently arise during the development of automated applications (since it is oftentimes quicker to re-create file content rather than add exclusionary logic to the applications code) and such cases are seldom easily resolved by the off-the-shelf software solutions readily available.
Nevertheless, while there are likely an abundance of potential solutions available to resolve this particular problem, I decided to create a simplistic Python script:
# A Python script to synchronize the contents of a folder to another folder using the sha-256 algorithm import fnmatch import os import hashlib import shutil import getopt import sys # a generic function to generate the SHA of a file def hash_SHA_256(filename): f = open(filename, 'rb') hashtype = hashlib.sha256() readbuffer = 65536 data = f.read(readbuffer) while len(data) > 0: hashtype.update(data) data = f.read(readbuffer) return hashtype.digest() # the main application if __name__ == "__main__": input_folder = "" output_folder = "" remove_files = False remove_folders = False # use getopt to handle command line arguments optlist, args = getopt.getopt(sys.argv[1:], 'i:o:rf?') for o, a in optlist: if o == "-?": print "*"*60 print "-i\tInput Folder" print "-o\tOutput Folder" print "-r\tRemove files from the output folder that are missing in input folder" print "-f\tRemove Empty Folders from the output folder" print "-?\tHelp" print "*"*60 exit() elif o == "-i": input_folder = a elif o == "-o": output_folder = a elif o == "-r": remove_files = True elif o == "-f": remove_folders = True else: print "Error: Bad Parameters" exit() print "Processing input folder" # Get a list of all input subdirectory files and folders local_matchs = [] for root, dirnames, filenames in os.walk(input_folder): for filename in fnmatch.filter(filenames, '*.*'): targetpath = os.path.join(root, filename) basepath = targetpath.replace(input_folder,"") targethash = hash_SHA_256(targetpath) local_matchs.append((basepath,targetpath,targethash)) print "Processing output folder" # Get a list of all output subdirectory files and folders remote_matchs = [] for root, dirnames, filenames in os.walk(output_folder): for filename in fnmatch.filter(filenames, '*.*'): targetpath = os.path.join(root, filename) basepath = targetpath.replace(output_folder,"") targethash = hash_SHA_256(targetpath) remote_matchs.append((basepath,targetpath,targethash)) #If file is missing copy file #If file is not the same copy file for local in local_matchs: loc_base,loc_target,loc_hash = local cancopy = True for remote in remote_matchs: rem_base,rem_target,rem_hash = remote if not loc_base == rem_base: continue else: if rem_hash == loc_hash: cancopy = False remote_matchs.remove(remote) break if cancopy: outputloc = output_folder+loc_base print "Copy %s to %s" % (loc_target,outputloc) if not os.path.exists(os.path.dirname(outputloc)): os.makedirs(os.path.dirname(outputloc)) shutil.copy2(loc_target,outputloc) # Remove Missing Files if remove_files: print "Removing Missing Files" for remote in remote_matchs: rem_base,rem_target,rem_hash = remote print "Removing %s" % rem_target os.remove(rem_target) if remove_folders: # Remove Missing Folders print "Removing Empty Folders" action = True while action: action = False for root, dirnames, filenames in os.walk(output_folder): for name in dirnames: target = os.path.join(root,name) if not os.listdir(target): print "Remove Empty Folder %s" % target os.rmdir(target) action = True
that utilizes the Secure Hash Algorithm (SHA-256) to compare files between two folders (in my case, a development folder that contained my automatically generated content and a deployment folder that mirrored my webserver) and update files as necessary. Conversely, at this point, I was able to utilize WinSCP to synchronize my local deployment folder with my webserver and, because my modifications (within the deployment folder) only occurred when the file content changed (thanks to my SHA-256 copy Python script), the time needed to FTP my website updates was substantially reduced.
Overall, I found this technique to be both effective and extremely useful for a multitude of automated applications.
Enjoy!
By Mike Mclain