Posted:10/20/2015 6:00AM
Mike Mclain discusses how to use Python to make Windows file copying more useful
After iterating through a number of program revisions during the development of my Python website generator, one of the obstacles I encountered (throughout the development process) was determining a method that would minimize the amount of FTP transactions needed (upon compiling my content) to upload my website to my webserver.
Likewise, while there are a number of off-the-shelf software solutions available that can (to some extent) quickly resolve this particular problem (like WinSCP). Nevertheless, most of these solutions utilize system-defined file timestamps (for file change detection) and such detection methodologies are rather ineffective (albeit from a conservative perspective) in identifying unmodified content when used in conjunction with an automated content generator (since the majority of file content remains consistent but the timestamp is modified when the file is re-created by the generator).
Conversely, although this particular scenario is typically the exception rather than the rule; however, such scenarios do frequently arise during the development of automated applications (since it is oftentimes quicker to re-create file content rather than add exclusionary logic to the applications code) and such cases are seldom easily resolved by the off-the-shelf software solutions readily available.
Nevertheless, while there are likely an abundance of potential solutions available to resolve this particular problem, I decided to create a simplistic Python script:
# A Python script to synchronize the contents of a folder to another folder using the sha-256 algorithm
import fnmatch
import os
import hashlib
import shutil
import getopt
import sys
# a generic function to generate the SHA of a file
def hash_SHA_256(filename):
f = open(filename, 'rb')
hashtype = hashlib.sha256()
readbuffer = 65536
data = f.read(readbuffer)
while len(data) > 0:
hashtype.update(data)
data = f.read(readbuffer)
return hashtype.digest()
# the main application
if __name__ == "__main__":
input_folder = ""
output_folder = ""
remove_files = False
remove_folders = False
# use getopt to handle command line arguments
optlist, args = getopt.getopt(sys.argv[1:], 'i:o:rf?')
for o, a in optlist:
if o == "-?":
print "*"*60
print "-i\tInput Folder"
print "-o\tOutput Folder"
print "-r\tRemove files from the output folder that are missing in input folder"
print "-f\tRemove Empty Folders from the output folder"
print "-?\tHelp"
print "*"*60
exit()
elif o == "-i":
input_folder = a
elif o == "-o":
output_folder = a
elif o == "-r":
remove_files = True
elif o == "-f":
remove_folders = True
else:
print "Error: Bad Parameters"
exit()
print "Processing input folder"
# Get a list of all input subdirectory files and folders
local_matchs = []
for root, dirnames, filenames in os.walk(input_folder):
for filename in fnmatch.filter(filenames, '*.*'):
targetpath = os.path.join(root, filename)
basepath = targetpath.replace(input_folder,"")
targethash = hash_SHA_256(targetpath)
local_matchs.append((basepath,targetpath,targethash))
print "Processing output folder"
# Get a list of all output subdirectory files and folders
remote_matchs = []
for root, dirnames, filenames in os.walk(output_folder):
for filename in fnmatch.filter(filenames, '*.*'):
targetpath = os.path.join(root, filename)
basepath = targetpath.replace(output_folder,"")
targethash = hash_SHA_256(targetpath)
remote_matchs.append((basepath,targetpath,targethash))
#If file is missing copy file
#If file is not the same copy file
for local in local_matchs:
loc_base,loc_target,loc_hash = local
cancopy = True
for remote in remote_matchs:
rem_base,rem_target,rem_hash = remote
if not loc_base == rem_base:
continue
else:
if rem_hash == loc_hash:
cancopy = False
remote_matchs.remove(remote)
break
if cancopy:
outputloc = output_folder+loc_base
print "Copy %s to %s" % (loc_target,outputloc)
if not os.path.exists(os.path.dirname(outputloc)):
os.makedirs(os.path.dirname(outputloc))
shutil.copy2(loc_target,outputloc)
# Remove Missing Files
if remove_files:
print "Removing Missing Files"
for remote in remote_matchs:
rem_base,rem_target,rem_hash = remote
print "Removing %s" % rem_target
os.remove(rem_target)
if remove_folders:
# Remove Missing Folders
print "Removing Empty Folders"
action = True
while action:
action = False
for root, dirnames, filenames in os.walk(output_folder):
for name in dirnames:
target = os.path.join(root,name)
if not os.listdir(target):
print "Remove Empty Folder %s" % target
os.rmdir(target)
action = True
that utilizes the Secure Hash Algorithm (SHA-256) to compare files between two folders (in my case, a development folder that contained my automatically generated content and a deployment folder that mirrored my webserver) and update files as necessary. Conversely, at this point, I was able to utilize WinSCP to synchronize my local deployment folder with my webserver and, because my modifications (within the deployment folder) only occurred when the file content changed (thanks to my SHA-256 copy Python script), the time needed to FTP my website updates was substantially reduced.
Overall, I found this technique to be both effective and extremely useful for a multitude of automated applications.
Enjoy!
By Mike Mclain