简体   繁体   中英

Python - regex to complete hyphens in standard filename format

This program goes through a directory and fixes (if possible) the filenames to a specific format of whitespaces, hyphens, etc. The method regexSubFixGrouping() changes the improper whitespace found in filenames to proper whitespace. The method checkProper() shows you exactly the format needed.

proper format:

201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.pdf

Everything works pretty well except that the regex should also insert any of the first 4 hyphens that may be missing. I'm not overly concerned about extra hyphens at this point, maybe down the road. Mainly, I just want it to insert any of the first 4 missing hyphens (and maintain all it's current functionality of correcting whitespace, etc).

Methods:

def readDir(path1):
    return [ f for f in os.listdir(path1) if os.path.isfile(os.path.join(path1,f)) ]

def checkProper(f,term):
    return re.match(term + '\s-\s\(\d{5}\)\s-\s\w{3}\s\d{4}\w?-\d{3}-[^\.]+\s-\s[^\.]+\.txt', f)


def regexSubFixGrouping(f,term):
    """ Much improved version of regexSubFix(). Corrects improper whitespace in filename """
    return re.sub(term + r'\s*-\s*(\(\d{5}\))\s*-\s*(\w{3}\s\d{4}\w?-\d{3}\s*-\s*(?:[^.\s]|\b\s\b)+)\s*-\s*([^.]+\.pdf)$',
          lambda match: term+' - {0} - {1} - {2}'.format(match.group(1),
          re.sub(r'\s*-\s*', '-', match.group(2)),
          match.group(3)) ,
          f)

def properFiles(dir1,term,path1):
""" Main functionality. Goes through list of files in directory, separates good from bad and fixes what it can. """
goodMatch = []; stillWrong = []; goodFix = [] #; fixed = ""
for f in dir1:
    result = checkProper(f,term)
    if result: goodMatch.append(result.group(0))
    else:
        fixed = regexSubFixGrouping(f,term)
        #print "^^^^^^   ",fixed
        if checkProper(fixed,term):
            os.rename(path1+'\\'+f, path1+'\\'+fixed); goodFix.append(fixed)
        else: os.rename(path1+'\\'+f, path1+'\\'+'@ '+fixed); stillWrong.append(fixed)
goodToGo = len(goodMatch)+len(goodFix); total = len(dir1); successRate = (goodToGo/(float(total)))*100.0
print "%d total files. %d files now in proper format. %0.2f%% success rate."%(total,goodToGo,successRate)
print "All files not in proper format are appended with @ to be clearly marked for the user."
return goodMatch, goodFix, stillWrong

So it should be able to fix filenames with these (missing hyphen) errors:

201308 - (82431) - MAC 1105-006 College Algebra - Graziose, James.pdf

201308 - (82610) - MAC 2233-009 Methods of Calculus - Grigoriev, Stepan.pdf

And also errors where the 3 capital letters after the 2nd hyphen don't have a space before the 4 integers following it:

201308 - (91500) - MAC1105-014 - College Algebra - Radulovic, AiBeng.pdf

If possible I'd like to just adjust the regexSubFixGrouping() method rather than use system resources in running more regex's than necessary. I'm teaching myself Python so I'm sure any junior programmer could do this, but if a pro happens on this question they could straighten this out easily.

EDIT: Remaining outliers:

201308 - (82442) - MAC 1105 - 012 - College Algebra - Harmon, Drake.pdf
201308 - (92835) - MAC 1105 - 017 - College Algebra - Harmon, Drake.pdf
201308 - (95125) - MAC1147-004 - Precaclculus Algebra & Trig - Greenberg, Alisa.pdf
201308 - (82600) - MAC1147-002 - Precaclculus Algebra & Trig - Greenberg, Alisa.pdf

First 2 I'm not sure why they didn't catch. They really seem to be fixable. Second 2, I'm not sure why it didn't separate the MAC with a space from the 1147 .

You can edit the second re.sub in the function, ,although you'll have to edit the first re.sub as well to accommodate this change:

return re.sub(term + r'\s*-\s*(\(\d{5}\))\s*-\s*(\w{3}\s?\d{4}\w?-?\d{3}\s*-?\s*(?:[^.\s]|\b\s\b)+)\s*-\s*([^.]+\.pdf)$',
      lambda match: term+' - {0} - {1} - {2}'.format(match.group(1),
      re.sub(r'(\w{3})\s?(\d{4}\w?)\s*-?\s*(\d{3})\s*-?\s*(.*)', r'\1 \2-\3-\4', match.group(2)),
      match.group(3)) ,
      f)

The second re.sub now parses the 'middle part' from scratch.

I don't know how that will affect the previous file names you have though, since I added some more flexibility to the regex to accept those 'wrong formats'.

EDIT: Didn't consider " & " and forgot to put spaces around the third hyphen. Use this regex for the first re.sub:

\s*-\s*(\(\d{5}\))\s*-\s*(\w{3}\s*\d{4}\w?\s*-?\s*\d{3}\s*-?\s*(?:[^.\s]|\b\s\b|\s&\s)+)\s*-\s*([^.]+\.pdf)$

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM