简体   繁体   中英

Pythonic way to process multiple for loops with different filters against the same list?

Here's a bit of a program I'm writing that will create a csv categorizing a directory of files:

matches = []
for root, dirnames, filenames in os.walk(directory):
    for filename in fnmatch.filter(filenames, '*[A-Z]*'):
        matches.append([os.path.join(root, filename), "No Capital Letters!"])

    test = re.compile(".*\.(py|php)", re.IGNORECASE)
    for filename in filter(test.search, filenames):
        matches.append([os.path.join(root, filename), "Invalid File type!"])

Basically, the user picks a folder and the program denotes problem files, which can be of several types (just two listed here: no files with uppercase letters, no php or python files). There will be probably five or six cases.

While this works, I want to refactor. Is it possible to do something like

for filename in itertools.izip(fnmatch.filter(filenames, '*[A-Z]*'), filter(test.search, filenames), ...):
    matches.append([os.path.join(root, filename), "Violation")

while being able to keep track of which of original unzipped lists caused the "violation?"

A simpler solution would probably be to just iterate over the files first and then apply your checks one by one:

reTest = re.compile(".*\.(py|php)", re.IGNORECASE)
for root, dirnames, filenames in os.walk(directory):
    for filename in filenames:
        error = None
        if fnmatch.fnmatch(filename, '*[A-Z]*'):
            error = 'No capital letters!'
        elif reTest.search(filename):
            error = 'Invalid file type!'

        if error:
            matches.append([os.path.join(root, filename), error])

This will not only make the logic a lot simpler since you only ever need to check a single file (instead of having to figure every time out how to call your check method on a sequence of filenames), it will also iterate only once through the list of filenames.

Furthermore, it will also avoid generating multiple matches for a single file name; it just adds one error (the first) at most. If you don't want this, you could make error a list instead and append to it in your checks—of course you want to change the elif to if then to evaluate all the checks.

I recommend you look at these slides .

David Beazley gives an example of using yield to process log files.

edit: here are two examples from the pdf, one without generator:

wwwlog = open("access-log")
total = 0
for line in wwwlog:
  bytestr = line.rsplit(None,1)[1]
   if bytestr != '-':
     total += int(bytestr)
 print "Total", total

and with generator (can use function with yield for more complex examples)

wwwlog = open("access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes = (int(x) for x in bytecolumn if x != '-')
print "Total", sum(bytes)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM