简体   繁体   中英

How to find all textual files regardless of extension that contains only commas and digits?

I have to search for files that may have any extension name. The special attribute that all these files have is that they are less than five lines long(less than 4 \\n\\r) and other than the line breaks, all characters are digits, spaces and commas. How would I write code that searches for files based on their content?

I am well aware this will take a long time to run.

My project does not require Java or Python, I simply mentioned them as I'm more familiar with them. Powershell is a worthy suggestion.

I am running a Windows 7 system.

Something like the following should work:

valid_chars = set('0123456789, \r\n')
for root, dirs, files in os.walk(base):
    for fname in files:
        fpath = os.path.join(root, fname)
        with open(fpath, 'rb') as f:
            lines = []
            for i, line in enumerate(f):
                if i >= 5 or not all(c in valid_chars for c in line):
                    break
            else:
                print 'found file: ' + fpath

Instead of not all(c in valid_chars for c in line) , you could use regular expressions:

            ...
                if i >= 5 or not re.match(r'[\d, \r\n]*$', line):
            ...

If you go with regex, to improve efficiency use re.compile outside of the loop.

import os

expected_chars = set(' ,1234567890\n\r')
nlines = 5
max_file_size = 1000  # ignore file longer than 1000bytes, this will speed things up


def process_dir(out, dirname, fnames):
    for fname in fnames:
    fpath = os.path.join(dirname, fname)

    if os.path.isfile(fpath):

        statinfo = os.stat(fpath)

        if statinfo.st_size < max_file_size: 
            with open(fpath) as f:
                # read the first n lines
                firstn = [ f.readline() for _ in range(nlines)]

                # if there are any more lines left this is not our file
                if f.readline():
                    continue

                # if the first n lines contain only spaces, commas, digits and new lines
                # this is our kind of file add it to the results.
                if not set(''.join(firstn)) - expected_chars:
                    out.append(fpath)


out = []
path.walk("/some/path/", process_dir, out)

you can use the grep -r and -l options. The -r allows you to search recursively in a directory over all the files and -l prints only the names of the files whose content matches your regex.

grep -r -l '\A([0-9, ]+\s){1,4}[0-9, ]+\Z' directory

This would print the list of names of all files that have less than 5 lines of numbers, space or comma characters.

The \\A and \\Z would check at the beginning and ending of the subject text. [0-9, ]+ looks for a sequence of digits, spaces or commas followed by \\s which is either a line break, space or a carriage return. This sequence can be repeated up to 4 times represented by {1,4} followed by another line of data.

In Python (I'll only outline the steps so you can program it yourself. But of course feel free to ask if you step into problems):

  • Use os.path.walk to find all files (it gives you all files, regardless of their extension).
  • Note that it also gives you directories etc, so use os.path.isfile to skip them.
  • For each file:
    • Open it ( open ). Do the following inside a with statement to avoid having to close the file by hand.
    • You could first count the lines, then check for the comma thing, but that's probably slower, so:
    • Read the file line by line. For each line, do two things:
    • Count the lines. If you arrive at 5, go on with the next file.
    • Check if it matches the comma criterion. I'd use a regular expression for that. If it does not match, continue.
    • If you are at the end of the file, you were successful, so you can print the filename or whatever you want to do.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM