I have to search for files that may have any extension name. The special attribute that all these files have is that they are less than five lines long(less than 4 \\n\\r) and other than the line breaks, all characters are digits, spaces and commas. How would I write code that searches for files based on their content?
I am well aware this will take a long time to run.
My project does not require Java or Python, I simply mentioned them as I'm more familiar with them. Powershell is a worthy suggestion.
I am running a Windows 7 system.
Something like the following should work:
valid_chars = set('0123456789, \r\n')
for root, dirs, files in os.walk(base):
for fname in files:
fpath = os.path.join(root, fname)
with open(fpath, 'rb') as f:
lines = []
for i, line in enumerate(f):
if i >= 5 or not all(c in valid_chars for c in line):
break
else:
print 'found file: ' + fpath
Instead of not all(c in valid_chars for c in line)
, you could use regular expressions:
...
if i >= 5 or not re.match(r'[\d, \r\n]*$', line):
...
If you go with regex, to improve efficiency use re.compile
outside of the loop.
import os
expected_chars = set(' ,1234567890\n\r')
nlines = 5
max_file_size = 1000 # ignore file longer than 1000bytes, this will speed things up
def process_dir(out, dirname, fnames):
for fname in fnames:
fpath = os.path.join(dirname, fname)
if os.path.isfile(fpath):
statinfo = os.stat(fpath)
if statinfo.st_size < max_file_size:
with open(fpath) as f:
# read the first n lines
firstn = [ f.readline() for _ in range(nlines)]
# if there are any more lines left this is not our file
if f.readline():
continue
# if the first n lines contain only spaces, commas, digits and new lines
# this is our kind of file add it to the results.
if not set(''.join(firstn)) - expected_chars:
out.append(fpath)
out = []
path.walk("/some/path/", process_dir, out)
you can use the grep -r
and -l
options. The -r
allows you to search recursively in a directory over all the files and -l
prints only the names of the files whose content matches your regex.
grep -r -l '\A([0-9, ]+\s){1,4}[0-9, ]+\Z' directory
This would print the list of names of all files that have less than 5 lines of numbers, space or comma characters.
The \\A and \\Z would check at the beginning and ending of the subject text. [0-9, ]+
looks for a sequence of digits, spaces or commas followed by \\s
which is either a line break, space or a carriage return. This sequence can be repeated up to 4 times represented by {1,4}
followed by another line of data.
In Python (I'll only outline the steps so you can program it yourself. But of course feel free to ask if you step into problems):
os.path.walk
to find all files (it gives you all files, regardless of their extension). os.path.isfile
to skip them. open
). Do the following inside a with
statement to avoid having to close the file by hand. regular expression
for that. If it does not match, continue.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.