Problems reading all the files in a directory?

Question

I have a folder with lots of .txt files and I would like to read them. For this, first I'm using some regex to catch only the importat stuff with I will work with. So I'm doing the following:

    txt_files =(path, '*.txt')
    important_stuff = re.findall("(\S+)\s+(NC\S+).*\n.*\s(\S+)\s+(AQ\S+)",txt_files)

    print important_stuff

The problem with this is that Im obtaining TypeError:expected string or buffer Any idea of how to solve this?.

Answer 1

A sounder approach might be:

import glob, re

txt_files = glob.glob('/the/path/ofthedirectory/*.txt')
important_stuff = [fn for fn in txt_files
                   if re.search(r"(\S+)\s+(NC\S+).*\n.*\s(\S+)\s+(AQ\S+)", fn)]

That's because (A) codecs.open opens a file for reading -- it does not open multiple files with wild cards, nor return file names; (B) re.findall works on a single string, and txt_files surely isn't one.

This assumes you're selecting important_stuff based on filenames . If you're actually selecting on files' contents , you'll need to open and read each of the files anyway, so a list comprehension becomes a bit unwieldy and one might prefer eg

important_stuff = dict()
for fn in txt_files:
    with codecs.open(fn, 'utf-8') as f:
        contents = f.read()
        if re.search(r"(\S+)\s+(NC\S+).*\n.*\s(\S+)\s+(AQ\S+)", contents):
            important_stuff[fn] = contents

Here, I'm builting a dict from filename to file's contents, to avoid having to open and read each file twice -- once to check if it's "important stuff", then again later to process it if it is. If all of this doesn't fit in memory, ah well, the double reading may be simpler -- then we'd go bavk to important_stuff = list() and important_stuff.append(fn) in the if , and later we'd again open and read the filenames thus recorded as "important stuff".

There may be more if those groups matched in the re.search need to be preserved (to avoid scanning for them again), but that's just too hard to guess w/o further into on your part!-)

Answer 2

You can't use a regex (or glob expansion) in codecs.open . It expects a file name. That's why you get the error.

So you can't do this:

txt_files = [(codecs.open('/the/path/ofthedirectory/*.txt','r','utf8')).readlines()]

You should use something like os.listdir or os.walk or glob.iglob ( glob.glob iterator variant), filter the results, and then open each file.

So you get something like this:

# filter to have only txts
txt_files = [p for p in os.listdir('/path/to/dir') if p.endswith('.txt')]
# do your filtering
important_stuff = re.findall("(\S+)\s+(NC\S+).*\n.*\s(\S+)\s+(AQ\S+)", txt_files)

Problems reading all the files in a directory?

Question

2 answers

solution1
2 2014-12-27 01:52:03

solution2
0 2014-12-27 01:46:05

Problems reading all the files in a directory?

Question

2 answers

solution1 2 2014-12-27 01:52:03

solution2 0 2014-12-27 01:46:05

solution1
2 2014-12-27 01:52:03

solution2
0 2014-12-27 01:46:05