Matching every word in a txt file

Question

I'm working on a Project Euler problem (for fun). It comes with a 46kb txt file containing 1 line with a list of over 5000 names in the format like this:

"MARIA","SUSAN","ANGELA","JACK"...

My plan is to write a method to extract every name and append them into a Python list. Is regular expression the best weapon to tackle this problem?
I looked up the Python re doc, but am having hard time figuring out the right regex.

Answer 1

That looks like a format that the csv module would be helpful with. Then you wouldn't have to write any regex.

Answer 2

If the format of the file is as you say it is, ie

It's a single line
The format is like this: "MARIA","SUSAN","ANGELA","JACK"

Then this should work:

 >>> import csv >>> lines = csv.reader(open('words.txt', 'r'), delimiter=',') >>> words = lines.next() >>> words ['MARIA', 'SUSAN', 'ANGELA', 'JACK']

Answer 3

A regexp will get the job done, but would be inefficient. Using csv would work, but it might not handle 5000 cells in a single line very well. At the very least it has to load the whole file in and maintain the entire list of names in memory (which might not be a problem for you because that's a very small amount of data). If you want an iterator for relatively large files (much larger than 5000 names), a state machine will do the trick:

def parse_chunks(iter, quote='"', delim=',', escape='\\'):
    in_quote = False
    in_escaped = False

    buffer = ''

    for chunk in iter:
        for byte in chunk:
            if in_escaped:
                # Done with the escape char, add it to the buffer
                buffer += byte
                in_escaped = False            
            elif byte == escape:
                # The next charachter will be added literally and not parsed
                in_escaped = True          
            elif in_quote:
                if byte == quote:
                    in_quote = False
                else:
                    buffer += byte
            elif byte == quote:
                in_quote = True
            elif byte in (' ', '\n', '\t', '\r'):
                # Ignore whitespace outside of quotes
                pass
            elif byte == delim:
                # Done with this block of text
                yield buffer
                buffer = ''                    
            else:
                buffer += byte

    if in_quote:
        raise ValueError('Found unbalanced quote char %r' % quote)
    elif in_escaped:
        raise ValueError('Found unbalanced escape char %r' % escape)

    # Yield the last bit in the buffer
    yield buffer

data = r"""
"MARIA","SUSAN",
"ANG
ELA","JACK",,TED,"JOE\""
"""
print list(parse_chunks(data))

# ['MARIA', 'SUSAN', 'ANG\nELA', 'JACK', '', 'TED', 'JOE"']

# Use a fixed buffer size if you know the file has only one long line or
# don't care about line parsing
buffer_size = 4096
with open('myfile.txt', 'r', buffer_size) as file:
    for name in parse_chunks(file):
        print name

Answer 4

If you can do it simpler, then do it simpler. No need to use the csv module. I don't think 5000 names or 46KB is enough to worry.

names = []
f = open("names.txt", "r")

# In case there is more than one line...
for line in f.readlines():
    names = [x.strip().replace('"', '') for x in line.split(",")]

print names
#should print ['name1', ... , ...]

Matching every word in a txt file

Question

4 answers

solution1
3 2011-10-04 01:11:07

solution2
3 ACCPTED 2011-10-04 01:43:09

solution3
1 2011-10-04 01:56:57

solution4
1 2011-10-04 01:58:34

Matching every word in a txt file

Question

4 answers

solution1 3 2011-10-04 01:11:07

solution2 3 ACCPTED 2011-10-04 01:43:09

solution3 1 2011-10-04 01:56:57

solution4 1 2011-10-04 01:58:34

solution1
3 2011-10-04 01:11:07

solution2
3 ACCPTED 2011-10-04 01:43:09

solution3
1 2011-10-04 01:56:57

solution4
1 2011-10-04 01:58:34