简体   繁体   中英

Matching every word in a txt file

I'm working on a Project Euler problem (for fun). It comes with a 46kb txt file containing 1 line with a list of over 5000 names in the format like this:

"MARIA","SUSAN","ANGELA","JACK"...

My plan is to write a method to extract every name and append them into a Python list. Is regular expression the best weapon to tackle this problem?
I looked up the Python re doc, but am having hard time figuring out the right regex.

That looks like a format that the csv module would be helpful with. Then you wouldn't have to write any regex.

If the format of the file is as you say it is, ie

  1. It's a single line
  2. The format is like this: "MARIA","SUSAN","ANGELA","JACK"

Then this should work:

 >>> import csv >>> lines = csv.reader(open('words.txt', 'r'), delimiter=',') >>> words = lines.next() >>> words ['MARIA', 'SUSAN', 'ANGELA', 'JACK'] 

A regexp will get the job done, but would be inefficient. Using csv would work, but it might not handle 5000 cells in a single line very well. At the very least it has to load the whole file in and maintain the entire list of names in memory (which might not be a problem for you because that's a very small amount of data). If you want an iterator for relatively large files (much larger than 5000 names), a state machine will do the trick:

def parse_chunks(iter, quote='"', delim=',', escape='\\'):
    in_quote = False
    in_escaped = False

    buffer = ''

    for chunk in iter:
        for byte in chunk:
            if in_escaped:
                # Done with the escape char, add it to the buffer
                buffer += byte
                in_escaped = False            
            elif byte == escape:
                # The next charachter will be added literally and not parsed
                in_escaped = True          
            elif in_quote:
                if byte == quote:
                    in_quote = False
                else:
                    buffer += byte
            elif byte == quote:
                in_quote = True
            elif byte in (' ', '\n', '\t', '\r'):
                # Ignore whitespace outside of quotes
                pass
            elif byte == delim:
                # Done with this block of text
                yield buffer
                buffer = ''                    
            else:
                buffer += byte

    if in_quote:
        raise ValueError('Found unbalanced quote char %r' % quote)
    elif in_escaped:
        raise ValueError('Found unbalanced escape char %r' % escape)

    # Yield the last bit in the buffer
    yield buffer

data = r"""
"MARIA","SUSAN",
"ANG
ELA","JACK",,TED,"JOE\""
"""
print list(parse_chunks(data))

# ['MARIA', 'SUSAN', 'ANG\nELA', 'JACK', '', 'TED', 'JOE"']

# Use a fixed buffer size if you know the file has only one long line or
# don't care about line parsing
buffer_size = 4096
with open('myfile.txt', 'r', buffer_size) as file:
    for name in parse_chunks(file):
        print name

If you can do it simpler, then do it simpler. No need to use the csv module. I don't think 5000 names or 46KB is enough to worry.

names = []
f = open("names.txt", "r")

# In case there is more than one line...
for line in f.readlines():
    names = [x.strip().replace('"', '') for x in line.split(",")]

print names
#should print ['name1', ... , ...]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM