I'm working on a Project Euler problem (for fun). It comes with a 46kb txt file containing 1 line with a list of over 5000 names in the format like this:
"MARIA","SUSAN","ANGELA","JACK"...
My plan is to write a method to extract every name and append them into a Python list. Is regular expression the best weapon to tackle this problem?
I looked up the Python re doc, but am having hard time figuring out the right regex.
That looks like a format that the csv module would be helpful with. Then you wouldn't have to write any regex.
If the format of the file is as you say it is, ie
Then this should work:
>>> import csv >>> lines = csv.reader(open('words.txt', 'r'), delimiter=',') >>> words = lines.next() >>> words ['MARIA', 'SUSAN', 'ANGELA', 'JACK']
A regexp will get the job done, but would be inefficient. Using csv would work, but it might not handle 5000 cells in a single line very well. At the very least it has to load the whole file in and maintain the entire list of names in memory (which might not be a problem for you because that's a very small amount of data). If you want an iterator for relatively large files (much larger than 5000 names), a state machine will do the trick:
def parse_chunks(iter, quote='"', delim=',', escape='\\'):
in_quote = False
in_escaped = False
buffer = ''
for chunk in iter:
for byte in chunk:
if in_escaped:
# Done with the escape char, add it to the buffer
buffer += byte
in_escaped = False
elif byte == escape:
# The next charachter will be added literally and not parsed
in_escaped = True
elif in_quote:
if byte == quote:
in_quote = False
else:
buffer += byte
elif byte == quote:
in_quote = True
elif byte in (' ', '\n', '\t', '\r'):
# Ignore whitespace outside of quotes
pass
elif byte == delim:
# Done with this block of text
yield buffer
buffer = ''
else:
buffer += byte
if in_quote:
raise ValueError('Found unbalanced quote char %r' % quote)
elif in_escaped:
raise ValueError('Found unbalanced escape char %r' % escape)
# Yield the last bit in the buffer
yield buffer
data = r"""
"MARIA","SUSAN",
"ANG
ELA","JACK",,TED,"JOE\""
"""
print list(parse_chunks(data))
# ['MARIA', 'SUSAN', 'ANG\nELA', 'JACK', '', 'TED', 'JOE"']
# Use a fixed buffer size if you know the file has only one long line or
# don't care about line parsing
buffer_size = 4096
with open('myfile.txt', 'r', buffer_size) as file:
for name in parse_chunks(file):
print name
If you can do it simpler, then do it simpler. No need to use the csv module. I don't think 5000 names or 46KB is enough to worry.
names = []
f = open("names.txt", "r")
# In case there is more than one line...
for line in f.readlines():
names = [x.strip().replace('"', '') for x in line.split(",")]
print names
#should print ['name1', ... , ...]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.