简体   繁体   English

匹配txt文件中的每个单词

[英]Matching every word in a txt file

I'm working on a Project Euler problem (for fun). 我正在研究欧拉计画问题(很有趣)。 It comes with a 46kb txt file containing 1 line with a list of over 5000 names in the format like this: 它带有一个46kb的 txt文件,其中包含1行,其中包含5000多个名称的列表,格式如下:

"MARIA","SUSAN","ANGELA","JACK"...

My plan is to write a method to extract every name and append them into a Python list. 我的计划是编写一种方法来提取每个名称并将其附加到Python列表中。 Is regular expression the best weapon to tackle this problem? 正则表达式是解决此问题的最佳武器吗?
I looked up the Python re doc, but am having hard time figuring out the right regex. 我查找了Python re doc,但是很难找出正确的正则表达式。

That looks like a format that the csv module would be helpful with. 看起来像csv模块会有用的格式。 Then you wouldn't have to write any regex. 然后,您不必编写任何正则表达式。

If the format of the file is as you say it is, ie 如果文件的格式如您所说,即

  1. It's a single line 这是一行
  2. The format is like this: "MARIA","SUSAN","ANGELA","JACK" 格式如下:“ MARIA”,“ SUSAN”,“ ANGELA”,“ JACK”

Then this should work: 然后这应该工作:

 >>> import csv >>> lines = csv.reader(open('words.txt', 'r'), delimiter=',') >>> words = lines.next() >>> words ['MARIA', 'SUSAN', 'ANGELA', 'JACK'] 

A regexp will get the job done, but would be inefficient. 正则表达式可以完成工作,但是效率很低。 Using csv would work, but it might not handle 5000 cells in a single line very well. 使用csv可以工作,但可能无法很好地在一行中处理5000个单元。 At the very least it has to load the whole file in and maintain the entire list of names in memory (which might not be a problem for you because that's a very small amount of data). 至少它必须加载整个文件并在内存中维护整个名称列表(这对您来说可能不是问题,因为这是非常少量的数据)。 If you want an iterator for relatively large files (much larger than 5000 names), a state machine will do the trick: 如果要使用相对较大的文件(远大于5000个名称)的迭代器,则状态机可以解决问题:

def parse_chunks(iter, quote='"', delim=',', escape='\\'):
    in_quote = False
    in_escaped = False

    buffer = ''

    for chunk in iter:
        for byte in chunk:
            if in_escaped:
                # Done with the escape char, add it to the buffer
                buffer += byte
                in_escaped = False            
            elif byte == escape:
                # The next charachter will be added literally and not parsed
                in_escaped = True          
            elif in_quote:
                if byte == quote:
                    in_quote = False
                else:
                    buffer += byte
            elif byte == quote:
                in_quote = True
            elif byte in (' ', '\n', '\t', '\r'):
                # Ignore whitespace outside of quotes
                pass
            elif byte == delim:
                # Done with this block of text
                yield buffer
                buffer = ''                    
            else:
                buffer += byte

    if in_quote:
        raise ValueError('Found unbalanced quote char %r' % quote)
    elif in_escaped:
        raise ValueError('Found unbalanced escape char %r' % escape)

    # Yield the last bit in the buffer
    yield buffer

data = r"""
"MARIA","SUSAN",
"ANG
ELA","JACK",,TED,"JOE\""
"""
print list(parse_chunks(data))

# ['MARIA', 'SUSAN', 'ANG\nELA', 'JACK', '', 'TED', 'JOE"']

# Use a fixed buffer size if you know the file has only one long line or
# don't care about line parsing
buffer_size = 4096
with open('myfile.txt', 'r', buffer_size) as file:
    for name in parse_chunks(file):
        print name

If you can do it simpler, then do it simpler. 如果您可以简化它,那么就可以简化它。 No need to use the csv module. 无需使用csv模块。 I don't think 5000 names or 46KB is enough to worry. 我认为5000个名称或46KB不足以担心。

names = []
f = open("names.txt", "r")

# In case there is more than one line...
for line in f.readlines():
    names = [x.strip().replace('"', '') for x in line.split(",")]

print names
#should print ['name1', ... , ...]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM