匹配txt文件中的每个单词

Question

我正在研究欧拉计画问题（很有趣）。 它带有一个46kb的 txt文件，其中包含1行，其中包含5000多个名称的列表，格式如下：

"MARIA","SUSAN","ANGELA","JACK"...

我的计划是编写一种方法来提取每个名称并将其附加到Python列表中。 正则表达式是解决此问题的最佳武器吗？
我查找了Python re doc，但是很难找出正确的正则表达式。

Answer 1

看起来像csv模块会有用的格式。 然后，您不必编写任何正则表达式。

Answer 2

如果文件的格式如您所说，即

这是一行
格式如下：“ MARIA”，“ SUSAN”，“ ANGELA”，“ JACK”

然后这应该工作：

 >>> import csv >>> lines = csv.reader(open('words.txt', 'r'), delimiter=',') >>> words = lines.next() >>> words ['MARIA', 'SUSAN', 'ANGELA', 'JACK']

Answer 3

正则表达式可以完成工作，但是效率很低。 使用csv可以工作，但可能无法很好地在一行中处理5000个单元。 至少它必须加载整个文件并在内存中维护整个名称列表（这对您来说可能不是问题，因为这是非常少量的数据）。 如果要使用相对较大的文件（远大于5000个名称）的迭代器，则状态机可以解决问题：

def parse_chunks(iter, quote='"', delim=',', escape='\\'):
    in_quote = False
    in_escaped = False

    buffer = ''

    for chunk in iter:
        for byte in chunk:
            if in_escaped:
                # Done with the escape char, add it to the buffer
                buffer += byte
                in_escaped = False            
            elif byte == escape:
                # The next charachter will be added literally and not parsed
                in_escaped = True          
            elif in_quote:
                if byte == quote:
                    in_quote = False
                else:
                    buffer += byte
            elif byte == quote:
                in_quote = True
            elif byte in (' ', '\n', '\t', '\r'):
                # Ignore whitespace outside of quotes
                pass
            elif byte == delim:
                # Done with this block of text
                yield buffer
                buffer = ''                    
            else:
                buffer += byte

    if in_quote:
        raise ValueError('Found unbalanced quote char %r' % quote)
    elif in_escaped:
        raise ValueError('Found unbalanced escape char %r' % escape)

    # Yield the last bit in the buffer
    yield buffer

data = r"""
"MARIA","SUSAN",
"ANG
ELA","JACK",,TED,"JOE\""
"""
print list(parse_chunks(data))

# ['MARIA', 'SUSAN', 'ANG\nELA', 'JACK', '', 'TED', 'JOE"']

# Use a fixed buffer size if you know the file has only one long line or
# don't care about line parsing
buffer_size = 4096
with open('myfile.txt', 'r', buffer_size) as file:
    for name in parse_chunks(file):
        print name

Answer 4

如果您可以简化它，那么就可以简化它。 无需使用csv模块。 我认为5000个名称或46KB不足以担心。

names = []
f = open("names.txt", "r")

# In case there is more than one line...
for line in f.readlines():
    names = [x.strip().replace('"', '') for x in line.split(",")]

print names
#should print ['name1', ... , ...]

匹配txt文件中的每个单词

问题描述

4 个解决方案

解决方案1
3 2011-10-04 01:11:07

解决方案2
3 已采纳 2011-10-04 01:43:09

解决方案3
1 2011-10-04 01:56:57

解决方案4
1 2011-10-04 01:58:34

匹配txt文件中的每个单词

问题描述

4 个解决方案

解决方案1 3 2011-10-04 01:11:07

解决方案2 3 已采纳 2011-10-04 01:43:09

解决方案3 1 2011-10-04 01:56:57

解决方案4 1 2011-10-04 01:58:34

解决方案1
3 2011-10-04 01:11:07

解决方案2
3 已采纳 2011-10-04 01:43:09

解决方案3
1 2011-10-04 01:56:57

解决方案4
1 2011-10-04 01:58:34