连续单词频率计数的有效方法？

Question

I have a string like this: 我有一个像这样的字符串：

inputString = "this is the first sentence in this book the first sentence is really the most interesting the first sentence is always first"

and a dictionary like this: 和这样的字典：

{   
   'always first': 0,
    'book the': 0,
    'first': 0,
    'first sentence': 0,
    'in this': 0,
    'interesting the': 0,
    'is always': 0,
    'is really': 0,
    'is the': 0,
    'most interesting': 0,
    'really the': 0,
    'sentence in': 0,
    'sentence is': 0,
    'the first': 0,
    'the first sentence': 0,
    'the first sentence is': 0,
    'the most': 0,
    'this': 0,
    'this book': 0,
    'this is': 0
}

What is the most efficient way of updating the frequency counts of this dictionary in one pass of the input string (if it is possible)? 在输入字符串的一遍中（如果可能的话）更新此字典的频率计数的最有效方法是什么？ I get a feeling that there must be a parser technique to do this but am not an expert in this area so am stuck. 我觉得必须要有一种解析器技术才能做到这一点，但由于不是该领域的专家，所以陷入了困境。 Any suggestions? 有什么建议么？

Answer 1

查看Aho-Corasick算法。

Answer 2

When confronted with this problem, I think, "I know, I'll use regular expressions". 遇到此问题时，我认为：“我知道，我将使用正则表达式”。

Start off by making a list of all the patterns, sorted by decreasing length: 首先列出所有模式，然后按长度减少排序：

patterns = sorted(counts.keys(), key=len, reverse=True)

Now make that into a single massive regular expression which is an alternation between each of the patterns: 现在，将其变成单个大规模正则表达式，这是每个模式之间的交替：

allPatterns = re.compile("|".join(patterns))

Now run that pattern over the input string, and count up the number of hits on each pattern as you go: 现在，在输入字符串上运行该模式，并在运行时计算每个模式的命中数：

pos = 0
while (True):
    match = allPatterns.search(inputString, pos)
    if (match is None): break
    pos = match.start() + 1
    counts[match.group()] = counts[match.group()] + 1

You will end up with the counts of each of the strings. 您将得到每个字符串的计数。

(An aside: i believe most good regular expression libraries will compile a large alternation over fixed strings like this using the Aho-Corasick algorithm that e.dan mentioned. Using a regular expression library is probably the easiest way of applying this algorithm.) （顺便说一句：我相信大多数好的正则表达式库都可以使用e.dan提到的Aho-Corasick算法，对固定字符串进行较大的替换。使用正则表达式库可能是应用此算法的最简单方法。）

With one problem: where a pattern is a prefix of another pattern (eg 'first' and 'first sentence'), only the longer pattern will have got a count against it. 有一个问题：一个模式是另一个模式的前缀（例如“第一”和“第一句子”），则只有较长的模式才能对其进行计数。 This is by design: that's what the sort by length at the start was for. 这是设计使然：这就是开始时按长度排序的目的。

We can deal with this as a postprocessing step; 我们可以将其作为后处理步骤来处理； go through the counts, and whenever one pattern is a prefix of another, add the longer pattern's counts to the shorter pattern's. 仔细检查计数，只要一个模式是另一个模式的前缀，就将较长模式的计数添加到较短模式的计数中。 Be careful not to double-add. 注意不要重复添加。 That's simply done as a nested loop: 只需将其作为嵌套循环即可完成：

correctedCounts = {}
for donor in counts:
    for recipient in counts:
        if (donor.startswith(recipient)):
            correctedCounts[recipient] = correctedCounts.get(recipient, 0) + counts[donor]

That dictionary now contains the actual counts. 该词典现在包含实际计数。

Answer 3

The Aho–Corasick seems definitely the way to go, but if I needed a simple Python implementation, I'd write: Aho–Corasick似乎肯定是要走的路，但是如果我需要一个简单的Python实现，我会写：

import collections

def consecutive_groups(seq, n):
    return (seq[i:i+n] for i in range(len(seq)-n))

def get_snippet_ocurrences(snippets):
    split_snippets = [s.split() for s in snippets]
    max_snippet_length = max(len(sp) for sp in split_snippets)
    for group in consecutive_groups(inputString.split(), max_snippet_length):
        for lst in split_snippets:
            if group[:len(lst)] == lst:
                yield " ".join(lst)

print collections.Counter(get_snippet_ocurrences(snippets))
# Counter({'the first sentence': 3, 'first sentence': 3, 'the first': 3, 'first': 3, 'the first sentence is': 2, 'this': 2, 'this book': 1, 'in this': 1, 'book the': 1, 'most interesting': 1, 'really the': 1, 'sentence in': 1, 'is really': 1, 'sentence is': 1, 'is the': 1, 'interesting the': 1, 'this is': 1, 'the most': 1})

Answer 4

尝试使用后缀树或Trie存储单词而不是字符。

Answer 5

Just go through the string and use the dictionary as you would normally to increment any occurance. 只需遍历字符串并像通常那样使用字典来增加任何出现次数。 This is O(n), since dictionary lookup is often O(1). 这是O（n），因为字典查找通常是O（1）。 I do this regularly, even for large word collections. 我经常这样做，即使是收集大量单词也是如此。

连续单词频率计数的有效方法？

问题描述

5 个解决方案

解决方案1
4 已采纳 2011-11-06 12:25:28

解决方案2
2 2011-11-06 12:54:28

解决方案3
2 2011-11-06 13:00:55

解决方案4
0 2011-11-06 13:02:55

解决方案5
0 2011-11-06 14:54:02

连续单词频率计数的有效方法？

问题描述

5 个解决方案

解决方案1 4 已采纳 2011-11-06 12:25:28

解决方案2 2 2011-11-06 12:54:28

解决方案3 2 2011-11-06 13:00:55

解决方案4 0 2011-11-06 13:02:55

解决方案5 0 2011-11-06 14:54:02

解决方案1
4 已采纳 2011-11-06 12:25:28

解决方案2
2 2011-11-06 12:54:28

解决方案3
2 2011-11-06 13:00:55

解决方案4
0 2011-11-06 13:02:55

解决方案5
0 2011-11-06 14:54:02