連續單詞頻率計數的有效方法？

Question

我有一個像這樣的字符串：

inputString = "this is the first sentence in this book the first sentence is really the most interesting the first sentence is always first"

和這樣的字典：

{   
   'always first': 0,
    'book the': 0,
    'first': 0,
    'first sentence': 0,
    'in this': 0,
    'interesting the': 0,
    'is always': 0,
    'is really': 0,
    'is the': 0,
    'most interesting': 0,
    'really the': 0,
    'sentence in': 0,
    'sentence is': 0,
    'the first': 0,
    'the first sentence': 0,
    'the first sentence is': 0,
    'the most': 0,
    'this': 0,
    'this book': 0,
    'this is': 0
}

在輸入字符串的一遍中（如果可能的話）更新此字典的頻率計數的最有效方法是什么？ 我覺得必須要有一種解析器技術才能做到這一點，但由於不是該領域的專家，所以陷入了困境。 有什么建議么？

Answer 1

查看Aho-Corasick算法。

Answer 2

遇到此問題時，我認為：“我知道，我將使用正則表達式”。

首先列出所有模式，然后按長度減少排序：

patterns = sorted(counts.keys(), key=len, reverse=True)

現在，將其變成單個大規模正則表達式，這是每個模式之間的交替：

allPatterns = re.compile("|".join(patterns))

現在，在輸入字符串上運行該模式，並在運行時計算每個模式的命中數：

pos = 0
while (True):
    match = allPatterns.search(inputString, pos)
    if (match is None): break
    pos = match.start() + 1
    counts[match.group()] = counts[match.group()] + 1

您將得到每個字符串的計數。

（順便說一句：我相信大多數好的正則表達式庫都可以使用e.dan提到的Aho-Corasick算法，對固定字符串進行較大的替換。使用正則表達式庫可能是應用此算法的最簡單方法。）

有一個問題：一個模式是另一個模式的前綴（例如“第一”和“第一句子”），則只有較長的模式才能對其進行計數。 這是設計使然：這就是開始時按長度排序的目的。

我們可以將其作為后處理步驟來處理； 仔細檢查計數，只要一個模式是另一個模式的前綴，就將較長模式的計數添加到較短模式的計數中。 注意不要重復添加。 只需將其作為嵌套循環即可完成：

correctedCounts = {}
for donor in counts:
    for recipient in counts:
        if (donor.startswith(recipient)):
            correctedCounts[recipient] = correctedCounts.get(recipient, 0) + counts[donor]

該詞典現在包含實際計數。

Answer 3

Aho–Corasick似乎肯定是要走的路，但是如果我需要一個簡單的Python實現，我會寫：

import collections

def consecutive_groups(seq, n):
    return (seq[i:i+n] for i in range(len(seq)-n))

def get_snippet_ocurrences(snippets):
    split_snippets = [s.split() for s in snippets]
    max_snippet_length = max(len(sp) for sp in split_snippets)
    for group in consecutive_groups(inputString.split(), max_snippet_length):
        for lst in split_snippets:
            if group[:len(lst)] == lst:
                yield " ".join(lst)

print collections.Counter(get_snippet_ocurrences(snippets))
# Counter({'the first sentence': 3, 'first sentence': 3, 'the first': 3, 'first': 3, 'the first sentence is': 2, 'this': 2, 'this book': 1, 'in this': 1, 'book the': 1, 'most interesting': 1, 'really the': 1, 'sentence in': 1, 'is really': 1, 'sentence is': 1, 'is the': 1, 'interesting the': 1, 'this is': 1, 'the most': 1})

Answer 4

嘗試使用后綴樹或Trie存儲單詞而不是字符。

Answer 5

只需遍歷字符串並像通常那樣使用字典來增加任何出現次數。 這是O（n），因為字典查找通常是O（1）。 我經常這樣做，即使是收集大量單詞也是如此。

連續單詞頻率計數的有效方法？

問題描述

5 個解決方案

解決方案1
4 已采納 2011-11-06 12:25:28

解決方案2
2 2011-11-06 12:54:28

解決方案3
2 2011-11-06 13:00:55

解決方案4
0 2011-11-06 13:02:55

解決方案5
0 2011-11-06 14:54:02

連續單詞頻率計數的有效方法？

問題描述

5 個解決方案

解決方案1 4 已采納 2011-11-06 12:25:28

解決方案2 2 2011-11-06 12:54:28

解決方案3 2 2011-11-06 13:00:55

解決方案4 0 2011-11-06 13:02:55

解決方案5 0 2011-11-06 14:54:02

解決方案1
4 已采納 2011-11-06 12:25:28

解決方案2
2 2011-11-06 12:54:28

解決方案3
2 2011-11-06 13:00:55

解決方案4
0 2011-11-06 13:02:55

解決方案5
0 2011-11-06 14:54:02