如何使用python在文件中查找最常出現的單詞對集合？

Question

我有一個數據集如下：

"485","AlterNet","Statistics","Estimation","Narnia","Two and half men"
"717","I like Sheen", "Narnia", "Statistics", "Estimation"
"633","MachineLearning","AI","I like Cars, but I also like bikes"
"717","I like Sheen","MachineLearning", "regression", "AI"
"136","MachineLearning","AI","TopGear"

等等

我想找出最常出現的單詞對，例如

(Statistics,Estimation:2)
(Statistics,Narnia:2)
(Narnia,Statistics)
(MachineLearning,AI:3)

這兩個詞可以是任何順序，也可以是彼此之間的任何距離

有人可以在python中提出可能的解決方案嗎？ 這是一個非常大的數據集。

任何建議都非常感謝

所以這是我在@ 275365的建議后嘗試的

@ 275365我嘗試從文件中讀取輸入以下內容

    def collect_pairs(file):
        pair_counter = Counter()
        for line in open(file):
            unique_tokens = sorted(set(line))  
            combos = combinations(unique_tokens, 2)
            pair_counter += Counter(combos)
            print pair_counter

    file = ('myfileComb.txt')
    p=collect_pairs(file)

文本文件與原始文件具有相同的行數，但在特定行中只有唯一的標記。 我不知道我做錯了什么，因為當我運行它時，它會將字母分成字母，而不是將輸出作為單詞的組合。 當我運行此文件時，它會輸出拆分字母而不是預期的單詞組合。 我不知道我在哪里弄錯了。

Answer 1

您可以從這樣的事情開始，具體取決於您的語料庫的大小：

>>> from itertools import combinations
>>> from collections import Counter

>>> def collect_pairs(lines):
    pair_counter = Counter()
    for line in lines:
        unique_tokens = sorted(set(line))  # exclude duplicates in same line and sort to ensure one word is always before other
        combos = combinations(unique_tokens, 2)
        pair_counter += Counter(combos)
    return pair_counter

結果：

>>> t2 = [['485', 'AlterNet', 'Statistics', 'Estimation', 'Narnia', 'Two and half men'], ['717', 'I like Sheen', 'Narnia', 'Statistics', 'Estimation'], ['633', 'MachineLearning', 'AI', 'I like Cars, but I also like bikes'], ['717', 'I like Sheen', 'MachineLearning', 'regression', 'AI'], ['136', 'MachineLearning', 'AI', 'TopGear']]
>>> pairs = collect_pairs(t2)
>>> pairs.most_common(3)
[(('MachineLearning', 'AI'), 3), (('717', 'I like Sheen'), 2), (('Statistics', 'Estimation'), 2)]

你想要這些組合中包含的數字嗎？ 由於你沒有特別提及排除它們，我把它們包括在這里。

編輯：使用文件對象

您在上面第一次嘗試時發布的功能非常接近工作。 您唯一需要做的就是將每一行（這是一個字符串）更改為元組或列表。 假設您的數據看起來與您上面發布的數據完全一樣（每個術語周圍都有引號和逗號分隔術語），我建議一個簡單的修復：您可以使用ast.literal_eval 。 （否則，您可能需要使用某種正則表達式。）請參閱下面的ast.literal_eval修改版本：

from itertools import combinations
from collections import Counter
import ast

def collect_pairs(file_name):
    pair_counter = Counter()
    for line in open(file_name):  # these lines are each simply one long string; you need a list or tuple
        unique_tokens = sorted(set(ast.literal_eval(line)))  # eval will convert each line into a tuple before converting the tuple to a set
        combos = combinations(unique_tokens, 2)
        pair_counter += Counter(combos)
    return pair_counter  # return the actual Counter object

現在你可以像這樣測試它：

file_name = 'myfileComb.txt'
p = collect_pairs(file_name)
print p.most_common(10)  # for example

Answer 2

除了計算所有對之外，你無能為力。

顯而易見的優化是盡早刪除重復的單詞和同義詞，執行詞干（減少不同標記數量的任何東西都很好！），並且只計算對(a,b)其中a<b （在您的示例中，只計算statistics,narnia ，或narnia,statistics ，但不是兩個！）。

如果內存不足，請執行兩次傳遞。 在第一遍中，使用一個或多個散列函數來獲得候選過濾器。 在第二遍中，只計算通過此過濾器的單詞（MinHash / LSH樣式過濾）。

這是一個天真的並行問題，因此也很容易分發到多個線程或計算機。

如何使用python在文件中查找最常出現的單詞對集合？

問題描述

2 個解決方案

解決方案1
5 已采納 2014-01-23 02:49:09

解決方案2
0 2014-01-23 22:59:22

如何使用python在文件中查找最常出現的單詞對集合？

問題描述

2 個解決方案

解決方案1 5 已采納 2014-01-23 02:49:09

解決方案2 0 2014-01-23 22:59:22

解決方案1
5 已采納 2014-01-23 02:49:09

解決方案2
0 2014-01-23 22:59:22