通過 Python 中的 ID 有效地計算 ngram

Question

我有一個包含 10,000 個 ngram（超過 1 個單詞的短語）和 650 萬條記錄的列表，其中包含不同數量的文本（從 10 個字符到 5,000 個字符不等）。 我希望在我的 dataframe 中創建 10,000 個新列，每個列都包含相關 ngram 的計數。 我當前的解決方案包括使用文本循環遍歷 dataframe 中的列，使用 re.findall 計算每行中 ngram 出現的次數，將 findall 方法的長度放入列表中，然后使用列表創建dataframe 中的新列。

由於 memory 的限制，我一次翻閱數據 100,000（6.5M）行。 瀏覽最終結果為 dataframe 的頁面大約需要五個小時，其中包含原始列和 10,000 列（每個 ngram 列）。 因為我有 65 頁要讀完，所以我預計需要 325 小時才能讀完所有這些。

有一個更好的方法嗎？ 我試圖找到一種 numpy 矢量化方法，但沒有找到。

編輯：在處理了更多之后，我開始使用 Pandas 矢量化

ngram = 'hello world'
df["columnCnt"] = df["text_column"].str.count(ngram)

我正在遍歷 10,000 個 ngram 的列表，並為 10,000 個中的每一個調用 str.count。 有沒有辦法對它進行矢量化，以便所有 10,000 個都比循環更快地完成？

Answer 1

例如，這將計算短語中的所有 1、2、3 克：

from collections import defaultdict
phrase='worms in the belly of the leviathan. we the living bear the cross of history when in the company of dogs it behooves one to act like a dog'

allwords = phrase.split()
ngram_dict = defaultdict(int)
for n in [1,2,3]:
 for i in range(len(allwords)-n):
     words=' '.join([allwords[i+j] for j in range(n)])
     ngram_dict[words]+=1

然后找到您的列表與上面的 ngram 的交集。

ngrams_to_detect=['worms','dogs','worms in','act like','monster trucks']
detected=set(ngram_dict.keys())
relevant_detected = detected.intersection(ngrams_to_detect)
Out[92]: {'act like', 'dogs', 'worms', 'worms in', 'the'}

not_found = set(ngrams_to_detect)-relevant_detected
Out[93]: {'monster trucks'}

這里有一些權衡你的列表有多完整，以及在生成不相關的 ngram 之上浪費了多少時間。 可以通過以下方式返回計數：

detected_counts = {k:v for k,v in ngram_dict.items() if k in relevant_detected}
Out[100]: {'worms': 1, 'dogs': 1, 'worms in': 1, 'act like': 1, 'the': 5}

通過 Python 中的 ID 有效地計算 ngram

問題描述

1 個解決方案

解決方案1
0 2020-05-21 03:40:08

通過 Python 中的 ID 有效地計算 ngram

問題描述

1 個解決方案

解決方案1 0 2020-05-21 03:40:08

解決方案1
0 2020-05-21 03:40:08