如何從 pandas dataframe 中的列中獲取 n-gram

Question

我對 n-gram 有一些疑問。 具體來說，我想從以下列中提取 2-gram、3-gram 和 4-gram：

Sentences

For each topic, we will explore the words occuring in that topic and its relative weight.
We will check where our test document would be classified.
For each document we create a dictionary reporting how many
words and how many times those words appear. 
Save this to ‘bow_corpus’, then check our selected document earlier.

為此，我使用了以下 function

def n_grams(lines , min_length=2, max_length=4):
    lenghts=range(min_length,max_length+1)
    ngrams={length:collections.Counter() for length in lengths)
    queue= collection.deque(maxlen=max_length)

但它不起作用，因為我沒有得到 output。

你能告訴我代碼有什么問題嗎？

Answer 1

你的ngrams字典有空的Counter()對象，因為你沒有傳遞任何東西來計數。 還有一些其他的問題：

Function 名稱不能包含在-中。
collection.deque無效，我想你想調用collections.deque()

我認為修復代碼比使用 collections 庫有更好的選擇。 其中兩個如下：

您可以使用列表理解修復您的 function：

def n_grams(lines, min_length=2, max_length=4):
    tokens = lines.split()
    ngrams = dict()
    for n in range(min_length, max_length + 1):
        ngrams[n] = [tokens[i:i+n] for i in range(len(tokens)-n+1)]

    return ngrams

或者您可以使用原生支持標記化和 n-gram 的nltk 。

from nltk import ngrams
from nltk.tokenize import word_tokenize


def n_grams(lines, min_length=2, max_length=4):
    tokens = word_tokenize(lines)
    ngrams = {n: ngrams(tokens, n) for n in range(min_length, max_length + 1)}
    return ngrams

如何從 pandas dataframe 中的列中獲取 n-gram

問題描述

1 個解決方案

解決方案1
0 已采納 2020-06-08 14:46:51

如何從 pandas dataframe 中的列中獲取 n-gram

問題描述

1 個解決方案

解決方案1 0 已采納 2020-06-08 14:46:51

解決方案1
0 已采納 2020-06-08 14:46:51