如何使用CountVectorizer獲取短語的計數而不計算短語中的單詞？

Question

我正在研究NLP項目，我希望對句子進行標記並獲得不同標記的計數。 有時候我希望用幾句話作為一個短語而不要計算短語中的單詞。

我發現Scikit-Learn中的CountVectorizer對計算短語有用，但我無法弄清楚如何刪除短語中的單詞。

例如：

words = ['cat', 'dog', 'walking', 'my dog']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=words, ngram_range=(1,2))
dtm = vect.fit_transform(example)
print(dtm)

我有：

>>> vect.get_feature_names()
['cat', 'dog', 'walking', 'my dog']
>>> print(dtm)
  (0, 0)    1
  (0, 1)    1
  (0, 2)    1
  (0, 3)    1

我想要的是：

>>> print(dtm)
  (0, 0)    1
  (0, 2)    1
  (0, 3)    1

但我想在字典中保留'dog' ，因為它可能在其他文本中單獨出現。

Answer 1

CountVectorizer沒有任何特定的配置CountVectorizer應用較長的字符串並將其從字符串中刪除以防止計算較短的子字符串。

因此，一個解決方案可以使用CountVectorzier作為您所做的。 Aftwerwards，迭代單詞以查找較長短語中包含的單詞，然后在CountVectorizer的第一個結果中CountVectorizer包含的較短短語中較長短語的數量。

如何使用CountVectorizer獲取短語的計數而不計算短語中的單詞？

問題描述

1 個解決方案

解決方案1
0 2019-05-09 09:40:34

如何使用CountVectorizer獲取短語的計數而不計算短語中的單詞？

問題描述

1 個解決方案

解決方案1 0 2019-05-09 09:40:34

解決方案1
0 2019-05-09 09:40:34