如何使用CountVectorizer获取短语的计数而不计算短语中的单词？

Question

我正在研究NLP项目，我希望对句子进行标记并获得不同标记的计数。 有时候我希望用几句话作为一个短语而不要计算短语中的单词。

我发现Scikit-Learn中的CountVectorizer对计算短语有用，但我无法弄清楚如何删除短语中的单词。

例如：

words = ['cat', 'dog', 'walking', 'my dog']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=words, ngram_range=(1,2))
dtm = vect.fit_transform(example)
print(dtm)

我有：

>>> vect.get_feature_names()
['cat', 'dog', 'walking', 'my dog']
>>> print(dtm)
  (0, 0)    1
  (0, 1)    1
  (0, 2)    1
  (0, 3)    1

我想要的是：

>>> print(dtm)
  (0, 0)    1
  (0, 2)    1
  (0, 3)    1

但我想在字典中保留'dog' ，因为它可能在其他文本中单独出现。

Answer 1

CountVectorizer没有任何特定的配置CountVectorizer应用较长的字符串并将其从字符串中删除以防止计算较短的子字符串。

因此，一个解决方案可以使用CountVectorzier作为您所做的。 Aftwerwards，迭代单词以查找较长短语中包含的单词，然后在CountVectorizer的第一个结果中CountVectorizer包含的较短短语中较长短语的数量。

如何使用CountVectorizer获取短语的计数而不计算短语中的单词？

问题描述

1 个解决方案

解决方案1
0 2019-05-09 09:40:34

如何使用CountVectorizer获取短语的计数而不计算短语中的单词？

问题描述

1 个解决方案

解决方案1 0 2019-05-09 09:40:34

解决方案1
0 2019-05-09 09:40:34