简体   繁体   English

如何使用CountVectorizer获取短语的计数而不计算短语中的单词?

[英]How do I use CountVectorizer to get the count of a phrase without counting words in the phrase?

I am working on an NLP project and I hope to tokenize sentences and get counts of different tokens. 我正在研究NLP项目,我希望对句子进行标记并获得不同标记的计数。 Sometimes I hope a few words to be a phrase and do not count the words inside the phrase. 有时候我希望用几句话作为一个短语而不要计算短语中的单词。

I have found CountVectorizer in scikit-learn useful in counting phrases, but I could not figure out how to remove the words inside the phrases. 我发现Scikit-Learn中的CountVectorizer对计算短语有用,但我无法弄清楚如何删除短语中的单词。

For example: 例如:

words = ['cat', 'dog', 'walking', 'my dog']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=words, ngram_range=(1,2))
dtm = vect.fit_transform(example)
print(dtm)

I got: 我有:

>>> vect.get_feature_names()
['cat', 'dog', 'walking', 'my dog']
>>> print(dtm)
  (0, 0)    1
  (0, 1)    1
  (0, 2)    1
  (0, 3)    1

What I want is: 我想要的是:

>>> print(dtm)
  (0, 0)    1
  (0, 2)    1
  (0, 3)    1

But I want to keep 'dog' in the dictionary because it may appear on its own in other text. 但我想在字典中保留'dog' ,因为它可能在其他文本中单独出现。

There is not any specific config in CountVectorizer to apply the longer string first and remove it from the string to prevent counting the shorter substring. CountVectorizer没有任何特定的配置CountVectorizer应用较长的字符串并将其从字符串中删除以防止计算较短的子字符串。

Hence, one solution can be using CountVectorzier as what you did. 因此,一个解决方案可以使用CountVectorzier作为您所做的。 Aftwerwards, iterate over the words to find the words that are contained in the longer phrases, and then minus the number of longer phrases from the shorter phrases that are contained, in the first result of CountVectorizer . Aftwerwards,迭代单词以查找较长短语中包含的单词,然后在CountVectorizer的第一个结果中CountVectorizer包含的较短短语中较长短语的数量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM