简体   繁体   中英

How do I use CountVectorizer to get the count of a phrase without counting words in the phrase?

I am working on an NLP project and I hope to tokenize sentences and get counts of different tokens. Sometimes I hope a few words to be a phrase and do not count the words inside the phrase.

I have found CountVectorizer in scikit-learn useful in counting phrases, but I could not figure out how to remove the words inside the phrases.

For example:

words = ['cat', 'dog', 'walking', 'my dog']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=words, ngram_range=(1,2))
dtm = vect.fit_transform(example)
print(dtm)

I got:

>>> vect.get_feature_names()
['cat', 'dog', 'walking', 'my dog']
>>> print(dtm)
  (0, 0)    1
  (0, 1)    1
  (0, 2)    1
  (0, 3)    1

What I want is:

>>> print(dtm)
  (0, 0)    1
  (0, 2)    1
  (0, 3)    1

But I want to keep 'dog' in the dictionary because it may appear on its own in other text.

There is not any specific config in CountVectorizer to apply the longer string first and remove it from the string to prevent counting the shorter substring.

Hence, one solution can be using CountVectorzier as what you did. Aftwerwards, iterate over the words to find the words that are contained in the longer phrases, and then minus the number of longer phrases from the shorter phrases that are contained, in the first result of CountVectorizer .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM