How do I use CountVectorizer to get the count of a phrase without counting words in the phrase?

Question

I am working on an NLP project and I hope to tokenize sentences and get counts of different tokens. Sometimes I hope a few words to be a phrase and do not count the words inside the phrase.

I have found CountVectorizer in scikit-learn useful in counting phrases, but I could not figure out how to remove the words inside the phrases.

For example:

words = ['cat', 'dog', 'walking', 'my dog']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=words, ngram_range=(1,2))
dtm = vect.fit_transform(example)
print(dtm)

I got:

>>> vect.get_feature_names()
['cat', 'dog', 'walking', 'my dog']
>>> print(dtm)
  (0, 0)    1
  (0, 1)    1
  (0, 2)    1
  (0, 3)    1

What I want is:

>>> print(dtm)
  (0, 0)    1
  (0, 2)    1
  (0, 3)    1

But I want to keep 'dog' in the dictionary because it may appear on its own in other text.

Answer 1

There is not any specific config in CountVectorizer to apply the longer string first and remove it from the string to prevent counting the shorter substring.

Hence, one solution can be using CountVectorzier as what you did. Aftwerwards, iterate over the words to find the words that are contained in the longer phrases, and then minus the number of longer phrases from the shorter phrases that are contained, in the first result of CountVectorizer .

How do I use CountVectorizer to get the count of a phrase without counting words in the phrase?

Question

1 answers

solution1
0 2019-05-09 09:40:34

How do I use CountVectorizer to get the count of a phrase without counting words in the phrase?

Question

1 answers

solution1 0 2019-05-09 09:40:34

solution1
0 2019-05-09 09:40:34