I am working on an NLP project and I hope to tokenize sentences and get counts of different tokens. Sometimes I hope a few words to be a phrase and do not count the words inside the phrase.
I have found CountVectorizer in scikit-learn useful in counting phrases, but I could not figure out how to remove the words inside the phrases.
For example:
words = ['cat', 'dog', 'walking', 'my dog']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=words, ngram_range=(1,2))
dtm = vect.fit_transform(example)
print(dtm)
I got:
>>> vect.get_feature_names()
['cat', 'dog', 'walking', 'my dog']
>>> print(dtm)
(0, 0) 1
(0, 1) 1
(0, 2) 1
(0, 3) 1
What I want is:
>>> print(dtm)
(0, 0) 1
(0, 2) 1
(0, 3) 1
But I want to keep 'dog'
in the dictionary because it may appear on its own in other text.
There is not any specific config in CountVectorizer
to apply the longer string first and remove it from the string to prevent counting the shorter substring.
Hence, one solution can be using CountVectorzier
as what you did. Aftwerwards, iterate over the words to find the words that are contained in the longer phrases, and then minus the number of longer phrases from the shorter phrases that are contained, in the first result of CountVectorizer
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.