Is there any method to treat multiple word as single in Python? I've written a script to find Tf-Idf value of words in a collection of documents. The problem is that, it gives the Tf-Idf for individual words. But there are cases where i've to treat multiple word as as one, such as words like Big Data , Machine Learning should be treated as a single word and Tf-Idf score for those word should be calculated. Any help would be highly useful.
I would approach it using scikit-learn and the TfidfVectorizer. Tweaking some of it's parameters would basically allow you to do all the work.
It's hard to show it's functionality though without a good example.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = "lots of text"
vectorizer = TfidfVectorizer(ngram_range=(2,2))
result = vectorizer.fit_transform(corpus)
Know that the ngram_range
parameter allows you to choose if you are interested in eg bigrams, trigrams, etc. by choosing a range.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.