简体   繁体   中英

Python treat multiple words as single

Is there any method to treat multiple word as single in Python? I've written a script to find Tf-Idf value of words in a collection of documents. The problem is that, it gives the Tf-Idf for individual words. But there are cases where i've to treat multiple word as as one, such as words like Big Data , Machine Learning should be treated as a single word and Tf-Idf score for those word should be calculated. Any help would be highly useful.

I would approach it using scikit-learn and the TfidfVectorizer. Tweaking some of it's parameters would basically allow you to do all the work.

It's hard to show it's functionality though without a good example.

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = "lots of text"    
vectorizer = TfidfVectorizer(ngram_range=(2,2))
result = vectorizer.fit_transform(corpus)

Know that the ngram_range parameter allows you to choose if you are interested in eg bigrams, trigrams, etc. by choosing a range.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM