Python treat multiple words as single

Question

Is there any method to treat multiple word as single in Python? I've written a script to find Tf-Idf value of words in a collection of documents. The problem is that, it gives the Tf-Idf for individual words. But there are cases where i've to treat multiple word as as one, such as words like Big Data , Machine Learning should be treated as a single word and Tf-Idf score for those word should be calculated. Any help would be highly useful.

Answer 1

I would approach it using scikit-learn and the TfidfVectorizer. Tweaking some of it's parameters would basically allow you to do all the work.

It's hard to show it's functionality though without a good example.

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = "lots of text"    
vectorizer = TfidfVectorizer(ngram_range=(2,2))
result = vectorizer.fit_transform(corpus)

Know that the ngram_range parameter allows you to choose if you are interested in eg bigrams, trigrams, etc. by choosing a range.

Python treat multiple words as single

Question

1 answers

solution1
1 2014-05-08 07:26:20

Python treat multiple words as single

Question

1 answers

solution1 1 2014-05-08 07:26:20

solution1
1 2014-05-08 07:26:20