简体   繁体   中英

Python Count Number of Phrases in Text

I have a list of product reviews/descriptions in excel and I am trying to classify them using Python based on words that appear in the reviews.

I import both the reviews, and a list of words that would indicate the product falling into a certain classification, into Python using Pandas and then count the number of occurrences of the classification words.

This all works fine for single classification words eg 'computer' but I am struggling to make it work for phrases eg 'laptop case'.

I have look through a few answers but none were successful for me including:

using just text.count(['laptop case', 'laptop bag']) as per the answer here: Counting phrase frequency in Python 3.3.2 but because you need to split the text up that does not work (and I think maybe text.count does not work for lists either?)

Other answers I have found only look at the occurrence of a single word. Is there something I can do to count words and phrases that does not involve the splitting of the body of text into individual words?

The code I currently have (that works for individual terms) is:

for i in df1.index:
    descriptions = df1['detaileddescription'][i]
    if type(descriptions) is str:
        descriptions = descriptions.split()
        pool.append(sum(map(descriptions.count, df2['laptop_bag'])))
    else:
        pool.append(0)
print(pool)

You're on the right track! You're currently splitting into single words, which facilitates finding occurrences of single words as you pointed out. To find phrases of length n you should split the text into chunks of length n , which are called n-grams .

To do that, check out the NLTK package :

from nltk import ngrams
sentence = 'I have a laptop case and a laptop bag'
n = 2
bigrams = ngrams(sentence.split(), n)
for gram in bigrams:
    print(gram)

Sklearn's CountVectorizer is the standard way

from sklearn.feature_extraction import text
vectorizer = text.CountVectorizer()
vec = vectorizer.fit_transform(descriptions)

And if you want to see the counts as a dict :

count_dict = {k:v for k,v in zip(vectorizer.get_feature_names(), vec.toarray()[0]) if v>0}
print (count_dict)

The default is unigrams, you can use bigrams or higher ngrams with the ngram_range parameter

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM