简体   繁体   中英

How to get similarity (PMI) score between a keyword and paragraphs using python?

I'm working on a project which extracts keywords from customer reviews. I somehow managed to extract the keywords using a topic modelling technique.

Now I'm looking for a technique or algorithm in python to rank the reviews based on similarity between the keyword.

for example: for the keyword 'delicious food' I would like to get the similarity score for reviews as below.

review score
this is place is costly but their food is delicious 0.7
I would not recommend this place for hangout. 0.0
This is place is very clean and friendly, perhaps, food is not so great! 0.2

How can I get the semantic similarity score between a keyword and sentence?

I have a method for doing this, but its complex, so I'll just show it and go over it after. Here it is:

sentences = ["this is place is costly but their food is delicious", "This is place is very clean and friendly, perhaps, food is not so great!", "I would not recommend this place for hangout."]
search = "food delicious"

count = 0
lst = []

for sentence in sentences:
    if search in sentence:
        lst.append([sentence, 1])
    else:
        for word in search.split():
            if word in sentence:
                count += 1
        lst.append([sentence, max(round(count / len(search.split()) - 0.3, 1), 0)])
        count = 0

for i in lst:
    print(*i)

This will give your desired outputs.

Basically, the first line puts the reviews into an array. The second line creates a variable called search which contains the keyphrase.

Now, after that we need to create 2 variables called count and count and lst. Lst will be the list we use to store our information, and count is a counter we will need later.

In line 7, we start a for loops, which will loop through the sentences one by one.

In line 8, we check if the exact key phrase is in the sentence, so if "food delicious " comes up somewhere in the sentence. If it does, then we add the sentence, and its PMI score of 1 to the list we created earlier.

Note: (The table does not specify that this is needed, so if it is not, then you can just remove it!)

So, next, we use else: to show that, if the direct key phrase is not in the sentence, then we need to do something else to get the PMI score. If we didn't have this else: then it could lead to duplications later on.

In line 11, it starts another for loop, but this time, it will iterate through every word in search.split() . search.split() just produces a list of search words, separating them by spaces. For example, here, the search.split() would be ["food", "delicious"] . So now, we are iterating through that list.

Now, in line 12, we check to see if the current word we are looping through is in the current sentence we are looping through, if that makes sense. If the word is, then that variable we created earlier on called count will be increased by the amount of times that word come in the sentence, or the count of that word. count will be incremented for each word.

Note: This means that if one word, eg food, came up twenty times, the computer would still act as if it only came up once.. To avoid this, you can change count += 1 to count += sentence.count(word) , which would count every single occurrence of the word in the sentence.

Now, after the search.split() for loop has ended, we need to add our count to the list. Here comes some mathy stuff. Firstly, we divide the count by search.split() , to get a decimal percentage (less than 1) of how many words from the search variable come up in the sentence. However, this raises a problem. If 2 words came up, and there were 2 words in the search variable, then we would be doing 2/2, which is 1. We don't want 1, we want 0.7. Therefore, we also need to subtract 0.3 from our number. I rounded this value because it can end up getting pretty messy in division.

Now, we still have one last problem in the lst.append() row. If we had 0 words coming up in the sentence, but 2 words in the search variable, then we would be doing 0/2 which is 0. That's what we want, but then, we subtract 0.3. which gives us - 0.3/ To avoid this, we can set the max() to 0.

Finally, right after, we reset the count to 0, so that the next sentence can start with a fresh count, to avoid any statistical errors.

That's all, To print it, I just used a small for loop at the end. but you don't need it.

These are my results:

this is place is costly but their food is delicious 0.7
This is place is very clean and friendly, perhaps, food is not so great! 0.2
I would not recommend this place for hangout. 0

PS: (The *i in the print() on the last line just removes the brackets and commas from the printed value. It does not change the list itself in any way.)

I know that this was long, but it is important to read everything to understand the point of each line.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM