简体   繁体   中英

Python: Grouping similar words with sentences in pandas

I have a database with sentences and often only words. Often I have words like purchase and purchases. When I count the words, I have both purchase and purchases, which distorts the calculation. my need is as follows:

I want to loop on my columns, and the first time I notice a word, I replace the similar word in the other sentences. I tried with fuzzy, but I only get words at the end and no sentence

For example :

This topic is about purchasing

He was talking about shopping

It becomes:

This topic is about purchasing

He was talking about purchasing

Even if the sentence is distorted, that's okay.

数据样本

I applied this code, but the result is not satisfactory:

import pandas
from fuzzywuzzy import fuzz

# Replaces %90 and more similar strings  
def func(input_list):
    for count, item in enumerate(input_list):
        rest_of_input_list = input_list[:count] + input_list[count + 1:]
        new_list = []
        for other_item in rest_of_input_list:
            similarity = fuzz.ratio(item, other_item)
            if similarity >= 90:
                new_list.append(item)
            else:
                new_list.append(other_item)
        input_list = new_list[:count] + [item] + new_list[count :]
                
    return input_list

df = pandas.read_csv('input.txt') # Read data from csv
result = []
for column in list(df):
    column_values = list(df[column])
    first_words = [x[:x.index(" ")] if " " in x else x for x in column_values]
    result.append(func(first_words))
    
new_df = pandas.DataFrame(result).transpose() 
new_df.columns = list(df)

print(new_df)

try lemmatizing the word into a dictionary of words and lemmatized roots. Count the lemmatized stem

from nltk.stem import LancasterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords 

stop_words = set(stopwords.words('english')) 
stemmer=LancasterStemmer()
lemmatizer = WordNetLemmatizer()
sentence="This topic is about purchasing. He was talking about shopping. Thinking about making a purchase. That was the reason for the request. "
sentence=re.sub(r'\.','',sentence)
words = sentence.split()
word_count = {}
for word in words :
    if not word in stop_words:
        lemmatized_word = lemmatizer.lemmatize(word)
        stemmed_word = stemmer.stem(word)
        if stemmed_word in word_count:
            word_count[stemmed_word] += 1
        else:
            word_count[stemmed_word] = 1
        if word in stop_words:
            word_count[word] = 0
print(word_count)

output:

{'thi': 1, 'top': 1, 'purchas': 2, 'he': 1, 'talk': 1, 'shop': 1, 'think': 1, 'mak': 1, 'that': 1, 'reason': 1, 'request': 1}

Maybe this is a possible solution. Given the following data:

input
This topic is about purchasing
He was talking about shopping
That was the reason for the request
About request
requests
My home is nice
My home is beautiful
My homes are nice

with:

import pandas as pd
from fuzzywuzzy import fuzz

# Replaces %90 and more similar strings  
def func(input_list):
    for count, item in enumerate(input_list):
        rest_of_input_list = input_list[:count] + input_list[count + 1:]
        new_list = []
        for other_item in rest_of_input_list:
            similarity = fuzz.ratio(item, other_item)
            if similarity >= 50:
                new_list.append(item)
            else:
                new_list.append(other_item)
        input_list = new_list[:count] + [item] + new_list[count :]
                
    return input_list

df = pd.read_csv('input.txt')

result = []
for column in list(df):
    column_values = list(df[column])
    result.append(func(column_values))
    
new_df = pd.DataFrame(result).transpose() 
new_df.columns = ['ouput']

full_df = pd.concat([df,new_df], axis=1)
print(full_df)

you would get the following output:

                             input                                ouput
0       This topic is about purchasing        He was talking about shopping
1        He was talking about shopping        He was talking about shopping
2  That was the reason for the request  That was the reason for the request
3                        About request                             requests
4                             requests                             requests
5                      My home is nice                    My homes are nice
6                 My home is beautiful                    My homes are nice
7                    My homes are nice                    My homes are nice

Note that I changed the limit for similarity. Indeed, if you check similarities, none of them reach a score of 90.

Another approach would be:

import pandas as pd
import fuzzywuzzy.fuzz as fuzz

df = pd.read_csv('input.txt')
print('--- before ---')
print(df)
SENTENCES = df['input'].to_list()
print('--- changes ---')
for index, word in enumerate(SENTENCES):
    for other_index, other_word in enumerate(SENTENCES[index+1:], index+1):
            result = fuzz.token_sort_ratio(word, other_word)
            if result > 10:
                print(f'OK | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}')                
            elif result >20:
                print(f'   | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}')  
            if result > 45:
                SENTENCES[index] = other_word
                
df['output'] = SENTENCES

print(df)

which gives a little info about what is happening:

--- before ---
                                 input
0       This topic is about purchasing
1        He was talking about shopping
2  That was the reason for the request
3                        About request
4                             requests
5                      My home is nice
6                 My home is beautiful
7                    My homes are nice
--- changes ---
OK |  51 |  0 This topic is about purchasing ->  1 He was talking about shopping
OK |  34 |  0 This topic is about purchasing ->  2 That was the reason for the request
OK |  37 |  0 This topic is about purchasing ->  3 About request
OK |  16 |  0 This topic is about purchasing ->  4 requests
OK |  36 |  0 This topic is about purchasing ->  5 My home is nice
OK |  28 |  0 This topic is about purchasing ->  6 My home is beautiful
OK |  30 |  0 This topic is about purchasing ->  7 My homes are nice
OK |  41 |  1 He was talking about shopping ->  2 That was the reason for the request
OK |  43 |  1 He was talking about shopping ->  3 About request
OK |  16 |  1 He was talking about shopping ->  4 requests
OK |  23 |  1 He was talking about shopping ->  5 My home is nice
OK |  37 |  1 He was talking about shopping ->  6 My home is beautiful
OK |  26 |  1 He was talking about shopping ->  7 My homes are nice
OK |  38 |  2 That was the reason for the request ->  3 About request
OK |  37 |  2 That was the reason for the request ->  4 requests
OK |  16 |  2 That was the reason for the request ->  5 My home is nice
OK |  22 |  2 That was the reason for the request ->  6 My home is beautiful
OK |  31 |  2 That was the reason for the request ->  7 My homes are nice
OK |  67 |  3 About request ->  4 requests
OK |  21 |  3 About request ->  5 My home is nice
OK |  36 |  3 About request ->  6 My home is beautiful
OK |  33 |  3 About request ->  7 My homes are nice
OK |  17 |  4 requests ->  5 My home is nice
OK |  29 |  4 requests ->  6 My home is beautiful
OK |  32 |  4 requests ->  7 My homes are nice
OK |  54 |  5 My homes are nice ->  6 My home is beautiful
OK | 100 |  5 My homes are nice ->  7 My homes are nice
OK |  54 |  6 My home is beautiful ->  7 My homes are nice
                                 input                               output
0       This topic is about purchasing        He was talking about shopping
1        He was talking about shopping        He was talking about shopping
2  That was the reason for the request  That was the reason for the request
3                        About request                             requests
4                             requests                             requests
5                      My home is nice                    My homes are nice
6                 My home is beautiful                    My homes are nice
7                    My homes are nice                    My homes are nice

To just get the dataframe:

import pandas as pd
import fuzzywuzzy.fuzz as fuzz

df = pd.read_csv('input.txt')

SENTENCES = df['input'].to_list()

for index, word in enumerate(SENTENCES):
    for other_index, other_word in enumerate(SENTENCES[index+1:], index+1):
            result = fuzz.token_sort_ratio(word, other_word)
            if result > 10:
                f'OK | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}'             
            elif result >20:
                f'   | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}'  
            if result > 45:
                SENTENCES[index] = other_word
                
df['output'] = SENTENCES

print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM