简体   繁体   中英

Normalization words for sentiment analysis

I'm currently doing sentiment analysis and having a problem.

I have a big normalization for word and I want to normalization text before tokenize like this example:

data normal
kamu knp sayang kamu kenapa sayang
drpd sedih mending belajar dari pada sedih mending belajar
dmna sekarang di mana sekarang
  • knp: kenapa
  • drpd: dari pada
  • dmna: di mana

This is my code:

import pandas as pd

slang = pd.DataFrame({'before': ['knp', 'dmna', 'drpd'], 'after': ['kenapa', 'di mana', 'dari pada']})
df = pd.DataFrame({'data': ['kamu knp sayang', 'drpd sedih mending bermain']})
                  
normalisasi = {}

for index, row in slang.iterrows():
  if row[0] not in normalisasi:
    normalisasi[row[0]] = row[1]


def normalized_term(document):
    return [normalisasi[term] if term in normalisasi else term for term in document]

df['normal'] = df['data'].apply(normalized_term)
df

But, the result like this: result

I want the result like the example table.

There is a utility named str.replace in pandas that allows us to replace a substring with another or even find/replace patterns. You can find full documentation here . Your desired output would have appeared like this:

import pandas as pd
slang = pd.DataFrame({'before': ['knp', 'dmna', 'drpd'], 'after': ['kenapa', 'di mana', 'dari pada']})
df = pd.DataFrame({'data': ['kamu knp sayang', 'drpd sedih mending bermain']})
for idx, row in slang.iterrows():
    df.data = df.data.str.replace(row['before'], row['after']) 

output:

                              data
0               kamu kenapa sayang
1  dari pada sedih mending bermain

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM