I'm currently doing sentiment analysis and having a problem.
I have a big normalization for word and I want to normalization text before tokenize like this example:
data | normal |
---|---|
kamu knp sayang | kamu kenapa sayang |
drpd sedih mending belajar | dari pada sedih mending belajar |
dmna sekarang | di mana sekarang |
This is my code:
import pandas as pd
slang = pd.DataFrame({'before': ['knp', 'dmna', 'drpd'], 'after': ['kenapa', 'di mana', 'dari pada']})
df = pd.DataFrame({'data': ['kamu knp sayang', 'drpd sedih mending bermain']})
normalisasi = {}
for index, row in slang.iterrows():
if row[0] not in normalisasi:
normalisasi[row[0]] = row[1]
def normalized_term(document):
return [normalisasi[term] if term in normalisasi else term for term in document]
df['normal'] = df['data'].apply(normalized_term)
df
But, the result like this: result
I want the result like the example table.
There is a utility named str.replace
in pandas that allows us to replace a substring with another or even find/replace patterns. You can find full documentation here . Your desired output would have appeared like this:
import pandas as pd
slang = pd.DataFrame({'before': ['knp', 'dmna', 'drpd'], 'after': ['kenapa', 'di mana', 'dari pada']})
df = pd.DataFrame({'data': ['kamu knp sayang', 'drpd sedih mending bermain']})
for idx, row in slang.iterrows():
df.data = df.data.str.replace(row['before'], row['after'])
output:
data
0 kamu kenapa sayang
1 dari pada sedih mending bermain
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.