replace words in a string with respect to words in pandas dataframe

Question

i have a string:

str = 'i have a banana and an apple'

i also have a dataframe

name    new_name
have     had
bed      eat
banana   lime

i want to replace words in string if that words exists in pandas df.

for eg( for my str= the output should be.

'i had a lime and an apple'

i am tryng to define a function

def replace(df,string):
    L = []
    for i in string:
        new_word = df[[new_name]].loc[df.name==i].item()
        if not new_word:
             new_word = i
    L.append(new_word)
    result_str = ' '.join(map(str, L))
    return result_str

But this seems very lenghty, is there a better way(time efficient) to get such output?

Answer 1

Option 1

Split your string on the natural delimiter (space)
Call pd.Series.replace , and pass new_name as an argument
Combine the cells in the series with str.cat / str.join

m = df.set_index('name').new_name

pd.Series(string.split()).replace(m).str.cat(sep=' ')
'i had a lime and an apple'

Where string is your original string. Don't use str to define variables, that hides the builtin class with the same name.

Alternatively, calling str.join should be faster than str.cat -

' '.join(pd.Series(string.split()).replace(m).tolist())
'i had a lime and an apple'

I'll be using this method of joining strings in Series from now one, you'll also see it in the forthcoming option.

Option 2
You can skip pandas, and instead use re.sub :

import re

m = df.set_index('name').new_name.to_dict()
p = r'\b{}\b'.format('|'.join(df.name.tolist()))

re.sub(p, lambda x: m.get(x.group()), string)
'i had a lime and an apple'

Performance

string = 'i have a banana and an apple ' * 10000

# Series-`replacement

%%timeit
m = df.set_index('name').new_name
' '.join(pd.Series(string.split()).replace(m).tolist())

100 loops, best of 3: 20.3 ms per loop

# `re`gex replacement

%%timeit
m = df.set_index('name').new_name.to_dict()
p = r'\b{}\b'.format('|'.join(df.name.tolist()))
re.sub(p, lambda x: m.get(x.group()), string)

10 loops, best of 3: 30.7 ms per loop

Answer 2

Use replace with parameter regex=True :

a = 'i have a banana and an apple'

b = pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
print (b)
i had a lime and an apple

Another solution:

a = 'i have a banana and an apple'

import re
d = df.set_index('name')['new_name'].to_dict()
p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
b = p.sub(lambda x: d[x.group()], a)
print (b)
i had a lime and an apple

Timings :

a = 'i have a banana and an apple' * 1000

In [205]: %%timeit
     ...: import re
     ...: d = df.set_index('name')['new_name'].to_dict()
     ...: p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
     ...: b = p.sub(lambda x: d[x.group()], a)
     ...: 
100 loops, best of 3: 2.52 ms per loop

In [206]: %%timeit
     ...: pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
     ...: 
1000 loops, best of 3: 1.43 ms per loop


In [208]: %%timeit
     ...: m = df.set_index('name').new_name
     ...: 
     ...: pd.Series(a.split()).replace(m).str.cat(sep=' ')
     ...: 
100 loops, best of 3: 3.11 ms per loop


In [211]: %%timeit
     ...: m = df.set_index('name').new_name.to_dict()
     ...: p = r'\b{}\b'.format(df.name.str.cat(sep='|'))
     ...: 
     ...: re.sub(p, lambda x: m.get(x.group()), a)
     ...: 
100 loops, best of 3: 2.91 ms per loop

a = 'i have a banana and an apple' * 10000

In [213]: %%timeit
     ...: import re
     ...: d = df.set_index('name')['new_name'].to_dict()
     ...: p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
     ...: b = p.sub(lambda x: d[x.group()], a)
     ...: 
     ...: 
100 loops, best of 3: 19.8 ms per loop

In [214]: %%timeit
     ...: pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
     ...: 
100 loops, best of 3: 4.1 ms per loop

In [215]: %%timeit
     ...: m = df.set_index('name').new_name
     ...: 
     ...: pd.Series(a.split()).replace(m).str.cat(sep=' ')
     ...: 
10 loops, best of 3: 26.3 ms per loop

In [216]: %%timeit
     ...: m = df.set_index('name').new_name.to_dict()
     ...: p = r'\b{}\b'.format(df.name.str.cat(sep='|'))
     ...: 
     ...: re.sub(p, lambda x: m.get(x.group()), a)
     ...: 
10 loops, best of 3: 22.8 ms per loop

replace words in a string with respect to words in pandas dataframe

Question

2 answers

solution1
2 2018-01-22 13:34:55

solution2
1 ACCPTED 2018-01-22 13:37:03

replace words in a string with respect to words in pandas dataframe

Question

2 answers

solution1 2 2018-01-22 13:34:55

solution2 1 ACCPTED 2018-01-22 13:37:03

solution1
2 2018-01-22 13:34:55

solution2
1 ACCPTED 2018-01-22 13:37:03