简体   繁体   中英

replace words in a string with respect to words in pandas dataframe

i have a string:

str = 'i have a banana and an apple'

i also have a dataframe

name    new_name
have     had
bed      eat
banana   lime

i want to replace words in string if that words exists in pandas df.

for eg( for my str= the output should be.

'i had a lime and an apple'

i am tryng to define a function

def replace(df,string):
    L = []
    for i in string:
        new_word = df[[new_name]].loc[df.name==i].item()
        if not new_word:
             new_word = i
    L.append(new_word)
    result_str = ' '.join(map(str, L))
    return result_str

But this seems very lenghty, is there a better way(time efficient) to get such output?

Option 1

  1. Split your string on the natural delimiter (space)
  2. Call pd.Series.replace , and pass new_name as an argument
  3. Combine the cells in the series with str.cat / str.join

m = df.set_index('name').new_name

pd.Series(string.split()).replace(m).str.cat(sep=' ')
'i had a lime and an apple'

Where string is your original string. Don't use str to define variables, that hides the builtin class with the same name.

Alternatively, calling str.join should be faster than str.cat -

' '.join(pd.Series(string.split()).replace(m).tolist())
'i had a lime and an apple'

I'll be using this method of joining strings in Series from now one, you'll also see it in the forthcoming option.


Option 2
You can skip pandas, and instead use re.sub :

import re

m = df.set_index('name').new_name.to_dict()
p = r'\b{}\b'.format('|'.join(df.name.tolist()))

re.sub(p, lambda x: m.get(x.group()), string)
'i had a lime and an apple'

Performance

string = 'i have a banana and an apple ' * 10000

# Series-`replacement

%%timeit
m = df.set_index('name').new_name
' '.join(pd.Series(string.split()).replace(m).tolist())

100 loops, best of 3: 20.3 ms per loop

# `re`gex replacement

%%timeit
m = df.set_index('name').new_name.to_dict()
p = r'\b{}\b'.format('|'.join(df.name.tolist()))
re.sub(p, lambda x: m.get(x.group()), string)

10 loops, best of 3: 30.7 ms per loop

Use replace with parameter regex=True :

a = 'i have a banana and an apple'

b = pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
print (b)
i had a lime and an apple

Another solution:

a = 'i have a banana and an apple'

import re
d = df.set_index('name')['new_name'].to_dict()
p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
b = p.sub(lambda x: d[x.group()], a)
print (b)
i had a lime and an apple

Timings :

a = 'i have a banana and an apple' * 1000

In [205]: %%timeit
     ...: import re
     ...: d = df.set_index('name')['new_name'].to_dict()
     ...: p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
     ...: b = p.sub(lambda x: d[x.group()], a)
     ...: 
100 loops, best of 3: 2.52 ms per loop

In [206]: %%timeit
     ...: pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
     ...: 
1000 loops, best of 3: 1.43 ms per loop


In [208]: %%timeit
     ...: m = df.set_index('name').new_name
     ...: 
     ...: pd.Series(a.split()).replace(m).str.cat(sep=' ')
     ...: 
100 loops, best of 3: 3.11 ms per loop


In [211]: %%timeit
     ...: m = df.set_index('name').new_name.to_dict()
     ...: p = r'\b{}\b'.format(df.name.str.cat(sep='|'))
     ...: 
     ...: re.sub(p, lambda x: m.get(x.group()), a)
     ...: 
100 loops, best of 3: 2.91 ms per loop

a = 'i have a banana and an apple' * 10000

In [213]: %%timeit
     ...: import re
     ...: d = df.set_index('name')['new_name'].to_dict()
     ...: p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
     ...: b = p.sub(lambda x: d[x.group()], a)
     ...: 
     ...: 
100 loops, best of 3: 19.8 ms per loop

In [214]: %%timeit
     ...: pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
     ...: 
100 loops, best of 3: 4.1 ms per loop

In [215]: %%timeit
     ...: m = df.set_index('name').new_name
     ...: 
     ...: pd.Series(a.split()).replace(m).str.cat(sep=' ')
     ...: 
10 loops, best of 3: 26.3 ms per loop

In [216]: %%timeit
     ...: m = df.set_index('name').new_name.to_dict()
     ...: p = r'\b{}\b'.format(df.name.str.cat(sep='|'))
     ...: 
     ...: re.sub(p, lambda x: m.get(x.group()), a)
     ...: 
10 loops, best of 3: 22.8 ms per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM