简体   繁体   English

相对于熊猫数据框中的单词替换字符串中的单词

[英]replace words in a string with respect to words in pandas dataframe

i have a string: 我有一个字符串:

str = 'i have a banana and an apple'

i also have a dataframe 我也有一个数据框

name    new_name
have     had
bed      eat
banana   lime

i want to replace words in string if that words exists in pandas df. 如果熊猫df中存在单词,我想替换字符串中的单词。

for eg( for my str= the output should be. 对于eg(对于我的str =输出应该是。

'i had a lime and an apple'

i am tryng to define a function 我正在尝试定义一个功能

def replace(df,string):
    L = []
    for i in string:
        new_word = df[[new_name]].loc[df.name==i].item()
        if not new_word:
             new_word = i
    L.append(new_word)
    result_str = ' '.join(map(str, L))
    return result_str

But this seems very lenghty, is there a better way(time efficient) to get such output? 但这似乎很宽松,是否有更好的方法(省时)来获得这种输出?

Option 1 选项1

  1. Split your string on the natural delimiter (space) 在自然定界符(空格)上分割字符串
  2. Call pd.Series.replace , and pass new_name as an argument 调用pd.Series.replace ,并将new_name作为参数传递
  3. Combine the cells in the series with str.cat / str.join 将系列中的单元格与str.cat / str.join

m = df.set_index('name').new_name

pd.Series(string.split()).replace(m).str.cat(sep=' ')
'i had a lime and an apple'

Where string is your original string. 其中string是您的原始字符串。 Don't use str to define variables, that hides the builtin class with the same name. 不要使用str定义变量,这会隐藏具有相同名称的内置类。

Alternatively, calling str.join should be faster than str.cat - 另外,调用str.join应该比str.cat更快-

' '.join(pd.Series(string.split()).replace(m).tolist())
'i had a lime and an apple'

I'll be using this method of joining strings in Series from now one, you'll also see it in the forthcoming option. 从现在开始,我将使用这种方法在Series中连接字符串,您还将在即将到来的选项中看到它。


Option 2 选项2
You can skip pandas, and instead use re.sub : 您可以跳过熊猫,而使用re.sub

import re

m = df.set_index('name').new_name.to_dict()
p = r'\b{}\b'.format('|'.join(df.name.tolist()))

re.sub(p, lambda x: m.get(x.group()), string)
'i had a lime and an apple'

Performance 性能

string = 'i have a banana and an apple ' * 10000

# Series-`replacement

%%timeit
m = df.set_index('name').new_name
' '.join(pd.Series(string.split()).replace(m).tolist())

100 loops, best of 3: 20.3 ms per loop

# `re`gex replacement

%%timeit
m = df.set_index('name').new_name.to_dict()
p = r'\b{}\b'.format('|'.join(df.name.tolist()))
re.sub(p, lambda x: m.get(x.group()), string)

10 loops, best of 3: 30.7 ms per loop

Use replace with parameter regex=True : 使用带有参数regex=True replace

a = 'i have a banana and an apple'

b = pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
print (b)
i had a lime and an apple

Another solution: 另一个解决方案:

a = 'i have a banana and an apple'

import re
d = df.set_index('name')['new_name'].to_dict()
p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
b = p.sub(lambda x: d[x.group()], a)
print (b)
i had a lime and an apple

Timings : 时间

a = 'i have a banana and an apple' * 1000

In [205]: %%timeit
     ...: import re
     ...: d = df.set_index('name')['new_name'].to_dict()
     ...: p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
     ...: b = p.sub(lambda x: d[x.group()], a)
     ...: 
100 loops, best of 3: 2.52 ms per loop

In [206]: %%timeit
     ...: pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
     ...: 
1000 loops, best of 3: 1.43 ms per loop


In [208]: %%timeit
     ...: m = df.set_index('name').new_name
     ...: 
     ...: pd.Series(a.split()).replace(m).str.cat(sep=' ')
     ...: 
100 loops, best of 3: 3.11 ms per loop


In [211]: %%timeit
     ...: m = df.set_index('name').new_name.to_dict()
     ...: p = r'\b{}\b'.format(df.name.str.cat(sep='|'))
     ...: 
     ...: re.sub(p, lambda x: m.get(x.group()), a)
     ...: 
100 loops, best of 3: 2.91 ms per loop

a = 'i have a banana and an apple' * 10000

In [213]: %%timeit
     ...: import re
     ...: d = df.set_index('name')['new_name'].to_dict()
     ...: p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
     ...: b = p.sub(lambda x: d[x.group()], a)
     ...: 
     ...: 
100 loops, best of 3: 19.8 ms per loop

In [214]: %%timeit
     ...: pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
     ...: 
100 loops, best of 3: 4.1 ms per loop

In [215]: %%timeit
     ...: m = df.set_index('name').new_name
     ...: 
     ...: pd.Series(a.split()).replace(m).str.cat(sep=' ')
     ...: 
10 loops, best of 3: 26.3 ms per loop

In [216]: %%timeit
     ...: m = df.set_index('name').new_name.to_dict()
     ...: p = r'\b{}\b'.format(df.name.str.cat(sep='|'))
     ...: 
     ...: re.sub(p, lambda x: m.get(x.group()), a)
     ...: 
10 loops, best of 3: 22.8 ms per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM