[英]replace words in a string with respect to words in pandas dataframe
i have a string: 我有一个字符串:
str = 'i have a banana and an apple'
i also have a dataframe 我也有一个数据框
name new_name
have had
bed eat
banana lime
i want to replace words in string if that words exists in pandas df. 如果熊猫df中存在单词,我想替换字符串中的单词。
for eg( for my str= the output should be. 对于eg(对于我的str =输出应该是。
'i had a lime and an apple'
i am tryng to define a function 我正在尝试定义一个功能
def replace(df,string):
L = []
for i in string:
new_word = df[[new_name]].loc[df.name==i].item()
if not new_word:
new_word = i
L.append(new_word)
result_str = ' '.join(map(str, L))
return result_str
But this seems very lenghty, is there a better way(time efficient) to get such output? 但这似乎很宽松,是否有更好的方法(省时)来获得这种输出?
Option 1 选项1
pd.Series.replace
, and pass new_name
as an argument 调用pd.Series.replace
,并将new_name
作为参数传递 str.cat
/ str.join
将系列中的单元格与str.cat
/ str.join
m = df.set_index('name').new_name
pd.Series(string.split()).replace(m).str.cat(sep=' ')
'i had a lime and an apple'
Where string
is your original string. 其中string
是您的原始字符串。 Don't use str
to define variables, that hides the builtin class with the same name. 不要使用str
定义变量,这会隐藏具有相同名称的内置类。
Alternatively, calling str.join
should be faster than str.cat
- 另外,调用str.join
应该比str.cat
更快-
' '.join(pd.Series(string.split()).replace(m).tolist())
'i had a lime and an apple'
I'll be using this method of joining strings in Series from now one, you'll also see it in the forthcoming option. 从现在开始,我将使用这种方法在Series中连接字符串,您还将在即将到来的选项中看到它。
Option 2 选项2
You can skip pandas, and instead use re.sub
: 您可以跳过熊猫,而使用re.sub
:
import re
m = df.set_index('name').new_name.to_dict()
p = r'\b{}\b'.format('|'.join(df.name.tolist()))
re.sub(p, lambda x: m.get(x.group()), string)
'i had a lime and an apple'
Performance 性能
string = 'i have a banana and an apple ' * 10000
# Series-`replacement
%%timeit
m = df.set_index('name').new_name
' '.join(pd.Series(string.split()).replace(m).tolist())
100 loops, best of 3: 20.3 ms per loop
# `re`gex replacement
%%timeit
m = df.set_index('name').new_name.to_dict()
p = r'\b{}\b'.format('|'.join(df.name.tolist()))
re.sub(p, lambda x: m.get(x.group()), string)
10 loops, best of 3: 30.7 ms per loop
Use replace
with parameter regex=True
: 使用带有参数regex=True
replace
:
a = 'i have a banana and an apple'
b = pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
print (b)
i had a lime and an apple
Another solution: 另一个解决方案:
a = 'i have a banana and an apple'
import re
d = df.set_index('name')['new_name'].to_dict()
p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
b = p.sub(lambda x: d[x.group()], a)
print (b)
i had a lime and an apple
Timings : 时间 :
a = 'i have a banana and an apple' * 1000
In [205]: %%timeit
...: import re
...: d = df.set_index('name')['new_name'].to_dict()
...: p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
...: b = p.sub(lambda x: d[x.group()], a)
...:
100 loops, best of 3: 2.52 ms per loop
In [206]: %%timeit
...: pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
...:
1000 loops, best of 3: 1.43 ms per loop
In [208]: %%timeit
...: m = df.set_index('name').new_name
...:
...: pd.Series(a.split()).replace(m).str.cat(sep=' ')
...:
100 loops, best of 3: 3.11 ms per loop
In [211]: %%timeit
...: m = df.set_index('name').new_name.to_dict()
...: p = r'\b{}\b'.format(df.name.str.cat(sep='|'))
...:
...: re.sub(p, lambda x: m.get(x.group()), a)
...:
100 loops, best of 3: 2.91 ms per loop
a = 'i have a banana and an apple' * 10000
In [213]: %%timeit
...: import re
...: d = df.set_index('name')['new_name'].to_dict()
...: p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
...: b = p.sub(lambda x: d[x.group()], a)
...:
...:
100 loops, best of 3: 19.8 ms per loop
In [214]: %%timeit
...: pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
...:
100 loops, best of 3: 4.1 ms per loop
In [215]: %%timeit
...: m = df.set_index('name').new_name
...:
...: pd.Series(a.split()).replace(m).str.cat(sep=' ')
...:
10 loops, best of 3: 26.3 ms per loop
In [216]: %%timeit
...: m = df.set_index('name').new_name.to_dict()
...: p = r'\b{}\b'.format(df.name.str.cat(sep='|'))
...:
...: re.sub(p, lambda x: m.get(x.group()), a)
...:
10 loops, best of 3: 22.8 ms per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.