[英]`re.sub()` in pandas
Say I have:说我有:
s = 'white male, 2 white females'
And want to "expand" this to:并希望将其“扩展”为:
'white male, white female, white female'
A more complete list of cases would be:更完整的案例列表是:
It seems like I am close with:好像我很接近:
import re
# Do I need boundaries here?
mult = re.compile('two|2 (?P<race>[a-z]+) (?P<gender>(?:fe)?male)s')
# This works:
s = 'white male, 2 white females'
mult.sub(r'\g<race> \g<gender>, \g<race> \g<gender>', s)
# 'white male, white female, white female'
# This fails:
s = 'two hispanic males, 2 hispanic females'
mult.sub(r'\g<race> \g<gender>, \g<race> \g<gender>', s)
# ' , , hispanic males, hispanic female, hispanic female,'
What is creating the trip-up in the second case?在第二种情况下是什么造成了绊倒?
Bonus question: Is there a method of pandas' Series that implements this functionality directly instead of using Series.apply()
?额外问题:是否有一种熊猫系列的方法可以直接实现此功能而不是使用
Series.apply()
? Sorry to revise my question and waste anyone's time here.很抱歉修改我的问题并在这里浪费任何人的时间。
For instance, on:例如,在:
s = pd.Series(
['white male',
'white male, white female',
'hispanic male, 2 hispanic females',
'black male, 2 white females'])
Is there a faster route than:是否有比以下更快的路线:
s.apply(lambda x: mult.sub(..., x))
With regards to your "bonus" question, you can use pandas.Series.str.replace
, which is part of the pandas.Series.str
methods which work with regex:关于您的“奖金”问题,您可以使用
pandas.Series.str.replace
,它是与正则表达式一起使用的pandas.Series.str
方法的一部分:
In [10]: import re
In [11]: import pandas as pd
In [12]: s = pd.Series(
...: ['white male',
...: 'white male, white female',
...: 'hispanic male, 2 hispanic females',
...: 'black male, 2 white females'])
In [13]: mult = re.compile('two|2 (?P<race>[a-z]+) (?P<gender>(?:fe)?male)s')
...:
In [14]: s.str.replace(mult, r'\g<race> \g<gender>, \g<race> \g<gender>')
Out[14]:
0 white male
1 white male, white female
2 hispanic male, hispanic female, hispanic female
3 black male, white female, white female
dtype: object
Whether or not these methods are significantly faster than .apply
I don't know.我不知道这些方法是否比
.apply
快得多。 I suspect that you'll never be very fast working with object
dtypes.我怀疑你永远不会很快使用
object
类型。
Note, if found this issue regarding these methods being on the slow side.请注意,如果发现这个问题对于这些方法是在缓慢的一面。 I suppose until they decide it is worth it to write out a Cythonized implementation then you probably can't hope for much.
我想,在他们决定写出一个 Cythonized 实现是值得的之前,你可能不能抱太大希望。
IIUC, you need to put paranthesis around two|2
like (two|2)
if you want to match either. IIUC,如果要匹配,则需要在
two|2
周围加上括号,例如(two|2)
。
import re
mult = re.compile('(two|2) (?P<race>[a-z]+) (?P<gender>(?:fe)?male)s')
s = 'two hispanic males, 2 hispanic females'
mult.sub(r'\g<race> \g<gender>, \g<race> \g<gender>', s)
# 'hispanic male, hispanic male, hispanic female, hispanic female'
Regarding your regex itself I'd go with the following one which is more general and optimized.关于你的正则表达式本身,我会选择以下更通用和优化的。
In [14]: mult = re.compile('(?:two|2) ([^,]+)')
In [15]: s = 'two hispanic males, 2 hispanic females'
In [16]: mult.sub(lambda x: x.group(1) + ' ' + x.group(1), s)
Out[16]: 'hispanic males hispanic males, hispanic females hispanic females'
But about the performance and applying the regex to a Pandas Series
using a list comprehension is the best way to go:但是关于性能和使用列表理解将正则表达式应用于 Pandas
Series
是最好的方法:
In [29]: s = pd.Series(
['white male',
'white male, white female',
'hispanic male, 2 hispanic females',
'black male, 2 white females'])
In [30]: %timeit s.str.replace('(?:two|2) (?P<race>[a-z]+) (?P<gender>(?:fe)?male)s', r'\g<race> \g<gender>, \g<race> \g<gender>')
1000 loops, best of 3: 205 µs per loop
In [31]: %timeit s.apply(lambda x: mult.sub(lambda x: x.group(1) + ' ' + x.group(1), x))
10000 loops, best of 3: 148 µs per loop
In [32]: %timeit [mult.sub(lambda x: x.group(1) + ' ' + x.group(1), i) for i in s]
100000 loops, best of 3: 14.6 µs per loop
The most simple way:最简单的方法:
import pandas as pd
lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks']
lst2 = [11, 22, 33, 44, 55, 66, 77]
df = pd.DataFrame(list(zip(lst, lst2)), columns =['Name', 'val'])
# \1 $1 \g<1>
df.replace(regex=r'(\w)(?P<ewe>\w)', value='\g<1>_\g<ewe>=')
## Output
Name val
0 G_e=e_k=s 11
1 F_o=r 22
2 G_e=e_k=s 33
3 i_s= 44
4 p_o=r_t=a_l= 55
5 f_o=r 66
6 G_e=e_k=s 77
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.