[英]Pandas Series identifying number of consecutive consonants
Given a Series
of strings, I'm trying to calculate a new Series
which contains the highest consecutive count of consonants in the original string, ignoring spaces. 给定一个
Series
的字符串,我试图计算出新的Series
其中包含原始字符串辅音的最高连续数,忽略空格。
For example, given df['names']
, I'd like to determine df['max_consonants']
like below: 例如,给定
df['names']
,我想确定df['max_consonants']
如下:
In [1]: df
Out[1]:
names max_consonants
0 will hunting 2
1 sean maguire 1
2 gerald lambeau 2
3 chuckie sullivan 2
4 mike krzyzewski 5
Outside of pandas
, I am able to do this using the re
module, like so: 在
pandas
之外,我可以使用re
模块来做到这一点,就像这样:
In [2]: def max_consonants(s):
return max(len(i) for i in re.findall(r'[^aeiou ]+', s))
In [3]: max_consonants('mike krzyzewski')
Out[3]: 5
I know I can use pd.Series.apply
to use the max_consonants
function on a Series
, but it is not vectorized. 我知道我可以使用
pd.Series.apply
在Series
上使用max_consonants
函数,但是它不是矢量化的。 I am working with data containing 2-3mm rows/names, so I am looking for the most efficient solution. 我正在处理包含2-3毫米行/名称的数据,因此我正在寻找最有效的解决方案。
Is there a more elegant solution native to pandas
that would allow me to take advantage of vectorization? 有没有
pandas
特有的优雅解决方案,可以让我利用矢量化的优势?
You could try this, it should also work for special characters because of the \\W
. 您可以尝试这样做,因为
\\W
,它也应该适用于特殊字符。 But please note, that \\W
also catches numbers, so if you also want to split on those, you need to add 0-9
to the regex used by split: 但请注意,
\\W
也会捕获数字,因此,如果您也想对数字进行拆分,则需要在split使用的正则表达式中添加0-9
:
df['names'].str.split(r'[AaEeIiOoUu\W]', expand=True).fillna('').applymap(len).max(axis='columns')
With the test data: 带有测试数据:
raw="""idx names max_consonants
0 will hunting 2
1 sean maguire 1
2 gerald lambeau 2
3 chuckie sullivan 2
4 mike krzyzewski 5
5 mike krzyzewski12345678 5
"""
df= pd.read_csv(io.StringIO(raw), sep='\s{2,}', index_col=[0])
This evaluates to: 计算结果为:
idx
0 2
1 1
2 2
3 2
4 5
5 8
dtype: int64
The intermediate result before the applymap
looks like this btw: applymap
之前的中间结果如下所示:
Out[89]:
0 1 2 3 4 5 6 7
idx
0 w ll h nt ng
1 s n m g r
2 g r ld l mb
3 ch ck s ll v n
4 m k krzyz wsk
5 m k krzyz wsk 12345678
Note on the performance: I would expect .mapapply(len)
to be translated to efficient C++ operations, but can't verify it with my data. 关于性能的注意事项:我希望将
.mapapply(len)
转换为有效的C ++操作,但是无法使用我的数据进行验证。 In case you get performance problems with this solution, you can try a variant in which you perform everything up to the applymap
, replace the applymap by a loop over the columns and perform .str.len()
. 如果此解决方案出现性能问题,则可以尝试一种变体,在该变体中,执行直到
applymap
所有applymap
,用循环遍历各列的方式替换applymap,然后执行.str.len()
。 Which would roughly look like this: 大致如下所示:
df_consonant_strings= df['names'].str.split(r'[AaEeIiOoUu\W]', expand=True).fillna('')
ser_max= None
for col in df_consonant_strings.columns:
ser_col= df_consonant_strings[col].str.len()
if ser_max is None:
ser_max= ser_col
else:
ser_max= ser_max.where(ser_max>ser_col, ser_col)
# now ser_max contains the desired maximum length of consonant substrings
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.