简体   繁体   English

熊猫系列识别连续辅音的数量

[英]Pandas Series identifying number of consecutive consonants

Given a Series of strings, I'm trying to calculate a new Series which contains the highest consecutive count of consonants in the original string, ignoring spaces. 给定一个Series的字符串,我试图计算出新的Series其中包含原始字符串辅音的最高连续数,忽略空格。

For example, given df['names'] , I'd like to determine df['max_consonants'] like below: 例如,给定df['names'] ,我想确定df['max_consonants']如下:

In [1]: df
Out[1]:
               names max_consonants
0       will hunting              2
1       sean maguire              1
2     gerald lambeau              2
3   chuckie sullivan              2
4    mike krzyzewski              5

Outside of pandas , I am able to do this using the re module, like so: pandas之外,我可以使用re模块来做到这一点,就像这样:

In [2]: def max_consonants(s):
             return max(len(i) for i in re.findall(r'[^aeiou ]+', s))

In [3]: max_consonants('mike krzyzewski')
Out[3]: 5

I know I can use pd.Series.apply to use the max_consonants function on a Series , but it is not vectorized. 我知道我可以使用pd.Series.applySeries上使用max_consonants函数,但是它不是矢量化的。 I am working with data containing 2-3mm rows/names, so I am looking for the most efficient solution. 我正在处理包含2-3毫米行/名称的数据,因此我正在寻找最有效的解决方案。

Is there a more elegant solution native to pandas that would allow me to take advantage of vectorization? 有没有pandas特有的优雅解决方案,可以让我利用矢量化的优势?

You could try this, it should also work for special characters because of the \\W . 您可以尝试这样做,因为\\W ,它也应该适用于特殊字符。 But please note, that \\W also catches numbers, so if you also want to split on those, you need to add 0-9 to the regex used by split: 但请注意, \\W也会捕获数字,因此,如果您也想对数字进行拆分,则需要在split使用的正则表达式中添加0-9

df['names'].str.split(r'[AaEeIiOoUu\W]', expand=True).fillna('').applymap(len).max(axis='columns')

With the test data: 带有测试数据:

raw="""idx             names  max_consonants
0       will hunting              2
1       sean maguire              1
2     gerald lambeau              2
3   chuckie sullivan              2
4    mike krzyzewski              5
5    mike krzyzewski12345678      5
"""
df= pd.read_csv(io.StringIO(raw), sep='\s{2,}', index_col=[0])

This evaluates to: 计算结果为:

idx
0    2
1    1
2    2
3    2
4    5
5    8
dtype: int64

The intermediate result before the applymap looks like this btw: applymap之前的中间结果如下所示:

Out[89]: 
      0   1   2      3    4         5  6  7
idx                                        
0     w  ll   h     nt   ng                
1     s       n      m    g            r   
2     g   r  ld      l   mb                
3    ch  ck               s        ll  v  n
4     m   k      krzyz  wsk                
5     m   k      krzyz  wsk  12345678      

Note on the performance: I would expect .mapapply(len) to be translated to efficient C++ operations, but can't verify it with my data. 关于性能的注意事项:我希望将.mapapply(len)转换为有效的C ++操作,但是无法使用我的数据进行验证。 In case you get performance problems with this solution, you can try a variant in which you perform everything up to the applymap , replace the applymap by a loop over the columns and perform .str.len() . 如果此解决方案出现性能问题,则可以尝试一种变体,在该变体中,执行直到applymap所有applymap ,用循环遍历各列的方式替换applymap,然后执行.str.len() Which would roughly look like this: 大致如下所示:

df_consonant_strings= df['names'].str.split(r'[AaEeIiOoUu\W]', expand=True).fillna('')
ser_max= None
for col in df_consonant_strings.columns:
    ser_col= df_consonant_strings[col].str.len()
    if ser_max is None:
        ser_max= ser_col
    else:
        ser_max= ser_max.where(ser_max>ser_col, ser_col)
# now ser_max contains the desired maximum length of consonant substrings

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM