熊猫系列识别连续辅音的数量

Question

Given a Series of strings, I'm trying to calculate a new Series which contains the highest consecutive count of consonants in the original string, ignoring spaces. 给定一个Series的字符串，我试图计算出新的Series其中包含原始字符串辅音的最高连续数，忽略空格。

For example, given df['names'] , I'd like to determine df['max_consonants'] like below: 例如，给定df['names'] ，我想确定df['max_consonants']如下：

In [1]: df
Out[1]:
               names max_consonants
0       will hunting              2
1       sean maguire              1
2     gerald lambeau              2
3   chuckie sullivan              2
4    mike krzyzewski              5

Outside of pandas , I am able to do this using the re module, like so: 在pandas之外，我可以使用re模块来做到这一点，就像这样：

In [2]: def max_consonants(s):
             return max(len(i) for i in re.findall(r'[^aeiou ]+', s))

In [3]: max_consonants('mike krzyzewski')
Out[3]: 5

I know I can use pd.Series.apply to use the max_consonants function on a Series , but it is not vectorized. 我知道我可以使用pd.Series.apply在Series上使用max_consonants函数，但是它不是矢量化的。 I am working with data containing 2-3mm rows/names, so I am looking for the most efficient solution. 我正在处理包含2-3毫米行/名称的数据，因此我正在寻找最有效的解决方案。

Is there a more elegant solution native to pandas that would allow me to take advantage of vectorization? 有没有pandas特有的优雅解决方案，可以让我利用矢量化的优势？

Answer 1

You could try this, it should also work for special characters because of the \\W . 您可以尝试这样做，因为\\W ，它也应该适用于特殊字符。 But please note, that \\W also catches numbers, so if you also want to split on those, you need to add 0-9 to the regex used by split: 但请注意， \\W也会捕获数字，因此，如果您也想对数字进行拆分，则需要在split使用的正则表达式中添加0-9 ：

df['names'].str.split(r'[AaEeIiOoUu\W]', expand=True).fillna('').applymap(len).max(axis='columns')

With the test data: 带有测试数据：

raw="""idx             names  max_consonants
0       will hunting              2
1       sean maguire              1
2     gerald lambeau              2
3   chuckie sullivan              2
4    mike krzyzewski              5
5    mike krzyzewski12345678      5
"""
df= pd.read_csv(io.StringIO(raw), sep='\s{2,}', index_col=[0])

This evaluates to: 计算结果为：

idx
0    2
1    1
2    2
3    2
4    5
5    8
dtype: int64

The intermediate result before the applymap looks like this btw: applymap之前的中间结果如下所示：

Out[89]: 
      0   1   2      3    4         5  6  7
idx                                        
0     w  ll   h     nt   ng                
1     s       n      m    g            r   
2     g   r  ld      l   mb                
3    ch  ck               s        ll  v  n
4     m   k      krzyz  wsk                
5     m   k      krzyz  wsk  12345678

Note on the performance: I would expect .mapapply(len) to be translated to efficient C++ operations, but can't verify it with my data. 关于性能的注意事项：我希望将.mapapply(len)转换为有效的C ++操作，但是无法使用我的数据进行验证。 In case you get performance problems with this solution, you can try a variant in which you perform everything up to the applymap , replace the applymap by a loop over the columns and perform .str.len() . 如果此解决方案出现性能问题，则可以尝试一种变体，在该变体中，执行直到applymap所有applymap ，用循环遍历各列的方式替换applymap，然后执行.str.len() 。 Which would roughly look like this: 大致如下所示：

df_consonant_strings= df['names'].str.split(r'[AaEeIiOoUu\W]', expand=True).fillna('')
ser_max= None
for col in df_consonant_strings.columns:
    ser_col= df_consonant_strings[col].str.len()
    if ser_max is None:
        ser_max= ser_col
    else:
        ser_max= ser_max.where(ser_max>ser_col, ser_col)
# now ser_max contains the desired maximum length of consonant substrings

熊猫系列识别连续辅音的数量

问题描述

1 个解决方案

解决方案1
1 2019-08-26 22:48:24

熊猫系列识别连续辅音的数量

问题描述

1 个解决方案

解决方案1 1 2019-08-26 22:48:24

解决方案1
1 2019-08-26 22:48:24