计算另一列中一列的子串的出现次数

Question

I have two dataframes I am working with, one which contains a list of players and another that contains play by play data for the players from the other dataframe. 我有两个我正在使用的数据帧，一个包含一个播放器列表，另一个包含来自其他数据帧的播放器播放数据。 Portions of the rows of interest within these two dataframes are shown below. 这两个数据帧中感兴趣的行的部分如下所示。

0          Matt Carpenter
1           Jason Heyward
2           Peter Bourjos
3           Matt Holliday
4          Jhonny Peralta
5              Matt Adams
...
Name: Name, dtype: object


0     Matt Carpenter grounded out to second (Grounder).
1               Jason Heyward doubled to right (Liner).
2     Matt Holliday singled to right (Liner). Jason Heyward scored.
...
Name: Play, dtype: object

What I am trying to do is create a column in the first dataframe that counts the number of occurrences of the string (df['Name'] + ' scored') in the column in the other dataframe. 我要做的是在第一个数据框中创建一个列，用于计算另一个数据帧中列的字符串出现次数（df ['Name'] +'scored'）。 For example, it would search for instances of "Matt Carpenter scored", "Jason Heyward scored", etc. I know you can use str.contains to do this type of thing, but it only seems to work if you put in the explicit string. 例如，它将搜索“Matt Carpenter得分”，“Jason Heyward得分”等实例。我知道你可以使用str.contains来做这类事情，但它似乎只有在你明确表达时才有用串。 For example, 例如，

batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains('Jason Heyward scored')].index)

works fine but if I try 工作正常，但如果我尝试

batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains(batter_game_logs_df['Name'].astype(str) + ' scored')].index)

it returns the error 'Series' objects are mutable, thus they cannot be hashed. 它返回错误'Series'对象是可变的，因此它们不能被散列。 I have looked at various similar questions but cannot find the solution to this problem for the life of me. 我看过各种类似的问题但是找不到解决这个问题的方法。 Any assistance on this would be greatly appreciated, thank you! 对此有任何帮助将不胜感激，谢谢！

Answer 1

I think need findall by regex with join all values of Name , then create indicator columns by MultiLabelBinarizer and add all missing columns by reindex : 我认为需要findall通过正则表达式与加入的所有值Name ，然后通过创建指标列MultiLabelBinarizer并添加所有缺少的列reindex ：

s = df1['Name'] + ' scored'
pat = r'\b{}\b'.format('|'.join(s))

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df2['Play'].str.findall(pat)),
                  columns=mlb.classes_, 
                  index=df2.index).reindex(columns=s, fill_value=0)
print (df)
Name  Matt Carpenter scored  Jason Heyward scored  Peter Bourjos scored  \
0                         0                     0                     0   
1                         0                     0                     0   
2                         0                     1                     0   

Name  Matt Holliday scored  Jhonny Peralta scored  Matt Adams scored  
0                        0                      0                  0  
1                        0                      0                  0  
2                        0                      0                  0

Last if necessary join to df1 : 如有必要，最后join到df1 ：

df = df2.join(df)
print (df)
                                                Play  Matt Carpenter scored  \
0  Matt Carpenter grounded out to second (Grounder).                      0   
1            Jason Heyward doubled to right (Liner).                      0   
2  Matt Holliday singled to right (Liner). Jason ...                      0   

   Jason Heyward scored  Peter Bourjos scored  Matt Holliday scored  \
0                     0                     0                     0   
1                     0                     0                     0   
2                     1                     0                     0   

   Jhonny Peralta scored  Matt Adams scored  
0                      0                  0  
1                      0                  0  
2                      0                  0

计算另一列中一列的子串的出现次数

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-07-16 13:14:51

计算另一列中一列的子串的出现次数

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-07-16 13:14:51

解决方案1
2 已采纳 2018-07-16 13:14:51