計算 Pandas 中字符串中單詞的出現次數

Question

我試圖計算一個單詞在 Pandas 系列的所有字符串中出現的次數我有一個遵循以下邏輯的數據幀df ：

word

hi    
hello
bye
goodbye

一個df_2看起來像這樣（向右滾動以查看另一列）

sentence                                                                            metric_x

hello, what a wonderful day                                                         10
I did not said hello today                                                          15
what comes first, hi or hello                                                       25
the most used word is hi                                                            30
hi or hello, which is more formal                                                   50
he said goodbye, even though he never said hi or hello in the first place           5

我試圖在df實現以下目標：計算一個word出現的metric_x以及與該word匹配的值的metric_x總和是metric_x 。

word        count       metric_x_sum
        
hi          4           110
hello       5           105
bye         0           0
goodbye     1           5

我正在使用這個：

df['count'] = df['word'].apply(lambda x: df_2['sentence'].str.count(x).sum())

問題在於數據幀的長度，我在df有70,000獨特的單詞，在df_2 250,000獨特的句子， df_2的行運行了 15 分鍾，我不知道它可能運行多長時間。

讓它運行 15 分鍾后，我收到此錯誤：

error: multiple repeat at position 2

有沒有更聰明、更快的方法來實現這一目標？

Answer 1

首先拆分單詞和DataFrame.explode句子,通過Series.str.strip刪除尾隨值：

df2 = df_2.assign(word = df_2['sentence'].str.split()).explode('word')
df2['word'] = df2['word'].str.strip(',')
#print (df2)

然后DataFrame.merge與左連接並聚合GroupBy.count以使用sum排除缺失值：

df3 = (df.merge(df2, on='word', how='left')
         .groupby('word')
         .agg(count=('metric_x', 'count'), metric_x_sum=('metric_x','sum')))
# print (df3)

最后添加到原始：

df = df.join(df3, on='word')
df['metric_x_sum'] = df['metric_x_sum'].astype(int)
print (df)
      word  count  metric_x_sum
0       hi      4           110
1    hello      5           105
2      bye      0             0
3  goodbye      1             5

計算 Pandas 中字符串中單詞的出現次數

問題描述

1 個解決方案

解決方案1
1 已采納 2020-09-23 07:08:58

計算 Pandas 中字符串中單詞的出現次數

問題描述

1 個解決方案

解決方案1 1 已采納 2020-09-23 07:08:58

解決方案1
1 已采納 2020-09-23 07:08:58