從列表得出的數據幀中單詞的頻率總和

Question

我有一列數據，其中包含文本和要與文本列匹配的單個單詞的列表，並對這些單詞在列的每一行中出現的次數求和。

這是一個例子：

wordlist = ['alaska', 'france', 'italy']

test = pd.read_csv('vacation text.csv')
test.head(4)

Index    Text
0        'he's going to alaska and france'
1        'want to go to italy next summer'
2        'germany is great!'
4        'her parents are from france and alaska but she lives in alaska'

我嘗試使用以下代碼：

test['count'] = pd.Series(test.text.str.count(r).sum() for r in wordlist)

這段代碼：

test['count'] = pd.Series(test.text.str.contains(r).sum() for r in wordlist)

問題在於，總和似乎無法准確反映text列中的單詞數。 當我再次使用示例將germany添加到列表中，然后總和從0更改為1時，我注意到了這一點。

最終，我希望我的數據看起來像：

Index    Text                                                     Count
0        'he's going to alaska and france'                          2
1        'want to go to italy next summer'                          1
2        'germany is great!'                                        0
4        'her folks are from france and italy but she lives in alaska'   3

有人知道其他方法嗎？

Answer 1

一種方法是使用str.count

In [792]: test['Text'].str.count('|'.join(wordlist))
Out[792]:
0    2
1    1
2    0
3    3
Name: Text, dtype: int64

另一種方法，單個單詞數的sum

In [802]: pd.DataFrame({w:test['Text'].str.count(w) for w in wordlist}).sum(1)
Out[802]:
0    2
1    1
2    0
3    3
dtype: int64

細節

In [804]: '|'.join(wordlist)
Out[804]: 'alaska|france|italy'

In [805]: pd.DataFrame({w:test['Text'].str.count(w) for w in wordlist})
Out[805]:
   alaska  france  italy
0       1       1      0
1       0       0      1
2       0       0      0
3       2       1      0

從列表得出的數據幀中單詞的頻率總和

問題描述

1 個解決方案

解決方案1
2 已采納 2017-08-12 19:10:28

從列表得出的數據幀中單詞的頻率總和

問題描述

1 個解決方案

解決方案1 2 已采納 2017-08-12 19:10:28

解決方案1
2 已采納 2017-08-12 19:10:28