[英]Sum of frequency of words in a dataframe derived from a list
我有一列數據,其中包含文本和要與文本列匹配的單個單詞的列表,並對這些單詞在列的每一行中出現的次數求和。
這是一個例子:
wordlist = ['alaska', 'france', 'italy']
test = pd.read_csv('vacation text.csv')
test.head(4)
Index Text
0 'he's going to alaska and france'
1 'want to go to italy next summer'
2 'germany is great!'
4 'her parents are from france and alaska but she lives in alaska'
我嘗試使用以下代碼:
test['count'] = pd.Series(test.text.str.count(r).sum() for r in wordlist)
這段代碼:
test['count'] = pd.Series(test.text.str.contains(r).sum() for r in wordlist)
問題在於,總和似乎無法准確反映text
列中的單詞數。 當我再次使用示例將germany
添加到列表中,然后總和從0更改為1時,我注意到了這一點。
最終,我希望我的數據看起來像:
Index Text Count
0 'he's going to alaska and france' 2
1 'want to go to italy next summer' 1
2 'germany is great!' 0
4 'her folks are from france and italy but she lives in alaska' 3
有人知道其他方法嗎?
一種方法是使用str.count
In [792]: test['Text'].str.count('|'.join(wordlist))
Out[792]:
0 2
1 1
2 0
3 3
Name: Text, dtype: int64
另一種方法,單個單詞數的sum
In [802]: pd.DataFrame({w:test['Text'].str.count(w) for w in wordlist}).sum(1)
Out[802]:
0 2
1 1
2 0
3 3
dtype: int64
細節
In [804]: '|'.join(wordlist)
Out[804]: 'alaska|france|italy'
In [805]: pd.DataFrame({w:test['Text'].str.count(w) for w in wordlist})
Out[805]:
alaska france italy
0 1 1 0
1 0 0 1
2 0 0 0
3 2 1 0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.