pythonic方法來計算列表/集中單詞出現在dataframe列中的次數

Question

給定一個列表/一組標簽

labels = {'rectangle', 'square', 'triangle', 'cube'}

和一個 dataframe df，

df = pd.DataFrame(['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], columns=['text'])

我想知道我的一組標簽中的每個單詞在 dataframe 的文本列中出現了多少次，並創建一個新列，其中包含前 X 個（可能是 2 或 3 個）重復最多的單詞。 如果 2 個單詞的重復次數相同，則它們可以出現在列表或字符串中

Output：

pd.DataFrame({'text' : ['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], 'best_labels' : [{'rectangle' : 2, 'square' : 1, 'cube' : 1}, {'triangle' : 1, 'circle' : 1}, np.nan]})                                                                                                                          
                                                                                                                      
df['best_labels'] = some_function(df.text)

Answer 1

from collections import Counter

labels = {'rectangle', 'square', 'triangle', 'cube'}    
df = pd.DataFrame(['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], columns=['text'])
    
df['best_labels'] = df.text.apply(lambda x: {k: v for k, v in Counter(x.split()).items() if k in labels} or np.nan)    
print(df)

印刷：

                                    text                               best_labels
0  rectangle rectangle in my square cube  {'rectangle': 2, 'square': 1, 'cube': 1}
1               triangle circle not here                           {'triangle': 1}
2                           nothing here                                       NaN

Answer 2

另一種可視化數據的方法是使用矩陣：

(df['text'].str.extractall(r'\b({})\b'.format('|'.join(labels)))
           .groupby(level=0)[0]
           .value_counts()
           .unstack()
           .reindex(df.index)
           .rename_axis(None, axis=1))

   cube  rectangle  square  triangle
0   1.0        2.0     1.0       NaN
1   NaN        NaN     NaN       1.0
2   NaN        NaN     NaN       NaN

這個想法是從labels中指定的行中提取文本，然后找出每個句子出現的次數。

這看起來像什么？ 是的，你猜對了，一個稀疏矩陣。

pythonic方法來計算列表/集中單詞出現在dataframe列中的次數

問題描述

2 個解決方案

解決方案1
4 已采納 2020-06-28 21:29:27

解決方案2
3 2020-06-28 21:31:47

pythonic方法來計算列表/集中單詞出現在dataframe列中的次數

問題描述

2 個解決方案

解決方案1 4 已采納 2020-06-28 21:29:27

解決方案2 3 2020-06-28 21:31:47

解決方案1
4 已采納 2020-06-28 21:29:27

解決方案2
3 2020-06-28 21:31:47