使用數據框和列表之間的重疊詞數創建新的數據框列

Question

我在解決以下問題時遇到了一些麻煩：我在每一行上都有一個帶有標記化文本的數據框，看起來（某事）如下

index feelings           
1     [happy, happy, sad] 
2     [neutral, sad, mad] 
3     [neutral, neutral, happy]

和單詞列表lst1=[happy, fantastic] , lst2=[mad, sad] , lst3=[neutral]並且我想檢查我的數據框中的每一行列表中有多少個單詞出現。 所以輸出看起來像這樣：

index feelings                  occlst1 occlst2 occlst3      
1     [happy, happy, sad]       2      1        0
2     [neutral, sad, mad]       0      2        1
3     [neutral, neutral, happy] 1      0        2

所以，我想創建一個新列並將數據框單元格與列表進行比較。

提前致謝！

Answer 1

使用collections.Counter

設置：

import pandas as pd
from collections import Counter  # Load 'Counter'

df = pd.DataFrame({'feelings': [['happy', 'happy', 'sad'],
                                ['neutral', 'sad', 'mad'],
                                ['neutral', 'neutral', 'happy']]})

lst1 = ['happy', 'fantastic']
lst2 = ['mad', 'sad']
lst3 = ['neutral']

# Create an intermediate dict
occ = {'occlst1': lst1, 'occlst2': lst2, 'occlst3': lst3}

更新：正如@mozway 所建議的

def count_occ(sr):
    return {col: sum([v for k, v in Counter(sr).items() if k in lst])
                     for col, lst in occ.items()}

df = pd.concat([df, df['feelings'].apply(count_occ).apply(pd.Series)], axis=1)

注：我沒有使用任何其他列，除了feelings的可讀性。 但是concat函數從df恢復所有列。

輸出：

>>> df
                    feelings  occlst1  occlst2  occlst3
0        [happy, happy, sad]        2        1        0
1        [neutral, sad, mad]        0        2        1
2  [neutral, neutral, happy]        1        0        2

Answer 2

您可以構建一個參考系列，以將感受與列表 ID 相匹配。 然后explode + merge + pivot_table ：

ref = pd.Series({e: 'occlist_%s' % (i+1) for i,l in enumerate([lst1, lst2, lst3]) for e in l}, name='cols')

## ref:
# happy        occlst1
# fantastic    occlst1
# mad          occlst2
# sad          occlst2
# neutral      occlst3
# Name: cols, dtype: object

df.merge((df.explode('feelings')  # lists to single rows
           # create a new column with list id
           .merge(ref, left_on='feelings', right_index=True)
           # reshape back to 1 row per original index
           .pivot_table(index='index', columns='cols', values='feelings', aggfunc='count', fill_value=0)
          ),
         left_on='index', right_index=True  # merge with original df
        )

注意。 我這里認為index是一個列，如果是一個索引，則需要添加一個df.reset_index()步驟

輸出：

   index                   feelings  occlist_1  occlist_2  occlist_3
0      1        [happy, happy, sad]          2          1          0
1      2        [neutral, sad, mad]          0          2          1
2      3  [neutral, neutral, happy]          1          0          2

輸入：

df = pd.DataFrame({'index': [1, 2, 3],
                   'feelings': [['happy', 'happy', 'sad'],
                                ['neutral', 'sad', 'mad'],
                                ['neutral', 'neutral', 'happy']
                               ]})
lst1=['happy', 'fantastic']
lst2=['mad', 'sad']
lst3=['neutral']

Answer 3

您還可以使用：

my_lists = [lst1, lst2, st3]
occ = pd.DataFrame.from_records(df['feelings'].apply(lambda x: [pd.Series(x).isin(l).sum() for l in my_lists]).values, columns=['occlst1', 'occlst2', 'occlst3'])
df_occ = df.join(occ)

輸出：

                    feelings  occlst1  occlst2  occlst3
0        [happy, happy, sad]        2        1        0
1        [neutral, sad, mad]        0        2        1
2  [neutral, neutral, happy]        1        0        2

使用數據框和列表之間的重疊詞數創建新的數據框列

問題描述

3 個解決方案

解決方案1
1 已采納 2021-10-14 08:18:52

解決方案2
0 2021-10-14 08:16:29

解決方案3
0 2021-10-14 11:30:40

使用數據框和列表之間的重疊詞數創建新的數據框列

問題描述

3 個解決方案

解決方案1 1 已采納 2021-10-14 08:18:52

解決方案2 0 2021-10-14 08:16:29

解決方案3 0 2021-10-14 11:30:40

解決方案1
1 已采納 2021-10-14 08:18:52

解決方案2
0 2021-10-14 08:16:29

解決方案3
0 2021-10-14 11:30:40