[英]Creating a new dataframe column with the number of overlapping words between dataframe and list
我在解決以下問題時遇到了一些麻煩:我在每一行上都有一個帶有標記化文本的數據框,看起來(某事)如下
index feelings
1 [happy, happy, sad]
2 [neutral, sad, mad]
3 [neutral, neutral, happy]
和單詞列表lst1=[happy, fantastic]
, lst2=[mad, sad]
, lst3=[neutral]
並且我想檢查我的數據框中的每一行列表中有多少個單詞出現。 所以輸出看起來像這樣:
index feelings occlst1 occlst2 occlst3
1 [happy, happy, sad] 2 1 0
2 [neutral, sad, mad] 0 2 1
3 [neutral, neutral, happy] 1 0 2
所以,我想創建一個新列並將數據框單元格與列表進行比較。
提前致謝!
使用collections.Counter
設置:
import pandas as pd
from collections import Counter # Load 'Counter'
df = pd.DataFrame({'feelings': [['happy', 'happy', 'sad'],
['neutral', 'sad', 'mad'],
['neutral', 'neutral', 'happy']]})
lst1 = ['happy', 'fantastic']
lst2 = ['mad', 'sad']
lst3 = ['neutral']
# Create an intermediate dict
occ = {'occlst1': lst1, 'occlst2': lst2, 'occlst3': lst3}
更新:正如@mozway 所建議的
def count_occ(sr):
return {col: sum([v for k, v in Counter(sr).items() if k in lst])
for col, lst in occ.items()}
df = pd.concat([df, df['feelings'].apply(count_occ).apply(pd.Series)], axis=1)
注:我沒有使用任何其他列,除了feelings
的可讀性。 但是concat
函數從df
恢復所有列。
輸出:
>>> df
feelings occlst1 occlst2 occlst3
0 [happy, happy, sad] 2 1 0
1 [neutral, sad, mad] 0 2 1
2 [neutral, neutral, happy] 1 0 2
您可以構建一個參考系列,以將感受與列表 ID 相匹配。 然后explode
+ merge
+ pivot_table
:
ref = pd.Series({e: 'occlist_%s' % (i+1) for i,l in enumerate([lst1, lst2, lst3]) for e in l}, name='cols')
## ref:
# happy occlst1
# fantastic occlst1
# mad occlst2
# sad occlst2
# neutral occlst3
# Name: cols, dtype: object
df.merge((df.explode('feelings') # lists to single rows
# create a new column with list id
.merge(ref, left_on='feelings', right_index=True)
# reshape back to 1 row per original index
.pivot_table(index='index', columns='cols', values='feelings', aggfunc='count', fill_value=0)
),
left_on='index', right_index=True # merge with original df
)
注意。 我這里認為index
是一個列,如果是一個索引,則需要添加一個df.reset_index()
步驟
輸出:
index feelings occlist_1 occlist_2 occlist_3
0 1 [happy, happy, sad] 2 1 0
1 2 [neutral, sad, mad] 0 2 1
2 3 [neutral, neutral, happy] 1 0 2
輸入:
df = pd.DataFrame({'index': [1, 2, 3],
'feelings': [['happy', 'happy', 'sad'],
['neutral', 'sad', 'mad'],
['neutral', 'neutral', 'happy']
]})
lst1=['happy', 'fantastic']
lst2=['mad', 'sad']
lst3=['neutral']
您還可以使用:
my_lists = [lst1, lst2, st3]
occ = pd.DataFrame.from_records(df['feelings'].apply(lambda x: [pd.Series(x).isin(l).sum() for l in my_lists]).values, columns=['occlst1', 'occlst2', 'occlst3'])
df_occ = df.join(occ)
輸出:
feelings occlst1 occlst2 occlst3
0 [happy, happy, sad] 2 1 0
1 [neutral, sad, mad] 0 2 1
2 [neutral, neutral, happy] 1 0 2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.