[英]Creating a new dataframe column with the number of overlapping words between dataframe and list
我在解决以下问题时遇到了一些麻烦:我在每一行上都有一个带有标记化文本的数据框,看起来(某事)如下
index feelings
1 [happy, happy, sad]
2 [neutral, sad, mad]
3 [neutral, neutral, happy]
和单词列表lst1=[happy, fantastic]
, lst2=[mad, sad]
, lst3=[neutral]
并且我想检查我的数据框中的每一行列表中有多少个单词出现。 所以输出看起来像这样:
index feelings occlst1 occlst2 occlst3
1 [happy, happy, sad] 2 1 0
2 [neutral, sad, mad] 0 2 1
3 [neutral, neutral, happy] 1 0 2
所以,我想创建一个新列并将数据框单元格与列表进行比较。
提前致谢!
使用collections.Counter
设置:
import pandas as pd
from collections import Counter # Load 'Counter'
df = pd.DataFrame({'feelings': [['happy', 'happy', 'sad'],
['neutral', 'sad', 'mad'],
['neutral', 'neutral', 'happy']]})
lst1 = ['happy', 'fantastic']
lst2 = ['mad', 'sad']
lst3 = ['neutral']
# Create an intermediate dict
occ = {'occlst1': lst1, 'occlst2': lst2, 'occlst3': lst3}
更新:正如@mozway 所建议的
def count_occ(sr):
return {col: sum([v for k, v in Counter(sr).items() if k in lst])
for col, lst in occ.items()}
df = pd.concat([df, df['feelings'].apply(count_occ).apply(pd.Series)], axis=1)
注:我没有使用任何其他列,除了feelings
的可读性。 但是concat
函数从df
恢复所有列。
输出:
>>> df
feelings occlst1 occlst2 occlst3
0 [happy, happy, sad] 2 1 0
1 [neutral, sad, mad] 0 2 1
2 [neutral, neutral, happy] 1 0 2
您可以构建一个参考系列,以将感受与列表 ID 相匹配。 然后explode
+ merge
+ pivot_table
:
ref = pd.Series({e: 'occlist_%s' % (i+1) for i,l in enumerate([lst1, lst2, lst3]) for e in l}, name='cols')
## ref:
# happy occlst1
# fantastic occlst1
# mad occlst2
# sad occlst2
# neutral occlst3
# Name: cols, dtype: object
df.merge((df.explode('feelings') # lists to single rows
# create a new column with list id
.merge(ref, left_on='feelings', right_index=True)
# reshape back to 1 row per original index
.pivot_table(index='index', columns='cols', values='feelings', aggfunc='count', fill_value=0)
),
left_on='index', right_index=True # merge with original df
)
注意。 我这里认为index
是一个列,如果是一个索引,则需要添加一个df.reset_index()
步骤
输出:
index feelings occlist_1 occlist_2 occlist_3
0 1 [happy, happy, sad] 2 1 0
1 2 [neutral, sad, mad] 0 2 1
2 3 [neutral, neutral, happy] 1 0 2
输入:
df = pd.DataFrame({'index': [1, 2, 3],
'feelings': [['happy', 'happy', 'sad'],
['neutral', 'sad', 'mad'],
['neutral', 'neutral', 'happy']
]})
lst1=['happy', 'fantastic']
lst2=['mad', 'sad']
lst3=['neutral']
您还可以使用:
my_lists = [lst1, lst2, st3]
occ = pd.DataFrame.from_records(df['feelings'].apply(lambda x: [pd.Series(x).isin(l).sum() for l in my_lists]).values, columns=['occlst1', 'occlst2', 'occlst3'])
df_occ = df.join(occ)
输出:
feelings occlst1 occlst2 occlst3
0 [happy, happy, sad] 2 1 0
1 [neutral, sad, mad] 0 2 1
2 [neutral, neutral, happy] 1 0 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.