[英]How to get degree of overlap between two rows with words in Pandas
我有一個 dataframe 如下所示:
df = pd.DataFrame([{"id": 'A1', 'happy_words': 'a,b,d,e', 'sad_words':'aa,cc,mm,zz'},
{"id": 'A2', 'happy_words': 'f,g,d,e', 'sad_words':'aa,dd,mm,zz'},
{"id": 'B2', 'happy_words': 'a,d,m,e', 'sad_words':'tt,cc,uu,zz'}])
我想計算每對(i,j)之間使用的單詞的重疊程度。 例如,A1和A2都選擇了4個單詞中的2個--“d,e”,這算作重疊。 計算兩個響應向量之間重疊程度的代碼是
def get_percent_agree(i, j):
return (list(i-j).count(0))/len(i)
如何應用上面的代碼來獲得最終的二元級數據框?
i j overlap_happy overlap_sad
0 1 x% x%
0 2 x% x%
0 3 x% x%
1 2 x% x%
2 3 x% x%
讓我們嘗試一下:
import re
from itertools import combinations
import numpy as np
import pandas as pd
df = pd.DataFrame([{"id": 'A1', 'happy_words': 'a,b,d,e', 'sad_words': 'aa,cc,mm,zz'},
{"id": 'A2', 'happy_words': 'f,g,d,e', 'sad_words': 'aa,dd,mm,zz'},
{"id": 'B2', 'happy_words': 'a,d,m,e', 'sad_words': 'tt,cc,uu,zz'}])
words_cols = list(filter(re.compile(r'.*_words$').search, df.columns))
df[words_cols] = df[words_cols].apply(lambda c: c.str.split(','))
# Get All Row Combinations
a, b = map(list, zip(*combinations(df.index, 2)))
# Merge Together
df = df.loc[a].reset_index().merge(
df.loc[b].reset_index(),
left_index=True,
right_index=True,
).rename(columns={'index_x': 'i', 'index_y': 'j'})
def get_percent_agree(s):
# Get Intersections
happy_intersect = np.intersect1d(s['happy_words_x'], s['happy_words_y'])
sad_intersect = np.intersect1d(s['sad_words_x'], s['sad_words_y'])
# Calc and Format Percent
return pd.Series([f'{len(happy_intersect) / len(s.happy_words_x):.2%}',
f'{len(sad_intersect) / len(s.sad_words_x):.2%}'],
index=['overlap_happy',
'overlap_sad'])
# Merge Back
df = df[['i', 'j']].merge(
df.apply(get_percent_agree, axis=1),
left_index=True,
right_index=True
)
# For Display
print(df.to_string(index=False))
Output:
i j overlap_happy overlap_sad 0 1 50.00% 75.00% 0 2 75.00% 50.00% 1 2 50.00% 25.00%
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.