如何使用 Pandas 中的單詞獲取兩行之間的重疊程度

Question

我有一個 dataframe 如下所示：

df = pd.DataFrame([{"id": 'A1', 'happy_words': 'a,b,d,e', 'sad_words':'aa,cc,mm,zz'},
                          {"id": 'A2', 'happy_words': 'f,g,d,e', 'sad_words':'aa,dd,mm,zz'},
                          {"id": 'B2', 'happy_words': 'a,d,m,e', 'sad_words':'tt,cc,uu,zz'}])

我想計算每對（i，j）之間使用的單詞的重疊程度。 例如，A1和A2都選擇了4個單詞中的2個--“d，e”，這算作重疊。 計算兩個響應向量之間重疊程度的代碼是

def get_percent_agree(i, j):
    return (list(i-j).count(0))/len(i)

如何應用上面的代碼來獲得最終的二元級數據框？

i    j  overlap_happy   overlap_sad
0    1    x%             x%  
0    2    x%             x%
0    3    x%             x%  
1    2    x%             x%  
2    3    x%             x%

Answer 1

讓我們嘗試一下：

import re
from itertools import combinations

import numpy as np
import pandas as pd

df = pd.DataFrame([{"id": 'A1', 'happy_words': 'a,b,d,e', 'sad_words': 'aa,cc,mm,zz'},
                   {"id": 'A2', 'happy_words': 'f,g,d,e', 'sad_words': 'aa,dd,mm,zz'},
                   {"id": 'B2', 'happy_words': 'a,d,m,e', 'sad_words': 'tt,cc,uu,zz'}])

words_cols = list(filter(re.compile(r'.*_words$').search, df.columns))
df[words_cols] = df[words_cols].apply(lambda c: c.str.split(','))

# Get All Row Combinations
a, b = map(list, zip(*combinations(df.index, 2)))

# Merge Together
df = df.loc[a].reset_index().merge(
    df.loc[b].reset_index(),
    left_index=True,
    right_index=True,
).rename(columns={'index_x': 'i', 'index_y': 'j'})


def get_percent_agree(s):
    # Get Intersections
    happy_intersect = np.intersect1d(s['happy_words_x'], s['happy_words_y'])
    sad_intersect = np.intersect1d(s['sad_words_x'], s['sad_words_y'])
    # Calc and Format Percent
    return pd.Series([f'{len(happy_intersect) / len(s.happy_words_x):.2%}',
                      f'{len(sad_intersect) / len(s.sad_words_x):.2%}'],
                     index=['overlap_happy',
                            'overlap_sad'])


# Merge Back
df = df[['i', 'j']].merge(
    df.apply(get_percent_agree, axis=1),
    left_index=True,
    right_index=True
)

# For Display
print(df.to_string(index=False))

Output：

i  j overlap_happy overlap_sad
0  1        50.00%      75.00%
0  2        75.00%      50.00%
1  2        50.00%      25.00%

如何使用 Pandas 中的單詞獲取兩行之間的重疊程度

問題描述

1 個解決方案

解決方案1
1 已采納 2021-05-04 19:56:56

如何使用 Pandas 中的單詞獲取兩行之間的重疊程度

問題描述

1 個解決方案

解決方案1 1 已采納 2021-05-04 19:56:56

解決方案1
1 已采納 2021-05-04 19:56:56