簡體   English   中英

如何使用 Pandas 中的單詞獲取兩行之間的重疊程度

[英]How to get degree of overlap between two rows with words in Pandas

我有一個 dataframe 如下所示:

df = pd.DataFrame([{"id": 'A1', 'happy_words': 'a,b,d,e', 'sad_words':'aa,cc,mm,zz'},
                          {"id": 'A2', 'happy_words': 'f,g,d,e', 'sad_words':'aa,dd,mm,zz'},
                          {"id": 'B2', 'happy_words': 'a,d,m,e', 'sad_words':'tt,cc,uu,zz'}])

在此處輸入圖像描述

我想計算每對(i,j)之間使用的單詞的重疊程度。 例如,A1和A2都選擇了4個單詞中的2個--“d,e”,這算作重疊。 計算兩個響應向量之間重疊程度的代碼是

def get_percent_agree(i, j):
    return (list(i-j).count(0))/len(i)

公式

如何應用上面的代碼來獲得最終的二元級數據框?

i    j  overlap_happy   overlap_sad
0    1    x%             x%  
0    2    x%             x%
0    3    x%             x%  
1    2    x%             x%  
2    3    x%             x%  

讓我們嘗試一下:

import re
from itertools import combinations

import numpy as np
import pandas as pd

df = pd.DataFrame([{"id": 'A1', 'happy_words': 'a,b,d,e', 'sad_words': 'aa,cc,mm,zz'},
                   {"id": 'A2', 'happy_words': 'f,g,d,e', 'sad_words': 'aa,dd,mm,zz'},
                   {"id": 'B2', 'happy_words': 'a,d,m,e', 'sad_words': 'tt,cc,uu,zz'}])

words_cols = list(filter(re.compile(r'.*_words$').search, df.columns))
df[words_cols] = df[words_cols].apply(lambda c: c.str.split(','))

# Get All Row Combinations
a, b = map(list, zip(*combinations(df.index, 2)))

# Merge Together
df = df.loc[a].reset_index().merge(
    df.loc[b].reset_index(),
    left_index=True,
    right_index=True,
).rename(columns={'index_x': 'i', 'index_y': 'j'})


def get_percent_agree(s):
    # Get Intersections
    happy_intersect = np.intersect1d(s['happy_words_x'], s['happy_words_y'])
    sad_intersect = np.intersect1d(s['sad_words_x'], s['sad_words_y'])
    # Calc and Format Percent
    return pd.Series([f'{len(happy_intersect) / len(s.happy_words_x):.2%}',
                      f'{len(sad_intersect) / len(s.sad_words_x):.2%}'],
                     index=['overlap_happy',
                            'overlap_sad'])


# Merge Back
df = df[['i', 'j']].merge(
    df.apply(get_percent_agree, axis=1),
    left_index=True,
    right_index=True
)

# For Display
print(df.to_string(index=False))

Output:

i  j overlap_happy overlap_sad
0  1        50.00%      75.00%
0  2        75.00%      50.00%
1  2        50.00%      25.00%

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM