簡體   English   中英

根據兩列比較和排列 dataframe 中的行?

[英]Compare and rank rows in dataframe based on two columns?

我試圖弄清楚如何根據兩個條件比較和排列 pandas dataframe 中的多行。

這些是條件:

rule1 < rule2 

if support(rule1) <= support(rule2) and confidence(rule1) < confidence(rule2) 

or support(rule1) < support(rule2) and confidence(rule1) <= confidence(rule2)

    
rule1 = rule2 

if support(rule1) = support(rule2) and confidence(rule1) = confidence(rule2)

這就是我的 dataframe 的設置方式:

import pandas as pd

data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}

df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])

   
  (Index)
   Rules       Support     Confidence
(4444, 5555)   0.0048      0.873015
(8747, 1254)   0.00141     0.533333
(7414, 1214)   0.0085      0.593220
(5655, 6651)   0.00106     0.012060
(4454, 3321)   0.00106     0.012060
(4893, 4923)   0.00038     0.237699
(1271, 8330)   0.00179     0.453423
(9112, 4722)   0.00913     0.097672
(4511, 6722)   0.00221     0.116983
(1102, 5789)   0.00173     0.541221
(2340, 5720)   0.00098     0.743222
(9822, 5067)   0.00024     0.378219

這就是我想要的 dataframe 的樣子(不確定排名到底是什么......這是假設的排名)

   (Index)
    Rules      Support     Confidence    Rank
(7414, 1214)   0.0085      0.593220        1
(4444, 5555)   0.0048      0.873015        2
(5655, 6651)   0.00106     0.012060        3
(4454, 3321)   0.00106     0.012060        3
(8747, 1254)   0.00141     0.533333        4
(1271, 8330)   0.00179     0.453423        5
(1102, 5789)   0.00173     0.541221        6
(2340, 5720)   0.00098     0.743222        7
(9822, 5067)   0.00024     0.378219        8
(9112, 4722)   0.00913     0.097672        9
(4511, 6722)   0.00221     0.116983        10
(4893, 4923)   0.00038     0.237699        11

我對如何讓這段代碼工作有一些想法,但我不確定如何將每條規則與每條規則進行比較。 我希望根據條件浮動到頂部的最佳規則。 它不是一個大的 dataframe (< 1000) 所以我並不關心速度只是准確性。

這是我到目前為止得到的代碼:

def rank_rules(confidence, support):

    # IF / ELSE goes here
   
    df['rank'] = some_var.rank(method='max')
  
    df.sort_values(by=['rank'], ascending=False)

    return df


df = df.apply(lambda x: rank_rules(x['confidence'], x['support']), axis=1)
 

解決方案:建議的方法

如果我理解正確,您正在嘗試創建一個基於多列(支持信心)的排名系統。 您可以將這兩個視為散點圖上的兩個正交軸( xy )。 在沒有進一步排序邏輯的情況下,我將假設歐幾里得距離是我們可以在這里用來對行進行排序以創建等級的方法。

處理數據

我在這里展示了使用MinMaxScaler可能是一個選項(除了可選地使用zscore )。

代碼

import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.preprocessing import MinMaxScaler

%matplotlib inline 
%config InlineBackend.figure_format = 'svg' # 'svg', 'retina' 
plt.style.use('seaborn-white')

df = df.reset_index(drop=False).rename(columns={'index': 'rules'})
df['distance'] = (df.support**2 + df.confidence**2)**0.5
df['zsupport'] = (df.support - df.support.mean())/df.support.std()
df['zconfidence'] = (df.confidence - df.confidence.mean())/df.confidence.std()
df['zdistance'] = (df.zsupport**2 + df.zconfidence**2)**0.5

round_strategy = {
    'support': 5,
    'confidence': 6,
    'distance': 5,
}

scaler = MinMaxScaler()
df2 = pd.DataFrame(scaler.fit_transform(df[['zsupport', 'zconfidence']]), 
                   columns=['scaled_support', 'scaled_confidence'])
df = pd.concat([df, df2], ignore_index=False, axis=1)
df['scaled_distance'] = (df.scaled_support**2 + df.scaled_confidence**2)**0.5
df = df.sort_values(['scaled_distance'], ascending=False).reset_index(drop=True)
df['Rank'] = df.index

decimals = dict()
for col in df.columns:
    for key, value in round_strategy.items():
        if key in col:
            decimals.update({col: value})
df = df.round(decimals=decimals)

sizes = (df.shape[0] - df.Rank)/df.shape[0]
colors = round(255*sizes).astype(int)
df

在此處輸入圖像描述

Plot

import plotly.express as px

fig = px.scatter(df4, x="scaled_support", y="scaled_confidence", text="Rank", 
                  log_x=False, size_max=20, 
                  color="Rank", 
                  size=(np.arange(df4.index.size) + 4)[::-1], 
                  hover_data=df4.columns)
fig.update_traces(textposition='top center')
fig.update_layout(title_text='Support vs. Confidence with Rank', title_x=0.5)
fig.show()

在此處輸入圖像描述

虛擬數據

import pandas as pd

data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}

df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM