[英]Compare and rank rows in dataframe based on two columns?
我試圖弄清楚如何根據兩個條件比較和排列 pandas dataframe 中的多行。
這些是條件:
rule1 < rule2
if support(rule1) <= support(rule2) and confidence(rule1) < confidence(rule2)
or support(rule1) < support(rule2) and confidence(rule1) <= confidence(rule2)
rule1 = rule2
if support(rule1) = support(rule2) and confidence(rule1) = confidence(rule2)
這就是我的 dataframe 的設置方式:
import pandas as pd
data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}
df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])
(Index)
Rules Support Confidence
(4444, 5555) 0.0048 0.873015
(8747, 1254) 0.00141 0.533333
(7414, 1214) 0.0085 0.593220
(5655, 6651) 0.00106 0.012060
(4454, 3321) 0.00106 0.012060
(4893, 4923) 0.00038 0.237699
(1271, 8330) 0.00179 0.453423
(9112, 4722) 0.00913 0.097672
(4511, 6722) 0.00221 0.116983
(1102, 5789) 0.00173 0.541221
(2340, 5720) 0.00098 0.743222
(9822, 5067) 0.00024 0.378219
這就是我想要的 dataframe 的樣子(不確定排名到底是什么......這是假設的排名)
(Index)
Rules Support Confidence Rank
(7414, 1214) 0.0085 0.593220 1
(4444, 5555) 0.0048 0.873015 2
(5655, 6651) 0.00106 0.012060 3
(4454, 3321) 0.00106 0.012060 3
(8747, 1254) 0.00141 0.533333 4
(1271, 8330) 0.00179 0.453423 5
(1102, 5789) 0.00173 0.541221 6
(2340, 5720) 0.00098 0.743222 7
(9822, 5067) 0.00024 0.378219 8
(9112, 4722) 0.00913 0.097672 9
(4511, 6722) 0.00221 0.116983 10
(4893, 4923) 0.00038 0.237699 11
我對如何讓這段代碼工作有一些想法,但我不確定如何將每條規則與每條規則進行比較。 我希望根據條件浮動到頂部的最佳規則。 它不是一個大的 dataframe (< 1000) 所以我並不關心速度只是准確性。
這是我到目前為止得到的代碼:
def rank_rules(confidence, support):
# IF / ELSE goes here
df['rank'] = some_var.rank(method='max')
df.sort_values(by=['rank'], ascending=False)
return df
df = df.apply(lambda x: rank_rules(x['confidence'], x['support']), axis=1)
如果我理解正確,您正在嘗試創建一個基於多列(支持、信心)的排名系統。 您可以將這兩個視為散點圖上的兩個正交軸( x
, y
)。 在沒有進一步排序邏輯的情況下,我將假設歐幾里得距離是我們可以在這里用來對行進行排序以創建等級的方法。
我在這里展示了使用MinMaxScaler
可能是一個選項(除了可選地使用zscore
)。
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline
%config InlineBackend.figure_format = 'svg' # 'svg', 'retina'
plt.style.use('seaborn-white')
df = df.reset_index(drop=False).rename(columns={'index': 'rules'})
df['distance'] = (df.support**2 + df.confidence**2)**0.5
df['zsupport'] = (df.support - df.support.mean())/df.support.std()
df['zconfidence'] = (df.confidence - df.confidence.mean())/df.confidence.std()
df['zdistance'] = (df.zsupport**2 + df.zconfidence**2)**0.5
round_strategy = {
'support': 5,
'confidence': 6,
'distance': 5,
}
scaler = MinMaxScaler()
df2 = pd.DataFrame(scaler.fit_transform(df[['zsupport', 'zconfidence']]),
columns=['scaled_support', 'scaled_confidence'])
df = pd.concat([df, df2], ignore_index=False, axis=1)
df['scaled_distance'] = (df.scaled_support**2 + df.scaled_confidence**2)**0.5
df = df.sort_values(['scaled_distance'], ascending=False).reset_index(drop=True)
df['Rank'] = df.index
decimals = dict()
for col in df.columns:
for key, value in round_strategy.items():
if key in col:
decimals.update({col: value})
df = df.round(decimals=decimals)
sizes = (df.shape[0] - df.Rank)/df.shape[0]
colors = round(255*sizes).astype(int)
df
import plotly.express as px
fig = px.scatter(df4, x="scaled_support", y="scaled_confidence", text="Rank",
log_x=False, size_max=20,
color="Rank",
size=(np.arange(df4.index.size) + 4)[::-1],
hover_data=df4.columns)
fig.update_traces(textposition='top center')
fig.update_layout(title_text='Support vs. Confidence with Rank', title_x=0.5)
fig.show()
import pandas as pd
data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}
df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.