plot dataframe 箱線圖中的值卡盤

Question

我有一個單列 Dataframe 如下

df = pd.DataFrame(np.random.randn(20, 1),
                      columns=['Time'])
df['EDGE'] = pd.Series(['A', 'A', 'A','B', 'B', 'A', 'B','C','C', 'B','D','A','E','F','F','A','G','H','H','A'])
df

真正的 dataframe 有幾十萬行，唯一的“EDGE”值列表約為 200

我想 plot 以箱線圖方式得到結果如下：

boxplot = df.boxplot(by='EDGE')

現在有這么多的值，我必須打印一點，只需在同一個 plot 中先說 10 個首字母。 另一方面，我想首先打印平均時間較長的值。

預期結果：每個箱線圖都有一系列箱線圖，包括 10 個邊。 關於平均“時間”按降序顯示的框。

如何進行？

我嘗試了什么？

我嘗試在 sub_df 上使用 loc 為每個值創建一個框，但隨后每個箱線圖只能得到一個框dataframe

NOTE: I pretend to use as less libraries as possible, ie if i can do it with pandas better than with matplotlib, and matplotlib better than using yet another library on top of matplotlib

Answer 1

IIUC，那么您可以通過重塑 dataframe 來做到這一點

# define the number of edges per plot
nb_edges_per_plot = 4 #to change to your needs

# group by edge
gr = df.groupby('EDGE')['Time']
# get the mean per group and sort them 
order_ = gr.mean().sort_values(ascending=False).index
print (order_) #order depends on the random value so probably not same for you
#Index(['D', 'H', 'C', 'B', 'A', 'E', 'G', 'F'], dtype='object', name='EDGE')

# reshape your dataframe to ake each EDGE a column and order the columns
df_ = df.set_index(['EDGE', gr.cumcount()])['Time'].unstack(0)[order_]
print (df_.iloc[:5, :5])
# EDGE         D         H         C         B         A
# 0     1.729417  0.270593 -0.140786 -0.540270  0.862832
# 1          NaN  0.647830  1.038952 -0.129361 -0.648432
# 2          NaN       NaN       NaN -1.235637 -0.430890
# 3          NaN       NaN       NaN  0.631744 -1.622461
# 4          NaN       NaN       NaN       NaN  0.694052

現在您可以使用groupby進行boxplot 。 要在子圖上繪制每組邊，請執行以下操作：

df_.groupby(np.arange(len(order_))//nb_edges_per_plot, axis=1).boxplot()

或者如果你想要分開的數字，那么你可以做

for _, dfg_ in df_.groupby(np.arange(len(order_))//nb_edges_per_plot, axis=1):
    dfg_.plot(kind='box')

或者甚至在一行中你可以得到分開的數字，看到不同之處在於使用 plot.box() 而不是使用boxplot() plot.box() 。 請注意，如果要更改每個 plot 中的參數，則循環版本更加靈活

df_.groupby(np.arange(len(order_))//nb_edges_per_plot, axis=1).plot.box()

Answer 2

您可以創建一個中間幀groups ，將 EDGE 分配給 plot 編號（ Order列）和每個 plot 中的 EDGE 位置（ Pos列）。

chunk_size = 3

groups = df.groupby('EDGE')
groups = (groups.ngroups - groups.Time.mean().rank(method='first').astype(int)).to_frame()
groups['Order'] = groups.Time // chunk_size
groups['Pos'] = groups.Time % chunk_size

for i in range(groups.Order.max() + 1):
    group = groups[groups.Order==i]
    df[df.EDGE.isin(group.index)].boxplot(by='EDGE', positions=group.Pos)

例子：

import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randn(20, 1), columns=['Time'])
df['EDGE'] = pd.Series(['A', 'A', 'A','B', 'B', 'A', 'B','C','C', 'B','D','A','E','F','F','A','G','H','H','A'])

# code from above ...

#verification:
print(df.groupby('EDGE').Time.mean().sort_values(ascending=False))
#EDGE
#G    1.494079
#B    1.367285
#E    0.761038
#A    0.442789
#F    0.282769
#D    0.144044
#H    0.053955
#C   -0.127288

plot dataframe 箱線圖中的值卡盤

問題描述

2 個解決方案

解決方案1
1 2020-07-07 14:38:35

解決方案2
1 已采納 2020-07-07 16:11:31

plot dataframe 箱線圖中的值卡盤

問題描述

2 個解決方案

解決方案1 1 2020-07-07 14:38:35

解決方案2 1 已采納 2020-07-07 16:11:31

解決方案1
1 2020-07-07 14:38:35

解決方案2
1 已采納 2020-07-07 16:11:31