简体   繁体   中英

Pandas groupby sort within groups retaining multiple aggregates and visualize it with facet

I have this example dataset

products = ["A", "B", "C", "D"]
stores = ["store1", "store2", "store3"]
n = 30

product_list = [products[i] for i in np.random.randint(0, len(products), n)]
store_list = [stores[i] for i in np.random.randint(0, len(stores), n)]
rating_list = np.random.random(n) * 5
sales_list = np.random.random(n) * 10000

df = pd.DataFrame(
    {'store': store_list, 
     'product': product_list, 
     'sales': sales_list, 
     'rating': rating_list})

and then sum the sales

df_1=df.groupby(['store','product']).agg({'sales':['sum']})
df_1

在此处输入图像描述

and ordered it by highest sales while maintain the store

df_2 = df_1.groupby(level=0, group_keys=False).apply(
                   lambda x: x.sort_values(('sales', 'sum'), ascending=False))
df_2

在此处输入图像描述

How can I facet by the store, so the resulting visualization is like the following?

在此处输入图像描述

You should reset the index in the last passage:

df_2 = df_1.groupby(level=0, group_keys=False).apply(
                   lambda x: x.sort_values(('sales', 'sum'), ascending=False)).reset_index()

Then you can plot with seaborn.FacetGrid :

g = sns.FacetGrid(df_2, col = 'store')
g.map(sns.barplot, 'product', 'sales')

plt.show()

在此处输入图像描述

  • Visualizations are about telling a story, and presenting data in a clear and concise manner to convey that story. As such, it is more clear to have the same xaxis order for each plot.
    • Someone looking at your visualization should be able to quickly discern which product from which store has the highest total sales, but that is not easy if the product categories for every axis are not in the same order
  • This can be done with pandas.DataFrame.plot by shaping the data with pandas.DataFrame.pivot_table .
  • Tested with python 3.8.11 , matplotlib 3.4.2 , seaborn 0.11.2 , and pandas 1.3.1 .
import pandas as pd
import matplotlib.pyplot as plt

# using the sample data; reshape df
dfp = df.pivot_table(index='product', columns='store', values='sales', aggfunc='sum')

# display(dfp)
store          store1        store2        store3
product                                          
A         9303.543781  15323.422183  20738.561588
B                 NaN   7549.028221           NaN
C        13976.321362  22350.050356   9865.392344
D         6905.455849   3183.767513   6010.941242

# plot
dfp.plot(kind='bar', subplots=True, layout=(1, 3), figsize=(8, 4), legend=False, rot=0,
         sharey=True, title='Store Sales by Product', ylabel='Total Sales')
plt.show()

在此处输入图像描述

  • This presentation is more clear without subplots (remove subplots=True )
    • It's more clear that product B only had sales at store 2
dfp.plot(kind='bar', rot=0, figsize=(5, 3), title='Store Sales by Product', ylabel='Total Sales')
plt.show()

在此处输入图像描述

  • Switching the categories for index and columns tells a different story
dfp = df.pivot_table(index='store', columns='product', values='sales', aggfunc='sum')

dfp.plot(kind='bar', rot=0, figsize=(5, 3), title='Product Sales by Store', ylabel='Total Sales')
plt.show()

在此处输入图像描述

seaborn.catplot

  • With .catplot this can be done without .groupby or .pivot_table because kind='bar' has an estimator parameter.
  • Using col=
import seaborn as sns

sns.catplot(kind='bar', data=df, col='store', x='product', y='sales',
            order=sorted(products), col_order=sorted(stores), estimator=sum, ci=False, height=3)
plt.show()

在此处输入图像描述

  • Using hue=
  • FYI, the random data ( df ) for this plot is different than the other plots.
sns.catplot(kind='bar', data=df, hue='store', x='product', y='sales', height=3,
            col_order=sorted(stores), estimator=sum, ci=False, order=sorted(products))
plt.show()

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM