简体   繁体   English

有没有更好的方法来聚合同一分组 pandas dataframe 上的多个列?

[英]Is there any nicer way to aggregate multiple columns on same grouped pandas dataframe?

I am trying to figure out how should I manipulate my data so I can aggregate on multiple columns but for same grouped pandas data.我试图弄清楚我应该如何操作我的数据,以便我可以聚合多个列,但对于相同的分组 pandas 数据。 The reason why I am doing this because, I need to get stacked line chart which take data from different aggregation on same grouped data.我这样做的原因是,我需要获取堆叠折线图,该折线图从同一分组数据的不同聚合中获取数据。 How can we do this some compact way?我们怎样才能以某种紧凑的方式做到这一点? can anyone suggest possible way of doing this in pandas?谁能建议在 pandas 中执行此操作的可能方法? any ideas?有任何想法吗?

my current attempt :我目前的尝试

import pandas as pd
import matplotlib.pyplot as plt

url = "https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv"
df = pd.read_csv(url, parse_dates=['date'])

df_re = df[df['retail_item'].str.contains("GROUND BEEF")]
df_rei = df_re.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_rei = df_rei.reset_index(level=[0,1])
df_rei['week'] = pd.DatetimeIndex(df_rei['date']).week
df_rei['year'] = pd.DatetimeIndex(df_rei['date']).year
df_rei['week'] = df_rei['date'].dt.strftime('%W').astype('uint8')

df_ret_df1 = df_rei.groupby(['retail_item', 'week'])['number_of_ads'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()

similarly, I need to do data aggregation also like this:同样,我也需要像这样进行数据聚合:

df_re['price_gap'] = df_re['high_price'] - df_re['low_price']
dff_rei1 = df_re.groupby(['date', 'retail_item']).agg({'price_gap': 'mean'})
dff_rei1 = dff_rei1.reset_index(level=[0,1])
dff_rei1['week'] = pd.DatetimeIndex(dff_rei1['date']).week
dff_rei1['year'] = pd.DatetimeIndex(dff_rei1['date']).year
dff_rei1['week'] = dff_rei1['date'].dt.strftime('%W').astype('uint8')

dff_ret_df2 = dff_rei1.groupby(['retail_item', 'week'])['price_gap'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()

problem问题

when I made data aggregation, those lines are similar:当我进行数据聚合时,这些行是相似的:

df_rei = df_re.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_ret_df1 = df_rei.groupby(['retail_item', 'week'])['number_of_ads'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()

and

dff_rei1 = df_re.groupby(['date', 'retail_item']).agg({'price_gap': 'mean'})
 dff_ret_df2 = dff_rei1.groupby(['retail_item', 'week'])['price_gap'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()

I think better way could be I have to make custom function with *arg , **kwargs to make shift for aggregating the columns, but how should I show stacked line chart where y axis shows different quantities.我认为更好的方法可能是我必须使用*arg**kwargs制作自定义 function 以进行转移以聚合列,但是我应该如何显示堆叠折线图,其中 y 轴显示不同的数量。 Is that doable to do so in pandas ?pandas这样做可行吗?

line plot线 plot

I did for getting line chart as follow:我做了如下折线图:

for g, d in df_ret_df1.groupby('retail_item'):
    fig, ax = plt.subplots(figsize=(7, 4), dpi=144)
    sns.lineplot(x='week', y='vals', hue='mm', data=d,alpha=.8)
    y1 = d[d.mm == 'max']
    y2 = d[d.mm == 'min']
    plt.fill_between(x=y1.week, y1=y1.vals, y2=y2.vals)
    
    for year in df['year'].unique():
        data = df_rei[(df_rei.date.dt.year == year) & (df_rei.retail_item == g)]
        sns.lineplot(x='week', y='price_gap', ci=None, data=data, palette=cmap,label=year,alpha=.8)

I want to minimize those so I could able to aggregate on different columns and make stacked line chart, where they share x-axis as week, and y axis shows number of ads and price_range respectively.我想最小化这些,这样我就可以在不同的列上进行聚合并制作堆叠折线图,它们共享 x 轴作为周,y 轴分别显示广告数量和 price_range。 I don't know is there any better way of doing this.我不知道有没有更好的方法来做到这一点。 I am doing this because stacked line chart (two vertical subplots), one shows number of ads on y axis and another one shows price ranges for same items along 52 weeks.我这样做是因为堆叠折线图(两个垂直子图),一个显示 y 轴上的广告数量,另一个显示 52 周内相同商品的价格范围。 can anyone suggest any possible way of doing this?任何人都可以提出任何可能的方法吗? any ideas?有任何想法吗?

This answer builds on the one by Andreas who has already answered the main question of how to produce aggregate variables of multiple columns in a compact way.这个答案建立在 Andreas 的答案之上,他已经回答了如何以紧凑的方式生成多列的聚合变量的主要问题。 The goal here is to implement that solution specifically to your case and to give an example of how to produce a single figure from the aggregated data.这里的目标是专门针对您的案例实施该解决方案,并举例说明如何从聚合数据中生成单个数字。 Here are some key points:以下是一些关键点:

  • The dates in the original dataset are already on a weekly frequency so groupby('week') is not needed for df_ret_df1 and dff_ret_df2 , which is why these contain identical values for min, max, and mean.原始数据集中的日期已经是每周频率,因此df_ret_df1dff_ret_df2不需要groupby('week') ,这就是为什么它们包含相同的最小值、最大值和平均值的原因。
  • This example uses pandas and matplotlib so the variables do not need to be stacked as when using seaborn.此示例使用 pandas 和 matplotlib,因此不需要像使用 seaborn 时那样堆叠变量。
  • The aggregation step produces a MultiIndex for the columns.聚合步骤为列生成一个 MultiIndex。 You can access the aggregated variables (min, max, mean) of each high-level variable by using df.xs .您可以使用df.xs每个高级变量的聚合变量(最小值、最大值、平均值)。
  • The date is set as the index of the aggregated dataframe to use as the x variable.日期设置为聚合 dataframe 的索引以用作 x 变量。 Using the DatetimeIndex as the x variable gives you more flexibility for formatting the tick labels and ensures that the data is always plotted in chronological order.使用 DatetimeIndex 作为 x 变量可以让您更灵活地格式化刻度标签,并确保数据始终按时间顺序绘制。
  • It is not clear in the question how the data for separate years should be displayed (in separate figures?) so here the entire time series is shown in a single figure.问题中不清楚应该如何显示不同年份的数据(在单独的图中?)所以这里整个时间序列都显示在一个图中。

Import dataset and aggregate it as needed导入数据集并根据需要聚合

import pandas as pd              # v 1.2.3
import matplotlib.pyplot as plt  # v 3.3.4

# Import dataset
url = 'https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/\
raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv'
df = pd.read_csv(url, parse_dates=['date'])

# Create dataframe containing data for ground beef products, compute
# aggregate variables, and set the date as the index
df_gbeef = df[df['retail_item'].str.contains('GROUND BEEF')].copy()
df_gbeef['price_gap'] = df_gbeef['high_price'] - df_gbeef['low_price']
agg_dict = {'number_of_ads': [min, max, 'mean'],
            'price_gap': [min, max, 'mean']}
df_gbeef_agg = (df_gbeef.groupby(['date', 'retail_item']).agg(agg_dict)
                .reset_index('retail_item'))
df_gbeef_agg

df_gbeef_agg


Plot aggregated variables in single figure containing small multiples Plot 单个图中包含小倍数的聚合变量

variables = ['number_of_ads', 'price_gap']
colors = ['tab:orange', 'tab:blue']
nrows = len(variables)
ncols = df_gbeef_agg['retail_item'].nunique()

fig, axs = plt.subplots(nrows, ncols, figsize=(10, 5), sharex=True, sharey='row')
for axs_row, var, color in zip(axs, variables, colors):
    for i, (item, df_item) in enumerate(df_gbeef_agg.groupby('retail_item')):
        ax = axs_row[i]
        
        # Select data and plot it
        data = df_item.xs(var, axis=1)
        ax.fill_between(x=data.index, y1=data['min'], y2=data['max'],
                        color=color, alpha=0.3, label='min/max')
        ax.plot(data.index, data['mean'], color=color, label='mean')
        ax.spines['bottom'].set_position('zero')
        
        # Format x-axis tick labels
        fmt = plt.matplotlib.dates.DateFormatter('%W') # is not equal to ISO week
        ax.xaxis.set_major_formatter(fmt)
        
        # Fomat subplot according to position within the figure
        if ax.is_first_row():
            ax.set_title(item, pad=10)
        if ax.is_last_row():
            ax.set_xlabel('Week number', size=12, labelpad=5)
        if ax.is_first_col():
            ax.set_ylabel(var, size=12, labelpad=10)
        if ax.is_last_col():
            ax.legend(frameon=False)

fig.suptitle('Cross-regional weekly ads and price gaps of ground beef products',
             size=14, y=1.02)
fig.subplots_adjust(hspace=0.1);

小倍数

I am not sure if this answers your question fully, but based on your headline I guess it all boils down to:我不确定这是否完全回答了您的问题,但根据您的标题,我想这一切都归结为:

import pandas as pd

url = "https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv"
df = pd.read_csv(url, parse_dates=['date'])

# define which columns to group and in which way
dct = {'low_price': [max, min],
       'high_price': min,
       'year': 'mean'}

# actually group the columns
df.groupby(['region']).agg(dct)

Output: Output:

              low_price       high_price         year
                    max   min        min         mean
region
ALASKA            16.99  1.33       1.33  2020.792123
HAWAII            12.99  1.33       1.33  2020.738318
MIDWEST           28.73  0.99       0.99  2020.690159
NORTHEAST         19.99  1.20       1.99  2020.709916
NORTHWEST         16.99  1.33       1.33  2020.736397
SOUTH CENTRAL     28.76  1.20       1.49  2020.700980
SOUTHEAST         21.99  1.33       1.48  2020.699655
SOUTHWEST         16.99  1.29       1.29  2020.704341

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM