[英]Is there any nicer way to aggregate multiple columns on same grouped pandas dataframe?
I am trying to figure out how should I manipulate my data so I can aggregate on multiple columns but for same grouped pandas data.我试图弄清楚我应该如何操作我的数据,以便我可以聚合多个列,但对于相同的分组 pandas 数据。 The reason why I am doing this because, I need to get stacked line chart which take data from different aggregation on same grouped data.
我这样做的原因是,我需要获取堆叠折线图,该折线图从同一分组数据的不同聚合中获取数据。 How can we do this some compact way?
我们怎样才能以某种紧凑的方式做到这一点? can anyone suggest possible way of doing this in pandas?
谁能建议在 pandas 中执行此操作的可能方法? any ideas?
有任何想法吗?
my current attempt :我目前的尝试:
import pandas as pd
import matplotlib.pyplot as plt
url = "https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv"
df = pd.read_csv(url, parse_dates=['date'])
df_re = df[df['retail_item'].str.contains("GROUND BEEF")]
df_rei = df_re.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_rei = df_rei.reset_index(level=[0,1])
df_rei['week'] = pd.DatetimeIndex(df_rei['date']).week
df_rei['year'] = pd.DatetimeIndex(df_rei['date']).year
df_rei['week'] = df_rei['date'].dt.strftime('%W').astype('uint8')
df_ret_df1 = df_rei.groupby(['retail_item', 'week'])['number_of_ads'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
similarly, I need to do data aggregation also like this:同样,我也需要像这样进行数据聚合:
df_re['price_gap'] = df_re['high_price'] - df_re['low_price']
dff_rei1 = df_re.groupby(['date', 'retail_item']).agg({'price_gap': 'mean'})
dff_rei1 = dff_rei1.reset_index(level=[0,1])
dff_rei1['week'] = pd.DatetimeIndex(dff_rei1['date']).week
dff_rei1['year'] = pd.DatetimeIndex(dff_rei1['date']).year
dff_rei1['week'] = dff_rei1['date'].dt.strftime('%W').astype('uint8')
dff_ret_df2 = dff_rei1.groupby(['retail_item', 'week'])['price_gap'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
problem问题
when I made data aggregation, those lines are similar:当我进行数据聚合时,这些行是相似的:
df_rei = df_re.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_ret_df1 = df_rei.groupby(['retail_item', 'week'])['number_of_ads'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
and和
dff_rei1 = df_re.groupby(['date', 'retail_item']).agg({'price_gap': 'mean'})
dff_ret_df2 = dff_rei1.groupby(['retail_item', 'week'])['price_gap'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
I think better way could be I have to make custom function with *arg
, **kwargs
to make shift for aggregating the columns, but how should I show stacked line chart where y axis shows different quantities.我认为更好的方法可能是我必须使用
*arg
, **kwargs
制作自定义 function 以进行转移以聚合列,但是我应该如何显示堆叠折线图,其中 y 轴显示不同的数量。 Is that doable to do so in pandas
?在
pandas
这样做可行吗?
line plot线 plot
I did for getting line chart as follow:我做了如下折线图:
for g, d in df_ret_df1.groupby('retail_item'):
fig, ax = plt.subplots(figsize=(7, 4), dpi=144)
sns.lineplot(x='week', y='vals', hue='mm', data=d,alpha=.8)
y1 = d[d.mm == 'max']
y2 = d[d.mm == 'min']
plt.fill_between(x=y1.week, y1=y1.vals, y2=y2.vals)
for year in df['year'].unique():
data = df_rei[(df_rei.date.dt.year == year) & (df_rei.retail_item == g)]
sns.lineplot(x='week', y='price_gap', ci=None, data=data, palette=cmap,label=year,alpha=.8)
I want to minimize those so I could able to aggregate on different columns and make stacked line chart, where they share x-axis as week, and y axis shows number of ads and price_range respectively.我想最小化这些,这样我就可以在不同的列上进行聚合并制作堆叠折线图,它们共享 x 轴作为周,y 轴分别显示广告数量和 price_range。 I don't know is there any better way of doing this.
我不知道有没有更好的方法来做到这一点。 I am doing this because stacked line chart (two vertical subplots), one shows number of ads on y axis and another one shows price ranges for same items along 52 weeks.
我这样做是因为堆叠折线图(两个垂直子图),一个显示 y 轴上的广告数量,另一个显示 52 周内相同商品的价格范围。 can anyone suggest any possible way of doing this?
任何人都可以提出任何可能的方法吗? any ideas?
有任何想法吗?
This answer builds on the one by Andreas who has already answered the main question of how to produce aggregate variables of multiple columns in a compact way.这个答案建立在 Andreas 的答案之上,他已经回答了如何以紧凑的方式生成多列的聚合变量的主要问题。 The goal here is to implement that solution specifically to your case and to give an example of how to produce a single figure from the aggregated data.
这里的目标是专门针对您的案例实施该解决方案,并举例说明如何从聚合数据中生成单个数字。 Here are some key points:
以下是一些关键点:
groupby('week')
is not needed for df_ret_df1
and dff_ret_df2
, which is why these contain identical values for min, max, and mean.df_ret_df1
和dff_ret_df2
不需要groupby('week')
,这就是为什么它们包含相同的最小值、最大值和平均值的原因。df.xs
.df.xs
每个高级变量的聚合变量(最小值、最大值、平均值)。 Import dataset and aggregate it as needed导入数据集并根据需要聚合
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
# Import dataset
url = 'https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/\
raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv'
df = pd.read_csv(url, parse_dates=['date'])
# Create dataframe containing data for ground beef products, compute
# aggregate variables, and set the date as the index
df_gbeef = df[df['retail_item'].str.contains('GROUND BEEF')].copy()
df_gbeef['price_gap'] = df_gbeef['high_price'] - df_gbeef['low_price']
agg_dict = {'number_of_ads': [min, max, 'mean'],
'price_gap': [min, max, 'mean']}
df_gbeef_agg = (df_gbeef.groupby(['date', 'retail_item']).agg(agg_dict)
.reset_index('retail_item'))
df_gbeef_agg
Plot aggregated variables in single figure containing small multiples Plot 单个图中包含小倍数的聚合变量
variables = ['number_of_ads', 'price_gap']
colors = ['tab:orange', 'tab:blue']
nrows = len(variables)
ncols = df_gbeef_agg['retail_item'].nunique()
fig, axs = plt.subplots(nrows, ncols, figsize=(10, 5), sharex=True, sharey='row')
for axs_row, var, color in zip(axs, variables, colors):
for i, (item, df_item) in enumerate(df_gbeef_agg.groupby('retail_item')):
ax = axs_row[i]
# Select data and plot it
data = df_item.xs(var, axis=1)
ax.fill_between(x=data.index, y1=data['min'], y2=data['max'],
color=color, alpha=0.3, label='min/max')
ax.plot(data.index, data['mean'], color=color, label='mean')
ax.spines['bottom'].set_position('zero')
# Format x-axis tick labels
fmt = plt.matplotlib.dates.DateFormatter('%W') # is not equal to ISO week
ax.xaxis.set_major_formatter(fmt)
# Fomat subplot according to position within the figure
if ax.is_first_row():
ax.set_title(item, pad=10)
if ax.is_last_row():
ax.set_xlabel('Week number', size=12, labelpad=5)
if ax.is_first_col():
ax.set_ylabel(var, size=12, labelpad=10)
if ax.is_last_col():
ax.legend(frameon=False)
fig.suptitle('Cross-regional weekly ads and price gaps of ground beef products',
size=14, y=1.02)
fig.subplots_adjust(hspace=0.1);
I am not sure if this answers your question fully, but based on your headline I guess it all boils down to:我不确定这是否完全回答了您的问题,但根据您的标题,我想这一切都归结为:
import pandas as pd
url = "https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv"
df = pd.read_csv(url, parse_dates=['date'])
# define which columns to group and in which way
dct = {'low_price': [max, min],
'high_price': min,
'year': 'mean'}
# actually group the columns
df.groupby(['region']).agg(dct)
Output: Output:
low_price high_price year
max min min mean
region
ALASKA 16.99 1.33 1.33 2020.792123
HAWAII 12.99 1.33 1.33 2020.738318
MIDWEST 28.73 0.99 0.99 2020.690159
NORTHEAST 19.99 1.20 1.99 2020.709916
NORTHWEST 16.99 1.33 1.33 2020.736397
SOUTH CENTRAL 28.76 1.20 1.49 2020.700980
SOUTHEAST 21.99 1.33 1.48 2020.699655
SOUTHWEST 16.99 1.29 1.29 2020.704341
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.