[英]How to group-by until different value in Pandas?
在我獲得 df_base 中我需要的所有數據后(為了簡單起見,我不會包括它),我想用列返回 df_product_final :
對於前兩列,這不是問題,因為我只是從 df_base 復制列並將它們粘貼到 df_product_final 中。
對於 SpeedAvg,我需要在 df_product_final 中插入該產品的平均速度,直到新產品出現在 Product 列中。
我的代碼:
df_product_final['Product'] = df_product_total['Product']
df_product_final['Speed'] = df_base['production'] / df_base['time_production']
df_product_final=df_product_final.fillna(0)
df_product_final['SpeedAvg'] = df_product_final["Speed"].groupby(df_product_final['Product']).mean()
df_product_final['newindex'] = df_base['date_key']+df_base['hour']+df_base['minute']
df_product_final['newindex'] = pd.to_datetime(df_product_final['newindex'], utc=1, format = "%Y%m%d%H%M%S")
df_product_final.set_index('newindex',inplace=True)
df_product_final=df_product_final.fillna(0)
df_product_final:
newindex Product Speed SpeedAvg
2020-10-15 22:00:00+00:00 0 0.000000 52.944285
2020-10-15 23:00:00+00:00 0 0.000000 0.000000
2020-10-16 00:00:00+00:00 0 0.000000 0.000000
2020-10-16 01:00:00+00:00 0 0.000000 0.000000
2020-10-16 02:00:00+00:00 0 0.000000 0.000000
...
2020-10-16 20:00:00+00:00 0 154.000000 0.000000
2020-10-16 21:00:00+00:00 0 150.000000 0.000000
我想得到這個結果:
newindex Product Speed SpeedAvg
2020-10-15 22:00:00+00:00 0 0.000000 52.944285
2020-10-15 23:00:00+00:00 0 0.000000 52.944285
2020-10-16 00:00:00+00:00 0 0.000000 52.944285
2020-10-16 01:00:00+00:00 0 0.000000 52.944285
...
2020-10-16 20:00:00+00:00 0 154.000000 52.944285
2020-10-16 21:00:00+00:00 0 0.000000 52.944285
使事情變得更加復雜的可能是相同的產品,但分開了一個多小時。 在這種情況下,我的 SpeedAvg 取決於這些新值,而不是以前的值。
例子:
Product Speed SpeedAvg
newindex
2020-10-15 22:00:00+00:00 0 0.000000 52.944285
2020-10-15 23:00:00+00:00 0 0.000000 52.944285
2020-10-16 00:00:00+00:00 0 0.000000 52.944285
2020-10-16 01:00:00+00:00 0 0.000000 52.944285
2020-10-16 02:00:00+00:00 1 10.000000 10.000000
2020-10-16 03:00:00+00:00 1 10.000000 10.000000
2020-10-16 04:00:00+00:00 1 10.000000 10.000000
2020-10-16 05:00:00+00:00 1 10.000000 10.000000
2020-10-16 06:00:00+00:00 1 10.000000 10.000000
2020-10-16 07:00:00+00:00 0 0.000000 31.500000
2020-10-16 08:00:00+00:00 0 0.000000 31.500000
2020-10-16 16:00:00+00:00 0 183.000000 31.500000
2020-10-16 17:00:00+00:00 0 69.000000 31.500000
2020-10-16 18:00:00+00:00 0 0.000000 31.500000
2020-10-16 19:00:00+00:00 0 0.000000 31.500000
2020-10-16 20:00:00+00:00 0 0.000000 31.500000
2020-10-16 21:00:00+00:00 0 0.000000 31.500000
如果我不是很全面,我很抱歉,我會提供解決這個問題所需的每一點信息。
找到了另一個使用 group by 的解決方案。 Lmk 如果這對你有用。
def _mean(df):
df['SpeedAvg'] = df['Speed'].mean()
return df
df_product_final = df_product_final.groupby(df['Product'].ne(df['Product'].shift()).cumsum()).apply(_mean)
改編自對這篇文章的回答
我想我找到了一個更簡單的解決方案來解決我的問題:
從一個空字典開始,我將 df_base 的所有鍵插入其中,如下所示:
product_keys = {}
product_keys = df_base['product_key'].drop_duplicates().reset_index(inplace=False, drop=True).to_dict()
生成的字典將類似於:
{0: 2,
1: 1,
2: 31
}
在使用 df.apply() 這一步之后,我可以迭代數據幀的每一行,用剛剛制作的字典的鍵更改產品鍵的行值:
df_product_final['Product'] = df_base['product_key']
df_product_final.apply(
self.keys_from_value,
dict = product_keys,
axis='columns',
raw = False,
result_type='broadcast',
)
self.keys_from_value:
def keys_from_value(self, row, dict):
if row is None:
return
else:
row['Product'] = list(dict.keys())[list(dict.values()).index(row['Product'])]
return row
最后一步是計算並在數據幀中插入正確的 SpeedAvg(這很簡單:第一個循環是根據剛剛修改的行獲取列 group_id;第二個循環是根據 group_id 插入 SpeedAvg) :
gid = 0
for i, row in df_base.iterrows():
if row['diff'] != 0:
gid += 1
df_base.at[i,'group_id'] = gid
avg = df_product_final["Speed"].groupby(df_base['group_id']).mean()
#avg is a Pandas Series of all the SpeedAvg based on their position relative to #the list
for i, row in df_product_final.iterrows():
for row_avg in avg.index.values.tolist():
if row.at['group_id'] == row_avg:
df_product_final.at[i,'SpeedAvg'] = avg[row_avg]
這是我經過這些步驟后得到的數據幀(df_product_final):
Product Speed SpeedAvg
newindex
2020-10-20 09:00:00+00:00 0 0.000000 0.000000
2020-10-20 09:00:00+00:00 1 0.000000 104.528338
2020-10-20 10:00:00+00:00 1 0.000000 104.528338
2020-10-20 11:00:00+00:00 1 0.000000 104.528338
2020-10-20 12:00:00+00:00 1 68.375000 104.528338
2020-10-20 13:00:00+00:00 1 188.074074 104.528338
2020-10-20 14:00:00+00:00 1 172.192982 104.528338
2020-10-20 15:00:00+00:00 1 162.553571 104.528338
2020-10-20 16:00:00+00:00 1 178.867925 104.528338
2020-10-20 17:00:00+00:00 1 181.844828 104.528338
2020-10-20 18:00:00+00:00 1 93.375000 104.528338
2020-10-19 20:00:00+00:00 0 0.000000 0.000000
2020-10-19 21:00:00+00:00 0 0.000000 0.000000
2020-10-19 22:00:00+00:00 0 0.000000 0.000000
2020-10-19 23:00:00+00:00 0 0.000000 0.000000
2020-10-20 00:00:00+00:00 0 0.000000 0.000000
2020-10-20 01:00:00+00:00 0 0.000000 0.000000
2020-10-20 02:00:00+00:00 0 0.000000 0.000000
2020-10-20 03:00:00+00:00 0 0.000000 0.000000
2020-10-20 04:00:00+00:00 0 0.000000 0.000000
2020-10-20 05:00:00+00:00 0 0.000000 0.000000
2020-10-20 06:00:00+00:00 0 0.000000 0.000000
2020-10-20 07:00:00+00:00 0 0.000000 0.000000
2020-10-20 08:00:00+00:00 0 0.000000 0.000000
2020-10-20 09:00:00+00:00 2 0.000000 95.025762
2020-10-20 10:00:00+00:00 2 0.000000 95.025762
2020-10-20 11:00:00+00:00 2 0.000000 95.025762
2020-10-20 12:00:00+00:00 2 68.375000 95.025762
2020-10-20 13:00:00+00:00 2 188.074074 95.025762
2020-10-20 14:00:00+00:00 2 172.192982 95.025762
2020-10-20 15:00:00+00:00 2 162.553571 95.025762
2020-10-20 16:00:00+00:00 2 178.867925 95.025762
2020-10-20 17:00:00+00:00 2 181.844828 95.025762
2020-10-20 18:00:00+00:00 2 93.375000 95.025762
2020-10-20 19:00:00+00:00 2 0.000000 95.025762
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.