简体   繁体   English

创建库存历史记录的最有效方法是什么

[英]What is the most efficient way to create inventory history

I am trying to get, for each day of the year, how many cars I had in stock and for how many days each car was in stock in that date. 我试图获取一年中每一天的库存量以及该日期该车的库存天数。

I have the full history of movements (the timestamp each car was moved in and out of stock - for rent, for sale, repair and so on). 我拥有完整的行驶历史(每辆车进出和出库的时间戳-出租,出售,维修等)。 Like this: 像这样:

car             in          out        status_id    operation
PZR4010 08/02/2018 08:55    08/02/2018 16:29    12  out_stock
QRX0502 07/02/2018 09:00    07/02/2018 10:28    7   in_stock
PYR8269 06/02/2018 17:10    09/02/2018 21:22    12  in_stock
QRG6455 06/02/2018 12:39                        8   sold
QRU1867 08/02/2018 08:00    09/02/2018 11:07    12  in_stock
PZR8528 06/02/2018 17:51    07/02/2018 07:46    10  out_stock
PZR7184 06/02/2018 16:00    08/02/2018 12:10    7   in_stock
PZR0386 08/02/2018 09:02    14/02/2018 14:53    10  out_stock
PZR8600 06/02/2018 16:00    07/02/2018 07:34    7   in_stock
PZR1787 06/02/2018 17:02    20/02/2018 17:33    12  in_stock

So, for each car, I have to join the whole consecutive time it has been in-stock, to know for how long it was in that state. 因此,对于每辆汽车,我必须连续不断地进入库存状态,以了解该状态持续了多长时间。

So for instance: 因此,例如:

car     in                 out          status_id   operation
QRX0502 08/02/2018 08:55    09/02/2018 16:29    7   in_stock
QRX0502 07/02/2018 09:00    08/02/2018 08:55    7   in_stock
QRX0502 06/02/2018 17:10    07/02/2018 09:00    7   in_stock

Will become simply: 将变得简单:

car          in                 out            status_id    operation
QRX0502 06/02/2018 17:10    09/02/2018 16:29    7   in_stock

Capturing the min timestamp in the 'in' column and the max timestamp in the 'out' column. 在“输入”列中捕获最小时间戳,在“输出”列中捕获最大时间戳。

I have tried to use groupby + shift: 我试图使用groupby + shift:

#'mov' is the dataframe with all the stock movements
# I create a columns to better filter on the groupby

mov['aux']=mov['car']+" - "+mov['operation']

#creating the base dataframe to be the output

hist_mov=pd.DataFrame(columns=list(mov.columns))

for line, operation in mov.groupby(mov['aux'].ne(mov['aux'].shift()).cumsum()):
    g_temp=operation.groupby(['car','operation',
        'aux']).agg({'in':'min','out':'max'}).reset_index()
    hist_mov=hist_mov.append(g_temp,sort=True)

The problem is that the whole database takes about 16 hours to run, and I will have to run it every day, to update inventory status. 问题是整个数据库大约需要16个小时才能运行,而我每天都必须运行它以更新库存状态。

I want to build something like: 我想建立类似的东西:

Every new row added to the history will check if it is consecutive to any one in my new base (hist_mov). 添加到历史记录的每个新行都将检查它是否与我的新库(hist_mov)中的任何一个连续。 If so, update that line. 如果是这样,请更新该行。 If not, add as a new line. 如果不是,则添加为新行。

Any ideas? 有任何想法吗? Thanks! 谢谢!

I think something like this might be what you are after: 我认为可能是您追求的目标:

cols = ["car", "operation"]
pd.merge(df.groupby(cols)["in"].min().reset_index(), 
         df.groupby(cols)["out"].max().reset_index(), on=cols, how="outer")

Edit: 编辑:

Hopefully this alleviates the problem outlined in the comments, using a trans_id column to recognise separate instances of a car coming back in and out: 希望这可以缓解注释中概述的问题,使用trans_id列来识别进出汽车的单独实例:

df['trans_id'] = df['operation'].ne(df['operation'].shift()).astype(int) + df.index
cols = ["car", "trans_id", "operation"]
df_grouped = pd.merge(df.groupby(cols)["in"].min().reset_index(), 
         df.groupby(cols)["out"].max().reset_index(), on=cols, how="outer")
df_grouped.drop('trans_id', axis=1, inplace=True)
df_grouped

I have found the answer! 我找到了答案!

The code that I had first posted was almost right,but it had an unecessary loop. 我最初发布的代码几乎是正确的,但是有一个不必要的循环。

1- First I sort the items by car and data of status change: 1-首先,我按汽车和状态变化数据对项目进行排序:

    mov=mov.sort_values(['car','in'],ascending=False)

2- Then I clusterzire by car and operation: 2-然后我开车和操作来聚类:

    mov['aux']=mov['car']+" - "+mov['operation']
    mov['cluster']=(mov.aux != mov.aux.shift()).cumsum()

3- Finally I can just group by thist cluster Id, and get the min "in" value and the max "out" value: 3-最后,我可以按此簇ID分组,并获得最小“输入”值和最大“输出”值:

    hist_mov=mov.groupby(['cluster','car','operation']).agg({'in':'min',
          'out':'max'}).reset_index().copy()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM