简体   繁体   English

优化 pandas 重新索引 date_range

[英]Optimizing pandas reindex date_range

I have the following situation:我有以下情况:

A dataframe that shows every inventory movements (Buy/Sell) of each products and store.一个 dataframe 显示每个产品和商店的每个库存变动(买/卖)。

        date     sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
2  2019-10-04  103993.0    001    1.0      8.0
3  2019-10-05  103993.0    001    0.0      8.0
4  2019-10-01  103994.0    002    0.0     12.0
5  2019-10-02  103994.0    002    1.0     11.0
6  2019-10-04  103994.0    002    1.0     10.0
7  2019-10-05  103994.0    002    0.0     10.0
8  2019-09-30  103991.0    012    0.0     12.0
9  2019-10-02  103991.0    012    1.0     11.0
10 2019-10-04  103991.0    012    1.0     10.0
11 2019-10-05  103991.0    012    0.0     10.0

Each product will have a different start date, however, I want to bring each of them to the same end date.每个产品都有不同的开始日期,但是,我想将它们中的每一个都带到相同的结束日期。

Supposing today is 2019-10-08 and I want to update this dataframe, inserting rows for the days between the first date until 2019-10-08 that was skipped.假设今天是 2019 年 10 月 8 日,我想更新这个 dataframe,插入第一个日期到 2019 年 10 月 8 日之间的天数的行被跳过。


  • Sku 103993货号 103993
  • store: 001店铺:001
  • First date: 2019-10-01 (It will be the first index)第一个日期:2019-10-01(将是第一个索引)
  • End date: 2019-10-08结束日期:2019-10-08

Dataframe: Dataframe:

        date     sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
2  2019-10-04  103993.0    001    1.0      8.0
3  2019-10-05  103993.0    001    0.0      8.0

The expected output should be:预期的 output 应该是:

        date     sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
1  2019-10-03  103993.0    001    NaN      NaN
2  2019-10-04  103993.0    001    1.0      8.0
3  2019-10-05  103993.0    001    0.0      8.0
4  2019-10-06  103993.0    001    NaN      NaN
5  2019-10-07  103993.0    001    NaN      NaN
6  2019-10-08  103993.0    001    NaN      NaN

In order to accomplish this I came with two solutions:为了实现这一点,我提出了两种解决方案:

  dfs = []
    for _, d in df.groupby(['sku', 'store']):

        start_date = d.date.iloc[0]
        end_date = pd.Timestamp('2019-10-08')

        d.set_index('date', inplace=True)
        d = d.reindex(pd.date_range(start_date, end_date))

    df = pd.concat(dfs)

And later on:后来:

v = '2019-10-08'

df = df.groupby(['sku', 'store'])['date', 'Units', 'balance']  \
    .apply(lambda x: x.set_index('date')  \
    .reindex(pd.date_range(x.date.iloc[0], pd.Timestamp(v))) 

However, it takes too long when I have a dataframe with 100000 products.但是,当我有一个拥有 100000 个产品的 dataframe 时,需要的时间太长了。

Do you guys have any idea to improve this function, vectorizing with pandas?你们有什么想法来改进这个 function,用 pandas 矢量化?

If I understand correctly, this is the type of thing you're trying to do.如果我理解正确,这就是你想要做的事情。 This may be faster, because you're not repeatedly concat'ing and appending the DF as a whole.可能会更快,因为您不会重复连接和附加整个 DF。 Really not sure.真的不确定。 You'll have to test it.你必须测试它。

import pandas as pd
import numpy as np 

def Insert_row(row_number, df, row_value):
    from here: https://www.geeksforgeeks.org/insert-row-at-given-position-in-pandas-dataframe/
    # Starting value of upper half 
    start_upper = 0
    # End value of upper half 
    end_upper = row_number 
    # Start value of lower half 
    start_lower = row_number 
    # End value of lower half 
    end_lower = df.shape[0] 
    # Create a list of upper_half index 
    upper_half = [*range(start_upper, end_upper, 1)] 
    # Create a list of lower_half index 
    lower_half = [*range(start_lower, end_lower, 1)] 
    # Increment the value of lower half by 1 
    lower_half = [x.__add__(1) for x in lower_half] 
    # Combine the two lists 
    index_ = upper_half + lower_half 
    # Update the index of the dataframe 
    df.index = index_ 
    # Insert a row at the end 
    df.loc[row_number] = row_value 
    # Sort the index labels 
    df = df.sort_index() 
    # return the dataframe 
    return df 

# First ensure the column is datetime values
df["date"] = pd.to_datetime(df["date"])

location = 1 # Start at the SECOND row
for i in range (1, df.shape[0], 1): # Loop through all the rows
    current_date  = df.iloc[location]["date"] # Date of the current row
    previous_date = df.iloc[location - 1]["date"] # Date of the previous row
    try: # Try to get a difference between the row's dates
        difference = int((current_date - previous_date) / np.timedelta64(1, 'D') )
    except ValueError as e: 
        if "NaN" in str(e).lower(): 
#    print(previous_date, " - ", current_date, "=", difference)
    if difference > 1: # If the difference is more than one day
        newdate = (pd.to_datetime(previous_date) + np.timedelta64(1, "D")) # Increment the date by one day        
        for d in range(1, difference, 1): # Loop for all missing rows
#            print("Inserting row with date {}".format(newdate))
            row_value = [newdate, np.nan, np.nan, np.nan, np.nan] # Create the row
            df = Insert_row(location, df, row_value) # Insert the row
            location += 1 # Increment the location
            newdate = (pd.to_datetime(newdate) + np.timedelta64(1, "D")) # Increment the date for the next loop if it's needed- 
    location += 1 



date       sku  store  Units  balance
0  2019-10-01  103993.0    1.0    0.0     10.0
1  2019-10-02  103993.0    1.0    1.0      9.0
2  2019-10-04  103993.0    1.0    1.0      8.0
3  2019-10-05  103993.0    1.0    0.0      8.0
4  2019-10-06  103994.0    2.0    0.0     12.0
5  2019-10-07  103994.0    2.0    1.0     11.0
6  2019-10-10  103994.0    2.0    1.0     10.0
7  2019-10-15  103994.0    2.0    0.0     10.0
8  2019-10-30  103991.0   12.0    0.0     12.0
9                   NaN    NaN    NaN      NaN
date       sku  store  Units  balance
0  2019-10-01  103993.0    1.0    0.0     10.0
1  2019-10-02  103993.0    1.0    1.0      9.0
2  2019-10-03       NaN    NaN    NaN      NaN
3  2019-10-04  103993.0    1.0    1.0      8.0
4  2019-10-05  103993.0    1.0    0.0      8.0
5  2019-10-06  103994.0    2.0    0.0     12.0
6  2019-10-07  103994.0    2.0    1.0     11.0
7  2019-10-08       NaN    NaN    NaN      NaN
8  2019-10-09       NaN    NaN    NaN      NaN
9  2019-10-10  103994.0    2.0    1.0     10.0
10 2019-10-11       NaN    NaN    NaN      NaN
11 2019-10-12       NaN    NaN    NaN      NaN
12 2019-10-13       NaN    NaN    NaN      NaN
13 2019-10-14       NaN    NaN    NaN      NaN
14 2019-10-15  103994.0    2.0    0.0     10.0
15 2019-10-16       NaN    NaN    NaN      NaN
16 2019-10-17       NaN    NaN    NaN      NaN
17 2019-10-18       NaN    NaN    NaN      NaN
18 2019-10-19       NaN    NaN    NaN      NaN
19 2019-10-20       NaN    NaN    NaN      NaN
20 2019-10-21       NaN    NaN    NaN      NaN
21 2019-10-22       NaN    NaN    NaN      NaN
22 2019-10-23       NaN    NaN    NaN      NaN
23 2019-10-24       NaN    NaN    NaN      NaN
24 2019-10-25       NaN    NaN    NaN      NaN
25 2019-10-26       NaN    NaN    NaN      NaN
26 2019-10-27       NaN    NaN    NaN      NaN
27 2019-10-28       NaN    NaN    NaN      NaN
28 2019-10-29       NaN    NaN    NaN      NaN
29 2019-10-30  103991.0   12.0    0.0     12.0
30 2019-10-31       NaN    NaN    NaN      NaN
31 2019-11-01       NaN    NaN    NaN      NaN
32 2019-11-02       NaN    NaN    NaN      NaN
33 2019-11-03       NaN    NaN    NaN      NaN
34 2019-11-04       NaN    NaN    NaN      NaN
35 2019-11-05       NaN    NaN    NaN      NaN
36 2019-11-06       NaN    NaN    NaN      NaN
37 2019-11-07       NaN    NaN    NaN      NaN
38 2019-11-08       NaN    NaN    NaN      NaN
39 2019-11-09       NaN    NaN    NaN      NaN
40 2019-11-10       NaN    NaN    NaN      NaN
41 2019-11-11       NaN    NaN    NaN      NaN
42 2019-11-12       NaN    NaN    NaN      NaN
43 2019-11-13       NaN    NaN    NaN      NaN
44        NaT       NaN    NaN    NaN      NaN

You can do all of this using pandas merge (or join) operations.您可以使用 pandas 合并(或连接)操作来完成所有这些操作。 A problem of this approach can arise when you have many 'products' ('sku', 'store' combinations) with many different 'total' dates (ranging from the minimum date of your dataframe to now).当您有许多具有许多不同“总”日期(从 dataframe 的最短日期到现在)的“产品”(“sku”、“商店”组合)时,可能会出现这种方法的问题。

The following assumes that your data is in df .以下假设您的数据位于df中。

# For convenience some variables:
END_DATE = datetime.date(2019, 10, 10)
product_columns = ['sku', 'store']
minimum_date = df['date'].min()
product_date_columns = product_columns + ['date']

# We will first save away the minimum date of for each product for later
minimum_date_per_product = df[product_date_columns].groupby(product_columns).agg('min')
minimum_date_per_product = minimum_date_per_product.rename({'date': 'minimum_date'}, axis=1)

# Then you find all possible product/date combinations, as said above, this might lead 
# to a huge dataframe (of size len(unique_products) times len(unique_dates)):
all_dates = pd.DataFrame(index=pd.date_range(minimum_date, END_DATE)).reset_index()
all_dates = all_dates.rename({'index': 'date'}, axis=1)
all_products = df[product_columns].drop_duplicates()
all_dates['key'] = 0
all_products['key'] = 0
all_product_date_combinations = pd.merge(all_dates, all_products, on='key').drop('key', axis=1)

# You then create all possible selling dates for your products
df = df.set_index(product_date_columns)
all_product_date_combinations = all_product_date_combinations.set_index(product_date_columns)
df = df.join(all_product_date_combinations, how='right')

# Now you only have to drop all rows that are before the first starting date of a product
df = df.join(minimum_date_per_product).reset_index()
df = df[df['date'] >= df['minimum_date']]
df = df.drop('minimum_date', axis=1)

For your provided input data the output looks something like this:对于您提供的输入数据,output 看起来像这样:

         sku  store       date  Units  balance
0   103991.0     12 2019-09-30    0.0     12.0
1   103991.0     12 2019-10-01    NaN      NaN
2   103991.0     12 2019-10-02    1.0     11.0
3   103991.0     12 2019-10-03    NaN      NaN
4   103991.0     12 2019-10-04    1.0     10.0
5   103991.0     12 2019-10-05    0.0     10.0
6   103991.0     12 2019-10-06    NaN      NaN
7   103991.0     12 2019-10-07    NaN      NaN
8   103991.0     12 2019-10-08    NaN      NaN
9   103991.0     12 2019-10-09    NaN      NaN
10  103991.0     12 2019-10-10    NaN      NaN
12  103993.0      1 2019-10-01    0.0     10.0
13  103993.0      1 2019-10-02    1.0      9.0
14  103993.0      1 2019-10-03    NaN      NaN
15  103993.0      1 2019-10-04    1.0      8.0
16  103993.0      1 2019-10-05    0.0      8.0
17  103993.0      1 2019-10-06    NaN      NaN
18  103993.0      1 2019-10-07    NaN      NaN
19  103993.0      1 2019-10-08    NaN      NaN
20  103993.0      1 2019-10-09    NaN      NaN
21  103993.0      1 2019-10-10    NaN      NaN
23  103994.0      2 2019-10-01    0.0     12.0
24  103994.0      2 2019-10-02    1.0     11.0
25  103994.0      2 2019-10-03    NaN      NaN
26  103994.0      2 2019-10-04    1.0     10.0
27  103994.0      2 2019-10-05    0.0     10.0
28  103994.0      2 2019-10-06    NaN      NaN
29  103994.0      2 2019-10-07    NaN      NaN
30  103994.0      2 2019-10-08    NaN      NaN
31  103994.0      2 2019-10-09    NaN      NaN
32  103994.0      2 2019-10-10    NaN      NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM