简体   繁体   中英

Optimizing pandas reindex date_range

I have the following situation:

A dataframe that shows every inventory movements (Buy/Sell) of each products and store.

        date     sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
2  2019-10-04  103993.0    001    1.0      8.0
3  2019-10-05  103993.0    001    0.0      8.0
4  2019-10-01  103994.0    002    0.0     12.0
5  2019-10-02  103994.0    002    1.0     11.0
6  2019-10-04  103994.0    002    1.0     10.0
7  2019-10-05  103994.0    002    0.0     10.0
8  2019-09-30  103991.0    012    0.0     12.0
9  2019-10-02  103991.0    012    1.0     11.0
10 2019-10-04  103991.0    012    1.0     10.0
11 2019-10-05  103991.0    012    0.0     10.0

Each product will have a different start date, however, I want to bring each of them to the same end date.

Supposing today is 2019-10-08 and I want to update this dataframe, inserting rows for the days between the first date until 2019-10-08 that was skipped.

Example:

  • Sku 103993
  • store: 001
  • First date: 2019-10-01 (It will be the first index)
  • End date: 2019-10-08

Dataframe:

        date     sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
2  2019-10-04  103993.0    001    1.0      8.0
3  2019-10-05  103993.0    001    0.0      8.0

The expected output should be:

        date     sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
1  2019-10-03  103993.0    001    NaN      NaN
2  2019-10-04  103993.0    001    1.0      8.0
3  2019-10-05  103993.0    001    0.0      8.0
4  2019-10-06  103993.0    001    NaN      NaN
5  2019-10-07  103993.0    001    NaN      NaN
6  2019-10-08  103993.0    001    NaN      NaN

In order to accomplish this I came with two solutions:

  dfs = []
    for _, d in df.groupby(['sku', 'store']):

        start_date = d.date.iloc[0]
        end_date = pd.Timestamp('2019-10-08')

        d.set_index('date', inplace=True)
        d = d.reindex(pd.date_range(start_date, end_date))
        dfs.append(d)

    df = pd.concat(dfs)

And later on:

v = '2019-10-08'

df = df.groupby(['sku', 'store'])['date', 'Units', 'balance']  \
    .apply(lambda x: x.set_index('date')  \
    .reindex(pd.date_range(x.date.iloc[0], pd.Timestamp(v))) 

However, it takes too long when I have a dataframe with 100000 products.

Do you guys have any idea to improve this function, vectorizing with pandas?

If I understand correctly, this is the type of thing you're trying to do. This may be faster, because you're not repeatedly concat'ing and appending the DF as a whole. Really not sure. You'll have to test it.

print(df)
print("--------------")
import pandas as pd
import numpy as np 

def Insert_row(row_number, df, row_value):
    """
    from here: https://www.geeksforgeeks.org/insert-row-at-given-position-in-pandas-dataframe/
    """ 
    # Starting value of upper half 
    start_upper = 0
    # End value of upper half 
    end_upper = row_number 
    # Start value of lower half 
    start_lower = row_number 
    # End value of lower half 
    end_lower = df.shape[0] 
    # Create a list of upper_half index 
    upper_half = [*range(start_upper, end_upper, 1)] 
    # Create a list of lower_half index 
    lower_half = [*range(start_lower, end_lower, 1)] 
    # Increment the value of lower half by 1 
    lower_half = [x.__add__(1) for x in lower_half] 
    # Combine the two lists 
    index_ = upper_half + lower_half 
    # Update the index of the dataframe 
    df.index = index_ 
    # Insert a row at the end 
    df.loc[row_number] = row_value 
    # Sort the index labels 
    df = df.sort_index() 
    # return the dataframe 
    return df 

# First ensure the column is datetime values
df["date"] = pd.to_datetime(df["date"])

location = 1 # Start at the SECOND row
for i in range (1, df.shape[0], 1): # Loop through all the rows
    current_date  = df.iloc[location]["date"] # Date of the current row
    previous_date = df.iloc[location - 1]["date"] # Date of the previous row
    try: # Try to get a difference between the row's dates
        difference = int((current_date - previous_date) / np.timedelta64(1, 'D') )
    except ValueError as e: 
        if "NaN" in str(e).lower(): 
            continue
#    print(previous_date, " - ", current_date, "=", difference)
    if difference > 1: # If the difference is more than one day
        newdate = (pd.to_datetime(previous_date) + np.timedelta64(1, "D")) # Increment the date by one day        
        for d in range(1, difference, 1): # Loop for all missing rows
#            print("Inserting row with date {}".format(newdate))
            row_value = [newdate, np.nan, np.nan, np.nan, np.nan] # Create the row
            df = Insert_row(location, df, row_value) # Insert the row
            location += 1 # Increment the location
            newdate = (pd.to_datetime(newdate) + np.timedelta64(1, "D")) # Increment the date for the next loop if it's needed- 
    location += 1 

print(df)

OUTPUT:

date       sku  store  Units  balance
0  2019-10-01  103993.0    1.0    0.0     10.0
1  2019-10-02  103993.0    1.0    1.0      9.0
2  2019-10-04  103993.0    1.0    1.0      8.0
3  2019-10-05  103993.0    1.0    0.0      8.0
4  2019-10-06  103994.0    2.0    0.0     12.0
5  2019-10-07  103994.0    2.0    1.0     11.0
6  2019-10-10  103994.0    2.0    1.0     10.0
7  2019-10-15  103994.0    2.0    0.0     10.0
8  2019-10-30  103991.0   12.0    0.0     12.0
9                   NaN    NaN    NaN      NaN
--------------
date       sku  store  Units  balance
0  2019-10-01  103993.0    1.0    0.0     10.0
1  2019-10-02  103993.0    1.0    1.0      9.0
2  2019-10-03       NaN    NaN    NaN      NaN
3  2019-10-04  103993.0    1.0    1.0      8.0
4  2019-10-05  103993.0    1.0    0.0      8.0
5  2019-10-06  103994.0    2.0    0.0     12.0
6  2019-10-07  103994.0    2.0    1.0     11.0
7  2019-10-08       NaN    NaN    NaN      NaN
8  2019-10-09       NaN    NaN    NaN      NaN
9  2019-10-10  103994.0    2.0    1.0     10.0
10 2019-10-11       NaN    NaN    NaN      NaN
11 2019-10-12       NaN    NaN    NaN      NaN
12 2019-10-13       NaN    NaN    NaN      NaN
13 2019-10-14       NaN    NaN    NaN      NaN
14 2019-10-15  103994.0    2.0    0.0     10.0
15 2019-10-16       NaN    NaN    NaN      NaN
16 2019-10-17       NaN    NaN    NaN      NaN
17 2019-10-18       NaN    NaN    NaN      NaN
18 2019-10-19       NaN    NaN    NaN      NaN
19 2019-10-20       NaN    NaN    NaN      NaN
20 2019-10-21       NaN    NaN    NaN      NaN
21 2019-10-22       NaN    NaN    NaN      NaN
22 2019-10-23       NaN    NaN    NaN      NaN
23 2019-10-24       NaN    NaN    NaN      NaN
24 2019-10-25       NaN    NaN    NaN      NaN
25 2019-10-26       NaN    NaN    NaN      NaN
26 2019-10-27       NaN    NaN    NaN      NaN
27 2019-10-28       NaN    NaN    NaN      NaN
28 2019-10-29       NaN    NaN    NaN      NaN
29 2019-10-30  103991.0   12.0    0.0     12.0
30 2019-10-31       NaN    NaN    NaN      NaN
31 2019-11-01       NaN    NaN    NaN      NaN
32 2019-11-02       NaN    NaN    NaN      NaN
33 2019-11-03       NaN    NaN    NaN      NaN
34 2019-11-04       NaN    NaN    NaN      NaN
35 2019-11-05       NaN    NaN    NaN      NaN
36 2019-11-06       NaN    NaN    NaN      NaN
37 2019-11-07       NaN    NaN    NaN      NaN
38 2019-11-08       NaN    NaN    NaN      NaN
39 2019-11-09       NaN    NaN    NaN      NaN
40 2019-11-10       NaN    NaN    NaN      NaN
41 2019-11-11       NaN    NaN    NaN      NaN
42 2019-11-12       NaN    NaN    NaN      NaN
43 2019-11-13       NaN    NaN    NaN      NaN
44        NaT       NaN    NaN    NaN      NaN

You can do all of this using pandas merge (or join) operations. A problem of this approach can arise when you have many 'products' ('sku', 'store' combinations) with many different 'total' dates (ranging from the minimum date of your dataframe to now).

The following assumes that your data is in df .

# For convenience some variables:
END_DATE = datetime.date(2019, 10, 10)
product_columns = ['sku', 'store']
minimum_date = df['date'].min()
product_date_columns = product_columns + ['date']

# We will first save away the minimum date of for each product for later
minimum_date_per_product = df[product_date_columns].groupby(product_columns).agg('min')
minimum_date_per_product = minimum_date_per_product.rename({'date': 'minimum_date'}, axis=1)

# Then you find all possible product/date combinations, as said above, this might lead 
# to a huge dataframe (of size len(unique_products) times len(unique_dates)):
all_dates = pd.DataFrame(index=pd.date_range(minimum_date, END_DATE)).reset_index()
all_dates = all_dates.rename({'index': 'date'}, axis=1)
all_products = df[product_columns].drop_duplicates()
all_dates['key'] = 0
all_products['key'] = 0
all_product_date_combinations = pd.merge(all_dates, all_products, on='key').drop('key', axis=1)

# You then create all possible selling dates for your products
df = df.set_index(product_date_columns)
all_product_date_combinations = all_product_date_combinations.set_index(product_date_columns)
df = df.join(all_product_date_combinations, how='right')

# Now you only have to drop all rows that are before the first starting date of a product
df = df.join(minimum_date_per_product).reset_index()
df = df[df['date'] >= df['minimum_date']]
df = df.drop('minimum_date', axis=1)

For your provided input data the output looks something like this:

         sku  store       date  Units  balance
0   103991.0     12 2019-09-30    0.0     12.0
1   103991.0     12 2019-10-01    NaN      NaN
2   103991.0     12 2019-10-02    1.0     11.0
3   103991.0     12 2019-10-03    NaN      NaN
4   103991.0     12 2019-10-04    1.0     10.0
5   103991.0     12 2019-10-05    0.0     10.0
6   103991.0     12 2019-10-06    NaN      NaN
7   103991.0     12 2019-10-07    NaN      NaN
8   103991.0     12 2019-10-08    NaN      NaN
9   103991.0     12 2019-10-09    NaN      NaN
10  103991.0     12 2019-10-10    NaN      NaN
12  103993.0      1 2019-10-01    0.0     10.0
13  103993.0      1 2019-10-02    1.0      9.0
14  103993.0      1 2019-10-03    NaN      NaN
15  103993.0      1 2019-10-04    1.0      8.0
16  103993.0      1 2019-10-05    0.0      8.0
17  103993.0      1 2019-10-06    NaN      NaN
18  103993.0      1 2019-10-07    NaN      NaN
19  103993.0      1 2019-10-08    NaN      NaN
20  103993.0      1 2019-10-09    NaN      NaN
21  103993.0      1 2019-10-10    NaN      NaN
23  103994.0      2 2019-10-01    0.0     12.0
24  103994.0      2 2019-10-02    1.0     11.0
25  103994.0      2 2019-10-03    NaN      NaN
26  103994.0      2 2019-10-04    1.0     10.0
27  103994.0      2 2019-10-05    0.0     10.0
28  103994.0      2 2019-10-06    NaN      NaN
29  103994.0      2 2019-10-07    NaN      NaN
30  103994.0      2 2019-10-08    NaN      NaN
31  103994.0      2 2019-10-09    NaN      NaN
32  103994.0      2 2019-10-10    NaN      NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM