優化 pandas 重新索引 date_range

Question

我有以下情況：

一個 dataframe 顯示每個產品和商店的每個庫存變動（買/賣）。

        date     sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
2  2019-10-04  103993.0    001    1.0      8.0
3  2019-10-05  103993.0    001    0.0      8.0
4  2019-10-01  103994.0    002    0.0     12.0
5  2019-10-02  103994.0    002    1.0     11.0
6  2019-10-04  103994.0    002    1.0     10.0
7  2019-10-05  103994.0    002    0.0     10.0
8  2019-09-30  103991.0    012    0.0     12.0
9  2019-10-02  103991.0    012    1.0     11.0
10 2019-10-04  103991.0    012    1.0     10.0
11 2019-10-05  103991.0    012    0.0     10.0

每個產品都有不同的開始日期，但是，我想將它們中的每一個都帶到相同的結束日期。

假設今天是 2019 年 10 月 8 日，我想更新這個 dataframe，插入第一個日期到 2019 年 10 月 8 日之間的天數的行被跳過。

例子：

貨號 103993
店鋪：001
第一個日期：2019-10-01（將是第一個索引）
結束日期：2019-10-08

Dataframe：

        date     sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
2  2019-10-04  103993.0    001    1.0      8.0
3  2019-10-05  103993.0    001    0.0      8.0

預期的 output 應該是：

        date     sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
1  2019-10-03  103993.0    001    NaN      NaN
2  2019-10-04  103993.0    001    1.0      8.0
3  2019-10-05  103993.0    001    0.0      8.0
4  2019-10-06  103993.0    001    NaN      NaN
5  2019-10-07  103993.0    001    NaN      NaN
6  2019-10-08  103993.0    001    NaN      NaN

為了實現這一點，我提出了兩種解決方案：

  dfs = []
    for _, d in df.groupby(['sku', 'store']):

        start_date = d.date.iloc[0]
        end_date = pd.Timestamp('2019-10-08')

        d.set_index('date', inplace=True)
        d = d.reindex(pd.date_range(start_date, end_date))
        dfs.append(d)

    df = pd.concat(dfs)

后來：

v = '2019-10-08'

df = df.groupby(['sku', 'store'])['date', 'Units', 'balance']  \
    .apply(lambda x: x.set_index('date')  \
    .reindex(pd.date_range(x.date.iloc[0], pd.Timestamp(v)))

但是，當我有一個擁有 100000 個產品的 dataframe 時，需要的時間太長了。

你們有什么想法來改進這個 function，用 pandas 矢量化？

Answer 1

如果我理解正確，這就是你想要做的事情。 這可能會更快，因為您不會重復連接和附加整個 DF。 真的不確定。 你必須測試它。

print(df)
print("--------------")
import pandas as pd
import numpy as np 

def Insert_row(row_number, df, row_value):
    """
    from here: https://www.geeksforgeeks.org/insert-row-at-given-position-in-pandas-dataframe/
    """ 
    # Starting value of upper half 
    start_upper = 0
    # End value of upper half 
    end_upper = row_number 
    # Start value of lower half 
    start_lower = row_number 
    # End value of lower half 
    end_lower = df.shape[0] 
    # Create a list of upper_half index 
    upper_half = [*range(start_upper, end_upper, 1)] 
    # Create a list of lower_half index 
    lower_half = [*range(start_lower, end_lower, 1)] 
    # Increment the value of lower half by 1 
    lower_half = [x.__add__(1) for x in lower_half] 
    # Combine the two lists 
    index_ = upper_half + lower_half 
    # Update the index of the dataframe 
    df.index = index_ 
    # Insert a row at the end 
    df.loc[row_number] = row_value 
    # Sort the index labels 
    df = df.sort_index() 
    # return the dataframe 
    return df 

# First ensure the column is datetime values
df["date"] = pd.to_datetime(df["date"])

location = 1 # Start at the SECOND row
for i in range (1, df.shape[0], 1): # Loop through all the rows
    current_date  = df.iloc[location]["date"] # Date of the current row
    previous_date = df.iloc[location - 1]["date"] # Date of the previous row
    try: # Try to get a difference between the row's dates
        difference = int((current_date - previous_date) / np.timedelta64(1, 'D') )
    except ValueError as e: 
        if "NaN" in str(e).lower(): 
            continue
#    print(previous_date, " - ", current_date, "=", difference)
    if difference > 1: # If the difference is more than one day
        newdate = (pd.to_datetime(previous_date) + np.timedelta64(1, "D")) # Increment the date by one day        
        for d in range(1, difference, 1): # Loop for all missing rows
#            print("Inserting row with date {}".format(newdate))
            row_value = [newdate, np.nan, np.nan, np.nan, np.nan] # Create the row
            df = Insert_row(location, df, row_value) # Insert the row
            location += 1 # Increment the location
            newdate = (pd.to_datetime(newdate) + np.timedelta64(1, "D")) # Increment the date for the next loop if it's needed- 
    location += 1 

print(df)

OUTPUT：

date       sku  store  Units  balance
0  2019-10-01  103993.0    1.0    0.0     10.0
1  2019-10-02  103993.0    1.0    1.0      9.0
2  2019-10-04  103993.0    1.0    1.0      8.0
3  2019-10-05  103993.0    1.0    0.0      8.0
4  2019-10-06  103994.0    2.0    0.0     12.0
5  2019-10-07  103994.0    2.0    1.0     11.0
6  2019-10-10  103994.0    2.0    1.0     10.0
7  2019-10-15  103994.0    2.0    0.0     10.0
8  2019-10-30  103991.0   12.0    0.0     12.0
9                   NaN    NaN    NaN      NaN
--------------
date       sku  store  Units  balance
0  2019-10-01  103993.0    1.0    0.0     10.0
1  2019-10-02  103993.0    1.0    1.0      9.0
2  2019-10-03       NaN    NaN    NaN      NaN
3  2019-10-04  103993.0    1.0    1.0      8.0
4  2019-10-05  103993.0    1.0    0.0      8.0
5  2019-10-06  103994.0    2.0    0.0     12.0
6  2019-10-07  103994.0    2.0    1.0     11.0
7  2019-10-08       NaN    NaN    NaN      NaN
8  2019-10-09       NaN    NaN    NaN      NaN
9  2019-10-10  103994.0    2.0    1.0     10.0
10 2019-10-11       NaN    NaN    NaN      NaN
11 2019-10-12       NaN    NaN    NaN      NaN
12 2019-10-13       NaN    NaN    NaN      NaN
13 2019-10-14       NaN    NaN    NaN      NaN
14 2019-10-15  103994.0    2.0    0.0     10.0
15 2019-10-16       NaN    NaN    NaN      NaN
16 2019-10-17       NaN    NaN    NaN      NaN
17 2019-10-18       NaN    NaN    NaN      NaN
18 2019-10-19       NaN    NaN    NaN      NaN
19 2019-10-20       NaN    NaN    NaN      NaN
20 2019-10-21       NaN    NaN    NaN      NaN
21 2019-10-22       NaN    NaN    NaN      NaN
22 2019-10-23       NaN    NaN    NaN      NaN
23 2019-10-24       NaN    NaN    NaN      NaN
24 2019-10-25       NaN    NaN    NaN      NaN
25 2019-10-26       NaN    NaN    NaN      NaN
26 2019-10-27       NaN    NaN    NaN      NaN
27 2019-10-28       NaN    NaN    NaN      NaN
28 2019-10-29       NaN    NaN    NaN      NaN
29 2019-10-30  103991.0   12.0    0.0     12.0
30 2019-10-31       NaN    NaN    NaN      NaN
31 2019-11-01       NaN    NaN    NaN      NaN
32 2019-11-02       NaN    NaN    NaN      NaN
33 2019-11-03       NaN    NaN    NaN      NaN
34 2019-11-04       NaN    NaN    NaN      NaN
35 2019-11-05       NaN    NaN    NaN      NaN
36 2019-11-06       NaN    NaN    NaN      NaN
37 2019-11-07       NaN    NaN    NaN      NaN
38 2019-11-08       NaN    NaN    NaN      NaN
39 2019-11-09       NaN    NaN    NaN      NaN
40 2019-11-10       NaN    NaN    NaN      NaN
41 2019-11-11       NaN    NaN    NaN      NaN
42 2019-11-12       NaN    NaN    NaN      NaN
43 2019-11-13       NaN    NaN    NaN      NaN
44        NaT       NaN    NaN    NaN      NaN

Answer 2

您可以使用 pandas 合並（或連接）操作來完成所有這些操作。 當您有許多具有許多不同“總”日期（從 dataframe 的最短日期到現在）的“產品”（“sku”、“商店”組合）時，可能會出現這種方法的問題。

以下假設您的數據位於df中。

# For convenience some variables:
END_DATE = datetime.date(2019, 10, 10)
product_columns = ['sku', 'store']
minimum_date = df['date'].min()
product_date_columns = product_columns + ['date']

# We will first save away the minimum date of for each product for later
minimum_date_per_product = df[product_date_columns].groupby(product_columns).agg('min')
minimum_date_per_product = minimum_date_per_product.rename({'date': 'minimum_date'}, axis=1)

# Then you find all possible product/date combinations, as said above, this might lead 
# to a huge dataframe (of size len(unique_products) times len(unique_dates)):
all_dates = pd.DataFrame(index=pd.date_range(minimum_date, END_DATE)).reset_index()
all_dates = all_dates.rename({'index': 'date'}, axis=1)
all_products = df[product_columns].drop_duplicates()
all_dates['key'] = 0
all_products['key'] = 0
all_product_date_combinations = pd.merge(all_dates, all_products, on='key').drop('key', axis=1)

# You then create all possible selling dates for your products
df = df.set_index(product_date_columns)
all_product_date_combinations = all_product_date_combinations.set_index(product_date_columns)
df = df.join(all_product_date_combinations, how='right')

# Now you only have to drop all rows that are before the first starting date of a product
df = df.join(minimum_date_per_product).reset_index()
df = df[df['date'] >= df['minimum_date']]
df = df.drop('minimum_date', axis=1)

對於您提供的輸入數據，output 看起來像這樣：

         sku  store       date  Units  balance
0   103991.0     12 2019-09-30    0.0     12.0
1   103991.0     12 2019-10-01    NaN      NaN
2   103991.0     12 2019-10-02    1.0     11.0
3   103991.0     12 2019-10-03    NaN      NaN
4   103991.0     12 2019-10-04    1.0     10.0
5   103991.0     12 2019-10-05    0.0     10.0
6   103991.0     12 2019-10-06    NaN      NaN
7   103991.0     12 2019-10-07    NaN      NaN
8   103991.0     12 2019-10-08    NaN      NaN
9   103991.0     12 2019-10-09    NaN      NaN
10  103991.0     12 2019-10-10    NaN      NaN
12  103993.0      1 2019-10-01    0.0     10.0
13  103993.0      1 2019-10-02    1.0      9.0
14  103993.0      1 2019-10-03    NaN      NaN
15  103993.0      1 2019-10-04    1.0      8.0
16  103993.0      1 2019-10-05    0.0      8.0
17  103993.0      1 2019-10-06    NaN      NaN
18  103993.0      1 2019-10-07    NaN      NaN
19  103993.0      1 2019-10-08    NaN      NaN
20  103993.0      1 2019-10-09    NaN      NaN
21  103993.0      1 2019-10-10    NaN      NaN
23  103994.0      2 2019-10-01    0.0     12.0
24  103994.0      2 2019-10-02    1.0     11.0
25  103994.0      2 2019-10-03    NaN      NaN
26  103994.0      2 2019-10-04    1.0     10.0
27  103994.0      2 2019-10-05    0.0     10.0
28  103994.0      2 2019-10-06    NaN      NaN
29  103994.0      2 2019-10-07    NaN      NaN
30  103994.0      2 2019-10-08    NaN      NaN
31  103994.0      2 2019-10-09    NaN      NaN
32  103994.0      2 2019-10-10    NaN      NaN

優化 pandas 重新索引 date_range

問題描述

2 個解決方案

解決方案1
0 2019-10-28 13:44:19

解決方案2
0 2019-11-24 10:40:30

優化 pandas 重新索引 date_range

問題描述

2 個解決方案

解決方案1 0 2019-10-28 13:44:19

解決方案2 0 2019-11-24 10:40:30

解決方案1
0 2019-10-28 13:44:19

解決方案2
0 2019-11-24 10:40:30