[英]Optimizing pandas reindex date_range
我有以下情況:
一個 dataframe 顯示每個產品和商店的每個庫存變動(買/賣)。
date sku store Units balance
0 2019-10-01 103993.0 001 0.0 10.0
1 2019-10-02 103993.0 001 1.0 9.0
2 2019-10-04 103993.0 001 1.0 8.0
3 2019-10-05 103993.0 001 0.0 8.0
4 2019-10-01 103994.0 002 0.0 12.0
5 2019-10-02 103994.0 002 1.0 11.0
6 2019-10-04 103994.0 002 1.0 10.0
7 2019-10-05 103994.0 002 0.0 10.0
8 2019-09-30 103991.0 012 0.0 12.0
9 2019-10-02 103991.0 012 1.0 11.0
10 2019-10-04 103991.0 012 1.0 10.0
11 2019-10-05 103991.0 012 0.0 10.0
每個產品都有不同的開始日期,但是,我想將它們中的每一個都帶到相同的結束日期。
假設今天是 2019 年 10 月 8 日,我想更新這個 dataframe,插入第一個日期到 2019 年 10 月 8 日之間的天數的行被跳過。
例子:
Dataframe:
date sku store Units balance
0 2019-10-01 103993.0 001 0.0 10.0
1 2019-10-02 103993.0 001 1.0 9.0
2 2019-10-04 103993.0 001 1.0 8.0
3 2019-10-05 103993.0 001 0.0 8.0
預期的 output 應該是:
date sku store Units balance
0 2019-10-01 103993.0 001 0.0 10.0
1 2019-10-02 103993.0 001 1.0 9.0
1 2019-10-03 103993.0 001 NaN NaN
2 2019-10-04 103993.0 001 1.0 8.0
3 2019-10-05 103993.0 001 0.0 8.0
4 2019-10-06 103993.0 001 NaN NaN
5 2019-10-07 103993.0 001 NaN NaN
6 2019-10-08 103993.0 001 NaN NaN
為了實現這一點,我提出了兩種解決方案:
dfs = []
for _, d in df.groupby(['sku', 'store']):
start_date = d.date.iloc[0]
end_date = pd.Timestamp('2019-10-08')
d.set_index('date', inplace=True)
d = d.reindex(pd.date_range(start_date, end_date))
dfs.append(d)
df = pd.concat(dfs)
后來:
v = '2019-10-08'
df = df.groupby(['sku', 'store'])['date', 'Units', 'balance'] \
.apply(lambda x: x.set_index('date') \
.reindex(pd.date_range(x.date.iloc[0], pd.Timestamp(v)))
但是,當我有一個擁有 100000 個產品的 dataframe 時,需要的時間太長了。
你們有什么想法來改進這個 function,用 pandas 矢量化?
如果我理解正確,這就是你想要做的事情。 這可能會更快,因為您不會重復連接和附加整個 DF。 真的不確定。 你必須測試它。
print(df)
print("--------------")
import pandas as pd
import numpy as np
def Insert_row(row_number, df, row_value):
"""
from here: https://www.geeksforgeeks.org/insert-row-at-given-position-in-pandas-dataframe/
"""
# Starting value of upper half
start_upper = 0
# End value of upper half
end_upper = row_number
# Start value of lower half
start_lower = row_number
# End value of lower half
end_lower = df.shape[0]
# Create a list of upper_half index
upper_half = [*range(start_upper, end_upper, 1)]
# Create a list of lower_half index
lower_half = [*range(start_lower, end_lower, 1)]
# Increment the value of lower half by 1
lower_half = [x.__add__(1) for x in lower_half]
# Combine the two lists
index_ = upper_half + lower_half
# Update the index of the dataframe
df.index = index_
# Insert a row at the end
df.loc[row_number] = row_value
# Sort the index labels
df = df.sort_index()
# return the dataframe
return df
# First ensure the column is datetime values
df["date"] = pd.to_datetime(df["date"])
location = 1 # Start at the SECOND row
for i in range (1, df.shape[0], 1): # Loop through all the rows
current_date = df.iloc[location]["date"] # Date of the current row
previous_date = df.iloc[location - 1]["date"] # Date of the previous row
try: # Try to get a difference between the row's dates
difference = int((current_date - previous_date) / np.timedelta64(1, 'D') )
except ValueError as e:
if "NaN" in str(e).lower():
continue
# print(previous_date, " - ", current_date, "=", difference)
if difference > 1: # If the difference is more than one day
newdate = (pd.to_datetime(previous_date) + np.timedelta64(1, "D")) # Increment the date by one day
for d in range(1, difference, 1): # Loop for all missing rows
# print("Inserting row with date {}".format(newdate))
row_value = [newdate, np.nan, np.nan, np.nan, np.nan] # Create the row
df = Insert_row(location, df, row_value) # Insert the row
location += 1 # Increment the location
newdate = (pd.to_datetime(newdate) + np.timedelta64(1, "D")) # Increment the date for the next loop if it's needed-
location += 1
print(df)
OUTPUT:
date sku store Units balance
0 2019-10-01 103993.0 1.0 0.0 10.0
1 2019-10-02 103993.0 1.0 1.0 9.0
2 2019-10-04 103993.0 1.0 1.0 8.0
3 2019-10-05 103993.0 1.0 0.0 8.0
4 2019-10-06 103994.0 2.0 0.0 12.0
5 2019-10-07 103994.0 2.0 1.0 11.0
6 2019-10-10 103994.0 2.0 1.0 10.0
7 2019-10-15 103994.0 2.0 0.0 10.0
8 2019-10-30 103991.0 12.0 0.0 12.0
9 NaN NaN NaN NaN
--------------
date sku store Units balance
0 2019-10-01 103993.0 1.0 0.0 10.0
1 2019-10-02 103993.0 1.0 1.0 9.0
2 2019-10-03 NaN NaN NaN NaN
3 2019-10-04 103993.0 1.0 1.0 8.0
4 2019-10-05 103993.0 1.0 0.0 8.0
5 2019-10-06 103994.0 2.0 0.0 12.0
6 2019-10-07 103994.0 2.0 1.0 11.0
7 2019-10-08 NaN NaN NaN NaN
8 2019-10-09 NaN NaN NaN NaN
9 2019-10-10 103994.0 2.0 1.0 10.0
10 2019-10-11 NaN NaN NaN NaN
11 2019-10-12 NaN NaN NaN NaN
12 2019-10-13 NaN NaN NaN NaN
13 2019-10-14 NaN NaN NaN NaN
14 2019-10-15 103994.0 2.0 0.0 10.0
15 2019-10-16 NaN NaN NaN NaN
16 2019-10-17 NaN NaN NaN NaN
17 2019-10-18 NaN NaN NaN NaN
18 2019-10-19 NaN NaN NaN NaN
19 2019-10-20 NaN NaN NaN NaN
20 2019-10-21 NaN NaN NaN NaN
21 2019-10-22 NaN NaN NaN NaN
22 2019-10-23 NaN NaN NaN NaN
23 2019-10-24 NaN NaN NaN NaN
24 2019-10-25 NaN NaN NaN NaN
25 2019-10-26 NaN NaN NaN NaN
26 2019-10-27 NaN NaN NaN NaN
27 2019-10-28 NaN NaN NaN NaN
28 2019-10-29 NaN NaN NaN NaN
29 2019-10-30 103991.0 12.0 0.0 12.0
30 2019-10-31 NaN NaN NaN NaN
31 2019-11-01 NaN NaN NaN NaN
32 2019-11-02 NaN NaN NaN NaN
33 2019-11-03 NaN NaN NaN NaN
34 2019-11-04 NaN NaN NaN NaN
35 2019-11-05 NaN NaN NaN NaN
36 2019-11-06 NaN NaN NaN NaN
37 2019-11-07 NaN NaN NaN NaN
38 2019-11-08 NaN NaN NaN NaN
39 2019-11-09 NaN NaN NaN NaN
40 2019-11-10 NaN NaN NaN NaN
41 2019-11-11 NaN NaN NaN NaN
42 2019-11-12 NaN NaN NaN NaN
43 2019-11-13 NaN NaN NaN NaN
44 NaT NaN NaN NaN NaN
您可以使用 pandas 合並(或連接)操作來完成所有這些操作。 當您有許多具有許多不同“總”日期(從 dataframe 的最短日期到現在)的“產品”(“sku”、“商店”組合)時,可能會出現這種方法的問題。
以下假設您的數據位於df
中。
# For convenience some variables:
END_DATE = datetime.date(2019, 10, 10)
product_columns = ['sku', 'store']
minimum_date = df['date'].min()
product_date_columns = product_columns + ['date']
# We will first save away the minimum date of for each product for later
minimum_date_per_product = df[product_date_columns].groupby(product_columns).agg('min')
minimum_date_per_product = minimum_date_per_product.rename({'date': 'minimum_date'}, axis=1)
# Then you find all possible product/date combinations, as said above, this might lead
# to a huge dataframe (of size len(unique_products) times len(unique_dates)):
all_dates = pd.DataFrame(index=pd.date_range(minimum_date, END_DATE)).reset_index()
all_dates = all_dates.rename({'index': 'date'}, axis=1)
all_products = df[product_columns].drop_duplicates()
all_dates['key'] = 0
all_products['key'] = 0
all_product_date_combinations = pd.merge(all_dates, all_products, on='key').drop('key', axis=1)
# You then create all possible selling dates for your products
df = df.set_index(product_date_columns)
all_product_date_combinations = all_product_date_combinations.set_index(product_date_columns)
df = df.join(all_product_date_combinations, how='right')
# Now you only have to drop all rows that are before the first starting date of a product
df = df.join(minimum_date_per_product).reset_index()
df = df[df['date'] >= df['minimum_date']]
df = df.drop('minimum_date', axis=1)
對於您提供的輸入數據,output 看起來像這樣:
sku store date Units balance
0 103991.0 12 2019-09-30 0.0 12.0
1 103991.0 12 2019-10-01 NaN NaN
2 103991.0 12 2019-10-02 1.0 11.0
3 103991.0 12 2019-10-03 NaN NaN
4 103991.0 12 2019-10-04 1.0 10.0
5 103991.0 12 2019-10-05 0.0 10.0
6 103991.0 12 2019-10-06 NaN NaN
7 103991.0 12 2019-10-07 NaN NaN
8 103991.0 12 2019-10-08 NaN NaN
9 103991.0 12 2019-10-09 NaN NaN
10 103991.0 12 2019-10-10 NaN NaN
12 103993.0 1 2019-10-01 0.0 10.0
13 103993.0 1 2019-10-02 1.0 9.0
14 103993.0 1 2019-10-03 NaN NaN
15 103993.0 1 2019-10-04 1.0 8.0
16 103993.0 1 2019-10-05 0.0 8.0
17 103993.0 1 2019-10-06 NaN NaN
18 103993.0 1 2019-10-07 NaN NaN
19 103993.0 1 2019-10-08 NaN NaN
20 103993.0 1 2019-10-09 NaN NaN
21 103993.0 1 2019-10-10 NaN NaN
23 103994.0 2 2019-10-01 0.0 12.0
24 103994.0 2 2019-10-02 1.0 11.0
25 103994.0 2 2019-10-03 NaN NaN
26 103994.0 2 2019-10-04 1.0 10.0
27 103994.0 2 2019-10-05 0.0 10.0
28 103994.0 2 2019-10-06 NaN NaN
29 103994.0 2 2019-10-07 NaN NaN
30 103994.0 2 2019-10-08 NaN NaN
31 103994.0 2 2019-10-09 NaN NaN
32 103994.0 2 2019-10-10 NaN NaN
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.