[英]Optimizing pandas reindex date_range
I have the following situation:我有以下情况:
A dataframe that shows every inventory movements (Buy/Sell) of each products and store.一个 dataframe 显示每个产品和商店的每个库存变动(买/卖)。
date sku store Units balance
0 2019-10-01 103993.0 001 0.0 10.0
1 2019-10-02 103993.0 001 1.0 9.0
2 2019-10-04 103993.0 001 1.0 8.0
3 2019-10-05 103993.0 001 0.0 8.0
4 2019-10-01 103994.0 002 0.0 12.0
5 2019-10-02 103994.0 002 1.0 11.0
6 2019-10-04 103994.0 002 1.0 10.0
7 2019-10-05 103994.0 002 0.0 10.0
8 2019-09-30 103991.0 012 0.0 12.0
9 2019-10-02 103991.0 012 1.0 11.0
10 2019-10-04 103991.0 012 1.0 10.0
11 2019-10-05 103991.0 012 0.0 10.0
Each product will have a different start date, however, I want to bring each of them to the same end date.每个产品都有不同的开始日期,但是,我想将它们中的每一个都带到相同的结束日期。
Supposing today is 2019-10-08 and I want to update this dataframe, inserting rows for the days between the first date until 2019-10-08 that was skipped.假设今天是 2019 年 10 月 8 日,我想更新这个 dataframe,插入第一个日期到 2019 年 10 月 8 日之间的天数的行被跳过。
Example:例子:
Dataframe: Dataframe:
date sku store Units balance
0 2019-10-01 103993.0 001 0.0 10.0
1 2019-10-02 103993.0 001 1.0 9.0
2 2019-10-04 103993.0 001 1.0 8.0
3 2019-10-05 103993.0 001 0.0 8.0
The expected output should be:预期的 output 应该是:
date sku store Units balance
0 2019-10-01 103993.0 001 0.0 10.0
1 2019-10-02 103993.0 001 1.0 9.0
1 2019-10-03 103993.0 001 NaN NaN
2 2019-10-04 103993.0 001 1.0 8.0
3 2019-10-05 103993.0 001 0.0 8.0
4 2019-10-06 103993.0 001 NaN NaN
5 2019-10-07 103993.0 001 NaN NaN
6 2019-10-08 103993.0 001 NaN NaN
In order to accomplish this I came with two solutions:为了实现这一点,我提出了两种解决方案:
dfs = []
for _, d in df.groupby(['sku', 'store']):
start_date = d.date.iloc[0]
end_date = pd.Timestamp('2019-10-08')
d.set_index('date', inplace=True)
d = d.reindex(pd.date_range(start_date, end_date))
dfs.append(d)
df = pd.concat(dfs)
And later on:后来:
v = '2019-10-08'
df = df.groupby(['sku', 'store'])['date', 'Units', 'balance'] \
.apply(lambda x: x.set_index('date') \
.reindex(pd.date_range(x.date.iloc[0], pd.Timestamp(v)))
However, it takes too long when I have a dataframe with 100000 products.但是,当我有一个拥有 100000 个产品的 dataframe 时,需要的时间太长了。
Do you guys have any idea to improve this function, vectorizing with pandas?你们有什么想法来改进这个 function,用 pandas 矢量化?
If I understand correctly, this is the type of thing you're trying to do.如果我理解正确,这就是你想要做的事情。 This may be faster, because you're not repeatedly concat'ing and appending the DF as a whole.这可能会更快,因为您不会重复连接和附加整个 DF。 Really not sure.真的不确定。 You'll have to test it.你必须测试它。
print(df)
print("--------------")
import pandas as pd
import numpy as np
def Insert_row(row_number, df, row_value):
"""
from here: https://www.geeksforgeeks.org/insert-row-at-given-position-in-pandas-dataframe/
"""
# Starting value of upper half
start_upper = 0
# End value of upper half
end_upper = row_number
# Start value of lower half
start_lower = row_number
# End value of lower half
end_lower = df.shape[0]
# Create a list of upper_half index
upper_half = [*range(start_upper, end_upper, 1)]
# Create a list of lower_half index
lower_half = [*range(start_lower, end_lower, 1)]
# Increment the value of lower half by 1
lower_half = [x.__add__(1) for x in lower_half]
# Combine the two lists
index_ = upper_half + lower_half
# Update the index of the dataframe
df.index = index_
# Insert a row at the end
df.loc[row_number] = row_value
# Sort the index labels
df = df.sort_index()
# return the dataframe
return df
# First ensure the column is datetime values
df["date"] = pd.to_datetime(df["date"])
location = 1 # Start at the SECOND row
for i in range (1, df.shape[0], 1): # Loop through all the rows
current_date = df.iloc[location]["date"] # Date of the current row
previous_date = df.iloc[location - 1]["date"] # Date of the previous row
try: # Try to get a difference between the row's dates
difference = int((current_date - previous_date) / np.timedelta64(1, 'D') )
except ValueError as e:
if "NaN" in str(e).lower():
continue
# print(previous_date, " - ", current_date, "=", difference)
if difference > 1: # If the difference is more than one day
newdate = (pd.to_datetime(previous_date) + np.timedelta64(1, "D")) # Increment the date by one day
for d in range(1, difference, 1): # Loop for all missing rows
# print("Inserting row with date {}".format(newdate))
row_value = [newdate, np.nan, np.nan, np.nan, np.nan] # Create the row
df = Insert_row(location, df, row_value) # Insert the row
location += 1 # Increment the location
newdate = (pd.to_datetime(newdate) + np.timedelta64(1, "D")) # Increment the date for the next loop if it's needed-
location += 1
print(df)
OUTPUT: OUTPUT:
date sku store Units balance
0 2019-10-01 103993.0 1.0 0.0 10.0
1 2019-10-02 103993.0 1.0 1.0 9.0
2 2019-10-04 103993.0 1.0 1.0 8.0
3 2019-10-05 103993.0 1.0 0.0 8.0
4 2019-10-06 103994.0 2.0 0.0 12.0
5 2019-10-07 103994.0 2.0 1.0 11.0
6 2019-10-10 103994.0 2.0 1.0 10.0
7 2019-10-15 103994.0 2.0 0.0 10.0
8 2019-10-30 103991.0 12.0 0.0 12.0
9 NaN NaN NaN NaN
--------------
date sku store Units balance
0 2019-10-01 103993.0 1.0 0.0 10.0
1 2019-10-02 103993.0 1.0 1.0 9.0
2 2019-10-03 NaN NaN NaN NaN
3 2019-10-04 103993.0 1.0 1.0 8.0
4 2019-10-05 103993.0 1.0 0.0 8.0
5 2019-10-06 103994.0 2.0 0.0 12.0
6 2019-10-07 103994.0 2.0 1.0 11.0
7 2019-10-08 NaN NaN NaN NaN
8 2019-10-09 NaN NaN NaN NaN
9 2019-10-10 103994.0 2.0 1.0 10.0
10 2019-10-11 NaN NaN NaN NaN
11 2019-10-12 NaN NaN NaN NaN
12 2019-10-13 NaN NaN NaN NaN
13 2019-10-14 NaN NaN NaN NaN
14 2019-10-15 103994.0 2.0 0.0 10.0
15 2019-10-16 NaN NaN NaN NaN
16 2019-10-17 NaN NaN NaN NaN
17 2019-10-18 NaN NaN NaN NaN
18 2019-10-19 NaN NaN NaN NaN
19 2019-10-20 NaN NaN NaN NaN
20 2019-10-21 NaN NaN NaN NaN
21 2019-10-22 NaN NaN NaN NaN
22 2019-10-23 NaN NaN NaN NaN
23 2019-10-24 NaN NaN NaN NaN
24 2019-10-25 NaN NaN NaN NaN
25 2019-10-26 NaN NaN NaN NaN
26 2019-10-27 NaN NaN NaN NaN
27 2019-10-28 NaN NaN NaN NaN
28 2019-10-29 NaN NaN NaN NaN
29 2019-10-30 103991.0 12.0 0.0 12.0
30 2019-10-31 NaN NaN NaN NaN
31 2019-11-01 NaN NaN NaN NaN
32 2019-11-02 NaN NaN NaN NaN
33 2019-11-03 NaN NaN NaN NaN
34 2019-11-04 NaN NaN NaN NaN
35 2019-11-05 NaN NaN NaN NaN
36 2019-11-06 NaN NaN NaN NaN
37 2019-11-07 NaN NaN NaN NaN
38 2019-11-08 NaN NaN NaN NaN
39 2019-11-09 NaN NaN NaN NaN
40 2019-11-10 NaN NaN NaN NaN
41 2019-11-11 NaN NaN NaN NaN
42 2019-11-12 NaN NaN NaN NaN
43 2019-11-13 NaN NaN NaN NaN
44 NaT NaN NaN NaN NaN
You can do all of this using pandas merge (or join) operations.您可以使用 pandas 合并(或连接)操作来完成所有这些操作。 A problem of this approach can arise when you have many 'products' ('sku', 'store' combinations) with many different 'total' dates (ranging from the minimum date of your dataframe to now).当您有许多具有许多不同“总”日期(从 dataframe 的最短日期到现在)的“产品”(“sku”、“商店”组合)时,可能会出现这种方法的问题。
The following assumes that your data is in df
.以下假设您的数据位于df
中。
# For convenience some variables:
END_DATE = datetime.date(2019, 10, 10)
product_columns = ['sku', 'store']
minimum_date = df['date'].min()
product_date_columns = product_columns + ['date']
# We will first save away the minimum date of for each product for later
minimum_date_per_product = df[product_date_columns].groupby(product_columns).agg('min')
minimum_date_per_product = minimum_date_per_product.rename({'date': 'minimum_date'}, axis=1)
# Then you find all possible product/date combinations, as said above, this might lead
# to a huge dataframe (of size len(unique_products) times len(unique_dates)):
all_dates = pd.DataFrame(index=pd.date_range(minimum_date, END_DATE)).reset_index()
all_dates = all_dates.rename({'index': 'date'}, axis=1)
all_products = df[product_columns].drop_duplicates()
all_dates['key'] = 0
all_products['key'] = 0
all_product_date_combinations = pd.merge(all_dates, all_products, on='key').drop('key', axis=1)
# You then create all possible selling dates for your products
df = df.set_index(product_date_columns)
all_product_date_combinations = all_product_date_combinations.set_index(product_date_columns)
df = df.join(all_product_date_combinations, how='right')
# Now you only have to drop all rows that are before the first starting date of a product
df = df.join(minimum_date_per_product).reset_index()
df = df[df['date'] >= df['minimum_date']]
df = df.drop('minimum_date', axis=1)
For your provided input data the output looks something like this:对于您提供的输入数据,output 看起来像这样:
sku store date Units balance
0 103991.0 12 2019-09-30 0.0 12.0
1 103991.0 12 2019-10-01 NaN NaN
2 103991.0 12 2019-10-02 1.0 11.0
3 103991.0 12 2019-10-03 NaN NaN
4 103991.0 12 2019-10-04 1.0 10.0
5 103991.0 12 2019-10-05 0.0 10.0
6 103991.0 12 2019-10-06 NaN NaN
7 103991.0 12 2019-10-07 NaN NaN
8 103991.0 12 2019-10-08 NaN NaN
9 103991.0 12 2019-10-09 NaN NaN
10 103991.0 12 2019-10-10 NaN NaN
12 103993.0 1 2019-10-01 0.0 10.0
13 103993.0 1 2019-10-02 1.0 9.0
14 103993.0 1 2019-10-03 NaN NaN
15 103993.0 1 2019-10-04 1.0 8.0
16 103993.0 1 2019-10-05 0.0 8.0
17 103993.0 1 2019-10-06 NaN NaN
18 103993.0 1 2019-10-07 NaN NaN
19 103993.0 1 2019-10-08 NaN NaN
20 103993.0 1 2019-10-09 NaN NaN
21 103993.0 1 2019-10-10 NaN NaN
23 103994.0 2 2019-10-01 0.0 12.0
24 103994.0 2 2019-10-02 1.0 11.0
25 103994.0 2 2019-10-03 NaN NaN
26 103994.0 2 2019-10-04 1.0 10.0
27 103994.0 2 2019-10-05 0.0 10.0
28 103994.0 2 2019-10-06 NaN NaN
29 103994.0 2 2019-10-07 NaN NaN
30 103994.0 2 2019-10-08 NaN NaN
31 103994.0 2 2019-10-09 NaN NaN
32 103994.0 2 2019-10-10 NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.