I have the following situation:
A dataframe that shows every inventory movements (Buy/Sell) of each products and store.
date sku store Units balance
0 2019-10-01 103993.0 001 0.0 10.0
1 2019-10-02 103993.0 001 1.0 9.0
2 2019-10-04 103993.0 001 1.0 8.0
3 2019-10-05 103993.0 001 0.0 8.0
4 2019-10-01 103994.0 002 0.0 12.0
5 2019-10-02 103994.0 002 1.0 11.0
6 2019-10-04 103994.0 002 1.0 10.0
7 2019-10-05 103994.0 002 0.0 10.0
8 2019-09-30 103991.0 012 0.0 12.0
9 2019-10-02 103991.0 012 1.0 11.0
10 2019-10-04 103991.0 012 1.0 10.0
11 2019-10-05 103991.0 012 0.0 10.0
Each product will have a different start date, however, I want to bring each of them to the same end date.
Supposing today is 2019-10-08 and I want to update this dataframe, inserting rows for the days between the first date until 2019-10-08 that was skipped.
Example:
Dataframe:
date sku store Units balance
0 2019-10-01 103993.0 001 0.0 10.0
1 2019-10-02 103993.0 001 1.0 9.0
2 2019-10-04 103993.0 001 1.0 8.0
3 2019-10-05 103993.0 001 0.0 8.0
The expected output should be:
date sku store Units balance
0 2019-10-01 103993.0 001 0.0 10.0
1 2019-10-02 103993.0 001 1.0 9.0
1 2019-10-03 103993.0 001 NaN NaN
2 2019-10-04 103993.0 001 1.0 8.0
3 2019-10-05 103993.0 001 0.0 8.0
4 2019-10-06 103993.0 001 NaN NaN
5 2019-10-07 103993.0 001 NaN NaN
6 2019-10-08 103993.0 001 NaN NaN
In order to accomplish this I came with two solutions:
dfs = []
for _, d in df.groupby(['sku', 'store']):
start_date = d.date.iloc[0]
end_date = pd.Timestamp('2019-10-08')
d.set_index('date', inplace=True)
d = d.reindex(pd.date_range(start_date, end_date))
dfs.append(d)
df = pd.concat(dfs)
And later on:
v = '2019-10-08'
df = df.groupby(['sku', 'store'])['date', 'Units', 'balance'] \
.apply(lambda x: x.set_index('date') \
.reindex(pd.date_range(x.date.iloc[0], pd.Timestamp(v)))
However, it takes too long when I have a dataframe with 100000 products.
Do you guys have any idea to improve this function, vectorizing with pandas?
If I understand correctly, this is the type of thing you're trying to do. This may be faster, because you're not repeatedly concat'ing and appending the DF as a whole. Really not sure. You'll have to test it.
print(df)
print("--------------")
import pandas as pd
import numpy as np
def Insert_row(row_number, df, row_value):
"""
from here: https://www.geeksforgeeks.org/insert-row-at-given-position-in-pandas-dataframe/
"""
# Starting value of upper half
start_upper = 0
# End value of upper half
end_upper = row_number
# Start value of lower half
start_lower = row_number
# End value of lower half
end_lower = df.shape[0]
# Create a list of upper_half index
upper_half = [*range(start_upper, end_upper, 1)]
# Create a list of lower_half index
lower_half = [*range(start_lower, end_lower, 1)]
# Increment the value of lower half by 1
lower_half = [x.__add__(1) for x in lower_half]
# Combine the two lists
index_ = upper_half + lower_half
# Update the index of the dataframe
df.index = index_
# Insert a row at the end
df.loc[row_number] = row_value
# Sort the index labels
df = df.sort_index()
# return the dataframe
return df
# First ensure the column is datetime values
df["date"] = pd.to_datetime(df["date"])
location = 1 # Start at the SECOND row
for i in range (1, df.shape[0], 1): # Loop through all the rows
current_date = df.iloc[location]["date"] # Date of the current row
previous_date = df.iloc[location - 1]["date"] # Date of the previous row
try: # Try to get a difference between the row's dates
difference = int((current_date - previous_date) / np.timedelta64(1, 'D') )
except ValueError as e:
if "NaN" in str(e).lower():
continue
# print(previous_date, " - ", current_date, "=", difference)
if difference > 1: # If the difference is more than one day
newdate = (pd.to_datetime(previous_date) + np.timedelta64(1, "D")) # Increment the date by one day
for d in range(1, difference, 1): # Loop for all missing rows
# print("Inserting row with date {}".format(newdate))
row_value = [newdate, np.nan, np.nan, np.nan, np.nan] # Create the row
df = Insert_row(location, df, row_value) # Insert the row
location += 1 # Increment the location
newdate = (pd.to_datetime(newdate) + np.timedelta64(1, "D")) # Increment the date for the next loop if it's needed-
location += 1
print(df)
OUTPUT:
date sku store Units balance
0 2019-10-01 103993.0 1.0 0.0 10.0
1 2019-10-02 103993.0 1.0 1.0 9.0
2 2019-10-04 103993.0 1.0 1.0 8.0
3 2019-10-05 103993.0 1.0 0.0 8.0
4 2019-10-06 103994.0 2.0 0.0 12.0
5 2019-10-07 103994.0 2.0 1.0 11.0
6 2019-10-10 103994.0 2.0 1.0 10.0
7 2019-10-15 103994.0 2.0 0.0 10.0
8 2019-10-30 103991.0 12.0 0.0 12.0
9 NaN NaN NaN NaN
--------------
date sku store Units balance
0 2019-10-01 103993.0 1.0 0.0 10.0
1 2019-10-02 103993.0 1.0 1.0 9.0
2 2019-10-03 NaN NaN NaN NaN
3 2019-10-04 103993.0 1.0 1.0 8.0
4 2019-10-05 103993.0 1.0 0.0 8.0
5 2019-10-06 103994.0 2.0 0.0 12.0
6 2019-10-07 103994.0 2.0 1.0 11.0
7 2019-10-08 NaN NaN NaN NaN
8 2019-10-09 NaN NaN NaN NaN
9 2019-10-10 103994.0 2.0 1.0 10.0
10 2019-10-11 NaN NaN NaN NaN
11 2019-10-12 NaN NaN NaN NaN
12 2019-10-13 NaN NaN NaN NaN
13 2019-10-14 NaN NaN NaN NaN
14 2019-10-15 103994.0 2.0 0.0 10.0
15 2019-10-16 NaN NaN NaN NaN
16 2019-10-17 NaN NaN NaN NaN
17 2019-10-18 NaN NaN NaN NaN
18 2019-10-19 NaN NaN NaN NaN
19 2019-10-20 NaN NaN NaN NaN
20 2019-10-21 NaN NaN NaN NaN
21 2019-10-22 NaN NaN NaN NaN
22 2019-10-23 NaN NaN NaN NaN
23 2019-10-24 NaN NaN NaN NaN
24 2019-10-25 NaN NaN NaN NaN
25 2019-10-26 NaN NaN NaN NaN
26 2019-10-27 NaN NaN NaN NaN
27 2019-10-28 NaN NaN NaN NaN
28 2019-10-29 NaN NaN NaN NaN
29 2019-10-30 103991.0 12.0 0.0 12.0
30 2019-10-31 NaN NaN NaN NaN
31 2019-11-01 NaN NaN NaN NaN
32 2019-11-02 NaN NaN NaN NaN
33 2019-11-03 NaN NaN NaN NaN
34 2019-11-04 NaN NaN NaN NaN
35 2019-11-05 NaN NaN NaN NaN
36 2019-11-06 NaN NaN NaN NaN
37 2019-11-07 NaN NaN NaN NaN
38 2019-11-08 NaN NaN NaN NaN
39 2019-11-09 NaN NaN NaN NaN
40 2019-11-10 NaN NaN NaN NaN
41 2019-11-11 NaN NaN NaN NaN
42 2019-11-12 NaN NaN NaN NaN
43 2019-11-13 NaN NaN NaN NaN
44 NaT NaN NaN NaN NaN
You can do all of this using pandas merge (or join) operations. A problem of this approach can arise when you have many 'products' ('sku', 'store' combinations) with many different 'total' dates (ranging from the minimum date of your dataframe to now).
The following assumes that your data is in df
.
# For convenience some variables:
END_DATE = datetime.date(2019, 10, 10)
product_columns = ['sku', 'store']
minimum_date = df['date'].min()
product_date_columns = product_columns + ['date']
# We will first save away the minimum date of for each product for later
minimum_date_per_product = df[product_date_columns].groupby(product_columns).agg('min')
minimum_date_per_product = minimum_date_per_product.rename({'date': 'minimum_date'}, axis=1)
# Then you find all possible product/date combinations, as said above, this might lead
# to a huge dataframe (of size len(unique_products) times len(unique_dates)):
all_dates = pd.DataFrame(index=pd.date_range(minimum_date, END_DATE)).reset_index()
all_dates = all_dates.rename({'index': 'date'}, axis=1)
all_products = df[product_columns].drop_duplicates()
all_dates['key'] = 0
all_products['key'] = 0
all_product_date_combinations = pd.merge(all_dates, all_products, on='key').drop('key', axis=1)
# You then create all possible selling dates for your products
df = df.set_index(product_date_columns)
all_product_date_combinations = all_product_date_combinations.set_index(product_date_columns)
df = df.join(all_product_date_combinations, how='right')
# Now you only have to drop all rows that are before the first starting date of a product
df = df.join(minimum_date_per_product).reset_index()
df = df[df['date'] >= df['minimum_date']]
df = df.drop('minimum_date', axis=1)
For your provided input data the output looks something like this:
sku store date Units balance
0 103991.0 12 2019-09-30 0.0 12.0
1 103991.0 12 2019-10-01 NaN NaN
2 103991.0 12 2019-10-02 1.0 11.0
3 103991.0 12 2019-10-03 NaN NaN
4 103991.0 12 2019-10-04 1.0 10.0
5 103991.0 12 2019-10-05 0.0 10.0
6 103991.0 12 2019-10-06 NaN NaN
7 103991.0 12 2019-10-07 NaN NaN
8 103991.0 12 2019-10-08 NaN NaN
9 103991.0 12 2019-10-09 NaN NaN
10 103991.0 12 2019-10-10 NaN NaN
12 103993.0 1 2019-10-01 0.0 10.0
13 103993.0 1 2019-10-02 1.0 9.0
14 103993.0 1 2019-10-03 NaN NaN
15 103993.0 1 2019-10-04 1.0 8.0
16 103993.0 1 2019-10-05 0.0 8.0
17 103993.0 1 2019-10-06 NaN NaN
18 103993.0 1 2019-10-07 NaN NaN
19 103993.0 1 2019-10-08 NaN NaN
20 103993.0 1 2019-10-09 NaN NaN
21 103993.0 1 2019-10-10 NaN NaN
23 103994.0 2 2019-10-01 0.0 12.0
24 103994.0 2 2019-10-02 1.0 11.0
25 103994.0 2 2019-10-03 NaN NaN
26 103994.0 2 2019-10-04 1.0 10.0
27 103994.0 2 2019-10-05 0.0 10.0
28 103994.0 2 2019-10-06 NaN NaN
29 103994.0 2 2019-10-07 NaN NaN
30 103994.0 2 2019-10-08 NaN NaN
31 103994.0 2 2019-10-09 NaN NaN
32 103994.0 2 2019-10-10 NaN NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.