I have a pandas dataframe with 5 different columns:
Product_ID, Start_Date, End_Date, Turnover, cumcount
As the Product_IDs are not unique and there can be multiple cumcount keeps track of the occurrence; so it goes from 0-5. The table is sorted according to Product_ID and Start date.
As the Start_Date of the same Product_ID can overlap with another I only want to include the occurrences that are outside of the first.
The code snippet is as follows:
df= df.sort_values(by=[ "Product_ID", "Start_Date"])
check1 = df["Product_ID"] == df["Product_ID"].shift(1)
conditions = [check1 & ( df["End_Date"].shift(df["cumcount"]) > df["Start_Date"]),
check1 & ( df["End_Date"].shift(df["cumcount"]) < df["Start_Date"]),
~check1 ]
choices = [0, 1, 1]
df["result"] = np.select(conditions, choices)
The idea is that it shifts as many rows back as there are occurences to check if they are within the first one.
When I execute this I get a Value Error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Does anyone have any ideas how I could make this work (without hardcoding 1/2)?
Edit: Sample of the data
{'Product_ID': {0: 'CJ48HL',
1: 'CL23P3',
2: 'CL5WKS',
3: 'DA0AAM',
4: 'DA0AAM'},
'Start_Date': {0: Timestamp('2022-02-11 00:00:00'),
1: Timestamp('2022-11-11 00:00:00'),
2: Timestamp('2022-10-24 00:00:00'),
3: Timestamp('2022-04-01 00:00:00'),
4: Timestamp('2022-04-06 00:00:00')},
'Turnover': {0: 1143845.0,
1: 512476.0,
2: 178382.0,
3: 2104083.0,
4: 1300434.0},
'count': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1},
'End_Date': {0: Timestamp('2022-02-25 00:00:00'),
1: Timestamp('2022-11-25 00:00:00'),
2: Timestamp('2022-11-07 00:00:00'),
3: Timestamp('2022-04-15 00:00:00'),
4: Timestamp('2022-04-20 00:00:00')}}
Edit2: Desired Output
Product_ID Start_Date Turnover count End_Date result
0 CJ48HL 2022-02-11 1143845.0 0 2022-02-25 1
1 CL23P3 2022-02-11 512476.0 0 2022-11-07 1
2 CL5WKS 2022-10-24 178382.0 0 2022-11-07 1
3 DA0AAM 2022-04-01 2104083.0 0 2022-04-15 1
4 DA0AAM 2022-04-06 1300434.0 1 2022-04-20 0
5 DA0AAM 2022-04-10 1451521.0 2 2022-04-24 0
6 DA0AAM 2022-04-20 2501520.0 3 2022-05-04 1
If I understand what you want correctly, the code below should solve your question
# sort by Product_ID and Start_Date
df.sort_values(by=['Product_ID', 'Start_Date'], ignore_index=True, inplace=True)
# simply create another column that take 1 row ahead of it then compare. If the row ahead is the same then value is 0.0 otherwise take 1.0
df['Product_lead1'] = df['Product_ID'].shift(-1)
df['result'] = np.where(df['Product_ID'] != df['Product_lead1'], 1.0, 0.0)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.