简体   繁体   中英

pandas shift dependent on column value

I have a pandas dataframe with 5 different columns:

Product_ID, Start_Date, End_Date, Turnover, cumcount

As the Product_IDs are not unique and there can be multiple cumcount keeps track of the occurrence; so it goes from 0-5. The table is sorted according to Product_ID and Start date.

As the Start_Date of the same Product_ID can overlap with another I only want to include the occurrences that are outside of the first.

The code snippet is as follows:

df= df.sort_values(by=[ "Product_ID", "Start_Date"])


check1 = df["Product_ID"] == df["Product_ID"].shift(1)

conditions = [check1 & ( df["End_Date"].shift(df["cumcount"]) > df["Start_Date"]),
check1 & ( df["End_Date"].shift(df["cumcount"]) < df["Start_Date"]),
~check1 ]

choices = [0, 1, 1]

df["result"] = np.select(conditions, choices)

The idea is that it shifts as many rows back as there are occurences to check if they are within the first one.

When I execute this I get a Value Error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Does anyone have any ideas how I could make this work (without hardcoding 1/2)?

Edit: Sample of the data

{'Product_ID': {0: 'CJ48HL',
  1: 'CL23P3',
  2: 'CL5WKS',
  3: 'DA0AAM',
  4: 'DA0AAM'},
 'Start_Date': {0: Timestamp('2022-02-11 00:00:00'),
  1: Timestamp('2022-11-11 00:00:00'),
  2: Timestamp('2022-10-24 00:00:00'),
  3: Timestamp('2022-04-01 00:00:00'),
  4: Timestamp('2022-04-06 00:00:00')},
 'Turnover': {0: 1143845.0,
  1: 512476.0,
  2: 178382.0,
  3: 2104083.0,
  4: 1300434.0},
 'count': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1},
 'End_Date': {0: Timestamp('2022-02-25 00:00:00'),
  1: Timestamp('2022-11-25 00:00:00'),
  2: Timestamp('2022-11-07 00:00:00'),
  3: Timestamp('2022-04-15 00:00:00'),
  4: Timestamp('2022-04-20 00:00:00')}}

Edit2: Desired Output

   Product_ID  Start_Date Turnover    count  End_Date   result
0     CJ48HL   2022-02-11  1143845.0    0   2022-02-25    1
1     CL23P3   2022-02-11   512476.0    0   2022-11-07    1
2     CL5WKS   2022-10-24   178382.0    0   2022-11-07    1
3     DA0AAM   2022-04-01  2104083.0    0   2022-04-15    1
4     DA0AAM   2022-04-06  1300434.0    1   2022-04-20    0
5     DA0AAM   2022-04-10  1451521.0    2   2022-04-24    0
6     DA0AAM   2022-04-20  2501520.0    3   2022-05-04    1

If I understand what you want correctly, the code below should solve your question

# sort by Product_ID and Start_Date
df.sort_values(by=['Product_ID', 'Start_Date'], ignore_index=True, inplace=True)

# simply create another column that take 1 row ahead of it then compare. If the row ahead is the same then value is 0.0 otherwise take 1.0
df['Product_lead1'] = df['Product_ID'].shift(-1)
df['result'] = np.where(df['Product_ID'] != df['Product_lead1'], 1.0, 0.0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2025 STACKOOM.COM