简体   繁体   中英

Define new column based on matching values between multiple columns in two dataframes

I'm currently trying to define a class label for a dataset I'm building. I have two different datasets that I need to consult, with df_port_call being the one that will ultimately contain the class label.

The conditions in the if statements need to be satisfied for the row to receive a class label of 1. Basically, if a row exists in df_deficiency that matches the if statement conditions listed below, the Class column in df_port_call should get a label of 1. But I'm not sure how to vectorize this and the loop is running very slowly (will take about 8 days to terminate). Any assistance here would be great!

df_port_call["Class"] = 0

for index, row in tqdm(df_port_call.iterrows()):
    for index_def, row_def in df_deficiency.iterrows():
        if row['MMSI'] == row_def['Primary VIN'] or row['IMO'] == row_def['Primary VIN'] or row['SHIP NAME'] == row_def['Vessel Name']:
            if row_def['Inspection Date'] >= row['ARRIVAL IN USA (UTC)'] and row_def['Inspection Date'] <= row['DEPARTURE (UTC)']:
                row['Class'] = 1

Without input data and expected outcome, it's difficult to answer. However you can use something like this with np.where :

df_port_call['Class'] = \
np.where(df_port_call['MMSI'].eq(df_deficiency['Primary VIN'])
         | df_port_call['IMO'].eq(df_deficiency['Primary VIN'])
         | df_port_call['SHIP NAME'].eq(df_deficiency['Vessel Name'])
         & df_deficiency['Inspection Date'].between(df_port_call['ARRIVAL IN USA (UTC)'],
                                                    df_port_call['DEPARTURE (UTC)']),
         1, 0)

Adapt to your code but I think this is the right way.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM