根据两个数据框中多列之间的匹配值定义新列

Question

我目前正在尝试为我正在构建的数据集定义 class label。 我需要查阅两个不同的数据集，其中 df_port_call 是最终将包含 class label 的数据集。

The conditions in the if statements need to be satisfied for the row to receive a class label of 1. Basically, if a row exists in df_deficiency that matches the if statement conditions listed below, the Class column in df_port_call should get a label of 1.但我不确定如何对其进行矢量化，并且循环运行非常缓慢（大约需要 8 天才能终止）。 这里的任何帮助都会很棒！

df_port_call["Class"] = 0

for index, row in tqdm(df_port_call.iterrows()):
    for index_def, row_def in df_deficiency.iterrows():
        if row['MMSI'] == row_def['Primary VIN'] or row['IMO'] == row_def['Primary VIN'] or row['SHIP NAME'] == row_def['Vessel Name']:
            if row_def['Inspection Date'] >= row['ARRIVAL IN USA (UTC)'] and row_def['Inspection Date'] <= row['DEPARTURE (UTC)']:
                row['Class'] = 1

Answer 1

没有输入数据和预期结果，很难回答。 但是，您可以在np.where中使用类似的东西：

df_port_call['Class'] = \
np.where(df_port_call['MMSI'].eq(df_deficiency['Primary VIN'])
         | df_port_call['IMO'].eq(df_deficiency['Primary VIN'])
         | df_port_call['SHIP NAME'].eq(df_deficiency['Vessel Name'])
         & df_deficiency['Inspection Date'].between(df_port_call['ARRIVAL IN USA (UTC)'],
                                                    df_port_call['DEPARTURE (UTC)']),
         1, 0)

适应您的代码，但我认为这是正确的方法。

根据两个数据框中多列之间的匹配值定义新列

问题描述

1 个解决方案

解决方案1
0 2022-01-01 22:20:17

根据两个数据框中多列之间的匹配值定义新列

问题描述

1 个解决方案

解决方案1 0 2022-01-01 22:20:17

解决方案1
0 2022-01-01 22:20:17