简体   繁体   中英

how to compare datetimes in a pandas dataframe

I got three columns with date informations, which indicate events that need to happen in a particular order and I would like to check if for any row in the dataframe the order is incorrect.

I prepared each column with pd.to_datetime()

Lets say the rule should be column a < b < c , so I tried this:

count = 0
for idx, _ in df.iterrows():
    if df.loc[idx, 'a'] > df.loc[idx, 'b']:
        print(f"Invalid b in line {idx}")
        print(f"{df.loc[idx, 'a']} {df.loc[idx, 'b']}")
        drop_rows.append(idx)
        count+=1
    if df.loc[idx, 'b'] > df.loc[idx, 'c']:
        print(f"Invalid c in line {idx}") 
        drop_rows.append(idx)
        count+=1
print(f"{count} invalid rows")

And it works for almost all rows, but for 36 (correct) rows I still receive something like the following

Invalid b in line 5883 2014-03-06 00:00:00 2014-03-06 00:00:00
Invalid b in line 24442 2011-11-14 00:00:00 2011-11-14 00:00:00

I also changed if df.loc[idx, 'a'] > df.loc[idx, 'b']: by if not df.loc[idx, 'a'] <= df.loc[idx, 'b']: but still receiving this correct entries as wrong.

Why does python think this are not the same dates and how could I change that?

Also is there a faster way to get through the dataframe than iterrows?

You don't necessarily need to iterate (potentially slowly) through your DataFrame rows, you could just filter the DataFrame to all rows which meet either condition, like so:

abc_errors = df.loc[(df['a'] > df['b']) | (df['b'] > df['c'])] 

Alternatively you can filter to ab errors and bc errors separately:

ab_errors = df.loc[(df['a'] > df['b'])] 
bc_errors = df.loc[(df['b'] > df['c'])] 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM