验证 dataframe 日期，返回不匹配的值

Question

This is the workflow I need to accomplish:这是我需要完成的工作流程：

Validate the date format of all dates in column_1 and column_2.验证 column_1 和 column_2 中所有日期的日期格式。
If date is not in either format: mm/dd/yy hh:mm or mm/dd/yyyy hh:mm如果日期不是两种格式：mm/dd/yy hh:mm 或 mm/dd/yyyy hh:mm
Need assistance - Print the non-matching values.需要帮助- 打印不匹配的值。

Note: I do not know what format the dates will be in and some will not be dates at all.注意：我不知道日期的格式是什么，有些根本就不是日期。

Sample input data CSV:示例输入数据 CSV：

column_1    column_2
8/22/22 15:27   8/24/22 15:27
8/23/22 15:27   Tuesday, August 23, 2022
8/24/22 15:27   abc123
8/25/22 15:27   8/25/2022 15:27
8/26/22 15:27   8/26/2022 18:27
8/26/22 15:27   8/22/22

The following method always throws an exception, as designed, when the to_datetime() function returns a ValueError.按照设计，当to_datetime() function 返回 ValueError 时，以下方法始终抛出异常。 How can I validate the date and then capture the values that do not match format_one or format_two ?如何验证日期然后捕获与format_one或format_two不匹配的值？

df = pd.read_csv('input.csv', encoding='ISO-8859-1', dtype=str)

date_columns = ['column_1', 'column_2']

format_one = '%m/%d/%y %H:%M'
format_two = '%m/%d/%Y %H:%M'

for column in date_columns:
    for item in df[column]:
        try:
            if pd.to_datetime(df[item], format=format_one):
                print('format 1: ' + item)
            elif pd.to_datetime(df[item], format=format_two):
                print('format 2: ' + item)   
            else:
                print('unknown format: ' + item)
        except Exception as e:
            print('Exception:' )
            print(e)

Output: Output：

Exception:
'8/22/22 15:27'
Exception:
'8/23/22 15:27'
Exception:
'8/24/22 15:27'
Exception:
'8/25/22 15:27'
Exception:
'8/26/22 15:27'
Exception:
'8/26/22 15:27'
Exception:
'8/24/22 15:27'
Exception:
'Tuesday, August 23, 2022'
Exception:
'abc123'
Exception:
'8/25/2022 15:27'
Exception:
'8/26/2022 18:27'
Exception:
'8/22/22'

Desired output:所需的 output：

Exception:
'Tuesday, August 23, 2022'
Exception:
'abc123'
Exception:
'8/22/22'

Thank you.谢谢你。

Answer 1

You'll need to test each allowed format individually (they're all in the same try block at the moment, in the example given in the question).您需要单独测试每种允许的格式（在问题中给出的示例中，它们目前都在同一个try块中）。 A general solution could make use of masking values that cannot be converted by any of the formats.通用解决方案可以使用无法由任何格式转换的屏蔽值。 That could look like那可能看起来像

import pandas as pd

allowed = ('%m/%d/%y %H:%M', '%m/%d/%Y %H:%M')

# dummy df
df = pd.DataFrame({"date": ["8/24/22 15:27", "Tuesday, August 23, 2022",
                            "abc123", "8/25/2022 15:27"]})

# this will be our mask, where the input format is invalid.
# initially, assume all invalid.
m = pd.Series([True]*df["date"].size)

# for each allowed format, test where the result is not NaT, i.e. valid.
# update the mask accordingly.
for fmt in allowed:
    m[pd.to_datetime(df["date"], format=fmt, errors="coerce").notna()] = False
    
# invalid format:
print(df["date"][m])
# 1    Tuesday, August 23, 2022
# 2                      abc123
# Name: date, dtype: object

Applied to the specific example from the question, that could look like应用于问题中的具体示例，可能看起来像

# for reference:
df
        column_1                  column_2
0  8/22/22 15:27             8/24/22 15:27
1  8/23/22 15:27  Tuesday, August 23, 2022
2  8/24/22 15:27                    abc123
3  8/25/22 15:27           8/25/2022 15:27
4  8/26/22 15:27           8/26/2022 18:27
5  8/26/22 15:27                   8/22/22


date_columns = ['column_1', 'column_2']

for column in date_columns:
    m = pd.Series([True]*df[column].size)
    for fmt in allowed:
        m[pd.to_datetime(df[column], format=fmt, errors="coerce").notna()] = False

    print(f"{column}\n", df[column][m])

# column_1
#  Series([], Name: column_1, dtype: object)
 
# column_2
#  1    Tuesday, August 23, 2022
# 2                      abc123
# 5                     8/22/22
# Name: column_2, dtype: object

Answer 2

Just sharing the logic thinking technically works.只是分享逻辑思维在技术上是可行的。 Pls, Try it.请尝试一下。 Let me know it deosn't wor.让我知道它没有用。

import pandas as pd
df = pd.DataFrame({'date': {0: '8/24/22 15:27', 1: '24/8/22 15:27', 2: 'a,b,c', 3: 'Tuesday, August 23, 2022'}})

       
mask1 = df.loc[pd.to_datetime(df['date'], errors='coerce',format='%m/%d/%y %H:%M').isnull()]
mask2 = df.loc[pd.to_datetime(df['date'], errors='coerce',format='%d/%m/%y %H:%M').isnull()]

df = pd.merge(mask1,mask2,on = ['date'],how ='inner')

print(df)

Sample obsorvations #观察样本#

Input df输入 df

                       date
0             8/24/22 15:27
1             24/8/22 15:27
2                     a,b,c
3  Tuesday, August 23, 2022

output # output#

                       date
0                     a,b,c
1  Tuesday, August 23, 2022

验证 dataframe 日期，返回不匹配的值

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-11-15 16:42:35

解决方案2
1 2022-11-15 16:47:52

验证 dataframe 日期，返回不匹配的值

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-11-15 16:42:35

解决方案2 1 2022-11-15 16:47:52

解决方案1
1 已采纳 2022-11-15 16:42:35

解决方案2
1 2022-11-15 16:47:52