简体   繁体   中英

Regex pattern for checking all type of Date format

I want to check for date values present in which column of dataframe and convert the column to datetime because column type can be object initially, but dates can be in any format as below. So I am looking for a regex pattern which will match all date type formats.

  1. 04/10/2022
  2. 10/04/2022
  3. 2022/04/10
  4. 2022/10/04
  5. 2022-12-20 00:00:00
  6. 04-10-2022

Can someone please suggest a regex pattern which will match all date formats?

I have tried below code:

    for columnIndex, colName in enumerate(df):

        df2 = pd.DataFrame()
        df2['test'] = df[colName]
        count = 0
        for i, j in df2.iteritems():
            for k in j:
                if re.match("[0-9]{2}/[0-9]{2}/[0-9]{4}", str(k)):
                    count = count+1
        if(count>5):
            df[colName] = pd.to_datetime(df[colName])
        print(df.dtypes)

Considering the following dataframe df with all date formats indicated by OP in the question

df = pd.DataFrame({'date': ['04/10/2022', '10/04/2022', '2022/04/10', '2022/10/04', '2022-12-20 00:00:00', '04-10-2022']})

[Out]:
                  date
0           04/10/2022
1           10/04/2022
2           2022/04/10
3           2022/10/04
4  2022-12-20 00:00:00
5           04-10-2022

Assuming the goal is to convert to datetime, one can use pandas.to_datetime . This has the parameter infer_datetime_format that one can use as follows

df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)

[Out]:

        date
0 2022-04-10
1 2022-10-04
2 2022-04-10
3 2022-10-04
4 2022-12-20
5 2022-04-10

For this case, it does the work.


Note:

  • If one wants to explore the source code to see how the function is implemented, check the Github here .

Why not simply use pandas.to_datetime without providing any format?

for col in df.columns:
    df[col] = pd.to_datetime(df[col])

# Output:

print(df)
        Col1       Col2       Col3       Col4
0 2022-04-10        NaT        NaT        NaT
1        NaT 2022-10-04        NaT        NaT
2        NaT        NaT 2022-04-10        NaT
3 2022-10-04        NaT        NaT        NaT
4        NaT        NaT        NaT 2022-12-20
5 2022-04-10        NaT        NaT        NaT

# Input used:

         Col1        Col2        Col3                 Col4
0  04/10/2022         NaN         NaN                  NaN
1         NaN  10/04/2022         NaN                  NaN
2         NaN         NaN  2022/04/10                  NaN
3  2022/10/04         NaN         NaN                  NaN
4         NaN         NaN         NaN  2022-12-20 00:00:00
5  04-10-2022         NaN         NaN                  NaN      

Here is an idea. With this code you will match all the formats, however you can't distinguish between days and month if the date is, say 05/05/2022 . But that is an issue that goes beyond the scope of the question.

The regexp I came up with looks for groups of one or more numbers [0-9]+ separated by either the dash or the slash '[/-]', and I used the backslash to escape the special symbols.

dates="""04/10/2022
10/04/2022
2022/04/10
2022/10/04
2022-12-20 00:00:00
04-10-2022"""

import re
dre = re.compile(r"([0-9]+)[\/\-]([0-9]+)[\/\-]([0-9]+)")

for date in dates.split("\n"):
    m = dre.match(date)
    print( m.group(1) , m.group(2) , m.group(3) )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM