简体   繁体   中英

Reformat date inside string using pandas replace with regex

I have a column of strings like below that contain date information, and I need to add leading zeros to single-digit months and days. I've run into some issues trying to do this purely with pandas.DataFrame.replace and regular expressions.

import pandas as pd
df = pd.DataFrame({'Key':['0123456789_1/2/2019','0123456789_11/23/2019','0145892367_10/2/2019','0145892367_4/13/2019']})

df
Out[323]: 
                     Key
0    0123456789_1/2/2019
1  0123456789_11/23/2019
2   0145892367_10/2/2019
3   0145892367_4/13/2019

For the above column, the output I'd want after reformatting would be:

                     Key
0  0123456789_01/02/2019
1  0123456789_11/23/2019
2  0145892367_10/02/2019
3  0145892367_04/13/2019

By now I've figured out I can do this by splitting the strings:

r = df['Key'].str.split('_|/', expand=True)
df2 = r[0] + '_' + r[1].str.zfill(2) + '/' + r[2].str.zfill(2) + '/' + r[3]

df2
Out[333]: 
0    0123456789_01/02/2019
1    0123456789_11/23/2019
2    0145892367_10/02/2019
3    0145892367_04/13/2019
dtype: object

...But when I was initially trying to do it with pandas.DataFrame.replace , the closest I was able to get was:

df2 = df.replace(r'(_|/)([1-9]/)',r'\1 0\2',regex=True)

df2
Out[335]: 
                      Key
0   0123456789_ 01/2/2019
1   0123456789_11/23/2019
2  0145892367_10/ 02/2019
3  0145892367_ 04/13/2019

There are two problems with this that I'd like to know more about:

  1. In cases like row 0 where both the month and day are single-digit, it only finds the month. How can I get it to match both?
  2. I don't want the spaces, but when I try to replace using r'\\10\\2' , of course I get an error because it thinks I'm trying to substitute in group 10, and there is no such group in the first regex. If I try r'(\\1)0\\2' , it works, except it prints the literal parenthesis. Why does it do this, and how can I properly write this so that it prints group 1 immediately followed by a literal zero?

Edit for clarification: I'm aware I could also fix it by parsing the dates, but I'm specifically interested in the regex solution, as a learning exercise. Also because a single replace is much faster for large dataframes.

IIUC, you can use:

df.Key=df.Key.str.split("_").str[0]+"_"+pd.to_datetime(df.Key.str.split("_")
            .str[1]).dt.strftime('%m/%d/%Y')
print(df)

                     Key
0  0123456789_01/02/2019
1  0123456789_11/23/2019
2  0145892367_10/02/2019
3  0145892367_04/13/2019

using datetime module

df['Key'] = df.Key.str.split('_').apply(lambda x: x[0]+'_'+datetime.strptime(x[1], "%m/%d/%Y").strftime("%m/%d/%Y"))

Output

                     Key
0  0123456789_01/02/2019
1  0123456789_11/23/2019
2  0145892367_10/02/2019
3  0145892367_04/13/2019

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM