简体   繁体   中英

convert column to date time in pandas(Data Cleaning)

i have a column which is of format h:m:s:ms dd/mm/yy(but not consistent) col3 日期字段 Unique values of the column are日期字段的唯一值 as you can see the format is not consistent some have a year as 04 and some as 2004. some have tab spaces. so I want to clean and convert it into a DateTime data type currently it is object 对象数据类型 . I am directly using this code我使用的代码 and it doesn't work how to overcome this problem

First is necessary remove trailing spaces with \t values, replace possible \t to spaces and split with join formats in swapped ordering.

#https://raw.githubusercontent.com/sprabhala-cpu/Machine-Learning/main/datetime.txt
df = pd.DataFrame({'col3':a})

s = df['col3'].str.strip(' \t').str.replace('\t',' ').str.split()
dates_str = s.str[1] + ' ' + s.str[0]

Then is defined formats in list and in function is generated only matched datetimes and combined by Series.combine_first to final datetimes.

*Important notice - order of format in list is important, because some months with days should be interpret 2 ways.

It means eg 23:59:00.00 08.07.2004 - it is August or July? In formats is first specified %m.%d.%Y %H:%M:%S.%f and then %d.%m.%Y %H:%M:%S.%f - so here is parsed like first is month - August. If need July is necessary first defined %d.%m.%Y %H:%M:%S.%f and then %m.%d.%Y %H:%M:%S.%f .

from functools import reduce 

def convert_formats_to_datetimes(s1, formats):
    out = [pd.to_datetime(s1, format=x, errors='coerce') for x in formats]
    return reduce(lambda l,r: pd.Series.combine_first(l,r), out)

formats = ['%d/%m/%y %H:%M:%S.%f', '%d/%m/%Y %H:%M:%S.%f',
           '%m/%d/%y %H:%M:%S.%f','%m.%d.%Y %H:%M:%S.%f',
           '%d.%m.%Y %H:%M:%S.%f',
           '%m/%d/%y %M:%S.%f', '%d/%m/%y %M:%S.%f',
           '%m/%d/%Y %M:%S.%f', '%m.%d.%Y %M:%S.%f',
           ]

df['date_time'] = convert_formats_to_datetimes(dates_str, formats)

print (df)
                        col3           date_time
0      23:59:00.00  31/07/04 2004-07-31 23:59:00
1         59:00.0\t4/11/1930 1930-04-11 00:59:00
2        23:59:00.00  2/4/14 2014-04-02 23:59:00
3      23:59:00.00  30/06/04 2004-06-30 23:59:00
4      23:59:00.00  31/05/04 2004-05-31 23:59:00
..                       ...                 ...
315    23:59:00.00\t12/30/04 2004-12-30 23:59:00
316    23:59:00.00  30/05/04 2004-05-30 23:59:00
317   23:59:00.00\t4/10/1930 1930-10-04 23:59:00
318     23:59:00.00 30/07/04 2004-07-30 23:59:00
319     23:59:00.00  3/30/04 2004-03-30 23:59:00

[320 rows x 2 columns]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM