i have a column which is of format h:m:s:ms dd/mm/yy(but not consistent) Unique values of the column are as you can see the format is not consistent some have a year as 04 and some as 2004. some have tab spaces. so I want to clean and convert it into a DateTime data type currently it is object . I am directly using this code and it doesn't work how to overcome this problem
First is necessary remove trailing spaces with \t
values, replace possible \t
to spaces and split with join formats in swapped ordering.
#https://raw.githubusercontent.com/sprabhala-cpu/Machine-Learning/main/datetime.txt
df = pd.DataFrame({'col3':a})
s = df['col3'].str.strip(' \t').str.replace('\t',' ').str.split()
dates_str = s.str[1] + ' ' + s.str[0]
Then is defined formats in list and in function is generated only matched datetimes and combined by Series.combine_first
to final datetimes.
*Important notice - order of format in list is important, because some months with days should be interpret 2 ways.
It means eg 23:59:00.00 08.07.2004
- it is August or July? In formats is first specified %m.%d.%Y %H:%M:%S.%f
and then %d.%m.%Y %H:%M:%S.%f
- so here is parsed like first is month - August. If need July
is necessary first defined %d.%m.%Y %H:%M:%S.%f
and then %m.%d.%Y %H:%M:%S.%f
.
from functools import reduce
def convert_formats_to_datetimes(s1, formats):
out = [pd.to_datetime(s1, format=x, errors='coerce') for x in formats]
return reduce(lambda l,r: pd.Series.combine_first(l,r), out)
formats = ['%d/%m/%y %H:%M:%S.%f', '%d/%m/%Y %H:%M:%S.%f',
'%m/%d/%y %H:%M:%S.%f','%m.%d.%Y %H:%M:%S.%f',
'%d.%m.%Y %H:%M:%S.%f',
'%m/%d/%y %M:%S.%f', '%d/%m/%y %M:%S.%f',
'%m/%d/%Y %M:%S.%f', '%m.%d.%Y %M:%S.%f',
]
df['date_time'] = convert_formats_to_datetimes(dates_str, formats)
print (df)
col3 date_time
0 23:59:00.00 31/07/04 2004-07-31 23:59:00
1 59:00.0\t4/11/1930 1930-04-11 00:59:00
2 23:59:00.00 2/4/14 2014-04-02 23:59:00
3 23:59:00.00 30/06/04 2004-06-30 23:59:00
4 23:59:00.00 31/05/04 2004-05-31 23:59:00
.. ... ...
315 23:59:00.00\t12/30/04 2004-12-30 23:59:00
316 23:59:00.00 30/05/04 2004-05-30 23:59:00
317 23:59:00.00\t4/10/1930 1930-10-04 23:59:00
318 23:59:00.00 30/07/04 2004-07-30 23:59:00
319 23:59:00.00 3/30/04 2004-03-30 23:59:00
[320 rows x 2 columns]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.