简体   繁体   中英

Converting string to a datetime when the string doesn't match a specific date format

I'm having some trouble getting the following string converted to a datetime object using Python. I have a large csv file (over 10k lines) and I need to transform a column of dates from the following format:

Jun 1, 2020 12:11:49 AM PDT

to:

06/01/20

My first thought was to use datetime.strptime, which requires passing in the string and the date format it is in, because then I can just reformat one date type to another real easy. The problem I'm having is I don't know how to represent this string as a date format, mostly due to the timezone.

My best guess for the date format I need is '%mmm %dd, %yyyy %H:%M:%S %aa' but I can't figure out how to represent the timezone here (and I'm also not sure about AM/PM being %aa).

I've tried looking at other threads but they all seem to have easily match-able strings.

Thanks!

The format is documented in the following table, in particular, AM/PM is %p and timezone is %Z :

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes

However, in your case, I would suggest not to bother with the parsing at all but rely on dateutil to do the parsing. It is more flexible as it can figure out the correct format almost always.

I'd be fine cutting out the time and timezone completely

Then you have lots of choices. As already mentioned, dateutil is cool and would work great. But if you wanted to stay in datetime for some reason you could:

  • Parse the whole thing, but know that the timezone is ignored

Datetime/strptime can parse the whole thing, but doesn't really understand/convert timezones. If you do this, it will just parse it as UTC.

>>> str(datetime.strptime("Jun 1, 2020 12:11:49 AM PDT", "%b %d, %Y %I:%M:%S %p %Z"))
'2020-06-01 00:11:49'

You could also throw away the time portion before handing it to strptime(), but that's probably more trouble than it's worth given the other options.


Oops. I didn't realize that %Z will only parse certain timezones (that probably depend on your machine). So if you can't control that, it's not going to work. On my machine 'PDT' will parse and 'EDT' will fail.

Given that, I'd throw away the timezone. If it's always in this format, then maybe something like:

>>> ts = "Jun 1, 2020 12:11:49 AM PDT"
>>> str(datetime.strptime(ts.rpartition(" ")[0], "%b %d, %Y %I:%M:%S %p"))
'2020-06-01 00:11:49'

As @adrtam already suggested, you can use dateutil 's parser to conveniently parse such a string. to correctly parse the time zone, you can supply it with a mapping dict :

from dateutil import parser, tz

s = 'Jun 1, 2020 12:11:49 AM PDT'

tzmapping = {'PDT': tz.gettz('US/Pacific')} # assuming PDT means Pacific daylight saving time

dt = parser.parse(s, tzinfos=tzmapping)

dt
Out[2]: datetime.datetime(2020, 6, 1, 0, 11, 49, tzinfo=tzfile('US/Pacific'))

Now you can easily format to string:

s_reformatted = dt.strftime('%m/%d/%y')

s_reformatted
Out[4]: '06/01/20'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM