简体   繁体   中英

How to efficiently parse Time/Date string into datetime object?

I'm scraping data from a news site and want to store the time and date these articles were posted. The good thing is that I can pull these timestamps right from the page of the articles.

When the articles I scrape were posted today, the output looks like this:

17:22 ET
02:41 ET
06:14 ET

When the articles were posted earlier than today, the output looks like this:

Mar 10, 2021, 16:05 ET
Mar 08, 2021, 08:00 ET
Feb 26, 2021, 11:23 ET

Current problem: I can't order my database by the time the articles were posted, because whenever I run the program, articles that were posted today are stored only with a time. Over multiple days, this will create a lot of articles with a stamp that looks as if they were posted on the day you look at the database - since there is only a time.

What I want: Add the current month/day/year in front of the time stamp on the basis of the already given format.

My idea: I have a hard time to understand how regex works. My idea would be to check the length of the imported string. If it is exactly 8, I want to add the Month, Date and Year in front. But I don't know whether this is a) the most efficient approach and b) most importantly, how to code this seemingly easy idea.

I would glady appreciate if someone can help me how to code this. The current line which grabs the time looks like this:

article_time = item.select_one('h3 small').text

Try this out and others can correct me if I overlooked something,

from datetime import datetime, timedelta

def get_datetime_from_time(time):
    time, timezone = time.rsplit(' ', 1)
    if ',' in time:
        article_time = datetime.strptime(time, r"%b %d, %Y, %H:%M")
    else:
        article_time = datetime.strptime(time, r"%H:%M")
        hour, minute = article_time.hour, article_time.minute
        if timezone == 'ET':
            hours = -4
        else:
            hours = -5
        article_time = (datetime.utcnow() + timedelta(hours=hours)).replace(hour=hour, minute=minute) # Adjust for timezone
    return article_time
        

article_time = item.select_one('h3 small').text
article_time = get_datetime_from_time(article_time)

What I'm doing here is I'm checking if a comma is in your time string. If it is, then it's with date, else it's without. Then I'm checking for timezone since Daylight time is different than Standard time. So I have a statement to adjust timezone by 4 or 5. Then I'm getting the UTC time (regardless of your timezone) and adjust for timezone. strptime is a function that parses time depending on a format you give it.

Note that this does not take into account an empty time string.

Handling timezones properly can get fairly involved since the standard library barely supports them (and recommends using the third-party pytz module) to do so). This would be especially true if you need it

So, one "quick and dirty" way to deal with them would be to just ignore that information and add the current day, month, and year to any timestamps encountered that don't include that. The code below demonstrates how to do that.

from datetime import datetime


scrapped = '''
17:22 ET
02:41 ET
06:14 ET
Mar 10, 2021, 16:05 ET
Mar 08, 2021, 08:00 ET
Feb 26, 2021, 11:23 ET
'''

def get_datetime(string):
    string = string[:-3]  # Remove timezone.
    try:
        r = datetime.strptime(string, "%b %d, %Y, %H:%M")
    except ValueError:
        try:
            today = datetime.today()
            daytime = datetime.strptime(string, "%H:%M")
            r = today.replace(hour=daytime.hour, minute=daytime.minute, second=0, microsecond=0)
        except ValueError:
            r = None
    return r

for line in scrapped.splitlines():
    if line:
        r = get_datetime(line)
        print(f'{line=}, {r=}')

" I can't order my database " - to be able to do so, you'll either have to convert the strings to datetime objects or to an ordered format (low to high resolution, so year-month-day- etc.) which would allow you to sort strings correctly.

" I have a hard time to understand how regex works " - while you can use regular expressions here to somehow parse and modify the strings you have, you don't need to.

#1 If you want a convenient option that leaves you with datetime objects, here's one using dateutil :

import dateutil

times = ["17:22 ET", "02:41 ET", "06:14 ET",
         "Mar 10, 2021, 16:05 ET", "Mar 08, 2021, 08:00 ET", "Feb 26, 2021, 11:23 ET"]

tzmapping = {'ET': dateutil.tz.gettz('US/Eastern')}

for t in times:
    print(f"{t:>22} -> {dateutil.parser.parse(t, tzinfos=tzmapping)}")
              17:22 ET -> 2021-03-13 17:22:00-05:00
              02:41 ET -> 2021-03-13 02:41:00-05:00
              06:14 ET -> 2021-03-13 06:14:00-05:00
Mar 10, 2021, 16:05 ET -> 2021-03-10 16:05:00-05:00
Mar 08, 2021, 08:00 ET -> 2021-03-08 08:00:00-05:00
Feb 26, 2021, 11:23 ET -> 2021-02-26 11:23:00-05:00

Note that you can easily tell dateutil's parser to use a certain time zone (eg to convert 'ET' to US/Eastern) and it also automatically adds today's date if the date is not present in the input.

#2 If you want to do more of the parsing yourself (probably more efficient), you can do so by extracting the time zone first, then parsing the rest and adding a date where needed:

from datetime import datetime
from zoneinfo import ZoneInfo # Python < 3.9: you can use backports.zoneinfo

# add more if you not only have ET...
tzmapping = {'ET': ZoneInfo('US/Eastern')}

# get tuples of the input string with tz stripped off and timezone object
times_zones = [(t[:t.rfind(' ')], tzmapping[t.split(' ')[-1]]) for t in times]

# parse to datetime
dt = []
for t, z in times_zones:
    if len(t)>5: # time and date...
        dt.append(datetime.strptime(t, '%b %d, %Y, %H:%M').replace(tzinfo=z))
    else: # time only...
        dt.append(datetime.combine(datetime.now(z).date(), 
                                   datetime.strptime(t, '%H:%M').time()).replace(tzinfo=z))
        
for t, dtobj in zip(times, dt):
    print(f"{t:>22} -> {dtobj}")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM