简体   繁体   English

如何从 Python 中的同一个字符串中提取多个时间?

[英]How to extract multiple time from same string in Python?

I'm trying to extract time from single strings where in one string there will be texts other than only time.我试图从单个字符串中提取时间,其中一个字符串中除了时间之外还有其他文本。 An example is s = 'Dates : 12/Jul/2019 12/Aug/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58' .例如s = 'Dates : 12/Jul/2019 12/Aug/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58'

I've tried using datefinder module like this :我试过像这样使用datefinder模块:

from datetime import datetime as dt
import datefinder as dfn
for m in dfn.find_dates(s):
    print(dt.strftime(m, "%H:%M:%S"))

Which gives me this :这给了我这个:

17:58:00

In this case the time "06:00" is missed out.在这种情况下,错过了时间"06:00" Now if I try without datefinder with only datetime module like this :现在,如果我在没有datefinder情况下尝试只有这样的datetime模块:

dt.strftime(s, "%H:%M")

It notifies me that the input must be a datetime object already, not a string with the following error :它通知我输入必须已经是日期时间对象,而不是带有以下错误的字符串:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: descriptor 'strftime' requires a 'datetime.date' object but received a 'str'

So I tried to use dateutil module to parse this string s to a datetime object with this :所以我试图用dateutil模块来解析这个字符串s与此DateTime对象:

from dateutil.parser import parse
parse(s)

but, now it now says that my string is not in proper format (which in most cases will not be in any fixed format), showing me this error :但是,现在它说我的字符串格式不正确(在大多数情况下不会是任何固定格式),向我显示此错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1358, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 649, in parse
    raise ValueError("Unknown string format:", timestr)
ValueError: ('Unknown string format:', '12/Jul/2019 12/Aug/2019 MEISHAN BRIDGE 06:00 17:58')

I have thought of getting the time with regex like我想过用正则表达式来花时间

import re
p = r"\d{2}\:\d{2}"
times = [i.group() for i in re.finditer(p, s)]
# Gives me ['06:00', '17:58']

But doing this way will need me to check again whether this regex matched chunks are actually time or not because even "99:99" could be regex matched rightly and told as time wrongly.但是这样做需要我再次检查这个正则表达式匹配的块是否实际上是时间,因为即使"99:99"也可以正确匹配正则表达式并错误地告知时间。 Is there any work around without regex to get all the times from a single string?在没有正则表达式的情况下,是否有任何解决方法可以从单个字符串中获取所有时间?

Please note that the string might contain or might not contain any date, but it will contain a time always.请注意,字符串可能包含也可能不包含任何日期,但它始终包含时间。 Even if it contains date, the date format might be anything on earth and also this string might or might not contain other irrelevant texts.即使它包含日期,日期格式也可能是地球上的任何东西,而且这个字符串可能包含也可能不包含其他不相关的文本。

I don't see many options here, so I would go with a heuristic.我在这里没有看到很多选项,所以我会采用启发式方法。 I would run the following against the whole dataset and extend the config/regexes until it covers all/most of the cases:我将对整个数据集运行以下命令并扩展配置/正则表达式,直到它涵盖所有/大多数情况:

import re
import logging
from datetime import datetime as dt

s = 'Dates : 12/Jul/2019 12/08/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58:59'


SUPPORTED_DATE_FMTS = {
    re.compile(r"(\d{2}/\w{3}/\d{4})"): "%d/%b/%Y",
    re.compile(r"(\d{2}/\d{2}/\d{4})"): "%d/%m/%Y",
    re.compile(r"(\d{2}/\w{3}\w+/\d{4})"): "%d/%B/%Y",
    # Capture more here
}

SUPPORTED_TIME_FMTS = {
    re.compile(r"((?:[0-1][0-9]|2[0-4]):[0-5][0-9])[^:]"): "%H:%M",
    re.compile(r"((?:[0-1][0-9]|2[0-4]):[0-5][0-9]:[0-5][0-9])"): "%H:%M:%S",
    # Capture more here
}


def extract_supported_dt(config, s):
    """
    Loop thru the given config (keys are regexes, values are date/time format)
    and attempt to gather all valid data.
    """
    valid_data = []
    for regex, fmt in config.items():
        # Extract what you think looks like date
        valid_ish_data = regex.findall(s)
        if not valid_ish_data:
            continue
        print("Checking " + str(valid_ish_data))

        # validate it
        for d in valid_ish_data:
            try:
                valid_data.append(dt.strptime(d, fmt))
            except ValueError:
                pass

    return valid_data


# Handle dates
dates = extract_supported_dt(SUPPORTED_DATE_FMTS, s)
# Handle times
times = extract_supported_dt(SUPPORTED_TIME_FMTS, s)

print("Found dates: ")
for date in dates:
    print("\t" + str(date.date()))

print("Found times: ")
for t in times:
    print("\t" + str(t.time()))

Example output:示例输出:

Checking ['12/Jul/2019']
Checking ['12/08/2019']
Checking ['06:00']
Checking ['17:58:59']
Found dates:
    2019-07-12
    2019-08-12
Found times:
    06:00:00
    17:58:59

This is a trial and error approach but I do not think there is an alternative in your case.这是一种反复试验的方法,但我认为您的情况没有替代方法。 Thus my goal here is to make it as easy as possible to extend support with more date/time formats as opposed to try to find a solution that covers 100% of the data day-1.因此,我的目标是尽可能轻松地扩展对更多日期/时间格式的支持,而不是试图找到一个覆盖 100% 第 1 天数据的解决方案。 This way, the more data you run against the more complete your config will be.这样,您运行的数据越多,您的配置就越完整。

One thing to note is that you will have to detect strings that appear to have no dates and log them somewhere.需要注意的一件事是,您必须检测似乎没有日期的字符串并将它们记录在某处。 Later you will need to manually revise and see if something that was missed could be captured.稍后您将需要手动修改并查看是否可以捕获遗漏的内容。

Now, assuming that your data are being generated by another system, sooner or later you will be able to match 100% of it.现在,假设您的数据是由另一个系统生成的,迟早您将能够 100% 匹配它。 If the data input is from human, then you will probably never manage to get 100%!如果数据输入来自人类,那么您可能永远无法获得 100%! (people tend to make spelling mistakes and sometimes import random stuff... date=today :) ) (人们往往会犯拼写错误,有时会输入随机的东西...... date=today :) )

Use Regex But Something Like This,使用正则表达式,但像这样,

(?=[0-1])[0-1][0-9]\\:[0-5][0-9]|(?=2)[2][0-3]\\:[0-5][0-9]

This Matched这个匹配

00:00, 00:59 01:00 01:59 02:00 02: 59 09:00 10:00 11:59 20:00 21:59 23:59 00:00, 00:59 01:00 01:59 02:00 02: 59 09:00 10:00 11:59 20:00 21:59 23:59

Not work for不为

99:99 23:99 01:99 99:99 23:99 01:99

Check Here Dude if it works for You在这里检查一下伙计是否适合你

Check on Repl.it检查 Repl.it

How to extract multiple time from same string in Python?如何从 Python 中的同一个字符串中提取多个时间?

If you need only time this regex should work fine如果你只需要时间这个正则表达式应该可以正常工作

r"[0-2][0-9]\:[0-5][0-9]"

If there could be spaces in time like 23 : 59 use this如果可以有像23 : 59这样的时间空间,请使用这个

r"[0-2][0-9]\s*\:\s*[0-5][0-9]"

you could use dictionaries:你可以使用字典:

my_dict = {}

for i in s.split(', '):
    m = i.strip().split(' : ', 1)
    my_dict[m[0]] = m[1].split()


my_dict
Out: 
{'Dates': ['12/Jul/2019', '12/Aug/2019'],
 'Loc': ['MEISHAN', 'BRIDGE'],
 'Time': ['06:00', '17:58']}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM