在 Python 中使用正则表达式匹配多种可能性

Question

I am trying to process a log file using Python and extract the date, time and log message of each entry and store it in a list of dicts.我正在尝试使用 Python 处理日志文件并提取每个条目的日期、时间和日志消息并将其存储在字典列表中。 I am using the re.search() and group() methods for this purpose.为此，我正在使用re.search()和group()方法。
The problem is the date/time take various formats such as.问题是日期/时间采用各种格式，例如。

dd/mm/yy, hh:mm AM - logs
dd/mm/yyyy, hh:mm a.m. - logs
dd/mm/yy HH:mm - logs

My program looks something like this:我的程序看起来像这样：

import re
infile=open('logfile.txt', 'r')
loglist=[]
logdict={}
for aline in infile.readlines():
    line=re.search(r'^(\d?\d/\d?\d/\d\d), (\d?\d:\d?\d \w\w) - (.*?)',aline)
    if line:
        logdict['date'] = line.group(1)
        logdict['time'] = line.group(2)
        logdict['logmsg'] = line.group(3)
        loglist.append(logdict)

However, this matches only the first of the above-mentioned formats.但是，这仅匹配上述格式中的第一个。
How can I match the other formats as well and also maintain the groups?我怎样才能匹配其他格式并维护组？ Or is there an easier method of doing this?或者有没有更简单的方法来做到这一点？

Answer 1

You can use {m,n} after a pattern to indicate that there can be between m and n repetitions.您可以在模式后使用{m,n}来指示可以在m和n之间重复。 So use \\d{1,2} to indicate 1 or 2 digits.所以使用\\d{1,2}来表示 1 或 2 位数字。 And you an use an alternation to indicate multiple possibilities, eg \\d{2}|\\d{4} for 2- or 4-digit years.并且您可以使用交替来表示多种可能性，例如\\d{2}|\\d{4}表示 2 位或 4 位年份。

So the regexp can be:所以正则表达式可以是：

^(\d{1,2}/\d{1,2}/(?:\d{2}|\d{4})),? (\d{1,2}:\d{1,2}(?: [AaPp]\.?[Mm]\.?)?) - (.*)'

Answer 2

I would first extract the data with a regex and then validate it manually.我会首先使用正则表达式提取数据，然后手动验证它。 I wouldn't use the regex for two things, validation and extraction.我不会将正则表达式用于两件事，验证和提取。

For clarity I would also assign names to these regex and make sure that each individual regex would return an atom such as a time or a date or am_pm and then string them together to form the sentence.为清楚起见，我还将为这些正则表达式指定名称，并确保每个单独的正则表达式都返回一个原子，例如时间或日期或 am_pm，然后将它们串在一起以形成句子。 Note: I have not assigned names to the groups but I think its possible but not sure how注意：我没有为组分配名称，但我认为它可能但不确定如何

However in the end you could get your date_time and do a split on it such as date_time.split("/") which would return you day, month, year which you can then validate or use.但是，最后您可以获取 date_time 并对其进行拆分，例如 date_time.split("/") 它将返回您可以验证或使用的日、月、年。

import re

log_records = ["10/10/1960, 10:50 AM - logs",
               "5/15/2001, 23:11 a.m. - logs",
               "50/100/1069 300:100 - logs"]
parsed_records = []

date_month_year_ptrn = r"((\d+/){2,2}\d+)"
time_ptrn = r"(\d+:\d+)"
morning_evening_ptrn = r"((\w+\.?)+)?"
everything_else_ptrn = r"(.*)"

log_record_ptrn = "^{date_ptrn},?\s+{time_ptrn}\s+{morn_even_ptrn}\s*-\s+{log_msg}$"
log_record_ptrn = log_record_ptrn.format(date_ptrn=date_month_year_ptrn,
                                         time_ptrn=time_ptrn,
                                         morn_even_ptrn=morning_evening_ptrn,
                                         log_msg=everything_else_ptrn)

def extract_log_record_from_match(matcher):
    if log_record_match:
        # I am pretty sure you can attach names to these numbers
        # but not sure how to do this
        date_time = log_record_match.group(1)
        time_ = log_record_match.group(3)
        am_pm = log_record_match.group(4)
        log_message = log_record_match.group(6)
        return date_time, time_, am_pm, log_message
    return None

def print_records(records):
    for record in parsed_records:
        if record:
            print(record)


for log_record in log_records:
    log_record_match = re.search(log_record_ptrn, log_record, re.IGNORECASE)
    parsed_records.append(extract_log_record_from_match(log_record_match))

print_records(parsed_records)

在 Python 中使用正则表达式匹配多种可能性

问题描述

2 个解决方案

解决方案1
2 2017-01-09 08:38:41

解决方案2
0 2017-01-09 09:28:58

在 Python 中使用正则表达式匹配多种可能性

问题描述

2 个解决方案

解决方案1 2 2017-01-09 08:38:41

解决方案2 0 2017-01-09 09:28:58

解决方案1
2 2017-01-09 08:38:41

解决方案2
0 2017-01-09 09:28:58