简体   繁体   English

在 Python 中使用正则表达式匹配多种可能性

[英]Matching multiple possibilities with regex in Python

I am trying to process a log file using Python and extract the date, time and log message of each entry and store it in a list of dicts.我正在尝试使用 Python 处理日志文件并提取每个条目的日期、时间和日志消息并将其存储在字典列表中。 I am using the re.search() and group() methods for this purpose.为此,我正在使用re.search()group()方法。
The problem is the date/time take various formats such as.问题是日期/时间采用各种格式,例如。

dd/mm/yy, hh:mm AM - logs
dd/mm/yyyy, hh:mm a.m. - logs
dd/mm/yy HH:mm - logs

My program looks something like this:我的程序看起来像这样:

import re
infile=open('logfile.txt', 'r')
loglist=[]
logdict={}
for aline in infile.readlines():
    line=re.search(r'^(\d?\d/\d?\d/\d\d), (\d?\d:\d?\d \w\w) - (.*?)',aline)
    if line:
        logdict['date'] = line.group(1)
        logdict['time'] = line.group(2)
        logdict['logmsg'] = line.group(3)
        loglist.append(logdict)

However, this matches only the first of the above-mentioned formats.但是,这仅匹配上述格式中的第一个。
How can I match the other formats as well and also maintain the groups?我怎样才能匹配其他格式并维护组? Or is there an easier method of doing this?或者有没有更简单的方法来做到这一点?

You can use {m,n} after a pattern to indicate that there can be between m and n repetitions.您可以在模式后使用{m,n}来指示可以在mn之间重复。 So use \\d{1,2} to indicate 1 or 2 digits.所以使用\\d{1,2}来表示 1 或 2 位数字。 And you an use an alternation to indicate multiple possibilities, eg \\d{2}|\\d{4} for 2- or 4-digit years.并且您可以使用交替来表示多种可能性,例如\\d{2}|\\d{4}表示 2 位或 4 位年份。

So the regexp can be:所以正则表达式可以是:

^(\d{1,2}/\d{1,2}/(?:\d{2}|\d{4})),? (\d{1,2}:\d{1,2}(?: [AaPp]\.?[Mm]\.?)?) - (.*)'

I would first extract the data with a regex and then validate it manually.我会首先使用正则表达式提取数据,然后手动验证它。 I wouldn't use the regex for two things, validation and extraction.我不会将正则表达式用于两件事,验证和提取。

For clarity I would also assign names to these regex and make sure that each individual regex would return an atom such as a time or a date or am_pm and then string them together to form the sentence.为清楚起见,我还将为这些正则表达式指定名称,并确保每个单独的正则表达式都返回一个原子,例如时间或日期或 am_pm,然后将它们串在一起以形成句子。 Note: I have not assigned names to the groups but I think its possible but not sure how注意:我没有为组分配名称,但我认为它可能但不确定如何

However in the end you could get your date_time and do a split on it such as date_time.split("/") which would return you day, month, year which you can then validate or use.但是,最后您可以获取 date_time 并对其进行拆分,例如 date_time.split("/") 它将返回您可以验证或使用的日、月、年。

import re

log_records = ["10/10/1960, 10:50 AM - logs",
               "5/15/2001, 23:11 a.m. - logs",
               "50/100/1069 300:100 - logs"]
parsed_records = []

date_month_year_ptrn = r"((\d+/){2,2}\d+)"
time_ptrn = r"(\d+:\d+)"
morning_evening_ptrn = r"((\w+\.?)+)?"
everything_else_ptrn = r"(.*)"

log_record_ptrn = "^{date_ptrn},?\s+{time_ptrn}\s+{morn_even_ptrn}\s*-\s+{log_msg}$"
log_record_ptrn = log_record_ptrn.format(date_ptrn=date_month_year_ptrn,
                                         time_ptrn=time_ptrn,
                                         morn_even_ptrn=morning_evening_ptrn,
                                         log_msg=everything_else_ptrn)

def extract_log_record_from_match(matcher):
    if log_record_match:
        # I am pretty sure you can attach names to these numbers
        # but not sure how to do this
        date_time = log_record_match.group(1)
        time_ = log_record_match.group(3)
        am_pm = log_record_match.group(4)
        log_message = log_record_match.group(6)
        return date_time, time_, am_pm, log_message
    return None

def print_records(records):
    for record in parsed_records:
        if record:
            print(record)


for log_record in log_records:
    log_record_match = re.search(log_record_ptrn, log_record, re.IGNORECASE)
    parsed_records.append(extract_log_record_from_match(log_record_match))

print_records(parsed_records)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM