[英]Matching multiple possibilities with regex in Python
I am trying to process a log file using Python and extract the date, time and log message of each entry and store it in a list of dicts.我正在尝试使用 Python 处理日志文件并提取每个条目的日期、时间和日志消息并将其存储在字典列表中。 I am using the
re.search()
and group()
methods for this purpose.为此,我正在使用
re.search()
和group()
方法。
The problem is the date/time take various formats such as.问题是日期/时间采用各种格式,例如。
dd/mm/yy, hh:mm AM - logs
dd/mm/yyyy, hh:mm a.m. - logs
dd/mm/yy HH:mm - logs
My program looks something like this:我的程序看起来像这样:
import re
infile=open('logfile.txt', 'r')
loglist=[]
logdict={}
for aline in infile.readlines():
line=re.search(r'^(\d?\d/\d?\d/\d\d), (\d?\d:\d?\d \w\w) - (.*?)',aline)
if line:
logdict['date'] = line.group(1)
logdict['time'] = line.group(2)
logdict['logmsg'] = line.group(3)
loglist.append(logdict)
However, this matches only the first of the above-mentioned formats.但是,这仅匹配上述格式中的第一个。
How can I match the other formats as well and also maintain the groups?我怎样才能匹配其他格式并维护组? Or is there an easier method of doing this?
或者有没有更简单的方法来做到这一点?
You can use {m,n}
after a pattern to indicate that there can be between m
and n
repetitions.您可以在模式后使用
{m,n}
来指示可以在m
和n
之间重复。 So use \\d{1,2}
to indicate 1 or 2 digits.所以使用
\\d{1,2}
来表示 1 或 2 位数字。 And you an use an alternation to indicate multiple possibilities, eg \\d{2}|\\d{4}
for 2- or 4-digit years.并且您可以使用交替来表示多种可能性,例如
\\d{2}|\\d{4}
表示 2 位或 4 位年份。
So the regexp can be:所以正则表达式可以是:
^(\d{1,2}/\d{1,2}/(?:\d{2}|\d{4})),? (\d{1,2}:\d{1,2}(?: [AaPp]\.?[Mm]\.?)?) - (.*)'
I would first extract the data with a regex and then validate it manually.我会首先使用正则表达式提取数据,然后手动验证它。 I wouldn't use the regex for two things, validation and extraction.
我不会将正则表达式用于两件事,验证和提取。
For clarity I would also assign names to these regex and make sure that each individual regex would return an atom such as a time or a date or am_pm and then string them together to form the sentence.为清楚起见,我还将为这些正则表达式指定名称,并确保每个单独的正则表达式都返回一个原子,例如时间或日期或 am_pm,然后将它们串在一起以形成句子。 Note: I have not assigned names to the groups but I think its possible but not sure how
注意:我没有为组分配名称,但我认为它可能但不确定如何
However in the end you could get your date_time and do a split on it such as date_time.split("/") which would return you day, month, year which you can then validate or use.但是,最后您可以获取 date_time 并对其进行拆分,例如 date_time.split("/") 它将返回您可以验证或使用的日、月、年。
import re
log_records = ["10/10/1960, 10:50 AM - logs",
"5/15/2001, 23:11 a.m. - logs",
"50/100/1069 300:100 - logs"]
parsed_records = []
date_month_year_ptrn = r"((\d+/){2,2}\d+)"
time_ptrn = r"(\d+:\d+)"
morning_evening_ptrn = r"((\w+\.?)+)?"
everything_else_ptrn = r"(.*)"
log_record_ptrn = "^{date_ptrn},?\s+{time_ptrn}\s+{morn_even_ptrn}\s*-\s+{log_msg}$"
log_record_ptrn = log_record_ptrn.format(date_ptrn=date_month_year_ptrn,
time_ptrn=time_ptrn,
morn_even_ptrn=morning_evening_ptrn,
log_msg=everything_else_ptrn)
def extract_log_record_from_match(matcher):
if log_record_match:
# I am pretty sure you can attach names to these numbers
# but not sure how to do this
date_time = log_record_match.group(1)
time_ = log_record_match.group(3)
am_pm = log_record_match.group(4)
log_message = log_record_match.group(6)
return date_time, time_, am_pm, log_message
return None
def print_records(records):
for record in parsed_records:
if record:
print(record)
for log_record in log_records:
log_record_match = re.search(log_record_ptrn, log_record, re.IGNORECASE)
parsed_records.append(extract_log_record_from_match(log_record_match))
print_records(parsed_records)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.