[英]Why is PyParsing taking so much longer to parse vs RegEx? Is it because it creates objects instead of dicts?
Is PyParsing slower than RegEx because it creates objects instead of dicts? PyParsing 是否比 RegEx 慢,因为它创建对象而不是 dicts? and if so, can this be improved?
如果是这样,这可以改善吗?
I have a an output of almost 400,000 lines that describe 40,000 items in a table of a router.我有一个近 400,000 行的 output 描述了路由器表中的 40,000 个项目。
I have 2 parsers, written in PyParsing and RegEx that do the same task.我有 2 个解析器,用 PyParsing 和 RegEx 编写,它们执行相同的任务。 The difference in performance is around 1:15 to 1:18 in favor of RegEx and
ParserElement.enablePackrat()
makes things worse.性能差异大约是 1:15 到 1:18,有利于 RegEx 和
ParserElement.enablePackrat()
使事情变得更糟。
I suspect that PyParsing is working "harder" because it generates objects, while RegEx generates dicts.我怀疑 PyParsing 工作“更努力”,因为它生成对象,而 RegEx 生成字典。
Did I miss something in the PyParsing grammar that makes it run slower than it should?我是否错过了 PyParsing 语法中的某些内容,使其运行速度比应有的慢?
Is PyParsing intended for this output scale? PyParsing 是否适用于此 output 规模?
Footnote : I am using the parsers as part of an automation framework.脚注:我将解析器用作自动化框架的一部分。 My main goal is to provide users and future maintainers an easy to understand, use and maintainable code PyParsing allows for that and I prefer using it.
我的主要目标是为用户和未来的维护者提供易于理解、使用和可维护的代码 PyParsing 允许这样做,我更喜欢使用它。 In 99% of the cases, the amount of lines to parse is not this high, so PyParsing is the prefered tool.
在 99% 的情况下,要解析的行数并没有这么高,因此 PyParsing 是首选工具。 Following @PaulMcG 's reply, I will look into refining the parser.
在@PaulMcG 的回复之后,我将研究改进解析器。
RegEx正则表达式
Output Output
started parse
finished parse, took 0.40858237000065856s, processed 7000 entries
RegEx
Code代码
import re
import time
output = """(205.1.0.0, 225.1.0.0) SM, Uptime: 00:40:58
Upstream Join/Prune: Joined(HoldTime: 00:00:03), RPF: 2.1.0.1, Flags: KA(00:02:23), RR(00:01:58)
Incoming Interface:
bundle-201.1, Uptime: 00:40:58, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 00:40:36, status: Fwd, JOIN(HoldTime: 00:02:41), Flags:
(205.1.2.139, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
bundle-201.1 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(205.1.2.140, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(205.1.2.141, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(205.1.2.142, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(*, 225.5.99.0) SM, Uptime: 00:44:41, RP-Address: 100.100.100.100
Upstream Join/Prune: Joined(HoldTime: 00:00:19), RPF: *, Flags:
Incoming Interface:
lo5, Uptime: 00:44:41, status: Rcv, Flags: R
Output Interface List:
bundle-1215, Uptime: 00:42:27, status: Fwd, JOIN(HoldTime: 00:03:14), Flags:
bundle-1218, Uptime: 00:44:41, status: Fwd, JOIN(HoldTime: 00:02:44), Flags:
(205.1.2.142, 225.1.10.1) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
"""
# output = output * 5715
output = output * 1000
sg_line_pattern = re.compile(r'\((?P<source>[\d\.\*]+),'
r' (?P<group>[\d\.]+)\)'
r' (?P<group_type>\w+),'
r' Uptime: (?P<uptime>[\d:\.]+)'
r'(, RP-Address: (?P<rp_address>[\d\.]+))?')
join_prune_line_pattern = re.compile(r'Upstream Join/Prune: (?P<upstream_join_prune>.*?),'
r' RPF: (?P<rpf>[\d\.\*]+),'
r' Flags:(?P<flags>.*)')
iif_lines_pattern = re.compile(r'Incoming Interface:(?P<iif_lines>\n(.|\n)*?)Output Interface List:')
oil_lines_pattern = re.compile(r'Output Interface List:(?P<oil_lines>\n(.|\n)*)')
iif_oil_line_pattern = re.compile(r'\s*(?P<interface>[\w\d\-\./]+)(,)?'
r'\s+Uptime: (?P<uptime>[\d:\.]+),'
r' status: (?P<status>\w+),'
r'( (?P<join_prune_state>[\w\s\d():]+),)?'
r'\s+Flags:(?P<flags>.*)')
print(f"started parse")
start_time = time.monotonic()
output = output.split("\n\n")
output = [entry for entry in output if entry]
result = []
for entry in output:
iifs = None
oils = None
sg_line = re.search(sg_line_pattern, entry).groupdict()
join_prune_line = re.search(join_prune_line_pattern, entry).groupdict()
iif_lines = re.search(iif_lines_pattern, entry)
oil_lines = re.search(oil_lines_pattern, entry)
if iif_lines:
iif_lines = iif_lines.groupdict()
iifs = [m.groupdict() for m in re.finditer(iif_oil_line_pattern, iif_lines['iif_lines'])]
iifs = {entry["interface"]: entry for entry in iifs}
if oil_lines:
oil_lines = oil_lines.groupdict()
oils = [m.groupdict() for m in re.finditer(iif_oil_line_pattern, oil_lines['oil_lines'])]
oils = {entry["interface"]: entry for entry in oils}
group = sg_line['group']
source = sg_line['source']
entry_dict = {**sg_line, **join_prune_line, "iifs": iifs, "oils": oils}
result.append(entry_dict)
end_time = time.monotonic()
print(f"finished parse, took {end_time - start_time}s, processed {len(result)} entries")
print('RegEx')
PyParsing PyParsing
Output: Output:
started parse
finished parse, took 1.3630201699997997s, processed 700 entries
PyParsing
Code:代码:
from pyparsing import Word, Keyword, nums, OneOrMore, Optional, Suppress, Literal, alphanums, LineEnd, \
Group, SkipTo, ParserElement, Dict
import time
output = """(205.1.0.0, 225.1.0.0) SM, Uptime: 00:40:58
Upstream Join/Prune: Joined(HoldTime: 00:00:03), RPF: 2.1.0.1, Flags: KA(00:02:23), RR(00:01:58)
Incoming Interface:
bundle-201.1, Uptime: 00:40:58, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 00:40:36, status: Fwd, JOIN(HoldTime: 00:02:41), Flags:
(205.1.2.139, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
bundle-201.1 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(205.1.2.140, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(205.1.2.141, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(205.1.2.142, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(*, 225.5.99.0) SM, Uptime: 00:44:41, RP-Address: 100.100.100.100
Upstream Join/Prune: Joined(HoldTime: 00:00:19), RPF: *, Flags:
Incoming Interface:
lo5, Uptime: 00:44:41, status: Rcv, Flags: R
Output Interface List:
bundle-1215, Uptime: 00:42:27, status: Fwd, JOIN(HoldTime: 00:03:14), Flags:
bundle-1218, Uptime: 00:44:41, status: Fwd, JOIN(HoldTime: 00:02:44), Flags:
(205.1.2.142, 225.1.10.1) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
"""
# output = output * 5715
output = output * 1000
# ParserElement.enablePackrat()
ParserElement.setDefaultWhitespaceChars(" \t")
SkipToNL = Suppress(SkipTo(LineEnd()) + LineEnd())
IpAddress = Word(nums + '.')
ParserUptime = Word(nums + ':.')
SgLine = (Suppress(Literal('(')) + (IpAddress | Literal('*'))('source') + Suppress(Literal(',')) +
IpAddress('group') + Literal(')') +
Word(alphanums)('group_type') + Literal(',') +
Keyword('Uptime:') + ParserUptime('group_uptime') +
Optional(Keyword(', RP-Address:') + IpAddress('rp_address')) +
SkipToNL)
SgFlagsLine = (Keyword('Upstream Join/Prune:') + SkipTo(",")('upstream_join_prune') + Literal(',') +
Keyword('RPF:') + (IpAddress | Literal('*'))('rpf') + Literal(',') +
Keyword('Flags:') + SkipTo(LineEnd())('sg_flags') + LineEnd())
IifStartLine = (Keyword('Incoming Interface:') + SkipToNL)
IifOilLine = (Word(alphanums + r'-./')('interface_name') + Optional(Literal(',')) +
Keyword('Uptime:') + ParserUptime('uptime') + Literal(',') +
Keyword('status:') + Word(alphanums) + Literal(',') +
Optional(Word(alphanums + r'(): ')('join_prune_state') + Literal(',')) +
Keyword('Flags:') + SkipTo(LineEnd())('interface_flags') + LineEnd())
IifLines = Dict(OneOrMore(Group(IifOilLine)))("iif")
OilLines = Dict(OneOrMore(Group(IifOilLine)))("oil")
OilStartLine = (Literal('Output Interface List:') + SkipToNL)
grammar = OneOrMore(Group((SgLine +
SgFlagsLine +
IifStartLine +
IifLines +
OilStartLine +
OilLines +
Optional(SkipToNL))
)
)
grammar.setDefaultWhitespaceChars(" \t")
print(f"started parse")
start_time = time.monotonic()
result = grammar.parseString(output)
end_time = time.monotonic()
print(f"finished parse, took {end_time - start_time}s, processed {len(result)} entries")
print('PyParsing')
Regex
instead of Word(word_chars)
(though Word uses a regex internally anyway, so unlikely to gain much).Regex
而不是Word(word_chars)
重做一些低级术语(尽管 Word 在内部无论如何都使用正则表达式,因此不太可能获得太多)。 I do note that your terms aren't really doing much pattern matching in their parsing - for instance using Word(nums+":")
to parse a time given in the form 00:00:00
is a bit of cheating, since that term would also match "::::", ":0:0:", and any integer.Word(nums+":")
来解析以00:00:00
形式给出的时间有点作弊,因为那个术语还将匹配“::::”、“:0:0:”和任何 integer。 Similar for defining IpAddress as any word composed of digits and ".".
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.