简体   繁体   English

为什么 PyParsing 与 RegEx 相比要花这么长时间来解析? 是因为它创建对象而不是字典吗?

[英]Why is PyParsing taking so much longer to parse vs RegEx? Is it because it creates objects instead of dicts?

Is PyParsing slower than RegEx because it creates objects instead of dicts? PyParsing 是否比 RegEx 慢,因为它创建对象而不是 dicts? and if so, can this be improved?如果是这样,这可以改善吗?

I have a an output of almost 400,000 lines that describe 40,000 items in a table of a router.我有一个近 400,000 行的 output 描述了路由器表中的 40,000 个项目。

I have 2 parsers, written in PyParsing and RegEx that do the same task.我有 2 个解析器,用 PyParsing 和 RegEx 编写,它们执行相同的任务。 The difference in performance is around 1:15 to 1:18 in favor of RegEx and ParserElement.enablePackrat() makes things worse.性能差异大约是 1:15 到 1:18,有利于 RegEx 和ParserElement.enablePackrat()使事情变得更糟。

  1. I suspect that PyParsing is working "harder" because it generates objects, while RegEx generates dicts.我怀疑 PyParsing 工作“更努力”,因为它生成对象,而 RegEx 生成字典。

  2. Did I miss something in the PyParsing grammar that makes it run slower than it should?我是否错过了 PyParsing 语法中的某些内容,使其运行速度比应有的慢?

  3. Is PyParsing intended for this output scale? PyParsing 是否适用于此 output 规模?


Footnote : I am using the parsers as part of an automation framework.脚注:我将解析器用作自动化框架的一部分。 My main goal is to provide users and future maintainers an easy to understand, use and maintainable code PyParsing allows for that and I prefer using it.我的主要目标是为用户和未来的维护者提供易于理解、使用和可维护的代码 PyParsing 允许这样做,我更喜欢使用它。 In 99% of the cases, the amount of lines to parse is not this high, so PyParsing is the prefered tool.在 99% 的情况下,要解析的行数并没有这么高,因此 PyParsing 是首选工具。 Following @PaulMcG 's reply, I will look into refining the parser.在@PaulMcG 的回复之后,我将研究改进解析器。


RegEx正则表达式

Output Output

started parse
finished parse, took 0.40858237000065856s, processed 7000 entries
RegEx

Result #0结果#0 结果#0

Code代码

import re
import time

output = """(205.1.0.0, 225.1.0.0) SM, Uptime: 00:40:58
Upstream Join/Prune: Joined(HoldTime: 00:00:03), RPF: 2.1.0.1, Flags: KA(00:02:23), RR(00:01:58)
Incoming Interface:
  bundle-201.1,                 Uptime: 00:40:58, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 00:40:36, status: Fwd, JOIN(HoldTime: 00:02:41), Flags:

(205.1.2.139, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
  bundle-201.1                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(205.1.2.140, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(205.1.2.141, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(205.1.2.142, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(*, 225.5.99.0) SM, Uptime: 00:44:41, RP-Address: 100.100.100.100
Upstream Join/Prune: Joined(HoldTime: 00:00:19), RPF: *, Flags:
Incoming Interface:
  lo5,                          Uptime: 00:44:41, status: Rcv, Flags: R
Output Interface List:
  bundle-1215,                  Uptime: 00:42:27, status: Fwd, JOIN(HoldTime: 00:03:14), Flags:
  bundle-1218,                  Uptime: 00:44:41, status: Fwd, JOIN(HoldTime: 00:02:44), Flags:

(205.1.2.142, 225.1.10.1) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

"""
# output = output * 5715
output = output * 1000

sg_line_pattern = re.compile(r'\((?P<source>[\d\.\*]+),'
                             r' (?P<group>[\d\.]+)\)'
                             r' (?P<group_type>\w+),'
                             r' Uptime: (?P<uptime>[\d:\.]+)'
                             r'(, RP-Address: (?P<rp_address>[\d\.]+))?')

join_prune_line_pattern = re.compile(r'Upstream Join/Prune: (?P<upstream_join_prune>.*?),'
                                     r' RPF: (?P<rpf>[\d\.\*]+),'
                                     r' Flags:(?P<flags>.*)')

iif_lines_pattern = re.compile(r'Incoming Interface:(?P<iif_lines>\n(.|\n)*?)Output Interface List:')

oil_lines_pattern = re.compile(r'Output Interface List:(?P<oil_lines>\n(.|\n)*)')

iif_oil_line_pattern = re.compile(r'\s*(?P<interface>[\w\d\-\./]+)(,)?'
                                  r'\s+Uptime: (?P<uptime>[\d:\.]+),'
                                  r' status: (?P<status>\w+),'
                                  r'( (?P<join_prune_state>[\w\s\d():]+),)?'
                                  r'\s+Flags:(?P<flags>.*)')

print(f"started parse")
start_time = time.monotonic()

output = output.split("\n\n")
output = [entry for entry in output if entry]

result = []

for entry in output:
    iifs = None
    oils = None
    sg_line = re.search(sg_line_pattern, entry).groupdict()
    join_prune_line = re.search(join_prune_line_pattern, entry).groupdict()
    iif_lines = re.search(iif_lines_pattern, entry)
    oil_lines = re.search(oil_lines_pattern, entry)
    if iif_lines:
        iif_lines = iif_lines.groupdict()
        iifs = [m.groupdict() for m in re.finditer(iif_oil_line_pattern, iif_lines['iif_lines'])]
        iifs = {entry["interface"]: entry for entry in iifs}
    if oil_lines:
        oil_lines = oil_lines.groupdict()
        oils = [m.groupdict() for m in re.finditer(iif_oil_line_pattern, oil_lines['oil_lines'])]
        oils = {entry["interface"]: entry for entry in oils}
    group = sg_line['group']
    source = sg_line['source']

    entry_dict = {**sg_line, **join_prune_line, "iifs": iifs, "oils": oils}
    result.append(entry_dict)

end_time = time.monotonic()

print(f"finished parse, took {end_time - start_time}s, processed {len(result)} entries")
print('RegEx')

PyParsing PyParsing

Output: Output:

started parse
finished parse, took 1.3630201699997997s, processed 700 entries
PyParsing

Result #0结果#0 结果#0

Code:代码:

from pyparsing import Word, Keyword, nums, OneOrMore, Optional, Suppress, Literal, alphanums, LineEnd, \
    Group, SkipTo, ParserElement, Dict
import time

output = """(205.1.0.0, 225.1.0.0) SM, Uptime: 00:40:58
Upstream Join/Prune: Joined(HoldTime: 00:00:03), RPF: 2.1.0.1, Flags: KA(00:02:23), RR(00:01:58)
Incoming Interface:
  bundle-201.1,                 Uptime: 00:40:58, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 00:40:36, status: Fwd, JOIN(HoldTime: 00:02:41), Flags:

(205.1.2.139, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
  bundle-201.1                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(205.1.2.140, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(205.1.2.141, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(205.1.2.142, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(*, 225.5.99.0) SM, Uptime: 00:44:41, RP-Address: 100.100.100.100
Upstream Join/Prune: Joined(HoldTime: 00:00:19), RPF: *, Flags:
Incoming Interface:
  lo5,                          Uptime: 00:44:41, status: Rcv, Flags: R
Output Interface List:
  bundle-1215,                  Uptime: 00:42:27, status: Fwd, JOIN(HoldTime: 00:03:14), Flags:
  bundle-1218,                  Uptime: 00:44:41, status: Fwd, JOIN(HoldTime: 00:02:44), Flags:

(205.1.2.142, 225.1.10.1) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

"""
# output = output * 5715
output = output * 1000

# ParserElement.enablePackrat()

ParserElement.setDefaultWhitespaceChars(" \t")

SkipToNL = Suppress(SkipTo(LineEnd()) + LineEnd())
IpAddress = Word(nums + '.')
ParserUptime = Word(nums + ':.')

SgLine = (Suppress(Literal('(')) + (IpAddress | Literal('*'))('source') + Suppress(Literal(',')) +
          IpAddress('group') + Literal(')') +
          Word(alphanums)('group_type') + Literal(',') +
          Keyword('Uptime:') + ParserUptime('group_uptime') +
          Optional(Keyword(', RP-Address:') + IpAddress('rp_address')) +
          SkipToNL)
SgFlagsLine = (Keyword('Upstream Join/Prune:') + SkipTo(",")('upstream_join_prune') + Literal(',') +
               Keyword('RPF:') + (IpAddress | Literal('*'))('rpf') + Literal(',') +
               Keyword('Flags:') + SkipTo(LineEnd())('sg_flags') + LineEnd())
IifStartLine = (Keyword('Incoming Interface:') + SkipToNL)
IifOilLine = (Word(alphanums + r'-./')('interface_name') + Optional(Literal(',')) +
              Keyword('Uptime:') + ParserUptime('uptime') + Literal(',') +
              Keyword('status:') + Word(alphanums) + Literal(',') +
              Optional(Word(alphanums + r'(): ')('join_prune_state') + Literal(',')) +
              Keyword('Flags:') + SkipTo(LineEnd())('interface_flags') + LineEnd())

IifLines = Dict(OneOrMore(Group(IifOilLine)))("iif")
OilLines = Dict(OneOrMore(Group(IifOilLine)))("oil")
OilStartLine = (Literal('Output Interface List:') + SkipToNL)

grammar = OneOrMore(Group((SgLine +
                           SgFlagsLine +
                           IifStartLine +
                           IifLines +
                           OilStartLine +
                           OilLines +
                           Optional(SkipToNL))
                          )
                    )

grammar.setDefaultWhitespaceChars(" \t")
print(f"started parse")
start_time = time.monotonic()
result = grammar.parseString(output)
end_time = time.monotonic()

print(f"finished parse, took {end_time - start_time}s, processed {len(result)} entries")
print('PyParsing')
  1. Primarily, pyparsing is slower because it is running in pure Python.首先,pyparsing 比较慢,因为它在纯 Python 中运行。 Python's regex engine is implemented in C, so is inherently faster. Python 的正则表达式引擎在 C 中实现,因此本质上更快。 Also, pyparsing's matching logic is broken up across many objects each with its own separate parse function to nibble away at the input string.此外,pyparsing 的匹配逻辑被分解为许多对象,每个对象都有自己单独的解析 function 以蚕食输入字符串。 re's implement their logic in a single C function call. re 在单个 C function 调用中实现它们的逻辑。
  2. I tried redoing a few of your low-level terms using pyparsing Regex instead of Word(word_chars) (though Word uses a regex internally anyway, so unlikely to gain much).我尝试使用 pyparsing Regex而不是Word(word_chars)重做一些低级术语(尽管 Word 在内部无论如何都使用正则表达式,因此不太可能获得太多)。 I do note that your terms aren't really doing much pattern matching in their parsing - for instance using Word(nums+":") to parse a time given in the form 00:00:00 is a bit of cheating, since that term would also match "::::", ":0:0:", and any integer.我确实注意到您的术语在解析中并没有真正做太多的模式匹配 - 例如使用Word(nums+":")来解析以00:00:00形式给出的时间有点作弊,因为那个术语还将匹配“::::”、“:0:0:”和任何 integer。 Similar for defining IpAddress as any word composed of digits and ".".类似于将 IpAddress 定义为由数字和“.”组成的任何单词。 If the re's you are comparing to are as tolerant of badly formatted data, then I'm sure they will be fast.如果您要比较的 re 可以容忍格式错误的数据,那么我相信它们会很快。
  3. On my machine, pyparsing parses 1000 elements per second, so about 40 seconds for your list of 40,000 elements.在我的机器上,pyparsing 每秒解析 1000 个元素,因此您的 40,000 个元素列表大约需要 40 秒。 If you are processing that router output once a day, 40 seconds seems fast enough.如果您每天处理一次路由器 output,40 秒似乎足够快。 If you are doing it once a minute, then pyparsing will not be the right tool.如果您每分钟执行一次,那么 pyparsing 将不是正确的工具。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM