为什么PLY将正则表达式与Python / re区别对待？

Question

Some background: 一些背景：

I am writing a parser to retrieve information from sites with a markup language. 我正在编写一个解析器，以使用标记语言从站点检索信息。 Standard libraries as wikitools, ... do not work for me as I need to be more specific and adapting them to my needs puts a layer of complexity between me and the problem. 作为wikitools的标准库，...对我不起作用，因为我需要更加具体，使它们适应我的需求使我和问题之间变得更加复杂。 Python + "simple" regex got me into difficulties identifying the dependencies between the different "tokens" in the markup language in a transparent manner - so obviously I needed to arrive at PLY at the end of this journey. Python +“简单”正则表达式使我难以以透明的方式识别标记语言中不同“令牌”之间的依赖关系-因此，很显然，我需要在此旅程的最后到达PLY。

Now it seems that PLY identifies the tokens via regex differently compared to Python - but I can't find something on it. 现在看来，PLY通过正则表达式识别令牌的方式与Python有所不同-但我找不到它。 I don't want to move on in case I don't understand how PLY determines the tokens within its lexer (as otherwise I would have no control of the logic I am depending on and will fail in a later stage). 我不想继续前进，以防我不了解PLY如何确定其词法分析器中的令牌（否则，我将无法控制我所依赖的逻辑，并且在以后阶段将失败）。

Here we go: 开始了：

import ply.lex as lex

text = r'--- 123456 ---'
token1 = r'-- .* --'
tokens = (
   'TEST',
)
t_TEST = token1

lexer = lex.lex(reflags=re.UNICODE, debug=1)
lexer.input(text)
for tok in lexer:
    print tok.type, tok.value, tok.lineno, tok.lexpos

results in: 结果是：

lex: tokens   = ('TEST',)
lex: literals = ''
lex: states   = {'INITIAL': 'inclusive'}
lex: Adding rule t_TEST -> '-- .* --' (state 'INITIAL')
lex: ==== MASTER REGEXS FOLLOW ====
lex: state 'INITIAL' : regex[0] = '(?P<t_TEST>-- .* --)'
TEST --- 123456 --- 1 0

The last line is surprising - I would have expected the first and the last - to be missing in --- 123456 --- in case it is comparable to "search" (and nothing in case it is comparable to "match"). 最后一行是令人惊讶的-我希望第一个和最后一个-在--- 123456 ---的情况下会丢失，以防它与“搜索”相似（而在与“匹配”可比的情况下则什么也没有）。 Obviously this is important as then -- cannot be distinguished from --- (or === from === ), ie headlines, enumbering, ... cannot be differentiated. 显然，这是重要的，因为那么--不能区别于--- （或===从=== ），即标题，enumbering，...无法区分。

So why does PLY behaves differently for standard Python/regex? 那么，为什么PLY对于标准Python / regex表现出不同？ (and how? - couldn't find something in the documentation, or here at stackoverflow). （以及如何？-在文档中或在stackoverflow上找不到任何内容）。

I would guess it is more my understanding of PLY as the tool is around for quite some time already, ie this behavior is in there by intention I would guess. 我猜这更多是我对PLY的理解，因为该工具已经存在相当长的一段时间了，也就是说，这种行为是我故意猜中的。 The only somehow related information I could find deals with different groups but does not explain a different behavior of identifying regexes itself. 我能找到的唯一以某种方式相关的信息涉及不同的组，但并不能解释识别正则表达式本身的不同行为。 I found nothing in ply-hack as well. 我也没有在多层黑客中找到任何东西。

Am I overlooking something stupid simple? 我是否忽略了一些愚蠢的简单事情？

For comparison purposes here standard Python / regex: 为了进行比较，这里使用标准的Python /正则表达式：

import re

text = r'--- 123456 ---'
token1 = r'-- .* --'

p = re.compile(token1)

m = p.search(text)
if m:
    print 'Match found: ', m.group()
else:
    print 'No match'

m = p.match(text)
if m:
    print 'Match found: ', m.group()
else:
    print 'No match'

gives: 给出：

Match found:  -- 123456 --
No match

(as expected, first is the result of "search", second of "match") （按预期，第一个是“搜索”的结果，第二个是“匹配”的结果）

My settings: I am working with spyder - this is the terminal display at start: 我的设置：我正在使用spyder-这是开始时的终端显示：

Python 2.7.5+ (default, Sep 19 2013, 13:49:51) 
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.

Imported NumPy 1.7.1, SciPy 0.12.0, Matplotlib 1.2.1
Type "scientific" for more details.

Thanks for your time and help. 感谢您的时间和帮助。

Answer 1

The answer in ply lexmatch regular expression has different groups than a usual re helps here too. 多层lexmatch正则表达式中的答案也比通常的re此处具有不同的组。 In lex.py: 在lex.py中：

c = re.compile("(?P<%s>%s)" % (fname,f.__doc__), re.VERBOSE | self.reflags)

Notice the VERBOSE flag. 注意VERBOSE标志。 It means the re engine ignores the whitespace characters in your regexps. 这意味着re引擎会忽略正则表达式中的空格字符。 So r'-- .* --' really means r'--.*--' , which indeed matches completely a string like '--- foobar ---' . 因此， r'-- .* --'的真正含义r'--.*--' ，这的确完全像一个字符串匹配'--- foobar ---' See the documentation of re.VERBOSE for more details. 有关更多详细信息，请参见re.VERBOSE的文档。

为什么PLY将正则表达式与Python / re区别对待？

问题描述

1 个解决方案

解决方案1
4 已采纳 2014-02-22 22:19:09

为什么PLY将正则表达式与Python / re区别对待？

问题描述

1 个解决方案

解决方案1 4 已采纳 2014-02-22 22:19:09

解决方案1
4 已采纳 2014-02-22 22:19:09