简体   繁体   English

为什么PLY将正则表达式与Python / re区别对待?

[英]Why does PLY treat regular expressions differently from Python/re?

Some background: 一些背景:

I am writing a parser to retrieve information from sites with a markup language. 我正在编写一个解析器,以使用标记语言从站点检索信息。 Standard libraries as wikitools, ... do not work for me as I need to be more specific and adapting them to my needs puts a layer of complexity between me and the problem. 作为wikitools的标准库,...对我不起作用,因为我需要更加具体,使它们适应我的需求使我和问题之间变得更加复杂。 Python + "simple" regex got me into difficulties identifying the dependencies between the different "tokens" in the markup language in a transparent manner - so obviously I needed to arrive at PLY at the end of this journey. Python +“简单”正则表达式使我难以以透明的方式识别标记语言中不同“令牌”之间的依赖关系-因此,很显然,我需要在此旅程的最后到达PLY。

Now it seems that PLY identifies the tokens via regex differently compared to Python - but I can't find something on it. 现在看来,PLY通过正则表达式识别令牌的方式与Python有所不同-但我找不到它。 I don't want to move on in case I don't understand how PLY determines the tokens within its lexer (as otherwise I would have no control of the logic I am depending on and will fail in a later stage). 我不想继续前进,以防我不了解PLY如何确定其词法分析器中的令牌(否则,我将无法控制我所依赖的逻辑,并且在以后阶段将失败)。

Here we go: 开始了:

import ply.lex as lex

text = r'--- 123456 ---'
token1 = r'-- .* --'
tokens = (
   'TEST',
)
t_TEST = token1

lexer = lex.lex(reflags=re.UNICODE, debug=1)
lexer.input(text)
for tok in lexer:
    print tok.type, tok.value, tok.lineno, tok.lexpos

results in: 结果是:

lex: tokens   = ('TEST',)
lex: literals = ''
lex: states   = {'INITIAL': 'inclusive'}
lex: Adding rule t_TEST -> '-- .* --' (state 'INITIAL')
lex: ==== MASTER REGEXS FOLLOW ====
lex: state 'INITIAL' : regex[0] = '(?P<t_TEST>-- .* --)'
TEST --- 123456 --- 1 0

The last line is surprising - I would have expected the first and the last - to be missing in --- 123456 --- in case it is comparable to "search" (and nothing in case it is comparable to "match"). 最后一行是令人惊讶的-我希望第一个和最后一个---- 123456 ---的情况下会丢失,以防它与“搜索”相似(而在与“匹配”可比的情况下则什么也没有)。 Obviously this is important as then -- cannot be distinguished from --- (or === from === ), ie headlines, enumbering, ... cannot be differentiated. 显然,这是重要的,因为那么--不能区别于--- (或====== ),即标题,enumbering,...无法区分。

So why does PLY behaves differently for standard Python/regex? 那么,为什么PLY对于标准Python / regex表现出不同? (and how? - couldn't find something in the documentation, or here at stackoverflow). (以及如何?-在文档中或在stackoverflow上找不到任何内容)。

I would guess it is more my understanding of PLY as the tool is around for quite some time already, ie this behavior is in there by intention I would guess. 我猜这更多是我对PLY的理解,因为该工具已经存在相当长的一段时间了,也就是说,这种行为是我故意猜中的。 The only somehow related information I could find deals with different groups but does not explain a different behavior of identifying regexes itself. 我能找到的唯一以某种方式相关的信息涉及不同的组,但并不能解释识别正则表达式本身的不同行为。 I found nothing in ply-hack as well. 我也没有在多层黑客中找到任何东西。

Am I overlooking something stupid simple? 我是否忽略了一些愚蠢的简单事情?

For comparison purposes here standard Python / regex: 为了进行比较,这里使用标准的Python /正则表达式:

import re

text = r'--- 123456 ---'
token1 = r'-- .* --'

p = re.compile(token1)

m = p.search(text)
if m:
    print 'Match found: ', m.group()
else:
    print 'No match'

m = p.match(text)
if m:
    print 'Match found: ', m.group()
else:
    print 'No match'

gives: 给出:

Match found:  -- 123456 --
No match

(as expected, first is the result of "search", second of "match") (按预期,第一个是“搜索”的结果,第二个是“匹配”的结果)

My settings: I am working with spyder - this is the terminal display at start: 我的设置:我正在使用spyder-这是开始时的终端显示:

Python 2.7.5+ (default, Sep 19 2013, 13:49:51) 
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.

Imported NumPy 1.7.1, SciPy 0.12.0, Matplotlib 1.2.1
Type "scientific" for more details.

Thanks for your time and help. 感谢您的时间和帮助。

The answer in ply lexmatch regular expression has different groups than a usual re helps here too. 多层lexmatch正则表达式中的答案也比通常的re此处具有不同的组 In lex.py: 在lex.py中:

c = re.compile("(?P<%s>%s)" % (fname,f.__doc__), re.VERBOSE | self.reflags)

Notice the VERBOSE flag. 注意VERBOSE标志。 It means the re engine ignores the whitespace characters in your regexps. 这意味着re引擎会忽略正则表达式中的空格字符。 So r'-- .* --' really means r'--.*--' , which indeed matches completely a string like '--- foobar ---' . 因此, r'-- .* --'的真正含义r'--.*--' ,这的确完全像一个字符串匹配'--- foobar ---' See the documentation of re.VERBOSE for more details. 有关更多详细信息,请参见re.VERBOSE的文档。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM