简体   繁体   English

我以犯罪效率低下的方式使用Python正则表达式

[英]I'm using Python regexes in a criminally inefficient manner

My goal here is to create a very simple template language. 我的目标是创建一个非常简单的模板语言。 At the moment, I'm working on replacing a variable with a value, like this: 目前,我正在使用值替换变量,如下所示:

This input: 这个输入:

The Web 网络

Should produce this output: 应该产生这个输出:

The Web This Is A Test Variable Web这是一个测试变量

I've got it working. 我有它的工作。 But looking at my code, I'm running multiple identical regexes on the same strings -- that just offends my sense of efficiency. 但是看看我的代码,我在相同的字符串上运行多个相同的正则表达式 - 这只是冒犯了我的效率感。 There's got to be a better, more Pythonic way. 必须有更好,更Pythonic的方式。 (It's the two "while" loops that really offend.) (这是真正冒犯的两个“while”循环。)

This does pass the unit tests, so if this is silly premature optimization, tell me -- I'm willing to let this go. 这确实通过了单元测试,所以如果这是愚蠢的过早优化,请告诉我 - 我愿意放手。 There may be dozens of these variable definitions and uses in a document, but not hundreds. 文档中可能有许多这些变量定义和用法,但不是数百个。 But I suspect there's obvious (to other people) ways of improving this, and I'm curious what the StackOverflow crowd will come up with. 但我怀疑(对其他人)显而易见的改善方法,我很好奇StackOverflow人群会想出什么。

def stripMatchedQuotes(item):
    MatchedSingleQuotes = re.compile(r"'(.*)'", re.LOCALE)
    MatchedDoubleQuotes = re.compile(r'"(.*)"', re.LOCALE)
    item = MatchedSingleQuotes.sub(r'\1', item, 1)
    item = MatchedDoubleQuotes.sub(r'\1', item, 1)
    return item




def processVariables(item):
    VariableDefinition = re.compile(r'<%(.*?)=(.*?)%>', re.LOCALE)
    VariableUse = re.compile(r'<%(.*?)%>', re.LOCALE)
    Variables={}

    while VariableDefinition.search(item):
        VarName, VarDef = VariableDefinition.search(item).groups()
        VarName = stripMatchedQuotes(VarName).upper().strip()
        VarDef = stripMatchedQuotes(VarDef.strip())
        Variables[VarName] = VarDef
        item = VariableDefinition.sub('', item, 1)

    while VariableUse.search(item):
        VarName = stripMatchedQuotes(VariableUse.search(item).group(1).upper()).strip()
        item = VariableUse.sub(Variables[VarName], item, 1)

    return item

The first thing that may improve things is to move the re.compile outside the function. 可能改进的第一件事是将re.compile移到函数之外。 The compilation is cached, but there is a speed hit in checking this to see if its compiled. 编译是缓存的,但检查它是否有速度命中,以查看它是否已编译。

Another possibility is to use a single regex as below: 另一种可能性是使用单个正则表达式如下:

MatchedQuotes = re.compile(r"(['\"])(.*)\1", re.LOCALE)
item = MatchedQuotes.sub(r'\2', item, 1)

Finally, you can combine this into the regex in processVariables. 最后,您可以将其组合到processVariables中的正则表达式中。 Taking Torsten Marek's suggestion to use a function for re.sub, this improves and simplifies things dramatically. 考虑到Torsten Marek建议使用re.sub函数,这可以显着改善和简化事物。

VariableDefinition = re.compile(r'<%(["\']?)(.*?)\1=(["\']?)(.*?)\3%>', re.LOCALE)
VarRepl = re.compile(r'<%(["\']?)(.*?)\1%>', re.LOCALE)

def processVariables(item):
    vars = {}
    def findVars(m):
        vars[m.group(2).upper()] = m.group(4)
        return ""

    item = VariableDefinition.sub(findVars, item)
    return VarRepl.sub(lambda m: vars[m.group(2).upper()], item)

print processVariables('<%"TITLE"="This Is A Test Variable"%>The Web <%"TITLE"%>')

Here are my timings for 100000 runs: 以下是我100000次运行的时间:

Original       : 13.637
Global regexes : 12.771
Single regex   :  9.095
Final version  :  1.846

[Edit] Add missing non-greedy specifier [编辑]添加缺少的非贪婪说明符

[Edit2] Added .upper() calls so case insensitive like original version [Edit2]添加.upper()调用,使其不像原始版本那样不区分大小写

sub can take a callable as it's argument rather than a simple string. sub可以采用可调用的参数而不是简单的字符串。 Using that, you can replace all variables with one function call: 使用它,您可以使用一个函数调用替换所有变量:

>>> import re
>>> var_matcher = re.compile(r'<%(.*?)%>', re.LOCALE)
>>> string = '<%"TITLE"%> <%"SHMITLE"%>'
>>> values = {'"TITLE"': "I am a title.", '"SHMITLE"': "And I am a shmitle."}
>>> var_matcher.sub(lambda m: vars[m.group(1)], string)
'I am a title. And I am a shmitle.

Follow eduffy.myopenid.com's advice and keep the compiled regexes around. 按照eduffy.myopenid.com的建议,保持编译的正则表达式。

The same recipe can be applied to the first loop, only there you need to store the value of the variable first, and always return "" as replacement. 相同的配方可以应用于第一个循环,只需要先存储变量的值,并始终返回""作为替换。

Never create your own programming language. 永远不要创建自己的编程语言。 Ever. 永远。 (I used to have an exception to this rule, but not any more.) (我曾经对此规则有例外,但不再有。)

There is always an existing language you can use which suits your needs better. 您可以使用现有的语言,更好地满足您的需求。 If you elaborated on your use-case, people may help you select a suitable language. 如果您详细说明了您的用例,人们可能会帮助您选择合适的语言。

Creating a templating language is all well and good, but shouldn't one of the goals of the templating language be easy readability and efficient parsing? 创建模板语言一切都很好,但模板语言的目标之一不应该是易读性和高效解析吗? The example you gave seems to be neither. 你给出的例子似乎都不是。

As Jamie Zawinsky famously said: 正如Jamie Zawinsky所说:

Some people, when confronted with a problem, think "I know, I'll use regular expressions!" 有些人在面对问题时会想“我知道,我会使用正则表达式!” Now they have two problems. 现在他们有两个问题。

If regular expressions are a solution to a problem you have created, the best bet is not to write a better regular expression, but to redesign your approach to eliminate their use entirely. 如果正则表达式是您创建的问题的解决方案,那么最好的选择不是编写更好的正则表达式,而是重新设计您的方法以完全消除它们的使用。 Regular expressions are complicated, expensive, hugely difficult to maintain, and (ideally) should only be used for working around a problem someone else created. 正则表达式复杂,昂贵,难以维护,并且(理想情况下)应仅用于解决其他人创建的问题。

你可以用r"(\\"|')(.*?)\\1"一次匹配这两种引号 - \\1引用第一组,所以它只匹配匹配的引号。

You're calling re.compile quite a bit. 你正在调用re.compile相当多。 A global variable for these wouldn't hurt here. 这些的全局变量在这里不会受到伤害。

If a regexp only contains one .* wildcard and literals, then you can use find and rfind to locate the opening and closing delimiters. 如果正则表达式只包含一个。*通配符和文字,那么您可以使用find和rfind来定位开始和结束分隔符。

If it contains only a series of .*? 如果它只包含一系列。*? wildcards, and literals, then you can just use a series of find's to do the work. 通配符和文字,然后你可以使用一系列的find来完成工作。

If the code is time-critical, this switch away from regexp's altogether might give a little more speed. 如果代码是时间关键的,那么完全脱离regexp可能会提高速度。

Also, it looks to me like this is an LL-parsable language . 此外,它在我看来这是一种LL可解析的语言 You could look for a library that can already parse such things for you. 你可以找一个可以解析这些东西的库。 You could also use recursive calls to do a one-pass parse -- for example, you could implement your processVariables function to only consume up the first quote, and then call a quote-matching function to consume up to the next quote, etc. 您还可以使用递归调用来执行一次性解析 - 例如,您可以实现processVariables函数以仅消耗第一个引用,然后调用引用匹配函数以消耗下一个引用等。

Why not use Mako ? 为什么不使用Mako Seriously. 认真。 What feature do you require that Mako doesn't have? Mako没有你需要什么功能? Perhaps you can adapt or extend something that already works. 也许你可以适应或扩展已经有效的东西。

Don't call search twice in a row (in the loop conditional, and the first statement in the loop). 不要连续两次调用搜索(在循环条件中,以及循环中的第一个语句)。 Call (and cache the result) once before the loop, and then in the final statement of the loop. 在循环之前调用(并缓存结果)一次,然后在循环的最后一个语句中调用。

Why not use XML and XSLT instead of creating your own template language? 为什么不使用XML和XSLT而不是创建自己的模板语言? What you want to do is pretty easy in XSLT. 在XSLT中,您想要做的事情非常简单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM