简体   繁体   English

你如何将这个正则表达式习语从 Perl 翻译成 Python?

[英]How do you translate this regular-expression idiom from Perl into Python?

I switched from Perl to Python about a year ago and haven't looked back.大约一年前,我从 Perl 切换到 Python,并没有回头。 There is only one idiom that I've ever found I can do more easily in Perl than in Python:我发现只有一种习惯用法在 Perl 中比在 Python 中更容易:

if ($var =~ /foo(.+)/) {
  # do something with $1
} elsif ($var =~ /bar(.+)/) {
  # do something with $1
} elsif ($var =~ /baz(.+)/) {
  # do something with $1
}

The corresponding Python code is not so elegant since the if statements keep getting nested:相应的 Python 代码不是那么优雅,因为 if 语句不断嵌套:

m = re.search(r'foo(.+)', var)
if m:
  # do something with m.group(1)
else:
  m = re.search(r'bar(.+)', var)
  if m:
    # do something with m.group(1)
  else:
    m = re.search(r'baz(.+)', var)
    if m:
      # do something with m.group(2)

Does anyone have an elegant way to reproduce this pattern in Python?有没有人有一种优雅的方式在 Python 中重现这种模式? I've seen anonymous function dispatch tables used, but those seem kind of unwieldy to me for a small number of regular expressions...我见过使用匿名函数调度表,但对于少数正则表达式,这些表对我来说似乎有点笨拙......

Using named groups and a dispatch table:使用命名组和调度表:

r = re.compile(r'(?P<cmd>foo|bar|baz)(?P<data>.+)')

def do_foo(data):
    ...

def do_bar(data):
    ...

def do_baz(data):
    ...

dispatch = {
    'foo': do_foo,
    'bar': do_bar,
    'baz': do_baz,
}


m = r.match(var)
if m:
    dispatch[m.group('cmd')](m.group('data'))

With a little bit of introspection you can auto-generate the regexp and the dispatch table.通过一些内省,您可以自动生成正则表达式和调度表。

Yeah, it's kind of annoying.是的,有点烦Perhaps this will work for your case.也许这对你的情况有用。


import re

class ReCheck(object):
    def __init__(self):
        self.result = None
    def check(self, pattern, text):
        self.result = re.search(pattern, text)
        return self.result

var = 'bar stuff'
m = ReCheck()
if m.check(r'foo(.+)',var):
    print m.result.group(1)
elif m.check(r'bar(.+)',var):
    print m.result.group(1)
elif m.check(r'baz(.+)',var):
    print m.result.group(1)

EDIT: Brian correctly pointed out that my first attempt did not work.编辑:布赖恩正确地指出我的第一次尝试没有奏效。 Unfortunately, this attempt is longer.不幸的是,这次尝试的时间更长。

r"""
This is an extension of the re module. It stores the last successful
match object and lets you access it's methods and attributes via
this module.

This module exports the following additional functions:
    expand  Return the string obtained by doing backslash substitution on a
            template string.
    group   Returns one or more subgroups of the match.
    groups  Return a tuple containing all the subgroups of the match.
    start   Return the indices of the start of the substring matched by
            group.
    end     Return the indices of the end of the substring matched by group.
    span    Returns a 2-tuple of (start(), end()) of the substring matched
            by group.

This module defines the following additional public attributes:
    pos         The value of pos which was passed to the search() or match()
                method.
    endpos      The value of endpos which was passed to the search() or
                match() method.
    lastindex   The integer index of the last matched capturing group.
    lastgroup   The name of the last matched capturing group.
    re          The regular expression object which as passed to search() or
                match().
    string      The string passed to match() or search().
"""

import re as re_

from re import *
from functools import wraps

__all__ = re_.__all__ + [ "expand", "group", "groups", "start", "end", "span",
        "last_match", "pos", "endpos", "lastindex", "lastgroup", "re", "string" ]

last_match = pos = endpos = lastindex = lastgroup = re = string = None

def _set_match(match=None):
    global last_match, pos, endpos, lastindex, lastgroup, re, string
    if match is not None:
        last_match = match
        pos = match.pos
        endpos = match.endpos
        lastindex = match.lastindex
        lastgroup = match.lastgroup
        re = match.re
        string = match.string
    return match

@wraps(re_.match)
def match(pattern, string, flags=0):
    return _set_match(re_.match(pattern, string, flags))


@wraps(re_.search)
def search(pattern, string, flags=0):
    return _set_match(re_.search(pattern, string, flags))

@wraps(re_.findall)
def findall(pattern, string, flags=0):
    matches = re_.findall(pattern, string, flags)
    if matches:
        _set_match(matches[-1])
    return matches

@wraps(re_.finditer)
def finditer(pattern, string, flags=0):
    for match in re_.finditer(pattern, string, flags):
        yield _set_match(match)

def expand(template):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.expand(template)

def group(*indices):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.group(*indices)

def groups(default=None):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.groups(default)

def groupdict(default=None):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.groupdict(default)

def start(group=0):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.start(group)

def end(group=0):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.end(group)

def span(group=0):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.span(group)

del wraps  # Not needed past module compilation

For example:例如:

if gre.match("foo(.+)", var):
  # do something with gre.group(1)
elif gre.match("bar(.+)", var):
  # do something with gre.group(1)
elif gre.match("baz(.+)", var):
  # do something with gre.group(1)

I'd suggest this, as it uses the least regex to accomplish your goal.我建议这样做,因为它使用最少的正则表达式来实现您的目标。 It is still functional code, but no worse then your old Perl.它仍然是功能性代码,但并不比您的旧 Perl 更糟。

import re
var = "barbazfoo"

m = re.search(r'(foo|bar|baz)(.+)', var)
if m.group(1) == 'foo':
    print m.group(1)
    # do something with m.group(1)
elif m.group(1) == "bar":
    print m.group(1)
    # do something with m.group(1)
elif m.group(1) == "baz":
    print m.group(2)
    # do something with m.group(2)

Starting Python 3.8 , and the introduction of assignment expressions (PEP 572) ( := operator), we can now capture the condition value re.search(pattern, text) in a variable match in order to both check if it's not None and then re-use it within the body of the condition:Python 3.8开始,并引入赋值表达式 (PEP 572):=运算符),我们现在可以在变量match中捕获条件值re.search(pattern, text)以便检查它是否不是None然后在条件的主体内重新使用它:

if match := re.search(r'foo(.+)', text):
  # do something with match.group(1)
elif match := re.search(r'bar(.+)', text):
  # do something with match.group(1)
elif match := re.search(r'baz(.+)', text)
  # do something with match.group(1)

With thanks to this other SO question :感谢这个其他SO问题

import re

class DataHolder:
    def __init__(self, value=None, attr_name='value'):
        self._attr_name = attr_name
        self.set(value)
    def __call__(self, value):
        return self.set(value)
    def set(self, value):
        setattr(self, self._attr_name, value)
        return value
    def get(self):
        return getattr(self, self._attr_name)

string = u'test bar 123'
save_match = DataHolder(attr_name='match')
if save_match(re.search('foo (\d+)', string)):
    print "Foo"
    print save_match.match.group(1)
elif save_match(re.search('bar (\d+)', string)):
    print "Bar"
    print save_match.match.group(1)
elif save_match(re.search('baz (\d+)', string)):
    print "Baz"
    print save_match.match.group(1)

Alternatively, something not using regular expressions at all:或者,根本不使用正则表达式的东西:

prefix, data = var[:3], var[3:]
if prefix == 'foo':
    # do something with data
elif prefix == 'bar':
    # do something with data
elif prefix == 'baz':
    # do something with data
else:
    # do something with var

Whether that is suitable depends on your actual problem.这是否合适取决于您的实际问题。 Don't forget, regular expressions aren't the swiss army knife that they are in Perl;不要忘记,正则表达式不是它们在 Perl 中的瑞士军刀; Python has different constructs for doing string manipulation. Python 有不同的构造来进行字符串操作。

def find_first_match(string, *regexes):
    for regex, handler in regexes:
        m = re.search(regex, string):
        if m:
            handler(m)
            return
    else:
        raise ValueError

find_first_match(
    foo, 
    (r'foo(.+)', handle_foo), 
    (r'bar(.+)', handle_bar), 
    (r'baz(.+)', handle_baz))

To speed it up, one could turn all regexes into one internally and create the dispatcher on the fly.为了加快速度,可以在内部将所有正则表达式合二为一,并即时创建调度程序。 Ideally, this would be turned into a class then.理想情况下,这将变成一个类。

Here's the way I solved this issue:这是我解决这个问题的方法:

matched = False;

m = re.match("regex1");
if not matched and m:
    #do something
    matched = True;

m = re.match("regex2");
if not matched and m:
    #do something else
    matched = True;

m = re.match("regex3");
if not matched and m:
    #do yet something else
    matched = True;

Not nearly as clean as the original pattern.不像原来的图案那么干净。 However, it is simple, straightforward and doesn't require extra modules or that you change the original regexs.但是,它简单明了,不需要额外的模块,也不需要更改原始正则表达式。

Expanding on the solution by Pat Notz a bit, I found it even the more elegant to:稍微扩展 Pat Notz 的解决方案,我发现它更优雅:
- name the methods the same as re provides (eg search() vs. check() ) and - 命名与re提供的方法相同的方法(例如search()check() )和
- implement the necessary methods like group() on the holder object itself: - 在持有者对象本身上实现必要的方法,如group()

class Re(object):
    def __init__(self):
        self.result = None

    def search(self, pattern, text):
        self.result = re.search(pattern, text)
        return self.result

    def group(self, index):
        return self.result.group(index)

Example例子

Instead of eg this:而不是例如这个:

m = re.search(r'set ([^ ]+) to ([^ ]+)', line)
if m:
    vars[m.group(1)] = m.group(2)
else:
    m = re.search(r'print ([^ ]+)', line)
    if m:
        print(vars[m.group(1)])
    else:
        m = re.search(r'add ([^ ]+) to ([^ ]+)', line)
        if m:
            vars[m.group(2)] += vars[m.group(1)]

One does just this:一个人这样做:

m = Re()
...
if m.search(r'set ([^ ]+) to ([^ ]+)', line):
    vars[m.group(1)] = m.group(2)
elif m.search(r'print ([^ ]+)', line):
    print(vars[m.group(1)])
elif m.search(r'add ([^ ]+) to ([^ ]+)', line):
    vars[m.group(2)] += vars[m.group(1)]

Looks very natural in the end, does not need too many code changes when moving from Perl and avoids the problems with global state like some other solutions.最后看起来很自然,从 Perl 迁移时不需要太多代码更改,并且避免了像其他一些解决方案一样的全局状态问题。

how about using a dictionary?用字典怎么样?

match_objects = {}

if match_objects.setdefault( 'mo_foo', re_foo.search( text ) ):
  # do something with match_objects[ 'mo_foo' ]

elif match_objects.setdefault( 'mo_bar', re_bar.search( text ) ):
  # do something with match_objects[ 'mo_bar' ]

elif match_objects.setdefault( 'mo_baz', re_baz.search( text ) ):
  # do something with match_objects[ 'mo_baz' ]

...

however, you must ensure there are no duplicate match_objects dictionary keys ( mo_foo, mo_bar, ... ), best by giving each regular expression its own name and naming the match_objects keys accordingly, otherwise match_objects.setdefault() method would return existing match object instead of creating new match object by running re_xxx.search( text ).但是,您必须确保没有重复的 match_objects 字典键( mo_foo, mo_bar, ... ),最好为每个正则表达式提供自己的名称并相应地命名 match_objects 键,否则 match_objects.setdefault() 方法将返回现有的匹配对象而不是通过运行 re_xxx.search( text ) 创建新的匹配对象。

A minimalist DataHolder:一个极简的 DataHolder:

class Holder(object):
    def __call__(self, *x):
        if x:
            self.x = x[0]
        return self.x

data = Holder()

if data(re.search('foo (\d+)', string)):
    print data().group(1)

or as a singleton function:或作为单例函数:

def data(*x):
    if x:
        data.x = x[0]
    return data.x

My solution would be:我的解决方案是:

import re

class Found(Exception): pass

try:        
    for m in re.finditer('bar(.+)', var):
        # Do something
        raise Found

    for m in re.finditer('foo(.+)', var):
        # Do something else
        raise Found

except Found: pass

Here is a RegexDispatcher class that dispatches its subclass methods by regular expression.这是一个 RegexDispatcher 类,它通过正则表达式调度其子类方法。

Each dispatchable method is annotated with a regular expression eg每个可调度的方法都用正则表达式注释,例如

def plus(self, regex: r"\+", **kwargs):
...

In this case, the annotation is called 'regex' and its value is the regular expression to match on, '\\+', which is the + sign.在这种情况下,注释称为“regex”,其值是要匹配的正则表达式“\\+”,即 + 符号。 These annotated methods are put in subclasses, not in the base class.这些带注释的方法放在子类中,而不是放在基类中。

When the dispatch(...) method is called on a string, the class finds the method with an annotation regular expression that matches the string and calls it.当在字符串上调用 dispatch(...) 方法时,该类会使用与字符串匹配的注解正则表达式查找该方法并调用它。 Here is the class:这是课程:

import inspect
import re


class RegexMethod:
    def __init__(self, method, annotation):
        self.method = method
        self.name = self.method.__name__
        self.order = inspect.getsourcelines(self.method)[1] # The line in the source file
        self.regex = self.method.__annotations__[annotation]

    def match(self, s):
        return re.match(self.regex, s)

    # Make it callable
    def __call__(self, *args, **kwargs):
        return self.method(*args, **kwargs)

    def __str__(self):
        return str.format("Line: %s, method name: %s, regex: %s" % (self.order, self.name, self.regex))


class RegexDispatcher:
    def __init__(self, annotation="regex"):
        self.annotation = annotation
        # Collect all the methods that have an annotation that matches self.annotation
        # For example, methods that have the annotation "regex", which is the default
        self.dispatchMethods = [RegexMethod(m[1], self.annotation) for m in
                                inspect.getmembers(self, predicate=inspect.ismethod) if
                                (self.annotation in m[1].__annotations__)]
        # Be sure to process the dispatch methods in the order they appear in the class!
        # This is because the order in which you test regexes is important.
        # The most specific patterns must always be tested BEFORE more general ones
        # otherwise they will never match.
        self.dispatchMethods.sort(key=lambda m: m.order)

    # Finds the FIRST match of s against a RegexMethod in dispatchMethods, calls the RegexMethod and returns
    def dispatch(self, s, **kwargs):
        for m in self.dispatchMethods:
            if m.match(s):
                return m(self.annotation, **kwargs)
        return None

To use this class, subclass it to create a class with annotated methods.要使用此类,请将其子类化以创建带有注释方法的类。 By way of example, here is a simple RPNCalculator that inherits from RegexDispatcher.举例来说,这里有一个简单的 RPNCalculator,它继承自 RegexDispatcher。 The methods to be dispatched are (of course) the ones with the 'regex' annotation.要调度的方法(当然)是带有“regex”注释的方法。 The parent dispatch() method is invoked in call .call 中调用父 dispatch() 方法。

from RegexDispatcher import *
import math

class RPNCalculator(RegexDispatcher):
    def __init__(self):
        RegexDispatcher.__init__(self)
        self.stack = []

    def __str__(self):
        return str(self.stack)

    # Make RPNCalculator objects callable
    def __call__(self, expression):
        # Calculate the value of expression
        for t in expression.split():
            self.dispatch(t, token=t)
        return self.top()  # return the top of the stack

    # Stack management
    def top(self):
        return self.stack[-1] if len(self.stack) > 0 else []

    def push(self, x):
        return self.stack.append(float(x))

    def pop(self, n=1):
        return self.stack.pop() if n == 1 else [self.stack.pop() for n in range(n)]

    # Handle numbers
    def number(self, regex: r"[-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?", **kwargs):
        self.stack.append(float(kwargs['token']))

    # Binary operators
    def plus(self, regex: r"\+", **kwargs):
        a, b = self.pop(2)
        self.push(b + a)

    def minus(self, regex: r"\-", **kwargs):
        a, b = self.pop(2)
        self.push(b - a)

    def multiply(self, regex: r"\*", **kwargs):
        a, b = self.pop(2)
        self.push(b * a)

    def divide(self, regex: r"\/", **kwargs):
        a, b = self.pop(2)
        self.push(b / a)

    def pow(self, regex: r"exp", **kwargs):
        a, b = self.pop(2)
        self.push(a ** b)

    def logN(self, regex: r"logN", **kwargs):
        a, b = self.pop(2)
        self.push(math.log(a,b))

    # Unary operators
    def neg(self, regex: r"neg", **kwargs):
        self.push(-self.pop())

    def sqrt(self, regex: r"sqrt", **kwargs):
        self.push(math.sqrt(self.pop()))

    def log2(self, regex: r"log2", **kwargs):
        self.push(math.log2(self.pop()))

    def log10(self, regex: r"log10", **kwargs):
        self.push(math.log10(self.pop()))

    def pi(self, regex: r"pi", **kwargs):
        self.push(math.pi)

    def e(self, regex: r"e", **kwargs):
        self.push(math.e)

    def deg(self, regex: r"deg", **kwargs):
        self.push(math.degrees(self.pop()))

    def rad(self, regex: r"rad", **kwargs):
        self.push(math.radians(self.pop()))

    # Whole stack operators
    def cls(self, regex: r"c", **kwargs):
        self.stack=[]

    def sum(self, regex: r"sum", **kwargs):
        self.stack=[math.fsum(self.stack)]


if __name__ == '__main__':
    calc = RPNCalculator()

    print(calc('2 2 exp 3 + neg'))

    print(calc('c 1 2 3 4 5 sum 2 * 2 / pi'))

    print(calc('pi 2 * deg'))

    print(calc('2 2 logN'))

I like this solution because there are no separate lookup tables.我喜欢这个解决方案,因为没有单独的查找表。 The regular expression to match on is embedded in the method to be called as an annotation.要匹配的正则表达式嵌入在要调用的方法中作为注释。 For me, this is as it should be.对我来说,这是应该的。 It would be nice if Python allowed more flexible annotations, because I would rather put the regex annotation on the method itself rather than embed it in the method parameter list.如果 Python 允许更灵活的注释,那就太好了,因为我宁愿将正则表达式注释放在方法本身上,而不是将其嵌入方法参数列表中。 However, this isn't possible at the moment.但是,目前这是不可能的。

For interest, take a look at the Wolfram language in which functions are polymorphic on arbitrary patterns, not just on argument types.感兴趣的话,看看 Wolfram 语言,其中函数在任意模式上是多态的,而不仅仅是参数类型。 A function that is polymorphic on a regex is a very powerful idea, but we can't get there cleanly in Python.正则表达式上的多态函数是一个非常强大的想法,但我们无法在 Python 中清晰地实现。 The RegexDispatcher class is the best I could do. RegexDispatcher 类是我能做的最好的。

import re

s = '1.23 Million equals to 1230000'

s = re.sub("([\d.]+)(\s*)Million", lambda m: str(round(float(m.groups()[0]) * 1000_000))+m.groups()[1], s)

print(s)

1230000 equals to 1230000 1230000 等于 1230000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM