简体   繁体   中英

How do you translate this regular-expression idiom from Perl into Python?

I switched from Perl to Python about a year ago and haven't looked back. There is only one idiom that I've ever found I can do more easily in Perl than in Python:

if ($var =~ /foo(.+)/) {
  # do something with $1
} elsif ($var =~ /bar(.+)/) {
  # do something with $1
} elsif ($var =~ /baz(.+)/) {
  # do something with $1
}

The corresponding Python code is not so elegant since the if statements keep getting nested:

m = re.search(r'foo(.+)', var)
if m:
  # do something with m.group(1)
else:
  m = re.search(r'bar(.+)', var)
  if m:
    # do something with m.group(1)
  else:
    m = re.search(r'baz(.+)', var)
    if m:
      # do something with m.group(2)

Does anyone have an elegant way to reproduce this pattern in Python? I've seen anonymous function dispatch tables used, but those seem kind of unwieldy to me for a small number of regular expressions...

Using named groups and a dispatch table:

r = re.compile(r'(?P<cmd>foo|bar|baz)(?P<data>.+)')

def do_foo(data):
    ...

def do_bar(data):
    ...

def do_baz(data):
    ...

dispatch = {
    'foo': do_foo,
    'bar': do_bar,
    'baz': do_baz,
}


m = r.match(var)
if m:
    dispatch[m.group('cmd')](m.group('data'))

With a little bit of introspection you can auto-generate the regexp and the dispatch table.

Yeah, it's kind of annoying.Perhaps this will work for your case.


import re

class ReCheck(object):
    def __init__(self):
        self.result = None
    def check(self, pattern, text):
        self.result = re.search(pattern, text)
        return self.result

var = 'bar stuff'
m = ReCheck()
if m.check(r'foo(.+)',var):
    print m.result.group(1)
elif m.check(r'bar(.+)',var):
    print m.result.group(1)
elif m.check(r'baz(.+)',var):
    print m.result.group(1)

EDIT: Brian correctly pointed out that my first attempt did not work. Unfortunately, this attempt is longer.

r"""
This is an extension of the re module. It stores the last successful
match object and lets you access it's methods and attributes via
this module.

This module exports the following additional functions:
    expand  Return the string obtained by doing backslash substitution on a
            template string.
    group   Returns one or more subgroups of the match.
    groups  Return a tuple containing all the subgroups of the match.
    start   Return the indices of the start of the substring matched by
            group.
    end     Return the indices of the end of the substring matched by group.
    span    Returns a 2-tuple of (start(), end()) of the substring matched
            by group.

This module defines the following additional public attributes:
    pos         The value of pos which was passed to the search() or match()
                method.
    endpos      The value of endpos which was passed to the search() or
                match() method.
    lastindex   The integer index of the last matched capturing group.
    lastgroup   The name of the last matched capturing group.
    re          The regular expression object which as passed to search() or
                match().
    string      The string passed to match() or search().
"""

import re as re_

from re import *
from functools import wraps

__all__ = re_.__all__ + [ "expand", "group", "groups", "start", "end", "span",
        "last_match", "pos", "endpos", "lastindex", "lastgroup", "re", "string" ]

last_match = pos = endpos = lastindex = lastgroup = re = string = None

def _set_match(match=None):
    global last_match, pos, endpos, lastindex, lastgroup, re, string
    if match is not None:
        last_match = match
        pos = match.pos
        endpos = match.endpos
        lastindex = match.lastindex
        lastgroup = match.lastgroup
        re = match.re
        string = match.string
    return match

@wraps(re_.match)
def match(pattern, string, flags=0):
    return _set_match(re_.match(pattern, string, flags))


@wraps(re_.search)
def search(pattern, string, flags=0):
    return _set_match(re_.search(pattern, string, flags))

@wraps(re_.findall)
def findall(pattern, string, flags=0):
    matches = re_.findall(pattern, string, flags)
    if matches:
        _set_match(matches[-1])
    return matches

@wraps(re_.finditer)
def finditer(pattern, string, flags=0):
    for match in re_.finditer(pattern, string, flags):
        yield _set_match(match)

def expand(template):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.expand(template)

def group(*indices):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.group(*indices)

def groups(default=None):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.groups(default)

def groupdict(default=None):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.groupdict(default)

def start(group=0):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.start(group)

def end(group=0):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.end(group)

def span(group=0):
    if last_match is None:
        raise TypeError, "No successful match yet."
    return last_match.span(group)

del wraps  # Not needed past module compilation

For example:

if gre.match("foo(.+)", var):
  # do something with gre.group(1)
elif gre.match("bar(.+)", var):
  # do something with gre.group(1)
elif gre.match("baz(.+)", var):
  # do something with gre.group(1)

I'd suggest this, as it uses the least regex to accomplish your goal. It is still functional code, but no worse then your old Perl.

import re
var = "barbazfoo"

m = re.search(r'(foo|bar|baz)(.+)', var)
if m.group(1) == 'foo':
    print m.group(1)
    # do something with m.group(1)
elif m.group(1) == "bar":
    print m.group(1)
    # do something with m.group(1)
elif m.group(1) == "baz":
    print m.group(2)
    # do something with m.group(2)

Starting Python 3.8 , and the introduction of assignment expressions (PEP 572) ( := operator), we can now capture the condition value re.search(pattern, text) in a variable match in order to both check if it's not None and then re-use it within the body of the condition:

if match := re.search(r'foo(.+)', text):
  # do something with match.group(1)
elif match := re.search(r'bar(.+)', text):
  # do something with match.group(1)
elif match := re.search(r'baz(.+)', text)
  # do something with match.group(1)

With thanks to this other SO question :

import re

class DataHolder:
    def __init__(self, value=None, attr_name='value'):
        self._attr_name = attr_name
        self.set(value)
    def __call__(self, value):
        return self.set(value)
    def set(self, value):
        setattr(self, self._attr_name, value)
        return value
    def get(self):
        return getattr(self, self._attr_name)

string = u'test bar 123'
save_match = DataHolder(attr_name='match')
if save_match(re.search('foo (\d+)', string)):
    print "Foo"
    print save_match.match.group(1)
elif save_match(re.search('bar (\d+)', string)):
    print "Bar"
    print save_match.match.group(1)
elif save_match(re.search('baz (\d+)', string)):
    print "Baz"
    print save_match.match.group(1)

Alternatively, something not using regular expressions at all:

prefix, data = var[:3], var[3:]
if prefix == 'foo':
    # do something with data
elif prefix == 'bar':
    # do something with data
elif prefix == 'baz':
    # do something with data
else:
    # do something with var

Whether that is suitable depends on your actual problem. Don't forget, regular expressions aren't the swiss army knife that they are in Perl; Python has different constructs for doing string manipulation.

def find_first_match(string, *regexes):
    for regex, handler in regexes:
        m = re.search(regex, string):
        if m:
            handler(m)
            return
    else:
        raise ValueError

find_first_match(
    foo, 
    (r'foo(.+)', handle_foo), 
    (r'bar(.+)', handle_bar), 
    (r'baz(.+)', handle_baz))

To speed it up, one could turn all regexes into one internally and create the dispatcher on the fly. Ideally, this would be turned into a class then.

Here's the way I solved this issue:

matched = False;

m = re.match("regex1");
if not matched and m:
    #do something
    matched = True;

m = re.match("regex2");
if not matched and m:
    #do something else
    matched = True;

m = re.match("regex3");
if not matched and m:
    #do yet something else
    matched = True;

Not nearly as clean as the original pattern. However, it is simple, straightforward and doesn't require extra modules or that you change the original regexs.

Expanding on the solution by Pat Notz a bit, I found it even the more elegant to:
- name the methods the same as re provides (eg search() vs. check() ) and
- implement the necessary methods like group() on the holder object itself:

class Re(object):
    def __init__(self):
        self.result = None

    def search(self, pattern, text):
        self.result = re.search(pattern, text)
        return self.result

    def group(self, index):
        return self.result.group(index)

Example

Instead of eg this:

m = re.search(r'set ([^ ]+) to ([^ ]+)', line)
if m:
    vars[m.group(1)] = m.group(2)
else:
    m = re.search(r'print ([^ ]+)', line)
    if m:
        print(vars[m.group(1)])
    else:
        m = re.search(r'add ([^ ]+) to ([^ ]+)', line)
        if m:
            vars[m.group(2)] += vars[m.group(1)]

One does just this:

m = Re()
...
if m.search(r'set ([^ ]+) to ([^ ]+)', line):
    vars[m.group(1)] = m.group(2)
elif m.search(r'print ([^ ]+)', line):
    print(vars[m.group(1)])
elif m.search(r'add ([^ ]+) to ([^ ]+)', line):
    vars[m.group(2)] += vars[m.group(1)]

Looks very natural in the end, does not need too many code changes when moving from Perl and avoids the problems with global state like some other solutions.

how about using a dictionary?

match_objects = {}

if match_objects.setdefault( 'mo_foo', re_foo.search( text ) ):
  # do something with match_objects[ 'mo_foo' ]

elif match_objects.setdefault( 'mo_bar', re_bar.search( text ) ):
  # do something with match_objects[ 'mo_bar' ]

elif match_objects.setdefault( 'mo_baz', re_baz.search( text ) ):
  # do something with match_objects[ 'mo_baz' ]

...

however, you must ensure there are no duplicate match_objects dictionary keys ( mo_foo, mo_bar, ... ), best by giving each regular expression its own name and naming the match_objects keys accordingly, otherwise match_objects.setdefault() method would return existing match object instead of creating new match object by running re_xxx.search( text ).

A minimalist DataHolder:

class Holder(object):
    def __call__(self, *x):
        if x:
            self.x = x[0]
        return self.x

data = Holder()

if data(re.search('foo (\d+)', string)):
    print data().group(1)

or as a singleton function:

def data(*x):
    if x:
        data.x = x[0]
    return data.x

My solution would be:

import re

class Found(Exception): pass

try:        
    for m in re.finditer('bar(.+)', var):
        # Do something
        raise Found

    for m in re.finditer('foo(.+)', var):
        # Do something else
        raise Found

except Found: pass

Here is a RegexDispatcher class that dispatches its subclass methods by regular expression.

Each dispatchable method is annotated with a regular expression eg

def plus(self, regex: r"\+", **kwargs):
...

In this case, the annotation is called 'regex' and its value is the regular expression to match on, '\\+', which is the + sign. These annotated methods are put in subclasses, not in the base class.

When the dispatch(...) method is called on a string, the class finds the method with an annotation regular expression that matches the string and calls it. Here is the class:

import inspect
import re


class RegexMethod:
    def __init__(self, method, annotation):
        self.method = method
        self.name = self.method.__name__
        self.order = inspect.getsourcelines(self.method)[1] # The line in the source file
        self.regex = self.method.__annotations__[annotation]

    def match(self, s):
        return re.match(self.regex, s)

    # Make it callable
    def __call__(self, *args, **kwargs):
        return self.method(*args, **kwargs)

    def __str__(self):
        return str.format("Line: %s, method name: %s, regex: %s" % (self.order, self.name, self.regex))


class RegexDispatcher:
    def __init__(self, annotation="regex"):
        self.annotation = annotation
        # Collect all the methods that have an annotation that matches self.annotation
        # For example, methods that have the annotation "regex", which is the default
        self.dispatchMethods = [RegexMethod(m[1], self.annotation) for m in
                                inspect.getmembers(self, predicate=inspect.ismethod) if
                                (self.annotation in m[1].__annotations__)]
        # Be sure to process the dispatch methods in the order they appear in the class!
        # This is because the order in which you test regexes is important.
        # The most specific patterns must always be tested BEFORE more general ones
        # otherwise they will never match.
        self.dispatchMethods.sort(key=lambda m: m.order)

    # Finds the FIRST match of s against a RegexMethod in dispatchMethods, calls the RegexMethod and returns
    def dispatch(self, s, **kwargs):
        for m in self.dispatchMethods:
            if m.match(s):
                return m(self.annotation, **kwargs)
        return None

To use this class, subclass it to create a class with annotated methods. By way of example, here is a simple RPNCalculator that inherits from RegexDispatcher. The methods to be dispatched are (of course) the ones with the 'regex' annotation. The parent dispatch() method is invoked in call .

from RegexDispatcher import *
import math

class RPNCalculator(RegexDispatcher):
    def __init__(self):
        RegexDispatcher.__init__(self)
        self.stack = []

    def __str__(self):
        return str(self.stack)

    # Make RPNCalculator objects callable
    def __call__(self, expression):
        # Calculate the value of expression
        for t in expression.split():
            self.dispatch(t, token=t)
        return self.top()  # return the top of the stack

    # Stack management
    def top(self):
        return self.stack[-1] if len(self.stack) > 0 else []

    def push(self, x):
        return self.stack.append(float(x))

    def pop(self, n=1):
        return self.stack.pop() if n == 1 else [self.stack.pop() for n in range(n)]

    # Handle numbers
    def number(self, regex: r"[-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?", **kwargs):
        self.stack.append(float(kwargs['token']))

    # Binary operators
    def plus(self, regex: r"\+", **kwargs):
        a, b = self.pop(2)
        self.push(b + a)

    def minus(self, regex: r"\-", **kwargs):
        a, b = self.pop(2)
        self.push(b - a)

    def multiply(self, regex: r"\*", **kwargs):
        a, b = self.pop(2)
        self.push(b * a)

    def divide(self, regex: r"\/", **kwargs):
        a, b = self.pop(2)
        self.push(b / a)

    def pow(self, regex: r"exp", **kwargs):
        a, b = self.pop(2)
        self.push(a ** b)

    def logN(self, regex: r"logN", **kwargs):
        a, b = self.pop(2)
        self.push(math.log(a,b))

    # Unary operators
    def neg(self, regex: r"neg", **kwargs):
        self.push(-self.pop())

    def sqrt(self, regex: r"sqrt", **kwargs):
        self.push(math.sqrt(self.pop()))

    def log2(self, regex: r"log2", **kwargs):
        self.push(math.log2(self.pop()))

    def log10(self, regex: r"log10", **kwargs):
        self.push(math.log10(self.pop()))

    def pi(self, regex: r"pi", **kwargs):
        self.push(math.pi)

    def e(self, regex: r"e", **kwargs):
        self.push(math.e)

    def deg(self, regex: r"deg", **kwargs):
        self.push(math.degrees(self.pop()))

    def rad(self, regex: r"rad", **kwargs):
        self.push(math.radians(self.pop()))

    # Whole stack operators
    def cls(self, regex: r"c", **kwargs):
        self.stack=[]

    def sum(self, regex: r"sum", **kwargs):
        self.stack=[math.fsum(self.stack)]


if __name__ == '__main__':
    calc = RPNCalculator()

    print(calc('2 2 exp 3 + neg'))

    print(calc('c 1 2 3 4 5 sum 2 * 2 / pi'))

    print(calc('pi 2 * deg'))

    print(calc('2 2 logN'))

I like this solution because there are no separate lookup tables. The regular expression to match on is embedded in the method to be called as an annotation. For me, this is as it should be. It would be nice if Python allowed more flexible annotations, because I would rather put the regex annotation on the method itself rather than embed it in the method parameter list. However, this isn't possible at the moment.

For interest, take a look at the Wolfram language in which functions are polymorphic on arbitrary patterns, not just on argument types. A function that is polymorphic on a regex is a very powerful idea, but we can't get there cleanly in Python. The RegexDispatcher class is the best I could do.

import re

s = '1.23 Million equals to 1230000'

s = re.sub("([\d.]+)(\s*)Million", lambda m: str(round(float(m.groups()[0]) * 1000_000))+m.groups()[1], s)

print(s)

1230000 equals to 1230000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM