How can I perform a single substitution, and use the regex captures afterwards?

Question

I am porting a parsing tool from Perl to Python:

my $lineno = 1;
my @data;
for my $line (split /\R/, $source) {
    $line =~ s/^([ ]*)//;
    my $indent = length $1;
    push @data, [$lineno++, $indent, $line];
}

This

splits the input into lines, using Unicode line separators,
strips leading space (only U+0020 space characters),
determines the indendation level from the stripped space.

I am finding it difficult to translate this to idiomatic Python because re.sub() only returns the string after replacement, but not the match object (which I need to count the removed spaces).

In this particular example, I could simply compare the length of the string before and after the substitution. But I'm interested in a general solution to this kind of problem:

How can I perform a single substitution while also accessing the regex captures?

Attempt 1 – exfiltrate the match object through a substitution function:

lineno = 1
data = []
re_leading_space = re.compile(r'^([ ]*)')
for line in source.splitlines():  # TODO handle Unicode line seps
    m = None
    def exfiltrate(the_match):
        nonlocal m
        m = the_match
        return ''
    line = re_leading_space.sub(exfiltrate, line, count=1)
    indent = len(m.group(1)) if m is not None else 0
    data.append((lineno, indent, line))
    lineno += 1

Disadvantage: weird nonlocal data flow.

Attempt 2 – perform the substitution manually:

lineno = 1
data = []
re_leading_space = re.compile(r'^([ ]*)')
for line in source.splitlines():  # TODO handle Unicode line seps
    m = re_leading_space.match(line)
    indent = 0
    if m is not None:
        line = line[m.end():]  # remove matched prefix
        indent = len(m.group(1))
    data.append((lineno, indent, line))
    lineno += 1

Disadvantage: while otherwise fairly clear, it just ends up being a bad reimplementation of the standard library.

Attempt 3 – perform a match, then match the regex again as a substitution:

lineno = 1
data = []
re_leading_space = re.compile(r'^([ ]*)')
for line in source.splitlines():  # TODO handle Unicode line seps
    m = re_leading_space.match(line)
    line = re_leading_space.sub('', line, count=1)
    indent = len(m.group(1)) if m is not None else 0
    data.append((lineno, indent, line))
    lineno += 1

Disadvantage: while comparatively concise, this needlessly matches the pattern twice. Care has to be taken to provide the same flags etc. to match() and sub() .

So what would be the Pythonic solution to this problem? I couldn't find “one and only one obvious way to do it.” Maybe I'm missing a particular idiom?

Answer 1

I strongly doubt you'll find any way to do regular expressions in Python that's as natural as it is in Perl. Regex are part of Perl's design at a very low level, while they're not nearly as central to Python.

So my first suggestion is to consider if you can avoid using regex all together. For your example problem that would be easy, just use line.lstrip(' ') and compare lengths to figure out how much indentation was removed. Maybe some other problems you'd consider will also have easy implementations using string methods, rather than regex.

I really doubt there is any solution for general regex substitutions that is massively better than all of the options you've considered. I'd probably use something like your Attempt 2 myself, or maybe Attempt 1 where the indentation amount was saved by the inner function, rather than the match object itself.

Answer 2

Match objects have an expand method, which is documented as:

Return the string obtained by doing backslash substitution on the template string template, as done by the sub() method. Escapes such as \\n are converted to the appropriate characters, and numeric backreferences (\\1, \\2) and named backreferences (\\g<1>, \\g) are replaced by the contents of the corresponding group.

This allows matching only once and doing the substitution using the match, like this:

data = []
re_leading_space = re.compile(r'^([ ]*)(.*)')
for lineno, line in enumerate(source.splitlines()):  # TODO handle Unicode line seps
    m = re_leading_space.match(line)
    indent = 0
    if m is not None:
        line = m.expand(r'\2')
        indent = len(m.group(1))
    data.append((lineno, indent, line))

How can I perform a single substitution, and use the regex captures afterwards?

Question

2 answers

solution1
2 ACCPTED 2018-02-19 21:09:53

solution2
0 2018-02-19 21:02:40

How can I perform a single substitution, and use the regex captures afterwards?

Question

2 answers

solution1 2 ACCPTED 2018-02-19 21:09:53

solution2 0 2018-02-19 21:02:40

solution1
2 ACCPTED 2018-02-19 21:09:53

solution2
0 2018-02-19 21:02:40