简体   繁体   中英

Is there a way to have python regex find the next instance of a character instead of the last instance?

I'm a noob with RegEx patterns, so I'll apologize right away. :) I'm trying to take this string (note that there are some nested parentheses):

"(i) Test text (see § 123.1 of this Proper Name (PN) subparagraph) and additional test text (see § 123.2 of this subparagraph) along with more test text (including a parenthetical) to test the text. (See § 125.3 of this subparagraph for even more test text.) (Hello World!)"

And remove all the parenthetical statements which begin with "(See §" or "(see §" in order to get the following result:

"(i) Test text and additional test text along with more test text (including a parenthetical) to test the text. (Hello World!)"

I've tried using .split() and re.sub(), but can't seem to find a good solution. Here's the closest I've gotten:

import re

txt = '(i) Test text (see § 123.1 of this Proper Name (PN) subparagraph) and additional test text (see § 123.2 of this subparagraph) along with more test text (including a parenthetical) to test the text. (See § 125.3 of this subparagraph for even more test text.) (Hello World!)'

x = re.sub(r'\((s|S)ee §.*\)', r'', txt)

print(x)

It seems the that the '.*\)' is finding the last instance of ')' as opposed to the next instance. Is there a way to override this behavior, or rather, is there a better solution that I've missed entirely?

Thanks very much!

It's been said that cannot (realistically) be achieved with a regular expression.

Here's a way that it could be done:

txt = "(i) Test text (see § 123.1 of this Proper Name (PN) subparagraph) and additional test text (see § 123.2 of this subparagraph) along with more test text (including a parenthetical) to test the text. (See § 125.3 of this subparagraph for even more test text.) (Hello World!)"
txt1 = txt[:]
inparens = []
pstart = 0
stack = 0

for i, c in enumerate(txt):
    if c == '(':
        if stack == 0:
            pstart = i
        stack += 1
    elif c == ')':
        if stack > 0:
            if (stack := stack - 1) == 0:
                segment = txt[pstart:i+1]
                if segment.lower().startswith('(see §'):
                    inparens.insert(0, (pstart, i+1))
for s, e in inparens:
    txt = txt[:s].rstrip() + txt[e:]
print(txt)

Output:

(i) Test text and additional test text along with more test text (including a parenthetical) to test the text. (Hello World!)

Essentially, what this does is to work through the string character by character maintaining a "stack" to allow for embedded segments with parentheses. Once the stack goes to zero - ie, right parenthesis observed we then know the segment of interest which we inspect for relevance. Build an inverted list of start/end tuples and finally reconstruct the string by eliminating the unwanted segments.

Very probably not efficient but it does seem to work

As I suggested in comment, you can use stack to track the start and end index of the part you want to remove. Once you have those index, you can create a new string, as shown below:

sample_string = '(i) Test text (see § 123.1 of this Proper Name (PN) subparagraph) and additional test text (see § 123.2 of this subparagraph) along with more test text (including a parenthetical) to test the text. (See § 125.3 of this subparagraph for even more test text.) (Hello World!)'

# track the start and end index of sub_string to be removed
stack = []
index_tracker = []

start = -1
for index, char in enumerate(sample_string):
    if char == '(':
        if sample_string[index+1: index+6].casefold() == 'see §'.casefold():
            start = index
        stack.append('(')
    elif char == ')':
        stack.pop()

    if not stack and start != -1:
        index_tracker.append((start, index))
        start = -1

# create a new string
result_string = ''
for index, (start, end) in enumerate(index_tracker):
    if index > 0:
        prev_end = index_tracker[index-1][1]
        result_string += sample_string[prev_end + 1: start - 1]
    elif index == 0:
        result_string += sample_string[: start - 1]
result_string += sample_string[index_tracker[-1][1] + 1: ]


>>> print(result_string)
>>> (i) Test text and additional test text along with more test text (including a parenthetical) to test the text. (Hello World!)

Unlike most people are saying here, this problem can be solved using regex if you only have two nested levels of parentheses. So, there may only not be any nested parntheses inside the deleted parts, but they do can contain one single level of parentheses. I assume that this is the case for your text.

For example, the following string would not be allowed, because there (Proper Name (PN)) contains three levels of nested parentheses:

(i) Test text (see § 123.1 of this (Proper Name (PN)) subparagraph) and additional test text (see § 123.2 of this subparagraph) along with more test text (including a parenthetical) to test the text. (See § 125.3 of this subparagraph for even more test text.) (Hello World!)

Here's the code:

import re

txt = '(i) Test text (see § 123.1 of this Proper Name (PN) subparagraph) and additional test text (see § 123.2 of this subparagraph) along with more test text (including a parenthetical) to test the text. (See § 125.3 of this subparagraph for even more test text.) (Hello World!)'

x = re.sub(r"\((s|S)ee §([^()]*\([^()]*\))*[^()]*\) ?", r'', txt)

print(x)

I've also added ? at the end of the pattern, to prevent double spaces where text is removed. Remove this if you don't want that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM