简体   繁体   中英

python regular expression remove matching brackets file

I have a Latex file where a lot of text is marked with \\red{} , but there may also be brackets inside the \\red{} , like \\red{here is \\underline{underlined} text} . I want to remove the red color and after some googling I wrote this python script:

import os, re, sys
#Start program in terminal with
#python RedRemover.py filename
#sys.argv[1] then has the value filename
ifn = sys.argv[1]
#Open file and read it
f = open(ifn, "r")
c = f.read() 
#The whole file content is now stored in the string c
#Remove occurences of \red{...} in c
c=re.sub(r'\\red\{(?:[^\}|]*\|)?([^\}|]*)\}', r'\1', c)
#Write c into new file
Nf=open("RedRemoved_"+ifn,"w")
Nf.write(c)

f.close()
Nf.close()

But this will convert

\\red{here is \\underline{underlined} text}

to

here is \\underline{underlined text}

which is not what I want. I want

here is \\underline{underlined} text

You can't match an undetermined level of nested brackets with the re module since it doesn't support recursion. To solve that, you can use the new regex module :

import regex

c = r'\red{here is \underline{underlined} text}'

c = regex.sub(r'\\red({((?>[^{}]+|(?1))*)})', r'\2', c)

Where (?1) is a recursive call to the capture group 1.

I think you need to keep the curlies, consider this case: \\red{\\bf test} :

import re

c = r'\red{here is \underline{underlined} text} and \red{more}'
d = c 

# this may be less painful and sufficient, and even more correct
c = re.sub(r'\\red\b', r'', c)
print "1ST:", c

# if you want to get rid of the curlies:
d = re.sub(r'\\red{([^{]*(?:{[^}]*}[^}]*)*)}', r'\1', d)
print "2ND:", d

Gives:

1ST: {here is \underline{underlined} text} and {more}
2ND: here is \underline{underlined} text and more

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM