I have a txt file directed downloaded from html whose contents like below.
<TYPE>GRAPHIC
<TEXT>
.....
Example of omitted part: M%$2G]\U?HQM7L^!5K*'5E/1@0?IQ5\S^0/\ G$O\IORU\W:1YV\MKK(UK1# (I guess are some kind of non-Ascii characters)
.....
</TEXT>
I want to remove all contents between <TYPE>GRAPHIC
and </TEXT>
and tried re.sub('<TYPE>GRAPHIC(.*)</TEXT>', '', reader)
but doesn't work.
Honestly, I think this is a legit question and it's probably been asked before, but re.sub behaves really oddly and takes a lot of getting used to and most answers really don't explain it. The fact that it will often ignore capture groups is especially confusing so I don't see why you were '-1'ed
Anyway, these two solutions should work:
1.
>>> import re
>>> reader = '''<TYPE>GRAPHIC
<TEXT>
.....
Example of omitted part: M%$2G]\U?HQM7L^!5K*'5E/1@0?IQ5\S^0/\
G$O\IORU\W:1YV\MKK(UK1#
(I guess are some kind of non-Ascii characters)
.....
</TEXT>'''
>>> re.sub("(?<=<TYPE>GRAPHIC)[\S\s]+(?=</TEXT>)", "", reader)
'<TYPE>GRAPHIC</TEXT>'
(?<=<TYPE>GRAPHIC)
I'm saying that what is ultimately captured must be preceded by <TYPE>GRAPHIC
. By doing this, I'm also simultaneously saying don't act on/capture/remove (?<=<TYPE>GRAPHIC)
itself [\\S\\s]+
I'm saying make this capture greedy and query to capture all text (?=</TEXT>)
I'm saying that the captured text must be followed by </TEXT>
to be captured, but </TEXT>
will not ultimately be removed in the re.sub string result, because this is simultaneously telling re.sub not to actually capture/act upon/remove </TEXT>
.
2.
>>> import re
>>> reader = '''<TYPE>GRAPHIC
<TEXT>
.....
Example of omitted part: M%$2G]\U?HQM7L^!5K*'5E/1@0?IQ5\S^0/\
G$O\IORU\W:1YV\MKK(UK1#
(I guess are some kind of non-Ascii characters)
.....
</TEXT>'''
>>> parsed = re.sub(r'(<TYPE>GRAPHIC)[\S\s]+(</TEXT>)', r'\1\n\n\2', reader)
>>> print(parsed)
<TYPE>GRAPHIC
</TEXT>
re.sub
the "r" preceding the regex means that the engine will process it in "raw string" mode r'\\1\\n\\n\\2'
I'm telling it keep/print my Captured Group 1 (via the \\1
), place to newline characters between and keep/print my Captured Group 2 (via the \\2
), everything else doesn't get returned and is ignored. Here try this:
re.sub("(?!<TYPE>GRAPHIC)\n(?:.|\n)+(?=<\/TEXT>)", "", text)
'<TYPE>GRAPHIC</TEXT>\n'
There are some complex regex patterns here, if you curious about what they are, here references for lookahead and lookbehind
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.