Remove all non-ascii characters between “<TYPE>GRAPHIC” and “</TEXT>”

Question

I have a txt file directed downloaded from html whose contents like below.

<TYPE>GRAPHIC
<TEXT>
.....
Example of omitted part: M%$2G]\U?HQM7L^!5K*'5E/1@0?IQ5\S^0/\ G$O\IORU\W:1YV\MKK(UK1# (I guess are some kind of non-Ascii characters)
.....
</TEXT>

I want to remove all contents between <TYPE>GRAPHIC and </TEXT> and tried re.sub('<TYPE>GRAPHIC(.*)</TEXT>', '', reader) but doesn't work.

Answer 1

Honestly, I think this is a legit question and it's probably been asked before, but re.sub behaves really oddly and takes a lot of getting used to and most answers really don't explain it. The fact that it will often ignore capture groups is especially confusing so I don't see why you were '-1'ed

Anyway, these two solutions should work:

1.

>>> import re

>>> reader = '''<TYPE>GRAPHIC
    <TEXT>
    .....
    Example of omitted part: M%$2G]\U?HQM7L^!5K*'5E/1@0?IQ5\S^0/\ 
    G$O\IORU\W:1YV\MKK(UK1# 
    (I guess are some kind of non-Ascii characters)
    .....
    </TEXT>''' 

>>> re.sub("(?<=<TYPE>GRAPHIC)[\S\s]+(?=</TEXT>)", "", reader)
'<TYPE>GRAPHIC</TEXT>'

With (?<=<TYPE>GRAPHIC) I'm saying that what is ultimately captured must be preceded by <TYPE>GRAPHIC . By doing this, I'm also simultaneously saying don't act on/capture/remove (?<=<TYPE>GRAPHIC) itself
With [\\S\\s]+ I'm saying make this capture greedy and query to capture all text
With (?=</TEXT>) I'm saying that the captured text must be followed by </TEXT> to be captured, but </TEXT> will not ultimately be removed in the re.sub string result, because this is simultaneously telling re.sub not to actually capture/act upon/remove </TEXT>

.

2.

>>> import re

>>> reader = '''<TYPE>GRAPHIC
    <TEXT>
    .....
    Example of omitted part: M%$2G]\U?HQM7L^!5K*'5E/1@0?IQ5\S^0/\ 
    G$O\IORU\W:1YV\MKK(UK1# 
    (I guess are some kind of non-Ascii characters)
    .....
    </TEXT>'''


>>> parsed = re.sub(r'(<TYPE>GRAPHIC)[\S\s]+(</TEXT>)', r'\1\n\n\2', reader)
>>> print(parsed)
<TYPE>GRAPHIC

</TEXT>

With re.sub the "r" preceding the regex means that the engine will process it in "raw string" mode
By doing that I can open the regex up to printing specific captured groups
The extra caveat is that re.sub in this way usually works conversely (really inversely) to what you'd expect for what you're trying to do which is eliminate text
By supplying the argument r'\\1\\n\\n\\2' I'm telling it keep/print my Captured Group 1 (via the \\1 ), place to newline characters between and keep/print my Captured Group 2 (via the \\2 ), everything else doesn't get returned and is ignored.

Answer 2

Here try this:

re.sub("(?!<TYPE>GRAPHIC)\n(?:.|\n)+(?=<\/TEXT>)", "", text)
'<TYPE>GRAPHIC</TEXT>\n'

There are some complex regex patterns here, if you curious about what they are, here references for lookahead and lookbehind

Remove all non-ascii characters between “<TYPE>GRAPHIC” and “</TEXT>”

Question

2 answers

solution1
1 2019-03-12 17:09:07

solution2
0 ACCPTED 2019-03-12 11:05:04

Remove all non-ascii characters between “<TYPE>GRAPHIC” and “</TEXT>”

Question

2 answers

solution1 1 2019-03-12 17:09:07

solution2 0 ACCPTED 2019-03-12 11:05:04

solution1
1 2019-03-12 17:09:07

solution2
0 ACCPTED 2019-03-12 11:05:04