简体   繁体   中英

Remove all non-ascii characters between “<TYPE>GRAPHIC” and “</TEXT>”

I have a txt file directed downloaded from html whose contents like below.

<TYPE>GRAPHIC
<TEXT>
.....
Example of omitted part: M%$2G]\U?HQM7L^!5K*'5E/1@0?IQ5\S^0/\ G$O\IORU\W:1YV\MKK(UK1# (I guess are some kind of non-Ascii characters)
.....
</TEXT>

I want to remove all contents between <TYPE>GRAPHIC and </TEXT> and tried re.sub('<TYPE>GRAPHIC(.*)</TEXT>', '', reader) but doesn't work.

Honestly, I think this is a legit question and it's probably been asked before, but re.sub behaves really oddly and takes a lot of getting used to and most answers really don't explain it. The fact that it will often ignore capture groups is especially confusing so I don't see why you were '-1'ed

Anyway, these two solutions should work:

1.

>>> import re

>>> reader = '''<TYPE>GRAPHIC
    <TEXT>
    .....
    Example of omitted part: M%$2G]\U?HQM7L^!5K*'5E/1@0?IQ5\S^0/\ 
    G$O\IORU\W:1YV\MKK(UK1# 
    (I guess are some kind of non-Ascii characters)
    .....
    </TEXT>''' 

>>> re.sub("(?<=<TYPE>GRAPHIC)[\S\s]+(?=</TEXT>)", "", reader)
'<TYPE>GRAPHIC</TEXT>'
  • With (?<=<TYPE>GRAPHIC) I'm saying that what is ultimately captured must be preceded by <TYPE>GRAPHIC . By doing this, I'm also simultaneously saying don't act on/capture/remove (?<=<TYPE>GRAPHIC) itself
  • With [\\S\\s]+ I'm saying make this capture greedy and query to capture all text
  • With (?=</TEXT>) I'm saying that the captured text must be followed by </TEXT> to be captured, but </TEXT> will not ultimately be removed in the re.sub string result, because this is simultaneously telling re.sub not to actually capture/act upon/remove </TEXT>

.

2.

>>> import re

>>> reader = '''<TYPE>GRAPHIC
    <TEXT>
    .....
    Example of omitted part: M%$2G]\U?HQM7L^!5K*'5E/1@0?IQ5\S^0/\ 
    G$O\IORU\W:1YV\MKK(UK1# 
    (I guess are some kind of non-Ascii characters)
    .....
    </TEXT>'''


>>> parsed = re.sub(r'(<TYPE>GRAPHIC)[\S\s]+(</TEXT>)', r'\1\n\n\2', reader)
>>> print(parsed)
<TYPE>GRAPHIC

</TEXT>
  • With re.sub the "r" preceding the regex means that the engine will process it in "raw string" mode
  • By doing that I can open the regex up to printing specific captured groups
  • The extra caveat is that re.sub in this way usually works conversely (really inversely) to what you'd expect for what you're trying to do which is eliminate text
  • By supplying the argument r'\\1\\n\\n\\2' I'm telling it keep/print my Captured Group 1 (via the \\1 ), place to newline characters between and keep/print my Captured Group 2 (via the \\2 ), everything else doesn't get returned and is ignored.

Here try this:

re.sub("(?!<TYPE>GRAPHIC)\n(?:.|\n)+(?=<\/TEXT>)", "", text)
'<TYPE>GRAPHIC</TEXT>\n'

There are some complex regex patterns here, if you curious about what they are, here references for lookahead and lookbehind

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM