简体   繁体   中英

Remove binary data from html file using Java Regex

I have html file that have tags for binary data like:

<HTML>
  <BODY STYLE="font: 10pt Times New Roman, Times, Serif">
    <TEXT>
      begin 644 image_002.jpg
        M_]C_X  02D9)1@ ! 0   0 !  #_VP!#  @&!@<&!0@'!P<)"0@*#!0-# L+
        M#!D2$P\4'1H?'AT:'!P@)"XG("(L(QP<*#<I+# Q-#0T'R<Y/3@R/"XS-#+_
        MVP!# 0D)"0P+#!@-#1@R(1PA,C(R,C(R,C(R,C(R,C(R,C(R,C(R,C(R,C(R
       ,Z4]1]: %HHHIB/_9
    end
   </TEXT>
   <TEXT>losses occurring in the third quarter and from weather  </TEXT>
  </BODY>
</HTML>

so I am trying to remove all "TEXT" tags those have binary data using Java Regex. I tried Jsoup library But it only remove html tags. I saw the same question here . But it is not using Java Regex.

Is any standard way to remove this binary data from html file?

It is well know that you shouldn't use a regex to handle xhtml.

I would use jsoup to remove the whole tag and later add it empty.

But if you want to use a regex, then you can use a regex like this:

"your html here".replaceAll("(?s)<TEXT>.*?<\\/TEXT>", "<TEXT></TEXT>")

Working demo

   val regex =  """<TEXT>\s*begin \d+ (?>[^e]+|e(?!nd\s*<\/TEXT>))*end\s*<\/TEXT>"""

完整示例在这里

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM