I have html file that have tags for binary data like:
<HTML>
<BODY STYLE="font: 10pt Times New Roman, Times, Serif">
<TEXT>
begin 644 image_002.jpg
M_]C_X 02D9)1@ ! 0 0 ! #_VP!# @&!@<&!0@'!P<)"0@*#!0-# L+
M#!D2$P\4'1H?'AT:'!P@)"XG("(L(QP<*#<I+# Q-#0T'R<Y/3@R/"XS-#+_
MVP!# 0D)"0P+#!@-#1@R(1PA,C(R,C(R,C(R,C(R,C(R,C(R,C(R,C(R,C(R
,Z4]1]: %HHHIB/_9
end
</TEXT>
<TEXT>losses occurring in the third quarter and from weather </TEXT>
</BODY>
</HTML>
so I am trying to remove all "TEXT" tags those have binary data using Java Regex. I tried Jsoup library But it only remove html tags. I saw the same question here . But it is not using Java Regex.
Is any standard way to remove this binary data from html file?
It is well know that you shouldn't use a regex to handle xhtml.
I would use jsoup to remove the whole tag and later add it empty.
But if you want to use a regex, then you can use a regex like this:
"your html here".replaceAll("(?s)<TEXT>.*?<\\/TEXT>", "<TEXT></TEXT>")
val regex = """<TEXT>\s*begin \d+ (?>[^e]+|e(?!nd\s*<\/TEXT>))*end\s*<\/TEXT>"""
完整示例在这里
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.