[英]How to use a regex pattern to remove a piece of code from an HTML page?
I am extracting some information from a website. 我正在从网站中提取一些信息。
Unfortunately, the code isn't very organized and some pieces of code (XML and Styles) appear in the middle of the HTML structure. 不幸的是,该代码不是非常有条理,并且一些代码片段(XML和样式)出现在HTML结构的中间。
I put all the HTML code in a string using Java and I want to get rid of things like these: 我使用Java将所有HTML代码放在一个字符串中,并且希望摆脱诸如此类的情况:
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]-->
(This code appears in one part of the page...) (此代码显示在页面的一部分中...)
Or more complex ones, like this: 或更复杂的代码,例如:
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="267">
<w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
<w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
<w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
</w:LatentStyles>
</xml><![endif]--><!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin-top:0in;
mso-para-margin-right:0in;
mso-para-margin-bottom:10.0pt;
mso-para-margin-left:0in;
line-height:115%;
mso-pagination:widow-orphan;
font-size:11.0pt;
font-family:"Calibri","sans-serif";
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;
mso-fareast-language:EN-US;}
</style>
<![endif]-->
Which appears also on the same page. 它也出现在同一页面上。
I noticed the if and endif tags so I tried to use the replaceall function to remove every part of the string that contains that pattern. 我注意到了if和endif标记,因此我尝试使用replaceall函数删除包含该模式的字符串的每个部分。
I am using the following pattern: html = html.replaceAll("(<!--(.*)-->)*?", "");
我正在使用以下模式:
html = html.replaceAll("(<!--(.*)-->)*?", "");
I also tried this: html = html.replaceAll("(<!--(.*)-->)", "");
我也尝试过这样做:
html = html.replaceAll("(<!--(.*)-->)", "");
html = html.replaceAll("(<!--(.*)<!\\\\[endif\\\\]-->)", "");
They are pretty vague, but every other variation I have tried don't work at all. 它们非常模糊,但是我尝试过的所有其他变种根本不起作用。
Unfortunately these don't work either since they only remove the first one but the large one remains there... 不幸的是,这些方法都不起作用,因为它们仅删除了第一个,而较大的一个仍保留在那里。
What am I doing wrong? 我究竟做错了什么?
您需要使您的正则表达式也匹配换行符。
html = html.replaceAll("(?s)<!--.*?-->", "");
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.