简体   繁体   English

如何使用正则表达式模式从HTML页面中删除一段代码?

[英]How to use a regex pattern to remove a piece of code from an HTML page?

I am extracting some information from a website. 我正在从网站中提取一些信息。

Unfortunately, the code isn't very organized and some pieces of code (XML and Styles) appear in the middle of the HTML structure. 不幸的是,该代码不是非常有条理,并且一些代码片段(XML和样式)出现在HTML结构的中间。

I put all the HTML code in a string using Java and I want to get rid of things like these: 我使用Java将所有HTML代码放在一个字符串中,并且希望摆脱诸如此类的情况:

<!--[if gte mso 9]><xml>
 <o:OfficeDocumentSettings>
  <o:AllowPNG/>
 </o:OfficeDocumentSettings>
</xml><![endif]-->

(This code appears in one part of the page...) (此代码显示在页面的一部分中...)

Or more complex ones, like this: 或更复杂的代码,例如:

<!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:View>Normal</w:View>
  <w:Zoom>0</w:Zoom>
  <w:LidThemeAsian>X-NONE</w:LidThemeAsian>
  <w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
  <w:Compatibility>
   <w:EnableOpenTypeKerning/>
   <w:DontFlipMirrorIndents/>
   <m:naryLim m:val="undOvr"/>
  </m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
  DefSemiHidden="true" DefQFormat="false" DefPriority="99"
  LatentStyleCount="267">
  <w:LsdException Locked="false" Priority="0" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
  <w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
  <w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
 </w:LatentStyles>
</xml><![endif]--><!--[if gte mso 10]>
<style>
 /* Style Definitions */
 table.MsoNormalTable
    {mso-style-name:"Table Normal";
    mso-tstyle-rowband-size:0;
    mso-tstyle-colband-size:0;
    mso-style-noshow:yes;
    mso-style-priority:99;
    mso-style-parent:"";
    mso-padding-alt:0in 5.4pt 0in 5.4pt;
    mso-para-margin-top:0in;
    mso-para-margin-right:0in;
    mso-para-margin-bottom:10.0pt;
    mso-para-margin-left:0in;
    line-height:115%;
    mso-pagination:widow-orphan;
    font-size:11.0pt;
    font-family:"Calibri","sans-serif";
    mso-ascii-font-family:Calibri;
    mso-ascii-theme-font:minor-latin;
    mso-hansi-font-family:Calibri;
    mso-hansi-theme-font:minor-latin;
    mso-bidi-font-family:"Times New Roman";
    mso-bidi-theme-font:minor-bidi;
    mso-fareast-language:EN-US;}
</style>
<![endif]-->

Which appears also on the same page. 它也出现在同一页面上。

I noticed the if and endif tags so I tried to use the replaceall function to remove every part of the string that contains that pattern. 我注意到了if和endif标记,因此我尝试使用replaceall函数删除包含该模式的字符串的每个部分。

I am using the following pattern: html = html.replaceAll("(<!--(.*)-->)*?", ""); 我正在使用以下模式: html = html.replaceAll("(<!--(.*)-->)*?", "");

I also tried this: html = html.replaceAll("(<!--(.*)-->)", ""); 我也尝试过这样做: html = html.replaceAll("(<!--(.*)-->)", ""); html = html.replaceAll("(<!--(.*)<!\\\\[endif\\\\]-->)", "");

They are pretty vague, but every other variation I have tried don't work at all. 它们非常模糊,但是我尝试过的所有其他变种根本不起作用。

Unfortunately these don't work either since they only remove the first one but the large one remains there... 不幸的是,这些方法都不起作用,因为它们仅删除了第一个,而较大的一个仍保留在那里。

What am I doing wrong? 我究竟做错了什么?

您需要使您的正则表达式也匹配换行符。

html = html.replaceAll("(?s)<!--.*?-->", "");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM