如何使用正则表达式模式从HTML页面中删除一段代码？

Question

I am extracting some information from a website. 我正在从网站中提取一些信息。

Unfortunately, the code isn't very organized and some pieces of code (XML and Styles) appear in the middle of the HTML structure. 不幸的是，该代码不是非常有条理，并且一些代码片段（XML和样式）出现在HTML结构的中间。

I put all the HTML code in a string using Java and I want to get rid of things like these: 我使用Java将所有HTML代码放在一个字符串中，并且希望摆脱诸如此类的情况：

<!--[if gte mso 9]><xml>
 <o:OfficeDocumentSettings>
  <o:AllowPNG/>
 </o:OfficeDocumentSettings>
</xml><![endif]-->

(This code appears in one part of the page...) （此代码显示在页面的一部分中...）

Or more complex ones, like this: 或更复杂的代码，例如：

<!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:View>Normal</w:View>
  <w:Zoom>0</w:Zoom>
  <w:LidThemeAsian>X-NONE</w:LidThemeAsian>
  <w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
  <w:Compatibility>
   <w:EnableOpenTypeKerning/>
   <w:DontFlipMirrorIndents/>
   <m:naryLim m:val="undOvr"/>
  </m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
  DefSemiHidden="true" DefQFormat="false" DefPriority="99"
  LatentStyleCount="267">
  <w:LsdException Locked="false" Priority="0" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
  <w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
  <w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
 </w:LatentStyles>
</xml><![endif]--><!--[if gte mso 10]>
<style>
 /* Style Definitions */
 table.MsoNormalTable
    {mso-style-name:"Table Normal";
    mso-tstyle-rowband-size:0;
    mso-tstyle-colband-size:0;
    mso-style-noshow:yes;
    mso-style-priority:99;
    mso-style-parent:"";
    mso-padding-alt:0in 5.4pt 0in 5.4pt;
    mso-para-margin-top:0in;
    mso-para-margin-right:0in;
    mso-para-margin-bottom:10.0pt;
    mso-para-margin-left:0in;
    line-height:115%;
    mso-pagination:widow-orphan;
    font-size:11.0pt;
    font-family:"Calibri","sans-serif";
    mso-ascii-font-family:Calibri;
    mso-ascii-theme-font:minor-latin;
    mso-hansi-font-family:Calibri;
    mso-hansi-theme-font:minor-latin;
    mso-bidi-font-family:"Times New Roman";
    mso-bidi-theme-font:minor-bidi;
    mso-fareast-language:EN-US;}
</style>
<![endif]-->

Which appears also on the same page. 它也出现在同一页面上。

I noticed the if and endif tags so I tried to use the replaceall function to remove every part of the string that contains that pattern. 我注意到了if和endif标记，因此我尝试使用replaceall函数删除包含该模式的字符串的每个部分。

I am using the following pattern: html = html.replaceAll("()*?", ""); 我正在使用以下模式： html = html.replaceAll("()*?", "");

I also tried this: html = html.replaceAll("()", ""); 我也尝试过这样做： html = html.replaceAll("()", ""); html = html.replaceAll("()", "");

They are pretty vague, but every other variation I have tried don't work at all. 它们非常模糊，但是我尝试过的所有其他变种根本不起作用。

Unfortunately these don't work either since they only remove the first one but the large one remains there... 不幸的是，这些方法都不起作用，因为它们仅删除了第一个，而较大的一个仍保留在那里。

What am I doing wrong? 我究竟做错了什么？

Answer 1

您需要使您的正则表达式也匹配换行符。

html = html.replaceAll("(?s)<!--.*?-->", "");

如何使用正则表达式模式从HTML页面中删除一段代码？

问题描述

1 个解决方案

解决方案1
1 2015-06-23 05:46:52

如何使用正则表达式模式从HTML页面中删除一段代码？

问题描述

1 个解决方案

解决方案1 1 2015-06-23 05:46:52

解决方案1
1 2015-06-23 05:46:52