簡體   English   中英

用Java清理HTML屬性

[英]html attributes cleaning with java

我有一個學校的任務,要從html標記中刪除所有內容,除了一些屬性,如類,id,alt,src,name和href。

例如,我們有一個HTML文件:

<div class="wrapper">
<h1 value="something" class=header>Header</h1>
<div id="article1" class="article" name="something" >
<img clsas="mistake" src="picture.jpg" id="pict1" class="image_article" alt="picture" />
<p class="article_text" >Lorem ipsum dolor sit amet, consectetur adipiscing. </p>
<a href="article.html" title="More">Více</a>
</div>

結果應該是這樣的:

<div class="wrapper">
<h1 class=header>Header</h1>
<div id="article1" class="article" >
<img src="picture.jpg" id="pict1" class="image_article" alt="picture" />
<p class="article_text" >Lorem ipsum dolor sit amet, consectetur adipiscing. </p>
<a href="article.html">Více</a>
</div>

我嘗試過這樣的事情:

String opr = html.replaceAll("<([a-zA-Z]+)[^<>]*(class|id)(=\".+?\")[^<]*(class|id)(=\".+?\")[^<]*>", "<$1 $2$3 $4$5 >");

但它僅適用於同時具有屬性class和id的HTML標簽。 有人可以幫忙嗎?

避免使用正則表達式來滿足這種需求,因為如果您想正確使用正則表達式會非常復雜,因此很難維護。 您應該改為使用HTML parserJsoup),然后通過刪除所有不需要的屬性來清理每個元素,如下所示:

Document doc = Jsoup.parse("<html>\n" +
    " <head></head>\n" +
    " <body>\n" +
    "<table><div class=\"wrapper\">\n" +
    "<h1 value=\"something\" class=header>Header</h1>\n" +
    "<div id=\"article1\" class=\"article\" name=\"something\" >\n" +
    "<img clsas=\"mistake\" src=\"picture.jpg\" id=\"pict1\" class=\"image_article\" alt=\"picture\" />\n" +
    "<p class=\"article_text\" >Lorem ipsum dolor sit amet, consectetur adipiscing. </p>\n" +
    "<a href=\"article.html\" title=\"More\">Více</a>\n" +
    "</div></body></html>"
);
for (Element element : doc.getAllElements()) {
    for (Attribute attribute : element.attributes()) {
        switch (attribute.getKey()) {
            case "class":
            case "id":
            case "alt":
            case "src":
            case "name":
            case "href":
                break;
            default:
                element.removeAttr(attribute.getKey());
        }
    }
}
System.out.println(doc);

輸出:

<html>
 <head></head> 
 <body> 
  <div class="wrapper"> 
   <h1 class="header">Header</h1> 
   <div id="article1" class="article" name="something"> 
    <img src="picture.jpg" id="pict1" class="image_article" alt="picture"> 
    <p class="article_text">Lorem ipsum dolor sit amet, consectetur adipiscing. </p> 
    <a href="article.html">Více</a> 
   </div>
  </div>
  <table></table>
 </body>
</html>

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM