简体   繁体   English

用Java清理HTML属性

[英]html attributes cleaning with java

I have a task from school to remove everything from html tags except on a few attributes like class, id, alt, src, name and href. 我有一个学校的任务,要从html标记中删除所有内容,除了一些属性,如类,id,alt,src,name和href。

For example, we have a HTML file: 例如,我们有一个HTML文件:

<div class="wrapper">
<h1 value="something" class=header>Header</h1>
<div id="article1" class="article" name="something" >
<img clsas="mistake" src="picture.jpg" id="pict1" class="image_article" alt="picture" />
<p class="article_text" >Lorem ipsum dolor sit amet, consectetur adipiscing. </p>
<a href="article.html" title="More">Více</a>
</div>

And the result should be like this: 结果应该是这样的:

<div class="wrapper">
<h1 class=header>Header</h1>
<div id="article1" class="article" >
<img src="picture.jpg" id="pict1" class="image_article" alt="picture" />
<p class="article_text" >Lorem ipsum dolor sit amet, consectetur adipiscing. </p>
<a href="article.html">Více</a>
</div>

I tried something like this: 我尝试过这样的事情:

String opr = html.replaceAll("<([a-zA-Z]+)[^<>]*(class|id)(=\".+?\")[^<]*(class|id)(=\".+?\")[^<]*>", "<$1 $2$3 $4$5 >");

But it only works on HTML tags that are both of attributes class and id. 但它仅适用于同时具有属性class和id的HTML标签。 Can someone help please? 有人可以帮忙吗?

Avoid regular expressions for such need, as it will be very complex if you want to have it right, so it would be hard to maintain. 避免使用正则表达式来满足这种需求,因为如果您想正确使用正则表达式会非常复杂,因此很难维护。 You should use an HTML parser instead like Jsoup then clean up each element by removing all the unwanted attributes as next: 您应该改为使用HTML parserJsoup),然后通过删除所有不需要的属性来清理每个元素,如下所示:

Document doc = Jsoup.parse("<html>\n" +
    " <head></head>\n" +
    " <body>\n" +
    "<table><div class=\"wrapper\">\n" +
    "<h1 value=\"something\" class=header>Header</h1>\n" +
    "<div id=\"article1\" class=\"article\" name=\"something\" >\n" +
    "<img clsas=\"mistake\" src=\"picture.jpg\" id=\"pict1\" class=\"image_article\" alt=\"picture\" />\n" +
    "<p class=\"article_text\" >Lorem ipsum dolor sit amet, consectetur adipiscing. </p>\n" +
    "<a href=\"article.html\" title=\"More\">Více</a>\n" +
    "</div></body></html>"
);
for (Element element : doc.getAllElements()) {
    for (Attribute attribute : element.attributes()) {
        switch (attribute.getKey()) {
            case "class":
            case "id":
            case "alt":
            case "src":
            case "name":
            case "href":
                break;
            default:
                element.removeAttr(attribute.getKey());
        }
    }
}
System.out.println(doc);

Output: 输出:

<html>
 <head></head> 
 <body> 
  <div class="wrapper"> 
   <h1 class="header">Header</h1> 
   <div id="article1" class="article" name="something"> 
    <img src="picture.jpg" id="pict1" class="image_article" alt="picture"> 
    <p class="article_text">Lorem ipsum dolor sit amet, consectetur adipiscing. </p> 
    <a href="article.html">Více</a> 
   </div>
  </div>
  <table></table>
 </body>
</html>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM