使用jsoup解析时摆脱html属性及其值

Question

I am trying to parse multiple html documents in such a way that I get only the tags discarding all its attributes and values.我试图以这样一种方式解析多个 html 文档，即我只得到丢弃其所有属性和值的标签。 Can someone help me please.有人能帮助我吗。

For example: <img src="pic_trulli.jpg" alt="Italian Trulli">例如： <img src="pic_trulli.jpg" alt="Italian Trulli">

changes to更改为

<img>

Similarly, I want this to work for all the tags in an HTML document.同样，我希望这适用于 HTML 文档中的所有标签。

Answer 1

To remove the attributes of a single element you can use this:要删除单个元素的属性，您可以使用：

element.attributes().asList()
        .stream().map(Attribute::getKey)
        .forEach(element::removeAttr);

To remove the attributes of all elements you can use this in combination with document.getAllElements() :要删除所有元素的属性，您可以将其与document.getAllElements()结合使用：

Document document = Jsoup.parse("<img src=\"pic_trulli.jpg\" alt=\"Italian Trulli\">");
document.getAllElements()
        .forEach(e -> e.attributes().asList()
                .stream().map(Attribute::getKey)
                .forEach(e::removeAttr));

The result will be this:结果将是这样的：

<html>
 <head></head>
 <body>
  <img>
 </body>
</html>

Answer 2

You can iterate over all elements from document and then over each element's attributes which should allow you to remove them.您可以遍历文档中的所有元素，然后遍历每个元素的属性，这些属性应该允许您删除它们。

Demo:演示：

String html = "<img src=\"pic_trulli.jpg\" alt=\"Italian Trulli\">" +
        "<div class=\"foo\"><a href=\"pic_trulli.jpg\" alt=\"Italian Trulli\" non-standard></div>";
Document doc = Jsoup.parse(html);

System.out.println(doc);
for (Element el : doc.getAllElements()){
    for (Attribute atr : el.attributes().asList()){
        el.removeAttr(atr.getKey());
    }
}
System.out.println("-----");
System.out.println(doc);

Output:输出：

<html>
 <head></head>
 <body>
  <img src="pic_trulli.jpg" alt="Italian Trulli">
  <div class="foo">
   <a href="pic_trulli.jpg" alt="Italian Trulli" non-standard></a>
  </div>
 </body>
</html>
-----
<html>
 <head></head>
 <body>
  <img>
  <div>
   <a></a>
  </div>
 </body>
</html>

Answer 3

If your aim is to receive a clear document structure, you need to remove text and data nodes as well.如果您的目标是获得清晰的文档结构，则还需要删除文本和数据节点。 Consider the following snippet.考虑以下片段。

Document document = Jsoup.connect("http://example.com").get();
document.getAllElements().forEach(element -> {
      element.attributes().asList().forEach(attr -> element.removeAttr(attr.getKey()));
      element.textNodes().forEach(Node::remove);
      element.dataNodes().forEach(Node::remove);
    });
System.out.println(document);

Output:输出：

<!doctype html>
<html>
 <head>
  <title></title>
  <meta>
  <meta>
  <meta>
  <style></style>
 </head>
 <body>
  <div>
   <h1></h1>
   <p></p>
   <p><a></a></p>
  </div>
 </body>
</html>

使用jsoup解析时摆脱html属性及其值

问题描述

3 个解决方案

解决方案1
1 2019-04-29 17:26:18

解决方案2
0 2019-04-29 17:22:52

解决方案3
0 2019-04-29 19:21:37

使用jsoup解析时摆脱html属性及其值

问题描述

3 个解决方案

解决方案1 1 2019-04-29 17:26:18

解决方案2 0 2019-04-29 17:22:52

解决方案3 0 2019-04-29 19:21:37

解决方案1
1 2019-04-29 17:26:18

解决方案2
0 2019-04-29 17:22:52

解决方案3
0 2019-04-29 19:21:37