[英]Getting rid of html attributes and its values while parsing using jsoup
I am trying to parse multiple html documents in such a way that I get only the tags discarding all its attributes and values.我试图以这样一种方式解析多个 html 文档,即我只得到丢弃其所有属性和值的标签。 Can someone help me please.
有人能帮助我吗。
For example: <img src="pic_trulli.jpg" alt="Italian Trulli">
例如:
<img src="pic_trulli.jpg" alt="Italian Trulli">
changes to更改为
<img>
Similarly, I want this to work for all the tags in an HTML document.同样,我希望这适用于 HTML 文档中的所有标签。
To remove the attributes of a single element you can use this:要删除单个元素的属性,您可以使用:
element.attributes().asList()
.stream().map(Attribute::getKey)
.forEach(element::removeAttr);
To remove the attributes of all elements you can use this in combination with document.getAllElements()
:要删除所有元素的属性,您可以将其与
document.getAllElements()
结合使用:
Document document = Jsoup.parse("<img src=\"pic_trulli.jpg\" alt=\"Italian Trulli\">");
document.getAllElements()
.forEach(e -> e.attributes().asList()
.stream().map(Attribute::getKey)
.forEach(e::removeAttr));
The result will be this:结果将是这样的:
<html>
<head></head>
<body>
<img>
</body>
</html>
You can iterate over all elements from document and then over each element's attributes which should allow you to remove them.您可以遍历文档中的所有元素,然后遍历每个元素的属性,这些属性应该允许您删除它们。
Demo:演示:
String html = "<img src=\"pic_trulli.jpg\" alt=\"Italian Trulli\">" +
"<div class=\"foo\"><a href=\"pic_trulli.jpg\" alt=\"Italian Trulli\" non-standard></div>";
Document doc = Jsoup.parse(html);
System.out.println(doc);
for (Element el : doc.getAllElements()){
for (Attribute atr : el.attributes().asList()){
el.removeAttr(atr.getKey());
}
}
System.out.println("-----");
System.out.println(doc);
Output:输出:
<html>
<head></head>
<body>
<img src="pic_trulli.jpg" alt="Italian Trulli">
<div class="foo">
<a href="pic_trulli.jpg" alt="Italian Trulli" non-standard></a>
</div>
</body>
</html>
-----
<html>
<head></head>
<body>
<img>
<div>
<a></a>
</div>
</body>
</html>
If your aim is to receive a clear document structure, you need to remove text and data nodes as well.如果您的目标是获得清晰的文档结构,则还需要删除文本和数据节点。 Consider the following snippet.
考虑以下片段。
Document document = Jsoup.connect("http://example.com").get();
document.getAllElements().forEach(element -> {
element.attributes().asList().forEach(attr -> element.removeAttr(attr.getKey()));
element.textNodes().forEach(Node::remove);
element.dataNodes().forEach(Node::remove);
});
System.out.println(document);
Output:输出:
<!doctype html>
<html>
<head>
<title></title>
<meta>
<meta>
<meta>
<style></style>
</head>
<body>
<div>
<h1></h1>
<p></p>
<p><a></a></p>
</div>
</body>
</html>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.