简体   繁体   English

JSoup解析属性=“值”=“”

[英]JSoup parse attribute=“value”=“”

My usecase我的用例

  1. Download an HTML file from a website从网站下载 HTML 文件
  2. Parse it with JSoup用 JSoup 解析它
  3. Transform it to valid XML with JSoup使用 JSoup 将其转换为有效的 XML
  4. Read elements and attributes from that XML document with XPath (javax.xml.xpath)使用 XPath (javax.xml.xpath) 从 XML 文档中读取元素和属性

(This is implemented and works in most cases as expected.) (这在大多数情况下按预期实施和工作。)

Problem / cause问题/原因

There is one case that fails:有一种情况失败:

  1. The source HTML file contains something invalid like this <div someattribute="somevalue"=""></div>源 HTML 文件包含无效的内容,例如<div someattribute="somevalue"=""></div>
  2. JSoup transforms it to the also invalid string <div someattribute="somevalue" =""=""></div> JSoup 将其转换为同样无效的字符串<div someattribute="somevalue" =""=""></div>
  3. XPath is not able to parse the invalid JSoup output XML. XPath 无法解析无效的 JSoup output XML。

Questions and solution approaches问题和解决方法

  1. Is it possible to give JSoup a hint so that it produces valid output for this invalid input?是否可以给 JSoup 一个提示,以便它为这个无效输入生成有效的 output ?
  2. Is it possible to give XPath a hint so that it parses that invalid input (=JSoup output)?是否可以给 XPath 一个提示,以便它解析无效输入(=JSoup 输出)?
  3. Yes, as a fallback I could filter that invalid "="" out of the HTML string and replace it with " , but why do it myself when there is a library that can parse invalid HTML??是的,作为后备,我可以从 HTML 字符串中过滤掉无效的"=""并将其替换为" ,但是当有一个可以解析无效 HTML 的库时,为什么要自己做呢?

Technical details技术细节

Unfortunately the HTML document I want to get parsed by JSoup contains something like this snippet:不幸的是,我想由 JSoup 解析的 HTML 文档包含类似以下代码段的内容:

<div someattribute="somevalue"=""></div>

Calling JSoup with this configuration...使用此配置调用 JSoup...

Document doc = Jsoup.parse(html);
doc.outputSettings().syntax(Document.OutputSettings.Syntax.xml).charset(StandardCharsets.UTF_8);
String html = doc.html();

... returns an HTML document that contains this snippet: ...返回包含此代码段的 HTML 文档:

<div someattribute="somevalue" =""=""></div>

XPath then aborts parsing this document with this message: XPath 然后使用以下消息中止解析此文档:

Auf Elementtyp "div" müssen entweder Attributspezifikationen, ">" oder "/>" folgen.

In English this is something like this:在英语中是这样的:

Element type "div" must be followed by either attribute specifications, ">" or "/>".

jsoup includes a converter to the W3C DOM model, which includes attribute filtering when converting. jsoup包含一个到 W3C DOM model 的转换器,其中包括转换时的属性过滤。 You can then run xpath queries on that object directly, which will not only work, but will be more efficient than serializing to XML and then re-parsing it.然后,您可以直接对该 object 运行 xpath 查询,这不仅有效,而且比序列化到 XML 然后重新解析更有效。

See the documentation for org.jsoup.helper.W3CDom请参阅org.jsoup.helper.W3CDom的文档

Here's an example:这是一个例子:

import org.w3c.dom.Document;
import org.w3c.dom.Node;
...

String html = "<div someattribute=\"somevalue\"=\"\"></div>";
org.jsoup.nodes.Document jdoc = Jsoup.parse(html);
Document w3doc = W3CDom.convert(jdoc);

String query = "//div";
XPathExpression xpath = XPathFactory.newInstance().newXPath().compile(query);
Node div = (Node) xpath.evaluate(w3doc, XPathConstants.NODE);

System.out.printf("Tag: %s, Attribute: %s",
        div.getNodeName(),
        div.getAttributes().getNamedItem("someattribute"));

(Note that Document and Node here are W3C DOM, not the jsoup DOM .) (注意这里的DocumentNode是 W3C DOM,而不是jsoup DOM 。)

That gives us:这给了我们:

Tag: div, Attribute: someattribute="somevalue"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM