简体   繁体   中英

JSoup parse attribute=“value”=“”

My usecase

  1. Download an HTML file from a website
  2. Parse it with JSoup
  3. Transform it to valid XML with JSoup
  4. Read elements and attributes from that XML document with XPath (javax.xml.xpath)

(This is implemented and works in most cases as expected.)

Problem / cause

There is one case that fails:

  1. The source HTML file contains something invalid like this <div someattribute="somevalue"=""></div>
  2. JSoup transforms it to the also invalid string <div someattribute="somevalue" =""=""></div>
  3. XPath is not able to parse the invalid JSoup output XML.

Questions and solution approaches

  1. Is it possible to give JSoup a hint so that it produces valid output for this invalid input?
  2. Is it possible to give XPath a hint so that it parses that invalid input (=JSoup output)?
  3. Yes, as a fallback I could filter that invalid "="" out of the HTML string and replace it with " , but why do it myself when there is a library that can parse invalid HTML??

Technical details

Unfortunately the HTML document I want to get parsed by JSoup contains something like this snippet:

<div someattribute="somevalue"=""></div>

Calling JSoup with this configuration...

Document doc = Jsoup.parse(html);
doc.outputSettings().syntax(Document.OutputSettings.Syntax.xml).charset(StandardCharsets.UTF_8);
String html = doc.html();

... returns an HTML document that contains this snippet:

<div someattribute="somevalue" =""=""></div>

XPath then aborts parsing this document with this message:

Auf Elementtyp "div" müssen entweder Attributspezifikationen, ">" oder "/>" folgen.

In English this is something like this:

Element type "div" must be followed by either attribute specifications, ">" or "/>".

jsoup includes a converter to the W3C DOM model, which includes attribute filtering when converting. You can then run xpath queries on that object directly, which will not only work, but will be more efficient than serializing to XML and then re-parsing it.

See the documentation for org.jsoup.helper.W3CDom

Here's an example:

import org.w3c.dom.Document;
import org.w3c.dom.Node;
...

String html = "<div someattribute=\"somevalue\"=\"\"></div>";
org.jsoup.nodes.Document jdoc = Jsoup.parse(html);
Document w3doc = W3CDom.convert(jdoc);

String query = "//div";
XPathExpression xpath = XPathFactory.newInstance().newXPath().compile(query);
Node div = (Node) xpath.evaluate(w3doc, XPathConstants.NODE);

System.out.printf("Tag: %s, Attribute: %s",
        div.getNodeName(),
        div.getAttributes().getNamedItem("someattribute"));

(Note that Document and Node here are W3C DOM, not the jsoup DOM .)

That gives us:

Tag: div, Attribute: someattribute="somevalue"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM