JSoup parse attribute=“value”=“”

Question

My usecase

Download an HTML file from a website
Parse it with JSoup
Transform it to valid XML with JSoup
Read elements and attributes from that XML document with XPath (javax.xml.xpath)

(This is implemented and works in most cases as expected.)

Problem / cause

There is one case that fails:

The source HTML file contains something invalid like this <div someattribute="somevalue"=""></div>
JSoup transforms it to the also invalid string <div someattribute="somevalue" =""=""></div>
XPath is not able to parse the invalid JSoup output XML.

Questions and solution approaches

Is it possible to give JSoup a hint so that it produces valid output for this invalid input?
Is it possible to give XPath a hint so that it parses that invalid input (=JSoup output)?
Yes, as a fallback I could filter that invalid "="" out of the HTML string and replace it with " , but why do it myself when there is a library that can parse invalid HTML??

Technical details

Unfortunately the HTML document I want to get parsed by JSoup contains something like this snippet:

<div someattribute="somevalue"=""></div>

Calling JSoup with this configuration...

Document doc = Jsoup.parse(html);
doc.outputSettings().syntax(Document.OutputSettings.Syntax.xml).charset(StandardCharsets.UTF_8);
String html = doc.html();

... returns an HTML document that contains this snippet:

<div someattribute="somevalue" =""=""></div>

XPath then aborts parsing this document with this message:

Auf Elementtyp "div" müssen entweder Attributspezifikationen, ">" oder "/>" folgen.

In English this is something like this:

Element type "div" must be followed by either attribute specifications, ">" or "/>".

Answer 1

jsoup includes a converter to the W3C DOM model, which includes attribute filtering when converting. You can then run xpath queries on that object directly, which will not only work, but will be more efficient than serializing to XML and then re-parsing it.

See the documentation for org.jsoup.helper.W3CDom

Here's an example:

import org.w3c.dom.Document;
import org.w3c.dom.Node;
...

String html = "<div someattribute=\"somevalue\"=\"\"></div>";
org.jsoup.nodes.Document jdoc = Jsoup.parse(html);
Document w3doc = W3CDom.convert(jdoc);

String query = "//div";
XPathExpression xpath = XPathFactory.newInstance().newXPath().compile(query);
Node div = (Node) xpath.evaluate(w3doc, XPathConstants.NODE);

System.out.printf("Tag: %s, Attribute: %s",
        div.getNodeName(),
        div.getAttributes().getNamedItem("someattribute"));

(Note that Document and Node here are W3C DOM, not the jsoup DOM .)

That gives us:

Tag: div, Attribute: someattribute="somevalue"

JSoup parse attribute=“value”=“”

Question

My usecase

Problem / cause

Questions and solution approaches

Technical details

1 answers

solution1
1 ACCPTED 2021-01-09 00:18:34

JSoup parse attribute=“value”=“”

Question

My usecase

Problem / cause

Questions and solution approaches

Technical details

1 answers

solution1 1 ACCPTED 2021-01-09 00:18:34

solution1
1 ACCPTED 2021-01-09 00:18:34