Use jsoup to parse XML - prevent jsoup from “cleaning” <link> tags

Question

In most case, I have no problem with using jsoup to parse XML. However, if there are <link> tags in the XML document, jsoup will change <link>some text here</link> to <link />some text here . This makes it impossible to extract text inside the <link> tag using CSS selector.

So how to prevent jsoup from "cleaning" <link> tags?

Answer 1

In jsoup 1.6.2 I have added an XML parser mode, which parses the input as-is, without applying the HTML5 parse rules (contents of element, document structure, etc). This mode will keep text in a <link> tag, and allow multiples of it, etc.

Here's an example:

String xml = "<link>One</link><link>Two</link>";
Document xmlDoc = Jsoup.parse(xml, "", Parser.xmlParser());

Elements links = xmlDoc.select("link");
System.out.println("Link text 1: " + links.get(0).text());
System.out.println("Link text 2: " + links.get(1).text());

Returns:

Link text 1: One
Link text 2: Two

Answer 2

Do not store any text inside <link> element - it's invalid. If you need extra information, keep it inside HTML5 data-* attributes. I'm sure jsoup won't touch it.

<link rel="..." data-city="Warsaw" />

Answer 3

There can be a workaround for this. Before passing XML to jsoup. Transform XML file to replace all with some dummy tag say and do what you want to do.

Use jsoup to parse XML - prevent jsoup from “cleaning” <link> tags

Question

3 answers

solution1
35 ACCPTED 2012-04-15 00:15:28

solution2
1 2011-12-28 10:48:01

solution3
-1 2011-10-20 14:37:00

Use jsoup to parse XML - prevent jsoup from “cleaning” <link> tags

Question

3 answers

solution1 35 ACCPTED 2012-04-15 00:15:28

solution2 1 2011-12-28 10:48:01

solution3 -1 2011-10-20 14:37:00

solution1
35 ACCPTED 2012-04-15 00:15:28

solution2
1 2011-12-28 10:48:01

solution3
-1 2011-10-20 14:37:00