使用jsoup解析XML - 防止jsoup“清理”<link>标记

Question

In most case, I have no problem with using jsoup to parse XML. 在大多数情况下，使用jsoup解析XML没有问题。 However, if there are <link> tags in the XML document, jsoup will change <link>some text here</link> to <link />some text here . 但是，如果XML文档中有<link>标记，jsoup会<link>some text here</link>将<link>some text here</link>更改为<link />some text here 。 This makes it impossible to extract text inside the <link> tag using CSS selector. 这使得无法使用CSS选择器在<link>标记内提取文本。

So how to prevent jsoup from "cleaning" <link> tags? 那么如何防止jsoup“清理” <link>标签？

Answer 1

In jsoup 1.6.2 I have added an XML parser mode, which parses the input as-is, without applying the HTML5 parse rules (contents of element, document structure, etc). 在jsoup 1.6.2中，我添加了一个XML解析器模式，它按原样解析输入，而不应用HTML5解析规则（元素，文档结构等的内容）。 This mode will keep text in a <link> tag, and allow multiples of it, etc. 此模式将文本保留在<link>标记中，并允许其多个等。

Here's an example: 这是一个例子：

String xml = "<link>One</link><link>Two</link>";
Document xmlDoc = Jsoup.parse(xml, "", Parser.xmlParser());

Elements links = xmlDoc.select("link");
System.out.println("Link text 1: " + links.get(0).text());
System.out.println("Link text 2: " + links.get(1).text());

Returns: 返回：

Link text 1: One
Link text 2: Two

Answer 2

Do not store any text inside <link> element - it's invalid. 不要在<link>元素中存储任何文本 - 它是无效的。 If you need extra information, keep it inside HTML5 data-* attributes. 如果您需要额外信息，请将其保留在HTML5 data-*属性中。 I'm sure jsoup won't touch it. 我确定jsoup不会碰它。

<link rel="..." data-city="Warsaw" />

Answer 3

There can be a workaround for this. 可以有一个解决方法。 Before passing XML to jsoup. 在将XML传递给jsoup之前。 Transform XML file to replace all with some dummy tag say and do what you want to do. 转换XML文件以替换所有带有虚拟标记的文件并执行您想要执行的操作。

使用jsoup解析XML - 防止jsoup“清理”<link>标记

问题描述

3 个解决方案

解决方案1
35 已采纳 2012-04-15 00:15:28

解决方案2
1 2011-12-28 10:48:01

解决方案3
-1 2011-10-20 14:37:00

使用jsoup解析XML - 防止jsoup“清理”<link>标记

问题描述

3 个解决方案

解决方案1 35 已采纳 2012-04-15 00:15:28

解决方案2 1 2011-12-28 10:48:01

解决方案3 -1 2011-10-20 14:37:00

解决方案1
35 已采纳 2012-04-15 00:15:28

解决方案2
1 2011-12-28 10:48:01

解决方案3
-1 2011-10-20 14:37:00