Jsoup中属性值的字符集

Question

I use the Jsoup and I need to pick up attribute values of all tags inside an html document in ascii-encoding maintaining them as they are, without converting them. 我使用Jsoup，我需要以ascii编码方式获取html文档中所有标签的属性值，并按原样维护它们，而无需进行转换。

So, I have the following html document 所以，我有以下html文件

<!DOCTYPE html>
<head>
    <meta charset="ascii">        
</head>
<body>
    <div title="2 &gt; 1, 1 > 0, &agrave; vs &egrave;">
        3 &gt; 2,  1 > 0
    </div>
</body>

which I want to parse by means of Jsoup. 我想通过Jsoup进行解析。

I need to extract the value of title attribite exactly as it is: 2 > 1, 1 > 0, à vs è 我需要完全按原样提取title属性的值： 2 > 1, 1 > 0, à vs è 2 > 1, 1 > 0, à vs è . 。

I've create a Document object doc as below (it is in Kotlin, but I don't think it important here): 我已经创建了一个Document对象doc ，如下所示（它在Kotlin中，但我认为这里并不重要）：

val charset = Charset.forName("ascii")
val doc = Jsoup.parse(File("test.html").readText(charset))
doc.outputSettings().charset(charset)

When I print out the doc by means of 当我通过以下方式打印文档时

println(doc.toString())

I get the following string 我得到以下字符串

<!doctype html>
<html>
 <head> 
  <meta charset="ascii"> 
 </head> 
 <body> 
  <div title="2 > 1, 1 > 0, &agrave; vs &egrave;">
    3 &gt; 2 
  </div> 
 </body>
</html>

which differs from the file content by the title attribute value ( > gets transformed into > in string "2 > 1" ), while the rest of the document is OK. title属性值>文件内容不同（在字符串"2 > 1"中将>转换为> ），而文档的其余部分都可以。

Then, inspecting the attribute value 然后，检查属性值

 doc.body().select("div").forEach { div -> println("title = ${div.attr("title")}") }

produces the following string 产生以下字符串

title = 2 > 1, 1 > 0, à vs è

Notice, that à 注意， à and è 和è get transformed into à and è . 转化为à和è 。

My question is: in Jsoup, how can I get attribute values of the html tags preserving the way they are written in the input file? 我的问题是：在Jsoup中，如何获取html标记的属性值，以保留它们在输入文件中的写入方式？

In the example above I need to get the string "2 > 1, 1 > 0, à vs è" 在上面的示例中，我需要获取字符串"2 > 1, 1 > 0, à vs è" (as it is written in the input file) and not "2 > 1, 1 > 0, à vs è" （因为它写在输入文件中）而不是"2 > 1, 1 > 0, à vs è" niether "2 > 1, 1 > 0, à vs è" . niether "2 > 1, 1 > 0, à vs è" 。

Answer 1

The attr() method returns a String without the HTML entities and I could not find a way to keep the HTML entities. attr()方法返回没有HTML实体的String，我找不到保留HTML实体的方法。 However, you can use the Jsoup.clean() method to convert the characters in the string to entities. 但是，可以使用Jsoup.clean()方法将字符串中的字符转换为实体。

val charset = Charset.forName("ascii")
val doc = Jsoup.parse(File("test.html").readText(charset))
doc.body().select("div").forEach { div ->
    val title = Jsoup.clean("${div.attr("title")}", "", Whitelist.none(), Document.OutputSettings().charset(charset))
    println("title = $title")
}

The result is: 结果是：

title = 2 &gt; 1, &agrave; vs &egrave;

Of course, this might not be a good solution for your use case. 当然，对于您的用例，这可能不是一个好的解决方案。

Jsoup中属性值的字符集

问题描述

1 个解决方案

解决方案1
0 2016-11-17 18:33:35

Jsoup中属性值的字符集

问题描述

1 个解决方案

解决方案1 0 2016-11-17 18:33:35

解决方案1
0 2016-11-17 18:33:35