[英]charset of the attribute values in Jsoup
I use the Jsoup and I need to pick up attribute values of all tags inside an html document in ascii-encoding maintaining them as they are, without converting them. 我使用Jsoup,我需要以ascii编码方式获取html文档中所有标签的属性值,并按原样维护它们,而无需进行转换。
So, I have the following html document 所以,我有以下html文件
<!DOCTYPE html>
<head>
<meta charset="ascii">
</head>
<body>
<div title="2 > 1, 1 > 0, à vs è">
3 > 2, 1 > 0
</div>
</body>
which I want to parse by means of Jsoup. 我想通过Jsoup进行解析。
I need to extract the value of title
attribite exactly as it is: 2 > 1, 1 > 0, à vs è
我需要完全按原样提取
title
属性的值: 2 > 1, 1 > 0, à vs è
2 > 1, 1 > 0, à vs è
. 。
I've create a Document
object doc
as below (it is in Kotlin, but I don't think it important here): 我已经创建了一个
Document
对象doc
,如下所示(它在Kotlin中,但我认为这里并不重要):
val charset = Charset.forName("ascii")
val doc = Jsoup.parse(File("test.html").readText(charset))
doc.outputSettings().charset(charset)
When I print out the doc by means of 当我通过以下方式打印文档时
println(doc.toString())
I get the following string 我得到以下字符串
<!doctype html>
<html>
<head>
<meta charset="ascii">
</head>
<body>
<div title="2 > 1, 1 > 0, à vs è">
3 > 2
</div>
</body>
</html>
which differs from the file content by the title
attribute value ( >
gets transformed into >
in string "2 > 1"
), while the rest of the document is OK. title
属性值>
文件内容不同(在字符串"2 > 1"
中将>
转换为>
),而文档的其余部分都可以。
Then, inspecting the attribute value 然后,检查属性值
doc.body().select("div").forEach { div -> println("title = ${div.attr("title")}") }
produces the following string 产生以下字符串
title = 2 > 1, 1 > 0, à vs è
Notice, that à
注意,
à
and è
和
è
get transformed into à
and è
. 转化为
à
和è
。
My question is: in Jsoup, how can I get attribute values of the html tags preserving the way they are written in the input file? 我的问题是:在Jsoup中,如何获取html标记的属性值,以保留它们在输入文件中的写入方式?
In the example above I need to get the string "2 > 1, 1 > 0, à vs è"
在上面的示例中,我需要获取字符串
"2 > 1, 1 > 0, à vs è"
(as it is written in the input file) and not "2 > 1, 1 > 0, à vs è"
(因为它写在输入文件中)而不是
"2 > 1, 1 > 0, à vs è"
niether "2 > 1, 1 > 0, à vs è"
. niether
"2 > 1, 1 > 0, à vs è"
。
The attr()
method returns a String without the HTML entities and I could not find a way to keep the HTML entities. attr()
方法返回没有HTML实体的String,我找不到保留HTML实体的方法。 However, you can use the Jsoup.clean()
method to convert the characters in the string to entities. 但是,可以使用
Jsoup.clean()
方法将字符串中的字符转换为实体。
val charset = Charset.forName("ascii")
val doc = Jsoup.parse(File("test.html").readText(charset))
doc.body().select("div").forEach { div ->
val title = Jsoup.clean("${div.attr("title")}", "", Whitelist.none(), Document.OutputSettings().charset(charset))
println("title = $title")
}
The result is: 结果是:
title = 2 > 1, à vs è
Of course, this might not be a good solution for your use case. 当然,对于您的用例,这可能不是一个好的解决方案。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.