简体   繁体   English

Jsoup中属性值的字符集

[英]charset of the attribute values in Jsoup

I use the Jsoup and I need to pick up attribute values of all tags inside an html document in ascii-encoding maintaining them as they are, without converting them. 我使用Jsoup,我需要以ascii编码方式获取html文档中所有标签的属性值,并按原样维护它们,而无需进行转换。

So, I have the following html document 所以,我有以下html文件

<!DOCTYPE html>
<head>
    <meta charset="ascii">        
</head>
<body>
    <div title="2 &gt; 1, 1 > 0, &agrave; vs &egrave;">
        3 &gt; 2,  1 > 0
    </div>
</body>

which I want to parse by means of Jsoup. 我想通过Jsoup进行解析。

I need to extract the value of title attribite exactly as it is: 2 &gt; 1, 1 > 0, &agrave; vs &egrave; 我需要完全按原样提取title属性的值: 2 &gt; 1, 1 > 0, &agrave; vs &egrave; 2 &gt; 1, 1 > 0, &agrave; vs &egrave; .

I've create a Document object doc as below (it is in Kotlin, but I don't think it important here): 我已经创建了一个Document对象doc ,如下所示(它在Kotlin中,但我认为这里并不重要):

val charset = Charset.forName("ascii")
val doc = Jsoup.parse(File("test.html").readText(charset))
doc.outputSettings().charset(charset)

When I print out the doc by means of 当我通过以下方式打印文档时

println(doc.toString())

I get the following string 我得到以下字符串

<!doctype html>
<html>
 <head> 
  <meta charset="ascii"> 
 </head> 
 <body> 
  <div title="2 > 1, 1 > 0, &agrave; vs &egrave;">
    3 &gt; 2 
  </div> 
 </body>
</html>

which differs from the file content by the title attribute value ( &gt; gets transformed into > in string "2 > 1" ), while the rest of the document is OK. title属性值&gt;文件内容不同(在字符串"2 > 1"中将&gt;转换为> ),而文档的其余部分都可以。

Then, inspecting the attribute value 然后,检查属性值

 doc.body().select("div").forEach { div -> println("title = ${div.attr("title")}") }

produces the following string 产生以下字符串

title = 2 > 1, 1 > 0, à vs è

Notice, that &agrave; 注意, &agrave; and &egrave; &egrave; get transformed into à and è . 转化为àè

My question is: in Jsoup, how can I get attribute values of the html tags preserving the way they are written in the input file? 我的问题是:在Jsoup中,如何获取html标记的属性值,以保留它们在输入文件中的写入方式?

In the example above I need to get the string "2 &gt; 1, 1 > 0, &agrave; vs &egrave;" 在上面的示例中,我需要获取字符串"2 &gt; 1, 1 > 0, &agrave; vs &egrave;" (as it is written in the input file) and not "2 > 1, 1 > 0, &agrave; vs &egrave;" (因为它写在输入文件中)而不是"2 > 1, 1 > 0, &agrave; vs &egrave;" niether "2 &gt; 1, 1 &gt; 0, à vs è" . niether "2 &gt; 1, 1 &gt; 0, à vs è"

The attr() method returns a String without the HTML entities and I could not find a way to keep the HTML entities. attr()方法返回没有HTML实体的String,我找不到保留HTML实体的方法。 However, you can use the Jsoup.clean() method to convert the characters in the string to entities. 但是,可以使用Jsoup.clean()方法将字符串中的字符转换为实体。

val charset = Charset.forName("ascii")
val doc = Jsoup.parse(File("test.html").readText(charset))
doc.body().select("div").forEach { div ->
    val title = Jsoup.clean("${div.attr("title")}", "", Whitelist.none(), Document.OutputSettings().charset(charset))
    println("title = $title")
}

The result is: 结果是:

title = 2 &gt; 1, &agrave; vs &egrave;

Of course, this might not be a good solution for your use case. 当然,对于您的用例,这可能不是一个好的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM