没有http-equiv元标记且字符集的HTML文件可能不是UTF-8

Question

we are using jsoup - excellent thanks. 我们正在使用jsoup-非常感谢。

We may get HTML files with no http-equiv meta tag and the charset may be other than UTF-8. 我们可能会获得没有http-equiv元标记的HTML文件，并且字符集可能不是UTF-8。 How is it best to handle this please. 请问如何最好地处理这个问题。 We can have a list of encodings and try them but I am not sure how to tell programatically if something is wrong. 我们可以列出一些编码，然后尝试使用它们，但是我不确定如何以编程方式告诉您是否出了问题。 Would jsoup throw an IOException? jsoup会抛出IOException吗？

Answer 1

Jsoup will try to determine the encoding by the content type header or http equiv tag , if you have none of them it will use utf8 . Jsoup将尝试通过内容类型标头或http equiv标签确定编码，如果您都不使用utf8 ，则将使用utf8 。 Not sure if jsoup can do more for you here. 不知道jsoup是否可以在这里为您做更多的事情。

But you can try another approach: 但是您可以尝试另一种方法：

Implement a class that reads the files for you. 实现一个为您读取文件的类。 There you can take care of all encoding issues. 在那里您可以解决所有编码问题。 As a result such a class should give you proper encoded string or at least the encoding that's used for your input. 结果，这样的类应该为您提供正确的编码字符串，或者至少为您的输入使用编码。

(html input) --> [encoding class] --normalized encoding--> [jsoup] --> (whatever)

Jsoup can now parse that input with a known encoding. 现在，Jsoup可以使用已知的编码来解析该输入。

I guess changes on the html-creation thing is not possible, isn't it? 我猜不可能对html创建内容进行更改，不是吗？

Some further readings: 一些进一步的阅读：

http://illegalargumentexception.blogspot.co.uk/2009/05/java-rough-guide-to-character-encoding.html#javaencoding_autodetect http://illegalargumentexception.blogspot.co.uk/2009/05/java-rough-guide-to-character-encoding.html#javaencoding_autodetect
Character Encoding Detection Algorithm 字符编码检测算法
What is the most accurate encoding detector? 什么是最准确的编码检测器？ (includes a list of implementation) （包括实现列表）
Java Text File Encoding Java文本文件编码
Detect (or best guess of) incoming string encoding in Java 在Java中检测（或最佳猜测）传入的字符串编码

没有http-equiv元标记且字符集的HTML文件可能不是UTF-8

问题描述

1 个解决方案

解决方案1
0 2014-03-10 23:12:14

没有http-equiv元标记且字符集的HTML文件可能不是UTF-8

问题描述

1 个解决方案

解决方案1 0 2014-03-10 23:12:14

解决方案1
0 2014-03-10 23:12:14