简体   繁体   English

如何将带有特殊字符(UTF-8)的HTML页面保存到txt文件

[英]How to save an HTML page with special chars (UTF-8) to a txt file

I need to make a java code that save an html to a txt file. 我需要制作一个将HTML保存到txt文件的Java代码。

The problem is that the special chars in UTF-8 are broken. 问题在于UTF-8中的特殊字符已损坏。

Words like "Hamamélis" are saved in this way "Hamam�lis". 像“Hamamélis”这样的单词以“Hamam�lis”的方式保存。

the code that i writed is listed down there: 我写的代码在下面列出:

    URLConnection conn;
                    conn = site.openConnection();
                    conn.setReadTimeout(10000);
                    Charset charset = Charset.forName("UTF8");
                    BufferedReader in = new BufferedReader( new InputStreamReader(  conn.getInputStream(), "UTF-8"   )   );
                    buff = in.readLine();

And after: 之后:

out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(Nome), "UTF-8"));
out.write(buff);
out.close();

Anyone can suggest me a solution? 有人可以建议我解决方案吗?

One possible error is omitting the hyphen from "UTF-8" in the 4th line of your first piece of code. 一个可能的错误是在第一段代码的第4行中省略了“ UTF-8”中的连字符。 See the CharSet documentation. 请参阅CharSet文档。

Otherwise, code seems correct. 否则,代码似乎是正确的。 But of course we cannot test it directly as we do not have your data. 但是当然我们不能直接测试它,因为我们没有您的数据。

For comparison, here is a little class I wrote. 为了比较,这是我写的一小节课。 In a manner similar to your code, this class correctly writes your "Hamamélis" example's accented 'e' as the two octets expected in UTF-8 for a single (non-normalized) character: in hex 'C3' & 'A9'. 以类似于您的代码的方式,此类正确地将您的“Hamamélis”示例的重音符号“ e”编写为UTF-8中预期的单个(非标准化)字符的两个八位字节:十六进制“ C3”和“ A9”。

    import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.BufferedWriter;
import java.io.IOException;

public class ReaderWriter {
    public static void main(String[] args) {
        try {
            String content = "Hamamélis. Written: " + new java.util.Date();

            File file = new File("some_text.txt");

            // Create file if not already existent. 
            if (!file.exists()) {
                file.createNewFile();
            }

            FileOutputStream fileOutputStream = new FileOutputStream( file );
            OutputStreamWriter outputStreamWriter = new OutputStreamWriter( fileOutputStream, "UTF-8" );
            BufferedWriter bufferedWriter = new BufferedWriter( outputStreamWriter );
            bufferedWriter.write( content );
            bufferedWriter.close();

            System.out.println("ReaderWriter 'main' method is done. " + new java.util.Date() );

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

As icktoofay commented, you should dig deeper to discover exactly what octets are involved. 正如icktoofay所评论的那样,您应该更深入地发现确切涉及哪些八位位组。 Use a hex editor like this " File Viewer " app I found today on the Mac App Store to see the exact octets in your saved file. 使用我今天在Mac App Store上找到的“ File Viewer ”应用程序之类的十六进制编辑器,查看保存的文件中的确切八位字节。

If the octets are C3 & A9, then the problem is simply that the text editor you used to look at the file as text used the wrong character encoding. 如果八位字节是C3和A9,那么问题就出在您用来查看文件的文本编辑器中,因为文本使用了错误的字符编码。 For example, you can open that text file in a web browser, and use its menu commands to re-interpret the file as UTF-8. 例如,您可以在Web浏览器中打开该文本文件,然后使用其菜单命令将该文件重新解释为UTF-8。

If the octets are not C3 & A9, I would go further back to examine the input's octets. 如果八位组不是C3和A9,我将进一步检查输入的八位组。

If you do not understand that text files in computers actually contain numbers (not text in the human sense), then take a break from coding to read this entertaining article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky 如果您不了解计算机中的文本文件实际上包含数字(不是人类意义上的文本),请从编码中休息一下以阅读这篇有趣的文章: 绝对,绝对地肯定每个软件开发人员都必须对Unicode和字符集有所了解(无借口!)乔尔·斯波斯基(Joel Spolsky)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM