简体   繁体   English

文件Java中的波兰语字符

[英]Polish characters in file java

I using in my project Jsoup . 我在我的项目Jsoup I read docx file and convert it to html. 我阅读了docx文件并将其转换为html。 I want write results in file, but I have problem. 我想在文件中写入结果,但是有问题。 FileOutputStream not write polish characters. FileOutputStream不写波兰语字符。 For example instead of 例如代替

Wiersz nad którym znajduje się aktualnie kursor myszy I have Wiersz nad którym znajduje się aktualnie kursor myszy我有

Wiersz nad kt?rym znajduje si� aktualnie kursor myszy . 

This is my method where I parse html: 这是我解析html的方法:

public String parseHTML(String html) {
    int i = 0;
    Document doc = Jsoup.parse(html);
    doc.outputSettings().syntax(Document.OutputSettings.Syntax.xml).charset("ISO-8859-2");
    for (Element element : doc.select("img[src]")) {
        element.attr("src", "resources/images/img" + i + ".png");
        i++;
    }
    return doc.toString();
}

and here I write to file: 在这里我写到文件:

public void saveHelpFile(byte[] document) throws IOException {
    File file = new File(
            "path/to/file");
    String s = new String(document, "ISO-8859-2");
    PrintWriter writer = new PrintWriter(file, "ISO-8859-2");
    try {
        writer.write(s);
    } finally {
        writer.close();
    }
}

Here is my method where I read file: 这是我读取文件的方法:

public void uploadFile() throws XWPFConverterException, IOException {
        InputStream in = new FileInputStream(new File("path/to/file"));
        XWPFDocument document = new XWPFDocument(in);

        XHTMLOptions options = XHTMLOptions.create();
        XHTMLConverter.getInstance().convert(document, out, options);

        String html = out.toString();
        html = html.replaceAll("<html>",
                "<html xmlns='http://www.w3.org/1999/xhtml' " + "\n" + " xmlns:h='http://java.sun.com/jsf/html' " + "\n"
                        + " xmlns:f='http://java.sun.com/jsf/core' " + "\n" + " xmlns:p='http://primefaces.org/ui ' "
                        + "\n" + " xmlns:ui='http://java.sun.com/jsf/facelets' " + "\n"
                        + " xmlns:pe='http://primefaces.org/ui/extensions' " + "\n"
                        + " xmlns:components='http://java.sun.com/jsf/composite/components' >");

        html = parseHTML(html, extractPhoto(document));
        html = html.replaceAll("<body>", "<h:body>").replaceAll("</body>", "</h:body>");
        saveHelpFile(html.getBytes("ISO-8859-2"));
    }

Your String is fine, it contains correct info, but when you write to file you write it with charset "ISO-8859-2". 您的字符串很好,它包含正确的信息,但是当您写入文件时,请使用字符集“ ISO-8859-2”来编写它。 File doesn't keep the charset info it is written with. 文件不保留与其一起写入的字符集信息。 Whatever app reads the file it is expected to know or guess the charset of the file. 无论哪种应用程序读取文件,都期望它知道或猜测文件的字符集。 That's why it is always recommended to write your files in UTF-8 or UTF-16. 这就是为什么始终建议使用UTF-8或UTF-16写入文件的原因。 So, in your code no change is needed as far as getting your String. 因此,在您的代码中,就获取String而言,不需要任何更改。 Just when you write to file change the charset to UTF-8. 就在您写入文件时,将字符集更改为UTF-8。 The reason that it will work is that you "told" your String that your bytes represent info in charset "ISO-8859-2" and should be interpreted as such. 它将起作用的原因是您“告诉”您的字符串,您的字节表示字符集“ ISO-8859-2”中的信息,因此应这样解释。 So the String is built correctly. 因此,字符串构建正确。 But internally java keeps all Srtings in Unicode charset (UCS-2). 但是java内部将所有Srtings保留在Unicode字符集(UCS-2)中。 So now you can write your String to any other destination (file in your case) in any valid charset and Java will know how to write it. 因此,现在您可以使用任何有效字符集将String写入任何其他目标位置(在您的情况下为文件),Java将知道如何编写它。 So in your case you can write it in "ISO-8859-2" or in "UTF-8" or any other charset that supports Polish (for instance "UTF-16") Since UTF-8 is generally accepted de-facto standard it is recommended to use it 因此,根据您的情况,您可以使用“ ISO-8859-2”或“ UTF-8”或任何其他支持波兰语的字符集(例如“ UTF-16”)编写它,因为UTF-8是公认的实际标准建议使用它

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM