简体   繁体   English

用Java创建UTF-8文件

[英]Creating a UTF-8 File in Java

I'm currently making a program that saves Chinese Words onto a text file. 我目前正在制作一个程序,将中文单词保存到文本文件中。 I create the text file in java, and then try and write words to it. 我用Java创建了文本文件,然后尝试向其中写入单词。 However, the text file I create is never encoded in UTF-8. 但是,我创建的文本文件从未以UTF-8编码。 This is the code I'm using, why doesn't it work? 这是我正在使用的代码,为什么不起作用? I was told that there was a bug inherent in Java but I have no idea how to get around it. 有人告诉我Java中有一个固有的错误,但我不知道如何解决它。

public void createFile(String name) {
    try {
        BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
        new FileOutputStream(name +".txt"), "UTF-8"));
        out.write("");
    }
    catch(java.io.IOException e) {
        System.err.println("Something went wrong.");
    }
}

Also, do I have another option aside from text files with which I could still use UTF encoding? 另外,除了文本文件之外,我还有其他选择可以使用UTF编码吗?

Also I'm testing its encoding by opening the TextEdit application and trying to write Chinese characters. 此外,我正在通过打开TextEdit应用程序并尝试编写中文字符来测试其编码。 Could this also be a problem? 这也可能是个问题吗?

First, files themselves don't have encodings. 首先,文件本身没有编码。 They're a bunch of 0s and 1s. 它们是一堆0和1。 If you write "asdf" in utf-8, it's completely indistinguishable from plain old ascii7. 如果您在utf-8中编写“ asdf”,则与普通的旧ascii7完全没有区别。

If you were writing in, say, utf-16, then the byte-order mark (BOM) would be a pretty clear indication that it's written in utf-16, even with an empty string, but utf-8 does not require such a marker to be present. 如果您正在写入,比如utf-16,那么字节顺序标记(BOM)将非常清楚地表明它是用utf-16编写的,即使是空字符串,但是utf-8不需要这样的要存在的标记。

Therefore, your editor has no way of knowing that this file is supposed to be written in utf-8. 因此,您的编辑器无法知道该文件应该以utf-8编写。 You could write utf-8's BOM to your file by: 您可以通过以下方式将utf-8的BOM写入您的文件:

out.write(0xEFBBBF); out.write(0xEFBBBF);

However, in this case, out would have to be an OutputStream, such as the FileOutputStream. 但是,在这种情况下, out必须是OutputStream,例如FileOutputStream。 (BufferedWriter and OutputStreamWriter do not accept byte arrays for input.) (BufferedWriter和OutputStreamWriter不接受输入的字节数组。)

Try the following code. 请尝试以下代码。 It worked for me. 它对我有用。 The file was written out as UTF-8. 该文件写为UTF-8。 I was able to open it with Notepad++, which verified that the encoding was UTF-8. 我能够使用Notepad ++打开它,该笔记本验证了编码为UTF-8。 The characters encoded correctly. 字符编码正确。 I got the characters from http://www.khngai.com/chinese/charmap/tbluni.php . 我从http://www.khngai.com/chinese/charmap/tbluni.php获得了字符。

package testutf8;

import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.UnsupportedEncodingException;
import java.io.Writer;

public class TestUTF8 {
  public static void main(String[] args) throws FileNotFoundException, UnsupportedEncodingException, IOException {
    String str = "Unicode Character Map, 0x4E00 - 0x4FFF\n" +
                 "4E00   一   丁   丂   七   丄   丅   丆   万   丈   三   上   下   丌   不   与   丏\n" +
                 "4E10   丐   丑   丒   专   且   丕   世   丗   丘   丙   业   丛   东   丝   丞   丟\n" +
                 "4E20   丠   両   丢   丣   两   严   並   丧   丨   丩   个   丫   丬   中   丮   丯\n" +
                 "4E30   丰   丱   串   丳   临   丵   丶   丷   丸   丹   为   主   丼   丽   举   丿\n" +
                 "4E40   乀   乁   乂   乃   乄   久   乆   乇   么   义   乊   之   乌   乍   乎   乏\n" +
                 "4E50   乐   乑   乒   乓   乔   乕   乖   乗   乘   乙   乚   乛   乜   九   乞   也\n" +
                 "4E60   习   乡   乢   乣   乤   乥   书   乧   乨   乩   乪   乫   乬   乭   乮   乯\n" +
                 "4E70   买   乱   乲   乳   乴   乵   乶   乷   乸   乹   乺   乻   乼   乽   乾   乿\n" +
                 "4E80   亀   亁   亂   亃   亄   亅   了   亇   予   争   亊   事   二   亍   于   亏\n" +
                 "4E90   亐   云   互   亓   五   井   亖   亗   亘   亙   亚   些   亜   亝   亞   亟\n" +
                 "4EA0   亠   亡   亢   亣   交   亥   亦   产   亨   亩   亪   享   京   亭   亮   亯\n" +
                 "4EB0   亰   亱   亲   亳   亴   亵   亶   亷   亸   亹   人   亻   亼   亽   亾   亿\n" +
                 "4EC0   什   仁   仂   仃   仄   仅   仆   仇   仈   仉   今   介   仌   仍   从   仏\n" +
                 "4ED0   仐   仑   仒   仓   仔   仕   他   仗   付   仙   仚   仛   仜   仝   仞   仟\n" +
                 "4EE0   仠   仡   仢   代   令   以   仦   仧   仨   仩   仪   仫   们   仭   仮   仯\n" +
                 "4EF0   仰   仱   仲   仳   仴   仵   件   价   仸   仹   仺   任   仼   份   仾   仿\n" +
                 "4F00   伀   企   伂   伃   伄   伅   伆   伇   伈   伉   伊   伋   伌   伍   伎   伏\n" +
                 "4F10   伐   休   伒   伓   伔   伕   伖   众   优   伙   会   伛   伜   伝   伞   伟\n" +
                 "4F20   传   伡   伢   伣   伤   伥   伦   伧   伨   伩   伪   伫   伬   伭   伮   伯\n" +
                 "4F30   估   伱   伲   伳   伴   伵   伶   伷   伸   伹   伺   伻   似   伽   伾   伿\n" +
                 "4F40   佀   佁   佂   佃   佄   佅   但   佇   佈   佉   佊   佋   佌   位   低   住\n" +
                 "4F50   佐   佑   佒   体   佔   何   佖   佗   佘   余   佚   佛   作   佝   佞   佟\n" +
                 "4F60   你   佡   佢   佣   佤   佥   佦   佧   佨   佩   佪   佫   佬   佭   佮   佯\n" +
                 "4F70   佰   佱   佲   佳   佴   併   佶   佷   佸   佹   佺   佻   佼   佽   佾   使\n" +
                 "4F80   侀   侁   侂   侃   侄   侅   來   侇   侈   侉   侊   例   侌   侍   侎   侏\n" +
                 "4F90   侐   侑   侒   侓   侔   侕   侖   侗   侘   侙   侚   供   侜   依   侞   侟\n" +
                 "4FA0   侠   価   侢   侣   侤   侥   侦   侧   侨   侩   侪   侫   侬   侭   侮   侯\n" +
                 "4FB0   侰   侱   侲   侳   侴   侵   侶   侷   侸   侹   侺   侻   侼   侽   侾   便\n" +
                 "4FC0   俀   俁   係   促   俄   俅   俆   俇   俈   俉   俊   俋   俌   俍   俎   俏\n" +
                 "4FD0   俐   俑   俒   俓   俔   俕   俖   俗   俘   俙   俚   俛   俜   保   俞   俟\n" +
                 "4FE0   俠   信   俢   俣   俤   俥   俦   俧   俨   俩   俪   俫   俬   俭   修   俯\n" +
                 "4FF0   俰   俱   俲   俳   俴   俵   俶   俷   俸   俹   俺   俻   俼   俽   俾   俿\n";

    FileOutputStream fos = new FileOutputStream("tmp.txt");
    Writer           out = new OutputStreamWriter(fos, "UTF-8");
    out.write(str);
    out.close();
  }
}

This may be a TextEdit usage issue. 这可能是TextEdit使用问题。

If there are no non-ASCII characters in the file you're writing, TextEdit's algorithm to determine encoding will likely land on ASCII or a Latin-1 variant. 如果您正在编写的文件中没有非ASCII字符,则TextEdit确定编码的算法可能会出现在ASCII或Latin-1变体上。

You can specify a text file's encoding in the File->Open dialog. 您可以在文件 - >打开对话框中指定文本文件的编码。 I'm not sure whether TextEdit remembers this decision on future double-clicks of this file. 我不确定TextEdit是否会记住此双击此文件的决定。

Try UTF-8 instead of UTF8. 尝试使用UTF-8而不是UTF8。 This might solve your problem. 这可能会解决您的问题。

I noticed that you didn't close your stream: 我注意到您没有关闭信息流:

out.close();

Of course you didn't include the code that wrote the actual characters either... 当然,您也没有包含编写实际字符的代码...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM