[英]Write to a file with a specific encoding in Java
This might be related to my previous question (on how to convert "för" to "för") 这可能与我之前的问题有关 (如何将“fÃr”转换为“för”)
So I have a file that I create in my code. 因此,我有一个在代码中创建的文件。 Right now I create it by the following code:
现在,我通过以下代码创建它:
FileWriter fwOne = new FileWriter(wordIndexPath);
BufferedWriter wordIndex = new BufferedWriter(fwOne);
followed by a few 其次是几个
wordIndex.write(wordBuilder.toString()); //that's a StringBuilder
ending (after a while-loop) with a (在while循环之后)以a结尾
wordIndex.close();
Now the problem is later on this file is huge and I want (need) to jump in it without going through the entire file. 现在问题出在这个文件的后面,这是巨大的,我希望(需要)跳过它而不浏览整个文件。 The
seek(long pos)
method of RandomAccessFile
lets me do this. 我可以执行
RandomAccessFile
的seek(long pos)
方法。
Here's my problem : The characters in the file I've created seem to be encoded with UTF-8 and the only info I have when I seek is the character-position I want to jump to. 这是我的问题 :我创建的文件中的字符似乎是用UTF-8编码的,而当我查找时唯一的信息就是我想跳转到的字符位置。
seek(long pos)
on the other hand jumps in bytes, so I don't end up in the right place since an UTF-8 character can be more than one byte. 另一方面,
seek(long pos)
以字节为单位跳,所以我不会以正确的位置结束,因为UTF-8字符可以超过一个字节。
Here's my question : Can I, when I write the file, write it in ISO-8859-15 instead (where a character is a byte)? 这是我的问题 :写文件时,我可以改用ISO-8859-15(字符是字节)写吗? That way the
seek(long pos)
will get me in the right position. 这样,
seek(long pos)
位置seek(long pos)
将使我处于正确的位置。 Or should I instead try to use an alternative to RandomAccessFile
(is there an alternative where you can jump to a character-position?) 还是我应该尝试使用替代
RandomAccessFile
的替代方法(是否存在可以跳转到字符位置的替代方法?)
Now first the worrisome. 现在首先令人担忧。 FileWriter and FileReader are old utility classes, that use the default platform settings on that computer.
FileWriter和FileReader是旧的实用程序类,它们使用该计算机上的默认平台设置。 Run elsewhere that code will give a different file, will not be able to read a file from another spot.
在其他地方运行,该代码将提供另一个文件,将无法从其他位置读取文件。
ISO-8859-15 is a single byte encoding. ISO-8859-15是单字节编码。 But java holds text in Unicode, so it can combine all scripts.
但是Java将文本保存为Unicode,因此可以合并所有脚本。 And
char
is UTF-16. char
是UTF-16。 In general a char index will not be a byte index, but in your case it probably works. 通常,char索引不会是字节索引,但是在您的情况下它可能会起作用。 But the line break might be one
\\n
or two \\r\\n
chars/bytes - platform dependently. 但是换行符可能是一个
\\n
或两个\\r\\n
字符/字节-取决于平台。
Re 回覆
Personally I think UTF-8 is well established, and it is easier to use: 就我个人而言,我认为UTF-8已经很成熟,并且更易于使用:
byte[] bytes = string.getBytes(StandardCharsets.UTF_8);
string = new String(bytes, StandardCharsets.UTF_8);
That way all special quotes, euro, and so on will always be available. 这样,所有特殊报价,欧元等等都将始终可用。
At least specify the encoding: 至少指定编码:
Files.newBufferedWriter(file.toPath(), "ISO-8859-15");
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.