简体   繁体   English

在java中将字符串从一种编码解释为另一种编码

[英]Interpret a string from one encoding to another in java

I've looked around for answers to this (I'm sure they're out there), and I'm not sure it's possible. 我四处寻找答案(我确定他们在那里),我不确定这是可能的。

So, I got a HUGE file that contains the word "för". 所以,我收到了一个包含“för”字样的巨大文件。 I'm using RandomAccessFile because I know where it is (kind of) and can therefore use the seek() function to get there. 我正在使用RandomAccessFile,因为我知道它的位置(种类),因此可以使用seek()函数来实现。

To know that I've found it I have a String "för" in my program that I check for equality. 要知道我发现它,我的程序中有一个字符串“för”,我检查是否相等。 Here's the problem, I ran the debugger and when I get to "för" what I get to compare is "för". 这是问题,我运行调试器,当我得到“för”时,我得到的比较是“för”。

So my program terminates without finding any "för". 所以我的程序终止而没有找到任何“för”。

This is the code I use to get a word: 这是我用来获取单词的代码:

    private static String getWord(RandomAccessFile file) throws IOException {
    StringBuilder stb = new StringBuilder();
    String word;
    char c;
    c = (char)file.read();
    int end;
    do {
        stb.append(c);
        end = file.read();
        if(end==-1)
            return "-1";
        c = (char)end;

    } while (c != ' ');
    word = stb.toString();
    word.trim();
    return word;
}

So basically I return all the characters from the current point in the file to the first ' '-character. 所以基本上我将所有字符从文件中的当前点返回到第一个''字符。 So basically I get the word, but since (char)file.read(); 所以基本上我得到了这个词,但是因为(char)file.read(); reads a byte (I think), UTF-8 'ö' becomes the two characters 'Ã' and '¶'? 读取一个字节(我认为),UTF-8'ö'成为两个字符'Ã'和'¶'?

One reason for this guess is that if I open my file with encoding UTF-8 it's "för" but if I open the file with ISO-8859-15 in the same place we now have exactly what my getWord method returns: "för" 这种猜测的一个原因是,如果我用UTF-8编码打开我的文件,它就是“för”,但是如果我在同一个地方用ISO-8859-15打开文件,我们现在就得到了我的getWord方法返回的内容:“fö R”

So my question: 所以我的问题:

When I'm sitting with a "för" and a "för", is there any way to fix this? 当我坐着“för”和“för”时,有什么方法可以解决这个问题吗? Like saying "read "för" as if it was an UTF-8 string" to get "för"? 就像说“读”för“好像是一个UTF-8字符串”得到“för”?

If you have to use a RandomAccessFile you should read the content into a byte[] first and then convert the complete array to a String - somthing along the lines of: 如果你必须使用RandomAccessFile你应该首先将内容读入byte[] ,然后将完整数组转换为String - somthing沿着以下行:

byte[] buffer = new byte[whatever];
file.read(buffer);
String result = new String(buffer,"UTF-8");

This is only to give you a general impression what to do, you'll have to add some length-handling etc. 这只是为了给你一个普遍的印象,你需要添加一些长度处理等。

This will not work correctly if you start reading in the middle of a UTF-8 sequence, but so will any other method. 如果您在UTF-8序列的中间开始阅读,这将无法正常工作,但任何其他方法也将如此。

You are using RandomAccessFile.read() . 您正在使用RandomAccessFile.read() This reads single bytes. 这读取单个字节。 UTF-8 sometimes uses several bytes for one character. UTF-8有时会为一个字符使用几个字节。

Different methods to read UTF-8 from a RandomAccessFile are discussed here: Java: reading strings from a random access file with buffered input 这里讨论从RandomAccessFile读取UTF-8的不同方法: Java:从具有缓冲输入的随机访问文件中读取字符串

If you don't necessarily need a RandomAccessFile, you should definitely switch to reading characters instead of bytes . 如果您不一定需要RandomAccessFile,您肯定应该切换到读取字符而不是字节

If possible, I would suggest Scanner.next() which searches for the next word by default. 如果可能的话,我会建议Scanner.next()默认搜索下一个单词。

import java.nio.charset.Charset;
String encodedString = new String(originalString.getBytes("ISO-8859-15"), Charset.forName("UTF-8"));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM