简体   繁体   English

java将String windows-1251转换为utf8

[英]java convert String windows-1251 to utf8

Scanner sc = new Scanner(System.in);
    System.out.println("Enter text: ");
    String text = sc.nextLine();
    try {
        String result = new String(text.getBytes("windows-1251"), Charset.forName("UTF-8"));
        System.out.println(result);
    } catch (UnsupportedEncodingException e) {
        System.out.println(e);
    }

I'm trying change keyboard: input cyrylic keyboard, output latin. 我正在尝试更换键盘:输入cyrylic键盘,输出拉丁语。 Example: qwerty +> йцукен 示例:qwerty +>йцукен

It doesn't work, can anyone tell me what i'm doing wrong? 它不起作用,谁能告诉我我做错了什么?

First java text, String/char/Reader/Writer is internally Unicode, so it can combine all scripts. 第一个java文本,String / char / Reader / Writer是内部Unicode,因此它可以组合所有脚本。 This is a major difference with for instance C/C++ where there is no such standard. 这与例如C / C ++的主要区别在于没有这样的标准。

Now System.in is an InputStream for historical reasons. 现在,由于历史原因,System.in是一个InputStream。 That needs an indication of encoding used. 这需要使用的编码指示。

Scanner sc = new Scanner(System.in, "Windows-1251");

The above explicitly sets the conversion for System.in to Cyrillic. 上面明确地将System.in的转换设置为Cyrillic。 Without this optional parameter the default encoding is taken. 如果没有此可选参数,则采用默认编码。 If that was not changed by the software, it would be the platform encoding. 如果软件没有改变,那就是平台编码。 So this might have been correct too. 所以这可能也是正确的。

Now text is correct, containing the Cyrillic from System.in as Unicode. 现在text是正确的,包含来自System.in的Cyrillic作为Unicode。

You would get the UTF-8 bytes as: 你会得到UTF-8字节:

byte[] bytes = text.getBytes(StandardCharsets.UTF_8);

The old "recoding" of text was wrong; 文本的旧“重新编码”是错误的; drop this line. 放弃这一行。 in fact not all Windows-1251 bytes are valid UTF-8 multi-byte sequences. 实际上并非所有Windows-1251字节都是有效的UTF-8多字节序列。

String result = text;

System.out.println(result);

System.out is a PrintStream, a rather rarely used historic class. System.out是一个PrintStream,一个很少使用的历史类。 It prints using the default platform encoding. 它使用默认平台编码进行打印。 More or less rely on it, that the default encoding is correct. 或多或少依赖它,默认编码是正确的。

System.out.println(result);

For printing to an UTF-8 encoded file: 要打印到UTF-8编码文件:

byte[] bytes = ("\uFEFF" + text).getBytes(StandardCharsets.UTF_8);
Path path = Paths.get("C:/Temp/test.txt");
Files.writeAllBytes(path, bytes);

Here I have added a Unicode BOM character in front, so Windows Notepad may recognize the encoding as UTF-8. 这里我在前面添加了一个Unicode BOM字符,因此Windows Notepad可能会将编码识别为UTF-8。 In general one should evade using a BOM. 一般来说,应该使用BOM来逃避。 It is a zero-width space (=invisible) and plays havoc with all kind of formats: CSV, XML, file concatenation, cut-copy-paste. 它是一个零宽度空间(=不可见)并且会对各种格式造成严重破坏:CSV,XML,文件串联,剪切复制粘贴。

The reason why you have gotten the answer to a different question, and nobody answered yours, is because your title doesn't fit the question. 你之所以得到另一个问题的答案而没有人回答你的问题,是因为你的标题不适合这个问题。 You were not attempting to convert between charsets, but rather between keyboard layouts. 您没有尝试在字符集之间进行转换,而是在键盘布局之间进行转换。

Here you shouldn't worry about character layout at all, simply read the line, convert it to an array of characters, go through them and using a predefined map convert these. 在这里你根本不用担心字符布局,只需读取行,将其转换为字符数组,浏览它们并使用预定义的地图转换这些。

The code will be something like this: 代码将是这样的:

Map<char, char> table = new TreeMap<char, char>();
table.put('q', 'й');
table.put('Q', 'Й');
table.put('w', 'ц');
// .... etc

String text = sc.nextLine();
char[] cArr = text.toCharArray();
for(int i=0; i<cArr.length; ++i)
{
  if(table.containsKey(cArr[i]))
  {
    cArr[i] = table.get(cArr[i]);
  }
}
text = new String(cArr);
System.out.println(text);

Now, i don't have time to test that code, but you should get the idea of how to do your task. 现在,我没有时间测试该代码,但您应该了解如何完成任务。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM