简体   繁体   English

在java中读取unicode字符

[英]Reading unicode character in java

I'm a bit new to java, When I assign a unicode string to 当我分配一个unicode字符串时,我对java有点新鲜

  String str = "\u0142o\u017Cy\u0142";
  System.out.println(str);

  final StringBuilder stringBuilder = new StringBuilder();
  InputStream inStream = new FileInputStream("C:/a.txt");
  final InputStreamReader streamReader = new InputStreamReader(inStream, "UTF-8");
  final BufferedReader bufferedReader = new BufferedReader(streamReader);
  String line = "";
  while ((line = bufferedReader.readLine()) != null) {
      System.out.println(line);
      stringBuilder.append(line);
  }

Why are the results different in both cases the file a.txt also contains the same string. 为什么在两种情况下结果都不同,文件a.txt也包含相同的字符串。 but when i print output of the file it prints z\ło\ży\ł instead of the actual unicode characters. 但是当我打印文件的输出时,它打印z\ło\ży\ł而不是实际的unicode字符。 Any idea how do i do this if i want to file content also to be printed as string is being printed. 如果我想将文件内容也打印为正在打印的字符串,我知道如何做到这一点。

Your code should be correct, but I guess that the file "a.txt" does not contain the Unicode characters encoded with UTF-8, but the escaped string "\ło\ży\ł". 你的代码应该是正确的,但我想文件“a.txt”不包含用UTF-8编码的Unicode字符,而是包含转义字符串“\\ u0142o \\ u017Cy \\ u0142”。

Please check if the text file is correct, using an UTF-8 aware editor such as recent versions of Notepad or Notepad++ on Windows. 请使用支持UTF-8的编辑器检查文本文件是否正确,例如Windows上的最新版本的Notepad或Notepad ++。 Or edit it with your favorite hex editor - it should not contain backslashes. 或使用您喜欢的十六进制编辑器编辑它 - 它不应包含反斜杠。

I tried it with "€" as UTF-8-encoded content of the file and it gets printed correctly. 我尝试用“€”作为文件的UTF-8编码内容并正确打印。 Note that not all Unicode characters can be printed, depending on your terminal encoding (really a hassle on Windows) and font. 请注意,并非所有Unicode字符都可以打印,具体取决于您的终端编码(在Windows上真的很麻烦)和字体。

Java interprets unicode escapes such as your that are in the source code as if you had actually typed that character (latin small letter L with stroke) into the source. Java解释了源代码中的unicode转义 ,例如你的 ,就好像你实际上已经将该字符(带笔划的拉丁小写字母L)输入到源代码中一样。 Java does not interpret unicode escapes that it reads from a file. Java 解释Unicode转义字符,它从文件中读取。

If you take your String str = "\ło\ży\ł"; 如果你把你的String str = "\ło\ży\ł"; and write it to a file a.txt from your Java program, then open the file in an editor, you'll see the characters themselves in the file, not the \\uNNNN sequence. 并将其写入Java程序中的文件a.txt ,然后在编辑器中打开文件,您将在文件中看到字符本身, 而不是 \\ uNNNN序列。

If you then take your original posted program and read that a.txt file you should see what you expected. 如果您随后使用原始发布的程序并读取 a.txt文件,您应该看到您的预期。

You can use Apache Commons Lang . 您可以使用Apache Commons Lang

import org.apache.commons.lang3.StringEscapeUtils;

// open the file as ASCII, read it into a string, then
String escapedStr; // = "\u0938\u093e\u0935\u0928@\u0928\u093f\u0915\u094d\u0938\u0940.\u092d\u093e\u0930\u0924"
// (to include such a string in a Java program you would have to double each \)

String hindiStr = StringEscapeUtils.unescapeJava( escapedStr );

System.out.println(hindiStr);

It sounds as though your file literally contains the text z\ło\ży\\u014\u003c/code> , ie has Unicode escape sequences in it. 听起来好像你的文件字面上包含文本z\ło\ży\\u014\u003c/code> ,即其中包含Unicode转义序列。

There's probably a library for decoding these but you could do it yourself - according to the Java Language Specification an escape sequence is always of the form \\uxxxx , so you could get the 4-digit hex value xxxx for the character, convert it to an integer with Integer.parseInt , convert it to a character and finally replace the whole \\uxxxx sequence with the character. 可能有一个用于解码这些文件的库,但你可以自己做 - 根据Java语言规范 ,转义序列的格式始终为\\uxxxx ,因此你可以获得字符的4位十六进制值xxxx ,将其转换为使用Integer.parseInt整数,将其转换为字符,最后用字符替换整个\\uxxxx序列。

So, you want to unescape unicode codepoints? 那么,你想要unicode unicode代码点? There is no public API available for this. 没有可用的公共API。 The java.util.Properties has a loadConvert() method which does exactly this, but it's private . java.util.Properties有一个loadConvert()方法,它正是这样做的,但它是private Check the Java source for the case you'd like to reuse this. 检查Java源代码,了解您要重用的情况。 It's doing the conversion by simple parsing. 它通过简单的解析进行转换。 I wouldn't use regex for this since this is too error prone in very specific circumstances. 我不会使用正则表达式,因为在非常特殊的情况下这很容易出错。

Or you should probably after all be using java.util.Properties or its i18n counterpart java.util.ResourceBundle with a .properties file instead of a plain .txt file. 或者你可能应该使用java.util.Properties或其i18n对应的java.util.ResourceBundle.properties文件而不是普通的.txt文件。

See also: 也可以看看:

I think its just "UTF8" not "UTF-8". 我认为它只是“UTF8”而不是“UTF-8”。

Here I saw it: Source 我在这里看到了它: 来源

我在这个答案中发布了Java代码到unescape(“descape”?)这样的东西和许多其他东西。

You have used FileInputStream and is a byte code reader not character reader. 您使用过FileInputStream并且是字节代码阅读器而不是字符阅读器。 Try using FileReader instead 请尝试使用FileReader

something like: 就像是:

BufferedReader inputStream = new BufferedReader(new FileReader("C:/a.txt")); BufferedReader inputStream = new BufferedReader(new FileReader(“C:/a.txt”));

then you can use the line oriented I/O BufferedReader to read each line. 然后你可以使用面向行的I / O BufferedReader来读取每一行。 FileInputREader is a low level I/O that you should avoid. FileInputREader是您应该避免的低级I / O. You're writing the characters to your file not the bytes, the best approach is to use character streams. 您将字符写入文件而不是字节,最好的方法是使用字符流。 for wrinting and reading unless you need to write bytes/binary data. 用于写入和读取,除非您需要写入字节/二进制数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM