简体   繁体   中英

Cannot read special letters from UTF-8 txt file in java

I have a problem with UTF-8 encoding in java. I have an UTF-8 encoded .txt file. I have checked in notepad++ that the file actually is UTF-8 encoded. I try to read the file, but the special letters are not shown correctly.

I use the following peace of code:

        try {

        Scanner sc = new Scanner(new FileInputStream("file.txt"), "UTF-8");

        String str;

        while(sc.hasNextLine()) {
            str = sc.nextLine();
            roadNames.add(str);
            System.out.println(str);
        }

        sc.close();

    } catch(IOException e1) {
        System.out.println("The file was not found....");
    }

It shows the special letters correctly in eclipse where I have defined the default encoding to be UTF-8, but not in my generated jar file.

The only thing that actually works for me, is to make a .bat file with the following arguments "java -Dfile.encoding=utf-8 -jar executable.jar" but I do not think that is a good solution.

Furthermore, this also works:

PrintStream out = new PrintStream(System.out, true, "UTF-8"); 
out.println(str);

Update

When I say

The special letters are not shown correctly

I mean that the System.out.println prints a string where the special letters are replaced by ├à in stead of å for example.

It turns out the

PrintStream out = new PrintStream(System.out, true, "UTF-8"); 
out.println(str);

does not work afterall - sorry about that.

The real problem is not that I want the console to print out what is inside the text document, but each line in the text document contains a name, and this name is added to an ArrayList. Then I have a JTextField which, when I begin typing inside it, tries to autocomplete what I typed by searching for the best matching name inside the ArrayList. This works perfectly if it was not for the encoding problem because the special letters inside the JTextField is not show correctly. It is only shown correctly when I use the Dfile.encoding=utf-8 argument

Java will use the platform default encoding, unless you specify something else.

It sounds like your platform default (a Windows setting) is not UTF-8, so in the cases where you don't specify the file.encoding property, or provide the encoding to the PrintStream constructor, the default encoding is used. In this case, when a character is found that cannot be encoded, that encoder's replacement character is used instead. This is usually ' ' or '?'.

The operating system is indicating that it may not be able to display some of the characters you wish to print. You can ignore that hint, and hope for the best, or you can replace the troublesome characters with something that is guaranteed to display. The default is to replace; you have to be explicit if you want to use the more risky approach.


Update: Based on the information provided in updates to the original question, it sounds like the problem lies in reading the file, not its output.

Using the platform default encoding is an exceptional case. The general pattern you should follow is to specify the encoding explicitly each time you are decoding a sequence of bytes to a string of characters. The encoding is inherent to the stream you are reading, and generally independent of the system that your code happens to be running on. Exceptions would be when you are reading from the console, or similar. Otherwise, there should be some metadata or convention that specifies the encoding, like an HTTP header, an attribute embedded in the file, or some standard that requires a particular encoding.

Here's how to read your road names from a UTF-8–encoded file:

Set<String> roadNames = new TreeSet<>();
try (InputStream bytes = new FileInputStream("file.txt")) {
  /* See how I'm specifying the UTF-8 encoding explicitly? */
  Reader chars = new InputStreamReader(bytes, StandardCharsets.UTF_8);
  BufferedReader lines = new BufferedReader(chars);
  while (true) {
    String line = lines.readLine();
    if (line == null)
      break;
    roadNames.add(line);
  }
}

I had the same problem. Use Charset.forName("cp866") and it should help.

BufferedReader brI = new BufferedReader(new InputStreamReader(cmd.getInputStream(), Charset.forName("cp866")));
        String result;
        while ((result = brI.readLine()) != null){
            System.out.println(result);
        }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM