简体   繁体   中英

Reading special characters from File - Java

I am reading data from a text file with following properties:

Encoding: ANSI
File Type: PC

Now, the file contains lot of special characters like degree symbol(º) etc. I am reading this file using the following code:

File file = new File("C:\\X\\Y\\SpecialCharacter.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"));

If the file encoding is ANSI, the above code does not read the special characters properly ex the line in file:
"Lower heat and simmer until product reaches internal temperature of 165ºF" , reader.readLine() would output:
"Lower heat and simmer until product reaches internal temperature of 165 F"

When I changed the encoding for the file to UTF-8, the line reads as it is in the file without messing up the special characters.

My question, at what point does the data get messed up? When storing the data in the file or when reading it from the file? Opening the file in Notepad displays all the special characters properly. How does that happen ?

Hexdump output:

          -0 -1 -2 -3  -4 -5 -6 -7  -8 -9 -A -B  -C -D -E -F

00000000- 4C 6F 77 65  72 20 68 65  61 74 20 61  6E 64 20 73 [Lower heat and s]
00000001- 69 6D 6D 65  72 20 75 6E  74 69 6C 20  70 72 6F 64 [immer until prod]
00000002- 75 63 74 20  72 65 61 63  68 65 73 20  69 6E 74 65 [uct reaches inte]
00000003- 72 6E 61 6C  20 74 65 6D  70 65 72 61  74 75 72 65 [rnal temperature]
00000004- 20 6F 66 20  31 36 35 BA  46                       [ of 165.F       ]

"ANSI" is not a particular encoding - it's a whole collection of encodings. You need to use the right encoding when reading the file. For example, it's entirely possible that you're using the Windows-1252 encoding, which means you may want to try passing in "Cp1252" as the encoding name.

In fact, you're passing in "UTF-8" which isn't one of the encodings typically referred to as ANSI. You need to find out the exact encoding that the file uses, and then specify that in the InputStreamReader parameter.

My question, at what point does the data get messed up? When storing the data in the file or when reading it from the file?

Assuming the encoding is capable of representing all the characters you're interested in, it's only when you read the file. Basically, you're trying to read it as if it's in one encoding, when it's actually in another. Notepad is either performing some sort of heuristic encoding detection, or it happens to use the right default for this particular situation .

new InputStreamReader(new FileInputStream(file), "UTF-8") is for reading UFT-8 -encoded files: if you're reading a file encoded differently (eg Win 1252) you should change the second parameter accordingly.

A text file is never "messed-up" encoding-wise: it is stored in some encoding and you should use that same encoding when reading it, so that the system can interpret that raw stream of bytes and associate each [group of] byte[s] with the proper character [or Unicode codepoint, if we're doing Unicode], for you to be able to see the "right" glyphs.

Hope this clarifies a little.

Cheers

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM