简体   繁体   English

在 Java 中读取带有重音字符的文件

[英]reading file with accented characters in Java

I came across two special characters which seem not to be covered by the ISO-8859-1 character set ie they don't make it through to my program.我遇到了两个似乎没有被ISO-8859-1字符集覆盖的特殊字符,即它们无法进入我的程序。

The German ß and the Norwegian ø德语ß和挪威语ø

i'm reading the files as follows:我正在阅读以下文件:

FileInputStream inputFile = new FileInputStream(corpus[i]);
InputStreamReader ir = new InputStreamReader(inputFile, "ISO-8859-1") ;

Is there a way for me to read these characters without having to apply manual replacement as a workaround?有没有办法让我阅读这些字符而不必应用手动替换作为解决方法?

[EDIT] [编辑]

this is how it looks on screen.这就是它在屏幕上的样子。 Note that i have no problems with other accents eg è and the lot...请注意,我对其他口音没有问题,例如è和很多...

在此处输入图像描述

Both characters are present in ISO-Latin-1 (check my name to see why I've looked into this).这两个字符都出现在 ISO-Latin-1 中(检查我的名字,看看我为什么要研究这个)。

If the characters are not read in correctly, the most likely cause is that the text in the file is not saved in that encoding, but in something else.如果未正确读取字符,最可能的原因是文件中的文本未以该编码保存,而是以其他方式保存。

Depending on your operating system and the origin of the file, possible encodings could be UTF-8 or a Windows code page like 850 or 437.根据您的操作系统和文件的来源,可能的编码可能是 UTF-8 或 Windows 代码页,如 850 或 437。

The easiest way is to look at the file with a hex editor and report back what exact values are saved for these two characters.最简单的方法是使用十六进制编辑器查看文件并报告为这两个字符保存的确切值。

Assuming that your file is probably UTF-8 encoded, try this:假设您的文件可能是UTF-8编码的,试试这个:

InputStreamReader ir = new InputStreamReader(inputFile, "UTF-8");

ISO-8859-1 covers ß and ø , so the file is probably saved in a different encoding. ISO-8859-1 涵盖 ß 和 ø ,因此文件可能以不同的编码保存。 You should pass in file's encoding to new InputStreamReader() .您应该将文件的编码传递给new InputStreamReader()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM