从包含不可映射字符的文件中读取

Question

I am attempting to use File and Scanner to read through a.txt file and grab the useful information within into a separate file.我正在尝试使用文件和扫描仪来读取 .txt 文件并将其中的有用信息抓取到一个单独的文件中。 Some of these files contain Chinese characters and its causing my Scanner to throw the following error "java.nio.charset.UnmappableCharacterException:".其中一些文件包含中文字符，这会导致我的扫描仪抛出以下错误“java.nio.charset.UnmappableCharacterException:”。 The Chinese characters are of no importance, so how do I make the scanner ignore the Chinese characters and keep searching the rest of the file for useful information?汉字无关紧要，如何让扫描仪忽略汉字，一直在文件的rest中搜索有用的信息呢？

Here is the code:这是代码：

            try {
                File source = new File(this.parentDirectory + File.separator + this.fileName.getText());
                Scanner reader = new Scanner(source);
                StringBuilder str = new StringBuilder();
                while (reader.hasNextLine()) {
                    str.append(reader.nextLine());
                    str.append("\n");
                }
                if (reader.ioException() != null) {
                    throw reader.ioException();
                }
                reader.close();
                this.input.setText(str.toString());
            } catch (FileNotFoundException e1) {
                JOptionPane.showMessageDialog(this, "File not found!");
                return;
            } catch (IOException e1) {
                // TODO Auto-generated catch block
                e1.printStackTrace();
            }

Answer 1

A scanner implicitly converts between an external sequence of bytes, and the 16-bit Unicode characters used by all Java Strings.扫描器在外部字节序列和所有 Java 字符串使用的 16 位 Unicode 字符之间隐式转换。

You need to know the actual encoding used for the external data (ie, the file content).您需要知道用于外部数据（即文件内容）的实际编码。 Then you declare your Scanner as然后你声明你的扫描仪为

  Scanner reader = new Scanner(file, charset);

Having done that correctly, then there should be no 'unmappable' characters.正确完成后，就不会有“不可映射”的字符。

If you don't specify the charset explicitly, then the platform default is used, which is probably UTF-8.如果您没有明确指定字符集，则会使用平台默认值，可能是 UTF-8。

Alternatively, it seems that you're not really using the Scanner to any significant degree;或者，您似乎并没有真正在很大程度上使用扫描仪； you're just using it to collect lines.你只是用它来收集线。 You could drop down a level and use a FileInputStream to read the file as a sequence of bytes, and use whatever heuristics you think appropriate to determine the 'useful' parts of the file.您可以降低一个级别并使用 FileInputStream 将文件读取为字节序列，并使用您认为合适的任何启发式方法来确定文件的“有用”部分。

从包含不可映射字符的文件中读取

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-09-01 03:06:29

从包含不可映射字符的文件中读取

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-09-01 03:06:29

解决方案1
0 已采纳 2020-09-01 03:06:29