简体   繁体   English

阅读文件时找不到ZERO WIDTH NO-BREAK SPACE

[英]Cannot find ZERO WIDTH NO-BREAK SPACE when reading file

I've run into a problem when trying to parse a JSON string that I grab from a file. 我在尝试解析从文件中获取的JSON字符串时遇到了问题。 My problem is that the Zero width no-break space character (unicode 0xfeff) is at the beginning of my string when I read it in, and I cannot get rid of it. 我的问题是,当我读取它时,零宽度不间断空格字符(unicode 0xfeff)位于我的字符串的开头,我无法摆脱它。 I don't want to use regex because of the chance there may be other hidden characters with different unicodes. 我不想使用正则表达式,因为可能存在具有不同unicodes的其他隐藏字符。

Here's what I have: 这就是我所拥有的:

StringBuilder content = new StringBuilder();
    try {
        BufferedReader br = new BufferedReader(new FileReader("src/test/resources/getStuff.json"));
        String currentLine;
        while((currentLine = br.readLine()) != null) {
            content.append(currentLine);
        }
        br.close();
    } catch(Exception e) {
        Assert.fail();
    }

And this is the the start of the JSON file (it's too long to copy paste the whole thing, but I have confirmed it is valid): 这是JSON文件的开头(复制粘贴整个东西太长了,但我确认它是有效的):

{"result":{"data":{"request":{"year":null,"timestamp":1413398641246,...

Here's what I've tried so far: 这是我到目前为止所尝试的:

  • Copying the JSON file to notepad++ and showing all characters 将JSON文件复制到记事本++并显示所有字符
  • Copying file to notepad++ and converting to UFT-8 without BOM, and ISO 8859-1 将文件复制到记事本++并转换为不带BOM的UFT-8和ISO 8859-1
  • Opened JSON file in other text editors such as sublime and saved as UFT-8 在其他文本编辑器中打开JSON文件,例如sublime并保存为UFT-8
  • Copied the JSON file to a txt file and read that in 将JSON文件复制到txt文件并读入
  • Tried using Scanner instead of BufferedReader 尝试使用Scanner而不是BufferedReader
  • In intellij I tried view -> active editor -> show whitespaces 在intellij我尝试了视图 - >主动编辑器 - >显示空白

How can I read this file in without having the Zero width no-break space character at the beginning of the string? 如何在字符串开头没有零宽度不间断空格字符的情况下读取此文件?

0xEF 0xBB 0xBF is the UTF-8 BOM , 0xFE 0xFF is the UTF-16BE BOM , and 0xFF 0xFE is the UTF-16LE BOM . 0xEF 0xBB 0xBF是UTF-8 BOM0xFE 0xFF是UTF-16BE BOM0xFF 0xFE是UTF-16LE BOM If 0xFEFF exists at the front of your String, it means you created a UTF encoded text file with a BOM. 如果字符串前面存在0xFEFF ,则表示您创建了带有BOM的UTF编码文本文件。 A UTF-16 BOM could appear as-is as 0xFEFF , whereas a UTF-8 BOM would only appear as 0xFEFF if the BOM itself were being decoded from UTF-8 to UTF-16 (meaning the reader detected the BOM but did not skip it). UTF-16 BOM可以显示为0xFEFF ,而如果BOM本身从UTF-8解码为UTF-16,则UTF-8 BOM仅显示为0xFEFF (意味着读取器检测到BOM但未跳过它)。 In fact, it is known that Java does not handle UTF-8 BOMs (see bugs JDK-4508058 and JDK-6378911 ). 实际上,众所周知,Java不处理UTF-8 BOM(参见JDK-4508058JDK-6378911错误 )。

If you read the FileReader documentation , it says: 如果您阅读FileReader 文档 ,它会说:

The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. 此类的构造函数假定默认字符编码和默认字节缓冲区大小是适当的。 To specify these values yourself, construct an InputStreamReader on a FileInputStream. 要自己指定这些值,请在FileInputStream上构造一个InputStreamReader。

You need to read the file content using a reader that recognizes charsets, preferably one that will read the BOM for you and adjust itself internally as needed. 您需要使用识别字符集的阅读器来阅读文件内容,最好是为您读取BOM并根据需要在内部进行调整。 But worse case, you could just open the file yourself, read the first few bytes to detect if a BOM is present, and then construct a reader using an appropriate charset to read the rest of the file. 但更糟糕的情况是,您可以自己打开文件,读取前几个字节以检测是否存在BOM,然后使用适当的字符集构建读取器以读取文件的其余部分。 Here is an example using org.apache.commons.io.input.BOMInputStream that does exactly that: 以下是使用org.apache.commons.io.input.BOMInputStream的示例:

(from https://stackoverflow.com/a/13988345/65863 ) (来自https://stackoverflow.com/a/13988345/65863

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
    BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
    ByteOrderMark bom = bOMInputStream.getBOM();
    String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
    InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
    //use reader
} finally {
    inputStream.close();
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM