简体   繁体   English

如何摆脱UTF-8编码的.txt中的“流氓字符”

[英]How to get rid of “Rogue Chars” in an .txt encoded under UTF-8

My program is reading from a .txt encoded with UTF-8. 我的程序正在读取使用UTF-8编码的.txt。 The reason why I'm using UTF-8 is to handle the characters åäö. 我使用UTF-8的原因是要处理字符åäö。 The problem I come across is when the lines are read is that there seems to be some "rogue" characters sneaking in to the string which causes problems when I'm trying to store those lines into variables. 我遇到的问题是,当读取行时,似乎有一些“流氓”字符潜入字符串中,这在我尝试将这些行存储到变量中时会引起问题。 Here's the code: 这是代码:

public void Läsochlista()
{
    String Content = "";
    String[] Argument = new String[50];
    int index = 0;
    Log.d("steg1", "steg1");
    try{
        InputStream inputstream = openFileInput("text.txt");
        if(inputstream != null)
        {
            Log.d("steg2", "steg2");
            //InputStreamReader inputstreamreader = new InputStreamReader(inputstream);
            //BufferedReader bufferreader = new BufferedReader(inputstreamreader);
            BufferedReader in = new BufferedReader(new InputStreamReader(inputstream, "UTF-8"));
            String reciveString = "";
            StringBuilder stringbuilder = new StringBuilder();

            while ((reciveString = in.readLine()) != null)
            {
                Argument[index] = reciveString;
                index++;
                if(index == 6)
                {
                    Log.d(Argument[0], String.valueOf((Argument[0].length())));
                    AllaPlatser.add(new Platser(Float.parseFloat(Argument[0]), Float.parseFloat(Argument[1]), Integer.parseInt(Argument[2]), Argument[3], Argument[4], Integer.parseInt(Argument[5])));
                    Log.d("En ny plats skapades", Argument[3]);
                    Arrays.fill(Argument, null);
                    index = 0;
                }
            }
            inputstream.close();
            Content = stringbuilder.toString();
        }
    }
    catch (FileNotFoundException e){
        Log.e("Filen", " Hittades inte");
    } catch (IOException e){
        Log.e("Filen", " Ej läsbar");
    }
}

Now, I'm getting the error 现在,我得到了错误

Invalid float: "61.193521"

where the line only contains the chars "61.193521". 其中该行仅包含字符“ 61.193521”。 When i print out the length of the string as read within the program, the output shows "10" which is one more character than the string is supposed to contain. 当我打印出在程序中读取的字符串的长度时,输出显示“ 10”,这比该字符串应包含的字符多一个字符。 The question; 问题; How do i get rid of those invisible "Rouge" chars? 我如何摆脱那些看不见的“胭脂”字符? and why are they there in the first place? 为什么他们首先出现在这里?

When you save a file as "UTF-8", your editor may be writing a byte-order mark (BOM) at the beginning of the file. 当您将文件另存为“ UTF-8”时,您的编辑器可能正在文件的开头写入字节顺序标记 (BOM)

See if there's an option in your editor to save UTF-8 without the BOM. 查看您的编辑器中是否有一个选项可以保存不带BOM的UTF-8。

Apparently the BOM is just a pain in the butt: What's different between UTF-8 and UTF-8 without BOM? 显然,BOM只是一个麻烦: UTF-8和不带BOM的UTF-8有什么区别?

I know you want to be able to have extended characters in your data; 我知道您希望能够在数据中包含扩展字符; however, you may want to pick a different encoding like Latin-1 (ISO 8859-1). 但是,您可能希望选择其他编码方式,例如Latin-1(ISO 8859-1)。

Or you can just read & discard the first three bytes from the input stream before you wrap it with the reader. 或者,您也可以在使用阅读器包装输入流之前先从输入流中读取并丢弃前三个字节。

Unfortunately you have not provided the sample text file so testing with your code exactly is not possible and here is the theoretical answer based on guess, what could have been the reasons: Looks like it is BOM related issue and you may have to treat this. 不幸的是,您没有提供示例文本文件,因此无法完全使用代码进行测试,这是基于猜测的理论答案,可能是原因所在:看起来这是与BOM相关的问题,您可能必须对此进行处理。 Some related detail is given here: http://www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html And some information here: What is XML BOM and how do I detect it? 这里提供了一些相关的详细信息: http : //www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html这里有一些信息: 什么是XML BOM,如何检测到它?

Basically there are various situation: 基本上有各种情况:

  1. In one of the situation we face issues when we don't read and write using correct encoding. 在一种情况下,当我们不使用正确的编码进行读写时,就会遇到问题。
  2. In another situation we use an editor or reader which doesn't support UTF-8 在另一种情况下,我们使用的编辑器或阅读器不支持UTF-8
  3. Third is when we are using correct encoding for reading and writing, we are not facing issue in a text editor but facing issue in some other application or program. 第三,当我们使用正确的编码进行读写时,我们在文本编辑器中不会遇到问题,而在其他应用程序或程序中会遇到问题。 I think your issues is related to third case. 我认为您的问题与第三种情况有关。

In third situation we may have to remove the BOM using a program or deal with it according to our context. 在第三种情况下,我们可能必须使用程序删除BOM或根据我们的上下文对其进行处理。 Here is some solution I guess you may find interesting: UTF-8 file reading: the first character issue 这是一些解决方案,我想您可能会发现有趣的事情: UTF-8文件读取:第一个字符问题

You can use code given in this threads answer or use apache commons to deal with it: Byte order mark screws up file reading in Java 您可以使用此线程答案中给出的代码,也可以使用apache commons处理它: 字节顺序标记会破坏Java中的文件读取

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM