简体   繁体   中英

How to get rid of “Rogue Chars” in an .txt encoded under UTF-8

My program is reading from a .txt encoded with UTF-8. The reason why I'm using UTF-8 is to handle the characters åäö. The problem I come across is when the lines are read is that there seems to be some "rogue" characters sneaking in to the string which causes problems when I'm trying to store those lines into variables. Here's the code:

public void Läsochlista()
{
    String Content = "";
    String[] Argument = new String[50];
    int index = 0;
    Log.d("steg1", "steg1");
    try{
        InputStream inputstream = openFileInput("text.txt");
        if(inputstream != null)
        {
            Log.d("steg2", "steg2");
            //InputStreamReader inputstreamreader = new InputStreamReader(inputstream);
            //BufferedReader bufferreader = new BufferedReader(inputstreamreader);
            BufferedReader in = new BufferedReader(new InputStreamReader(inputstream, "UTF-8"));
            String reciveString = "";
            StringBuilder stringbuilder = new StringBuilder();

            while ((reciveString = in.readLine()) != null)
            {
                Argument[index] = reciveString;
                index++;
                if(index == 6)
                {
                    Log.d(Argument[0], String.valueOf((Argument[0].length())));
                    AllaPlatser.add(new Platser(Float.parseFloat(Argument[0]), Float.parseFloat(Argument[1]), Integer.parseInt(Argument[2]), Argument[3], Argument[4], Integer.parseInt(Argument[5])));
                    Log.d("En ny plats skapades", Argument[3]);
                    Arrays.fill(Argument, null);
                    index = 0;
                }
            }
            inputstream.close();
            Content = stringbuilder.toString();
        }
    }
    catch (FileNotFoundException e){
        Log.e("Filen", " Hittades inte");
    } catch (IOException e){
        Log.e("Filen", " Ej läsbar");
    }
}

Now, I'm getting the error

Invalid float: "61.193521"

where the line only contains the chars "61.193521". When i print out the length of the string as read within the program, the output shows "10" which is one more character than the string is supposed to contain. The question; How do i get rid of those invisible "Rouge" chars? and why are they there in the first place?

When you save a file as "UTF-8", your editor may be writing a byte-order mark (BOM) at the beginning of the file.

See if there's an option in your editor to save UTF-8 without the BOM.

Apparently the BOM is just a pain in the butt: What's different between UTF-8 and UTF-8 without BOM?

I know you want to be able to have extended characters in your data; however, you may want to pick a different encoding like Latin-1 (ISO 8859-1).

Or you can just read & discard the first three bytes from the input stream before you wrap it with the reader.

Unfortunately you have not provided the sample text file so testing with your code exactly is not possible and here is the theoretical answer based on guess, what could have been the reasons: Looks like it is BOM related issue and you may have to treat this. Some related detail is given here: http://www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html And some information here: What is XML BOM and how do I detect it?

Basically there are various situation:

  1. In one of the situation we face issues when we don't read and write using correct encoding.
  2. In another situation we use an editor or reader which doesn't support UTF-8
  3. Third is when we are using correct encoding for reading and writing, we are not facing issue in a text editor but facing issue in some other application or program. I think your issues is related to third case.

In third situation we may have to remove the BOM using a program or deal with it according to our context. Here is some solution I guess you may find interesting: UTF-8 file reading: the first character issue

You can use code given in this threads answer or use apache commons to deal with it: Byte order mark screws up file reading in Java

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM