简体   繁体   中英

Comparing utf-8 strings in java

In my java program, I am retrieving some data from xml. This xml has few international characters and is encoded in utf8. Now I read this xml using xml parser. Once I retrieve a particular international string from xml parser, I need to compare it with set of predefined strings. Problem is when I use string.equals on internatinal string comparison fails.

How to compare strings with international strings in java ? I am using SAXParser & XMLReader to read strings from xml.

Here's the line that compares strings

 String country;
 country = getXMLNodeString();

 if(country.equals("Côte d'Ivoire"))
 {    

 } 

  getXMLNodeString()
  {

  /* Get a SAXParser from the SAXPArserFactory. */  
        SAXParserFactory spf = SAXParserFactory.newInstance();
        SAXParser sp = spf.newSAXParser();

        /* Get the XMLReader of the SAXParser we created. */
        XMLReader xr = sp.getXMLReader();
        /* Create a new ContentHandler and apply it to the XML-Reader*/
        XmlParser xmlParser = new XmlParser();  //my class to parse xml
        xr.setContentHandler(xmlParser);  

        /* Parse the xml-data from our URL. */
        xr.parse(new InputSource(url.openStream()));
        /* Parsing has finished. */


       //return string here
  }

Java stores String s internally as an array of char s, which are 16 bit unsigned values. This was based on an earlier Unicode standard that supported 64K characters.

Your String constant "Côte d'Ivoire" is in this format. If your character encoding on your XML document is correct then the String read from there will also be in the correct format. So possible errors are:

  1. The XML document doesn't declare a character encoding;

  2. The declared character encoding does not match the actual character encoding used.

Perhaps the XML string is being treated as US-ASCII instead of UTF-8. I would output both and eyeball them. If they look the same, compare them character by character to see where teh comparison fails. You may also want to compare the UTF8 encoding of your constant String to what's in the XML document:

byte[] bytes = "Côte d'Ivoire".getBytes("UTF-8");

It gets more complicated when you start getting into "supplementary characters". These are characters beyond the originally intended 64K ("code points" in Unicode parlance). See Supplementary Characters in the Java Platform . This isn't an issue with any of the characters you're using but it's worth noting for completeness.

Since you're comparing with a string literal, you need to make sure that you're saving your source file in the same encoding that javac is expecting. You can also specify what encoding your source files are in with the -encoding argument to javac .

That seems like the most likely "gotcha" in this scenario.

Note that I'm talking about the encoding of your Java source code, not the XML document.

Java strings are always UTF-16. Your XML parser should be converting the file's UTF-8 characters into UTF-16 while reading, and your own strings are already UTF-16 in memory, so you can compare them with an ordinary equals() call. If they aren't comparing equal when you think they should, the problem is likely something else.

If your XML file is tagged as and the text file is saved as an actual UTF-8 file you can use contentEquals(literal or string) like so:

if (strMyvalue.contentEquals("Côte d'Ivoire") {
    // execute
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM