简体   繁体   中英

Reading XML file encoded in UTF16 in Java

I am trying to read a UTF-16 xml file with Java. The file was written with C#.

Here's the java code:

import java.io.File;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

public class XMLReadTest
{
    public static void main(String[] s)
    {
        try
        {
            File fXmlFile = new File("C:\\my_file.xml");

            DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
            DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
            Document doc = dBuilder.parse(fXmlFile);

            doc.getDocumentElement().normalize();

            NodeList nList = doc.getElementsByTagName("row");

            for (int temp = 0; temp < nList.getLength(); temp++)
            {
                Node nNode = nList.item(temp);

                if (nNode.getNodeType() == Node.ELEMENT_NODE)
                {
                    Element eElement = (Element) nNode;

                    System.out.println("FILE_NAME: " + eElement.getElementsByTagName("FILE_NAME").item(0).getTextContent());
                }
            }
        }
        catch(Exception ex)
        {
            ex.printStackTrace();
        }
    }
}

And here's the xml file:

<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<docMetadata>
  <row>
    <FILE_NAME>Выписка_Винтовые насосы.pdf</FILE_NAME>
    <FILE_CAT>GENERAL</FILE_CAT>
  </row>
</docMetadata>

When running this code in eclipse and in the Run/Debug settings window, in the last tab named 'Common' the selected encoding is the Default - Inherited (Cp1253), the output I get is wrong:

FILE_NAME: ???????_???????? ??????.pdf

When the selecdted encoding in the same tab is UTF-8 then the output is OK:

FILE_NAME: Выписка_Винтовые насосы.pdf

What am I doing wrong?

How can I get the correct output with the default encoding (cp 1253) in eclipse project settings?

This code runs in a server where I don't want to change the default encoding of the virtual machine.

I have tested this code with both Java 7 and Java 8

The problem has nothing to do with the XML itself. Java strings are UTF-16 encoded, and the Document is correctly decoding the XML data to UTF-16 strings. The real problem is that you have Eclipse set to use cp1253 (Windows-1253 Greek, which is slightly different than ISO-8859-7 Greek) for its console charset, but most of the Unicode characters you are trying to output (Russian) simply do not exist in that charset, so they get replaced with ? instead. That also explains why the output is correct when the console charset is set to UTF-8 instead, as UTF8<->UTF16 conversions are loss-less.

尝试在输入流中显式设置编码:

Document doc = dBuilder.parse(new InputStreamReader(new FileInputStream(fXmlFile), "UTF-16"));

How can I get the correct output with the default encoding (cp 1253) in eclipse project settings?

You can't. To see the correct output, the console must know the characters to display.

This code runs in a server where I don't want to change the default encoding of the virtual machine.

You could write a UTF-8/16 log file where you can see the output with cat from another console or a text editor.

            if (nNode.getNodeType() == Node.ELEMENT_NODE)
            {
                Element eElement = (Element) nNode;
                String message = "FILE_NAME: " + eElement.getElementsByTagName("FILE_NAME").item(0).getTextContent();
                System.out.println(message);
                // output FILE_NAME to logfile.txt (quick and dirty)
                OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(new File("logfile.txt")), "UTF-8");
                writer.write(message);
                writer.close();
            }

I ran this code in eclipse with ISO-8859-1 encoding in the run configuration.

Eclipse output: FILE_NAME: ???????_???????? ??????.pdf

logfile output: FILE_NAME: Выписка_Винтовые насосы.pdf

I was using an old dom4j library to parse the xml and that was causing the problem. Using the JVM 1.7 embeded library solved the problem:

import java.io.File;
import java.io.StringReader;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;

public XMLDoc()
    {
        try
        {
            File xmlFile = new File("C:\\my_file.xml");
            DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
            DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
            Document doc = dBuilder.parse(xmlFile);
            doc.getDocumentElement().normalize();

            NodeList nList = _doc.getElementsByTagName("row");
            for (int i = 0; i < nList.getLength(); i++)
            {
                Node nNode = nList.item(i);

                if (nNode.getNodeType() == Node.ELEMENT_NODE)
                {
                    Element eElement = (Element) nNode;
                    Node itemNode = eElement.getElementsByTagName("FILE_NAME").item(0);
                    String text = itemNode != null ? itemNode.getTextContent() : "";

                    // russian text is fine here
                }
            }
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM