简体   繁体   中英

Convert Windows-1252 xml file to UTF-8

有什么方法可以将大型XML文件(500 + MBs)从Java中的“ Windows-1252”编码转换为“ UTF-8”编码?

Sure:

  • Open a FileInputStream wrapped in an InputStreamReader with the Windows-1252 for the input
  • Open a FileOutputStream wrapped in an OutputStreamWriter with the UTF-8 encoding for the output
  • Create a buffer char array (eg 16K)
  • Repeatedly read into the array and write however much has been written:

     char[] buffer = new char[16 * 1024]; int charsRead; while ((charsRead = input.read(buffer)) > 0) { output.write(buffer, 0, charsRead); } 
  • Don't forget to close the output afterwards! (Otherwise there could be buffered data which never gets written to disk.)

Note that as it's XML, you may well need to manually change the XML declaration as well, as it should be specifying that it's in Windows-1252...

The fact that this works on a streaming basis means you don't need to worry about the size of the file - it only reads up to 16K characters in memory at a time.

Is this a one-off or a job that you need to run repeatedly and make efficient?

If it's a one-off, I don't see the need for Java coding. Just run the query ".", for example

java net.sf.saxon.Query -s:input.xml -qs:. -o:output.xml

making sure you allocate say 3Gb of memory.

If you're doing it repeatedly and want a streamed approach, you have to choose between handling it as text (as Jon Skeet suggests) or as XML. The advantage of doing it as XML is primarily that the XML declaration will get taken care of, and character references will be converted to characters. The simplest is to use a JAXP identity transformation:

Source in = new StreamSource(new File("input.xml"));
TransformerFactory f = TransformerFactory.newInstance();
Result out = new StreamResult(new File("output.xml"));
f.newTransformer().transform(in, out);

If this is a one-off, Java may not be the most appropriate tool. Consider iconv :

iconv -f windows-1252 -t utf-8 <source.xml >target.xml

This has all the benefits of streaming without requiring you to write any code.

Unlike Michael's solution, this won't take care of the XML declaration. Edit this manually if necessary or, now you're using UTF-8, omit it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM