简体   繁体   English

将Windows-1252 xml文件转换为UTF-8

[英]Convert Windows-1252 xml file to UTF-8

有什么方法可以将大型XML文件(500 + MBs)从Java中的“ Windows-1252”编码转换为“ UTF-8”编码?

Sure: 当然:

  • Open a FileInputStream wrapped in an InputStreamReader with the Windows-1252 for the input 使用Windows-1252打开包装在InputStreamReaderFileInputStream作为输入
  • Open a FileOutputStream wrapped in an OutputStreamWriter with the UTF-8 encoding for the output 打开包装在OutputStreamWriter具有UTF-8编码的FileOutputStream作为输出
  • Create a buffer char array (eg 16K) 创建一个缓冲区字符数组(例如16K)
  • Repeatedly read into the array and write however much has been written: 重复读入数组,然后写很多东西:

     char[] buffer = new char[16 * 1024]; int charsRead; while ((charsRead = input.read(buffer)) > 0) { output.write(buffer, 0, charsRead); } 
  • Don't forget to close the output afterwards! 不要忘了之后关闭输出! (Otherwise there could be buffered data which never gets written to disk.) (否则,可能有缓冲的数据永远不会写入磁盘。)

Note that as it's XML, you may well need to manually change the XML declaration as well, as it should be specifying that it's in Windows-1252... 请注意,由于它是XML,因此您可能还需要手动更改XML声明,因为它应该指定它在Windows-1252中。

The fact that this works on a streaming basis means you don't need to worry about the size of the file - it only reads up to 16K characters in memory at a time. 这是基于流的事实,这意味着您无需担心文件的大小-它一次只读取内存中的16K字符。

Is this a one-off or a job that you need to run repeatedly and make efficient? 这是一次性的工作还是您需要重复运行并提高效率的工作?

If it's a one-off, I don't see the need for Java coding. 如果是一次性的,我看不到需要Java编码。 Just run the query ".", for example 例如,只需运行查询“。”

java net.sf.saxon.Query -s:input.xml -qs:. -o:output.xml

making sure you allocate say 3Gb of memory. 确保分配了3Gb的内存。

If you're doing it repeatedly and want a streamed approach, you have to choose between handling it as text (as Jon Skeet suggests) or as XML. 如果您要重复执行此操作,并且希望使用流式处理方法,则必须在将其作为文本(如Jon Skeet建议)或XML进行处理之间进行选择。 The advantage of doing it as XML is primarily that the XML declaration will get taken care of, and character references will be converted to characters. 作为XML进行操作的优点主要是可以处理XML声明,并将字符引用转换为字符。 The simplest is to use a JAXP identity transformation: 最简单的是使用JAXP身份转换:

Source in = new StreamSource(new File("input.xml"));
TransformerFactory f = TransformerFactory.newInstance();
Result out = new StreamResult(new File("output.xml"));
f.newTransformer().transform(in, out);

If this is a one-off, Java may not be the most appropriate tool. 如果这是一次性的,那么Java可能不是最合适的工具。 Consider iconv : 考虑iconv

iconv -f windows-1252 -t utf-8 <source.xml >target.xml

This has all the benefits of streaming without requiring you to write any code. 这具有流式传输的所有优点,而无需您编写任何代码。

Unlike Michael's solution, this won't take care of the XML declaration. 与Michael的解决方案不同,这不会处理XML声明。 Edit this manually if necessary or, now you're using UTF-8, omit it. 如有必要,请手动进行编辑,或者现在使用UTF-8,则将其省略。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将Windows-1252文件转换为UTF-8文件 - Convert Windows-1252 file into UTF-8 file windows-1252到UTF-8 - windows-1252 to UTF-8 Java 将 Windows-1252 转换为 UTF-8,有些字母是错误的 - Java convert Windows-1252 to UTF-8, some letters are wrong 从Oracle读取Windows-1252格式并写入使用UTF-8编码的Latin1字符的XML文件 - Reading from Windows-1252 format from Oracle and Writing to XML file with Latin1 characters UTF-8 encoded 将UTF-8转换为Windows-1252并在tomcat v7的gwt 2.7.0中写入csv - Convert UTF-8 to windows-1252 and write into csv in gwt 2.7.0 on tomcat v7 字符编码将 windows-1252 输入文件转换为 utf-8 输出文件 - Character encoding converting windows-1252 input file to utf-8 output file 在 Java 中将 Windows-1252 转换为 UTF-16 - Convert Windows-1252 to UTF-16 in Java 使用 utf-8 编码的 utf-8 读取文件不起作用,但使用“windows-1252”或“iso-8859-1”读取相同的文件可以 - Reading a file using utf-8 that is encoded in utf-8 doesn't work, but reading the same file using “windows-1252” or “iso-8859-1” does 读取xml(Windows-1252)文件时出错 - Error while reading a xml (windows-1252) file Java函数将Windows-1252编码为UTF-8,得到相同的符号 - Java functions to encode Windows-1252 to UTF-8 getting the same symbol
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM