[英]Convert Windows-1252 xml file to UTF-8
有什么方法可以将大型XML文件(500 + MBs)从Java中的“ Windows-1252”编码转换为“ UTF-8”编码?
Sure: 当然:
FileInputStream
wrapped in an InputStreamReader
with the Windows-1252 for the input 使用Windows-1252打开包装在InputStreamReader
的FileInputStream
作为输入 FileOutputStream
wrapped in an OutputStreamWriter
with the UTF-8 encoding for the output 打开包装在OutputStreamWriter
具有UTF-8编码的FileOutputStream
作为输出 Repeatedly read into the array and write however much has been written: 重复读入数组,然后写很多东西:
char[] buffer = new char[16 * 1024]; int charsRead; while ((charsRead = input.read(buffer)) > 0) { output.write(buffer, 0, charsRead); }
Note that as it's XML, you may well need to manually change the XML declaration as well, as it should be specifying that it's in Windows-1252... 请注意,由于它是XML,因此您可能还需要手动更改XML声明,因为它应该指定它在Windows-1252中。
The fact that this works on a streaming basis means you don't need to worry about the size of the file - it only reads up to 16K characters in memory at a time. 这是基于流的事实,这意味着您无需担心文件的大小-它一次只读取内存中的16K字符。
Is this a one-off or a job that you need to run repeatedly and make efficient? 这是一次性的工作还是您需要重复运行并提高效率的工作?
If it's a one-off, I don't see the need for Java coding. 如果是一次性的,我看不到需要Java编码。 Just run the query ".", for example 例如,只需运行查询“。”
java net.sf.saxon.Query -s:input.xml -qs:. -o:output.xml
making sure you allocate say 3Gb of memory. 确保分配了3Gb的内存。
If you're doing it repeatedly and want a streamed approach, you have to choose between handling it as text (as Jon Skeet suggests) or as XML. 如果您要重复执行此操作,并且希望使用流式处理方法,则必须在将其作为文本(如Jon Skeet建议)或XML进行处理之间进行选择。 The advantage of doing it as XML is primarily that the XML declaration will get taken care of, and character references will be converted to characters. 作为XML进行操作的优点主要是可以处理XML声明,并将字符引用转换为字符。 The simplest is to use a JAXP identity transformation: 最简单的是使用JAXP身份转换:
Source in = new StreamSource(new File("input.xml"));
TransformerFactory f = TransformerFactory.newInstance();
Result out = new StreamResult(new File("output.xml"));
f.newTransformer().transform(in, out);
If this is a one-off, Java may not be the most appropriate tool. 如果这是一次性的,那么Java可能不是最合适的工具。 Consider iconv
: 考虑iconv
:
iconv -f windows-1252 -t utf-8 <source.xml >target.xml
This has all the benefits of streaming without requiring you to write any code. 这具有流式传输的所有优点,而无需您编写任何代码。
Unlike Michael's solution, this won't take care of the XML declaration. 与Michael的解决方案不同,这不会处理XML声明。 Edit this manually if necessary or, now you're using UTF-8, omit it. 如有必要,请手动进行编辑,或者现在使用UTF-8,则将其省略。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.