基于Expat的xml解析脚本在Linux上不起作用，在Windows上不起作用

Question

I'm writing a set of tool in python to extract data from some xml files that are generated by a traffic simulation software. 我正在用python编写一组工具，以从交通模拟软件生成的某些xml文件中提取数据。 As the resulting files can be quite big I use the xml.parsers.expat to parse them. 由于生成的文件可能很大，因此我使用xml.parsers.expat对其进行解析。

The issue is, when I run my scripts at work on a Windows XP machine it work perfectly but at home, on Ubuntu 10.10, on the very same file I get the following error : 问题是，当我在Windows XP机器上工作时运行脚本时，它可以正常运行，但是在Ubuntu 10.10上的同一文件上，却出现以下错误：
ExpatError: not well-formed (invalid token): line 1, column 0

The file was originally encoded in utf-8 and the encoding declared in the tag was ascii so try to change it to utf-8 (or UTF8 or utf8) without success. 该文件最初是用utf-8编码的，并且标签中声明的编码是ascii，因此请尝试将其更改为utf-8（或UTF8或utf8），但不会成功。 As the BOM was absent I tryed to write it, still without success. 由于缺少BOM，所以我尝试编写它，但仍然没有成功。 I also tried to replace Windows line break (CR/LF) by Unix ones (CR).Without any success too. 我还尝试用Unix（CR）替换Windows换行符（CR / LF）。也没有成功。

Also the python's version at work is 2.7.1, on my Ubuntu box it's 2.6.6, but don't think my issue is related that : I upgraded my work computer's Python from 2.6 to 2.7 a few weeks ago without trouble. 同样，正在使用的python版本是2.7.1，在我的Ubuntu盒子上是2.6.6，但是不要认为我的问题与之相关：几周前，我将工作计算机的Python从2.6升级到2.7没问题。

As I'm not an expert here, I'm running out of idea, any hint ? 因为我不是这里的专家，所以我有点想法了，有什么提示吗？

Edit: After further investigation (I got an headache now, I hate Unicode related trouble) it look like the issue was solved by setting properly the system environment variable LANG, LC_ALL and LANGUAGE to (in my case) "fr_FR.utf-8". 编辑：经过进一步的调查（现在我头疼，我讨厌与Unicode相关的麻烦），看来该问题已通过将系统环境变量LANG，LC_ALL和LANGUAGE正确设置为“ fr_FR.utf-8”来解决。。 I don't understand why they weren't at first neither why now, it work... 我不明白为什么他们一开始都不是，现在为什么呢，它行得通...

I thank you guys for the hand ! 我感谢你们的帮助！

Answer 1

I had the same problem, and, instead of trying to parse directly the file like this: 我遇到了同样的问题，而不是像这样直接解析文件：

document = xmltodict.parse("myfile.xml") # Parse the read document string

I parsed it indirectly, by opening previosly the xml document through a object, like this: 我通过像这样通过对象预先打开xml文档来间接解析它：

document_file = open("myfile.xml", "r") # Open a file in read-only mode
original_doc = document_file.read() # read the file object
document = xmltodict.parse(original_doc) # Parse the read document string

and it worked. 而且有效。

Answer 2

Excerpts from the documentation: 文档摘录：

xml.parsers.expat.XML_ERROR_INVALID_TOKEN
Raised when an input byte could not properly be assigned to a character; 在无法将输入字节正确分配给字符时引发； for example, a NUL byte (value 0) in a UTF-8 input stream. 例如，UTF-8输入流中的NUL字节（值0）。

ExpatError.lineno
Line number on which the error was detected. 检测到错误的行号。 The first line is numbered 1. 第一行编号为1。

ExpatError.offset
Character offset into the line where the error occurred. 字符偏移到发生错误的行中。 The first column is numbered 0. 第一列编号为0。

The above tends to indicate that you have a problem with the very first byte in your file. 上面的内容表明您的文件的第一个字节有问题。

Start with the original file, the one that worked on Windows. 从原始文件开始，该文件可在Windows上使用。 Edit your question to show the results of doing this: 编辑您的问题以显示执行此操作的结果：

python -c "print repr(open('win_ok_file.xml', 'rb').read(200))"

which will show unambiguously what is in the first 200 bytes in your file. 它将清楚地显示文件的前200个字节中的内容。

Also show us a cut-down version of your code that you have checked will work on Windows to get past the initial error, but reproduces the problem on Linux. 同时向我们展示您检查过的代码的简化版本，该版本将在Windows上运行以克服最初的错误，但在Linux上会重现该问题。

Some assertions, for what they are worth: 一些值得肯定的断言：

"The file was originally encoded in utf-8 and the encoding declared in the tag was ascii" ... If the encoding in the XML declaration is "ascii" but there are non-ASCII characters in the file, complying parsers should raise an exception. “该文件最初是用utf-8编码的，并且标记中声明的编码是ascii” ...如果XML声明中的编码是“ ascii”，但文件中包含非ASCII字符，则符合条件的解析器应引发一个例外。 Are you sure of what you report? 您确定报告什么吗？
The default encoding for XML documents is UTF-8. XML文档的默认编码为UTF-8。 In other words, if the encoding is not mentioned in the XML declaration, or there is no XML declaration at all, the parser is required to decode using UTF-8. 换句话说，如果XML声明中未提及编码，或者根本没有XML声明，则要求解析器使用UTF-8进行解码。
Putting a UTF-8 BOM at the start is more likely to hinder than help. 首先将UTF-8 BOM置于障碍而不是帮助之下。
The XML standard requires parsers to accept CR as a valid byte in an XML document and then immediately pretend it didn't exist (except maybe in an element with xmlns:space="preserve" ). XML标准要求解析器接受CR作为XML文档中的有效字节，然后立即假装它不存在（也许在xmlns:space="preserve"的元素中除外）。 Changing CR LF to LF is not a good idea. 将CR LF更改为LF不是一个好主意。

And some questions: How many bytes in a "quite big" file? 还有一些问题：“很大”文件中有多少个字节？ Have you considered using iterparse() from xml.etree.cElementTree or lxml ? 您是否考虑iterparse() xml.etree.cElementTree或lxml使用iterparse() ？

基于Expat的xml解析脚本在Linux上不起作用，在Windows上不起作用

问题描述

2 个解决方案

解决方案1
3 2014-04-04 13:07:24

解决方案2
3 已采纳 2011-02-22 19:12:09

基于Expat的xml解析脚本在Linux上不起作用，在Windows上不起作用

问题描述

2 个解决方案

解决方案1 3 2014-04-04 13:07:24

解决方案2 3 已采纳 2011-02-22 19:12:09

解决方案1
3 2014-04-04 13:07:24

解决方案2
3 已采纳 2011-02-22 19:12:09