简体   繁体   English

ElementTree不会使用Python 2.7解析特殊字符

[英]ElementTree will not parse special characters with Python 2.7

I had to rewrite my python script from python 3 to python2 and after that I got problem parsing special characters with ElementTree. 我不得不将我的python脚本从python 3重写为python2,然后我在使用ElementTree解析特殊字符时遇到了问题。

This is a piece of my xml: 这是我的xml的一部分:

<account number="89890000" type="Kostnad" taxCode="597" vatCode="">Avsättning egenavgifter</account>

This is the ouput when I parse this row: 这是我分析此行时的输出:

('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avs\xc3\xa4ttning egenavgifter')

So it seems to be a problem with the character "ä". 因此,字符“ä”似乎存在问题。

This is how i do it in the code: 这就是我在代码中的方法:

sys.setdefaultencoding( "UTF-8" )
xmltree = ET()

xmltree.parse("xxxx.xml")

printAccountPlan(xmltree)

def printAccountPlan(xmltree):
    print("account:",str(i.attrib['number']),      "AccountType:",str(i.attrib['type']),"Name:",str(i.text))

Anyone have an ide to get the ElementTree parse the charracter "ä", so the result will be like this: 任何人都有想法让ElementTree解析字符“ä”,因此结果将是这样的:

('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avsättning egenavgifter')

You're running into two separate differences between Python 2 and Python 3 at the same time, which is why you're getting unexpected results. 您同时遇到了Python 2和Python 3之间的两个不同之处,这就是为什么您得到意想不到的结果的原因。

The first difference is one you're probably already aware of: Python's print statement in version 2 became a print function in version 3. That change is creating a special circumstance in your case, which I'll get to a little later. 第一个区别是您可能已经知道的一个区别:版本2中的Python打印语句在版本3中成为打印功能。这种变化在您的情况下造成了一种特殊情况,我将在稍后介绍。 But briefly, this is the difference in how 'print' works: 但简单来说,这是“打印”工作方式的不同:

In Python 3: 在Python 3中:

>>> # Two arguments 'Hi' and 'there' get passed to the function 'print'.
>>> # They are concatenated with a space separator and printed.
>>> print('Hi', 'there') 
>>> Hi there

In Python 2: 在Python 2中:

>>> # 'print' is a statement which doesn't need parenthesis.
>>> # The parenthesis instead create a tuple containing two elements 
>>> # 'Hi' and 'there'. This tuple is then printed.
>>> print('Hi', 'there')
>>> ('Hi', 'there')

The second problem in your case is that tuples print themselves by calling repr() on each of their elements. 您遇到的第二个问题是元组通过在每个元素上调用repr()来打印自己。 In Python 3, repr() displays unicode as you want. 在Python 3中,repr()根据需要显示unicode。 But in Python 2, repr() uses escape characters for any byte values which fall outside the printable ASCII range (eg, larger than 127). 但是在Python 2中,repr()对所有超出可打印ASCII范围(例如,大于127)的字节值使用转义字符。 This is why you're seeing them. 这就是为什么您看到它们的原因。

You may decide to resolve this issue, or not, depending on what you're goal is with your code. 您可以根据代码的目标来决定是否解决此问题。 The representation of a tuple in Python 2 uses escape characters because it's not designed to be displayed to an end-user. Python 2中的元组表示使用转义字符,因为它不是设计为显示给最终用户的。 It's more for your internal convenience as a developer, for troubleshooting and similar tasks. 这更多地是为了您作为开发人员的内部便利,故障排除和类似任务。 If you're simply printing it for yourself, then you may not need to change a thing because Python is showing you that the encoded bytes for that non-ASCII character are correctly there in your string. 如果只是为自己打印,那么您可能不需要更改任何内容,因为Python向您显示了该非ASCII字符的编码字节正确存在于字符串中。 If you do want to display something to the end-user which has the format of how tuples look, then one way to do it (which retains correct printing of unicode) is to manually create the formatting, like this: 如果您确实想向最终用户显示具有元组外观格式的内容,那么一种方法(保留正确的unicode打印)是手动创建格式,如下所示:

def printAccountPlan(xmltree):
    data = (i.attrib['number'], i.attrib['type'], i.text)
    print "('account:', '%s', 'AccountType:', '%s', 'Name:', '%s')" % data
# Produces this:
# ('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avsättning egenavgifter')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM