简体   繁体   English

Python UTF-8 XML解析(SUDS):删除“无效令牌”

[英]Python UTF-8 XML parsing (SUDS): Removing 'invalid token'

Here's a common error when dealing with UTF-8 - 'invalid tokens' 处理UTF-8时出现常见错误 - “无效令牌”

In my example, It comes from dealing with a SOAP service provider that had no respect for unicode characters, simply truncating values to 100 bytes and neglecting that the 100'th byte may be in the middle of a multi-byte character: for example: 在我的例子中,它来自处理不尊重unicode字符的SOAP服务提供者,只是将值截断为100字节而忽略了第100个字节可能在多字节字符的中间:例如:

<name xsi:type="xsd:string">浙江家庭教会五十人遭驱散及抓打 圣诞节聚会被断电及抢走物品(图、视频\xef\xbc</name>

The last two bytes are what remains of a 3 byte unicode character, after the truncation knife assumed that the world uses 1-byte characters. 在截断刀假定世界使用1字节字符之后,最后两个字节是3字节unicode字符的剩余字节。 Next stop, sax parser and: 下一站,sax解析器和:

xml.sax._exceptions.SAXParseException: <unknown>:1:2392: not well-formed (invalid token)

I don't care about this character anymore. 我不再关心这个角色了。 It should be removed from the document and allow the sax parser to function. 它应该从文档中删除并允许sax解析器运行。

The XML reply is valid in every other respect except for these values. 除了这些值之外,XML回复在其他方面都有效。

Question: How do you remove this character without parsing the entire document and re-inventing UTF-8 encoding to check every byte? 问题:如何在不解析整个文档并重新发明UTF-8编码来检查每个字节的情况下如何删除此字符?

Using: Python+SUDS 使用:Python + SUDS

Turns out, SUDS sees xml as type 'string' (not unicode) so these are encoded values. 事实证明,SUDS将xml看作类型'string'(不是unicode),因此这些是编码值。

1) The FILTER: 1)过滤器:

badXML = "your bad utf-8 xml here"  #(type <str>)

#Turn it into a python unicode string - ignore errors, kick out bad unicode
decoded = badXML.decode('utf-8', errors='ignore')  #(type <unicode>)

#turn it back into a string, using utf-8 encoding.
goodXML = decoded.encode('utf-8')   #(type <str>)

2) SUDS: see https://fedorahosted.org/suds/wiki/Documentation#MessagePlugin 2)SUDS:请参阅https://fedorahosted.org/suds/wiki/Documentation#MessagePlugin

from suds.plugin import MessagePlugin
class UnicodeFilter(MessagePlugin):
    def received(self, context):
        decoded = context.reply.decode('utf-8', errors='ignore')
        reencoded = decoded.encode('utf-8')
        context.reply = reencoded

and

from suds.client import Client
client = Client(WSDL_url, plugins=[UnicodeFilter()])

Hope this helps someone. 希望这有助于某人。


Note: Thanks to John Machin ! 注意:感谢John Machin

See: Why is python decode replacing more than the invalid bytes from an encoded string? 请参阅: 为什么python decode会替换编码字符串中的无效字节?

Python issue8271 regarding errors='ignore' can get in your way here. 关于errors='ignore' Python issue8271可能会妨碍你。 Without this bug fixed in python, 'ignore' will consume the next few bytes to satisfy the length 如果没有在python中修复此错误,'ignore'将使用接下来的几个字节来满足长度

during the decoding of an invalid UTF-8 byte sequence, only the 在解码无效的UTF-8字节序列期间,只有
start byte and the continuation byte(s) are now considered invalid, instead of the number of bytes specified by the start byte 起始字节和连续字节现在被认为是无效的,而不是起始字节指定的字节数

Issue was fixed in: 问题修复于:
Python 2.6.6 rc1 Python 2.6.6 rc1
Python 2.7.1 rc1 (and all future releases of 2.7) Python 2.7.1 rc1(以及2.7的所有未来版本)
Python 3.1.3 rc1 (and all future release of 3.x) Python 3.1.3 rc1(以及3.x的所有未来版本)

Python 2.5 and below will contain this issue. Python 2.5及更低版本将包含此问题。

In the example above, "\\xef\\xbc</name".decode('utf-8', errors='ignore') should 在上面的例子中, "\\xef\\xbc</name".decode('utf-8', errors='ignore')应该
return "</name" , but in 'bugged' versions of python it returns "/name" . 返回"</name" ,但在'bugged'版本的python中它返回"/name"

The first four bits ( 0xe ) describes a 3-byte UTF character, so the bytes 0xef , 0xbc , and then (erroneously) 0x3c ( '<' ) are consumed. 前四位( 0xe )描述了一个3字节的UTF字符,因此0xbc了字节0xef0xbc ,然后(错误地) 0x3c'<' )。

0x3c is not a valid continuation byte which creates the invalid 3-byte UTF character in the first place. 0x3c不是一个有效的连续字节,它首先创建无效的3字节UTF字符。

Fixed versions of python only remove the first byte and only valid continuation bytes, leaving 0x3c unconsumed 固定版本的python只删除第一个字节,只删除有效的连续字节,而不使用0x3c

@FlipMcF's is the correct answer - I'm just posting my filter for his solution, because the original one didn't work out for me (I had some emoji characters in my XML, which were correctly encoded in UTF-8, but they still crashed XML parsers): @ FlipMcF是正确的答案 - 我只是为他的解决方案发布了我的过滤器,因为原来的那个没有为我工作(我的XML中有一些表情符号字符,它们是用UTF-8正确编码的,但它们是仍然崩溃的XML解析器):

class UnicodeFilter(MessagePlugin):
    def received(self, context):
        from lxml import etree
        from StringIO import StringIO
        parser = etree.XMLParser(recover=True) # recover=True is important here
        doc = etree.parse(StringIO(context.reply), parser)
        context.reply = etree.tostring(doc)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM