简体   繁体   English

使用Python的Unescape _xHHHH_ XML转义序列

[英]Unescape _xHHHH_ XML escape sequences using Python

I'm using Python 2.x [not negotiable] to read XML documents [created by others] that allow the content of many elements to contain characters that are not valid XML characters by escaping them using the _xHHHH_ convention eg ASCII BEL aka U+0007 is represented by the 7-character sequence u"_x0007_" . 我正在使用Python 2.x [不可协商]来读取[由其他人创建]的XML文档,这些文档允许许多元素的内容包含使用_xHHHH_约定转义它们的非有效XML字符的字符,例如ASCII BEL aka U + 0007由7个字符的序列u"_x0007_" Neither the functionality that allows representation of any old character in the document nor the manner of escaping is negotiable. 允许表示文档中任何旧字符的功能和转义的方式都不可协商。 I'm parsing the documents using cElementTree or lxml [semi-negotiable]. 我正在使用cElementTree或lxml解析文档[semi-negotiable]。

Here is my best attempt at unescapeing the parser output as efficiently as possible: 这是我尽可能高效地解析解析器输出的最佳尝试:

import re
def unescape(s,
    subber=re.compile(r'_x[0-9A-Fa-f]{4,4}_').sub,
    repl=lambda mobj: unichr(int(mobj.group(0)[2:6], 16)),
    ):
    if "_" in s:
         return subber(repl, s)
    return s

The above is biassed by observing a very low frequency of "_" in typical text and a better-than-doubling of speed by avoiding the regex apparatus where possible. 通过在典型文本中观察非常低频率的“_”以及在可能的情况下避免正则表达式设备的速度优于加倍来偏置上述情况。

The question: Any better ideas out there? 问题:那里有更好的想法吗?

You might as well check for '_x' rather than just _ , that won't matter much but surely the two-character sequence's even rarer than the single underscore. 你也可以检查'_x'而不仅仅是_ ,这并不重要,但肯定这两个字符的序列甚至比单个下划线更罕见。 Apart from such details, you do seem to be making the best of a bad situation! 除了这些细节之外,你似乎也做了最糟糕的情况!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM