简体   繁体   English

如何使用Python处理lxml中的转义字符串

[英]How to handle escaped string in lxml with Python

I'm trying to use lxml to help me parse some XML files and output it. 我正在尝试使用lxml来帮助我解析一些XML文件并将其输出。 However,there are some special characters in the XML file. 但是,XML文件中有一些特殊字符。 I don't want to replace it because it is too complicated to escape it and unescape it. 我不想替换它,因为它太复杂而无法转义和取消转义。 Also I can't force the others to produce a well-formed XML. 而且我不能强迫其他人生成格式正确的XML。

Is there any way Python can let me handle the non-well-formed XML with lxml? Python有什么办法可以让我使用lxml处理格式不正确的XML?

I can read it in properly: 我可以正确阅读它:

  parser = etree.XMLParser(recover=True)
  root = etree.parse(sys.argv[1],parser=parser)

But when I want to print the element text, it can only print the content until the special character occurs. 但是,当我要打印元素文本时,它只能打印内容,直到出现特殊字符为止。

  for element in root.iter("content"):
    print("%s - %s  attr - %s" % (element.tag, element.text, element.get("name"))) 

lxml does unescaping for you transparently. lxml会为您透明地进行转义。 So you could try to fix invalid characters in the input first and then feed the result to lxml. 因此,您可以尝试先修复输入中的无效字符,然后将结果提供给lxml。 For example, you could try a simple regex-based solution to escape invalid characters . 例如,您可以尝试使用基于正则表达式的简单解决方案来转义无效字符

A popular option to deal with less-than-perfect markup language files in Python is to use Beautiful Soup . 在Python中处理不完美的标记语言文件的一种流行选择是使用Beautiful Soup It can use many parsers, including lxml. 它可以使用许多解析器,包括lxml。

Can you post some of the XML that's giving you problems? 您可以发布一些给您带来麻烦的XML吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM