简体   繁体   English

使用Python从xml数据库中删除非Unicode字符

[英]Remove non Unicode characters from xml database with Python

So I have a 9000 line xml database, saved as a txt, which I want to load in python, so I can do some formatting and remove unnecessary tags (I only need some of the tags, but there is a lot of unnecessary information) to make it readable. 因此,我有一个要保存为txt的9000行xml数据库,我想在python中加载它,因此我可以进行一些格式化并删除不必要的标签(我只需要其中一些标签,但是有很多不必要的信息)使它可读。 However, I am getting a UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 608814: character maps to <undefined> , which I assume means that the program ran into a non-Unicode character. 但是,我得到了UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 608814: character maps to <undefined> ,我认为这意味着程序遇到了非Unicode字符。 I am quite positive that these characters are not important to the program (the data I am looking for is all plain text, with no special symbols), so how can I remove all of these from the txt file, when I can't read the file without getting the UnicodeDecodeError ? 我非常肯定这些字符对程序并不重要(我要查找的数据都是纯文本,没有特殊符号),所以当我看不懂时,如何从txt文件中删除所有这些字符?没有得到UnicodeDecodeError的文件?

One crude workaround is to decode the bytes from the file yourself and specify the error handling. 一种粗略的解决方法是自己解码文件中的字节并指定错误处理。 EG: 例如:

for line in somefile:
    uline = line.decode('ascii', errors='ignore')

That will turn the line into a Unicode object in which any non-ascii bytes have been dropped. 这会将行变成一个Unicode对象,其中删除了所有非ascii字节。 This is not a generally recommended approach - ideally you'd want to process XML with a proper parser, or at least know your file's encoding and open it appropriately (the exact details depend on your Python version). 这不是通常推荐的方法-理想情况下,您希望使用适当的解析器处理XML,或者至少要知道文件的编码并适当地打开它(具体细节取决于您的Python版本)。 But if you're entirely certain you only care about ascii characters this is a simple fallback. 但是,如果您完全确定只关心ASCII字符,那么这是一个简单的后备方法。

The error suggests that you're using open() function without specifying an explicit character encoding. 该错误表明您在使用open()函数时未指定显式字符编码。 locale.getpreferredencoding(False) is used in this case (eg, cp1252 ). 在这种情况下,使用locale.getpreferredencoding(False) (例如cp1252 )。 The error says that it is not an appropriate encoding for the input. 该错误表明这不是输入的适当编码。

An xml document may contain a declaration at the very begining that specifies the encoding used explicitly. xml文档可能在一开始就包含一个声明,该声明指定显式使用的编码。 Otherwise the encoding is defined by BOM or it is utf-8. 否则,编码由BOM定义或为utf-8。 If your copy-pasting and saving the file hasn't messed up the encoding and you don't see a line such as <?xml version="1.0" encoding="iso-8859-1" ?> then open the file using utf-8 : 如果复制粘贴和保存文件没有弄乱编码,并且看不到诸如<?xml version="1.0" encoding="iso-8859-1" ?>之类的行,请使用utf-8

with open('input-xml-like.txt', encoding='utf-8', errors='ignore') as file:
    ...

If the input is an actual XML then just pass it to an XML parser instead: 如果输入是实际的XML,则只需将其传递给XML解析器即可:

import xml.etree.ElementTree as etree

tree = etree.parse('input.xml')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM