简体   繁体   中英

Python lxml: how to deal with encoding errors parsing xml strings?

I need help with parsing xml data. Here's the scenario:

  1. I have xml files loaded as strings to a postgresql database.
  2. I downloaded them to a text file for further analysis. Each line corresponds to an xml file.
  3. The strings have different encodings. Some explicitly specify utf-8 , other windows-1252 . There might be others as well; some don't specify the encoding in the string.
  4. I need to parse these strings for data. The best approach I've found is the following:
encoded_string = bytes(bytearray(xml_data, encoding='utf-8'))
root = etree.fromstring(encoded_string)

When it doesn't work, I get two types of error messages:

"Extra content at the end of the document, line 1, column x (<string>, line 1)" 
# x varies with string; I think it corresponds to the last character in the line

Looking at the lines raising exceptions it looks like the Extra content error is raised by files with a windows-1252 encoding.

I need to be able to parse every string, ideally without having to alter them in any way after download. I've tried the following:

  1. Apply 'windows-1252' as the encoding instead.
  2. Reading the string as binary and then applying the encoding
  3. Reading the string as binary and converting it directly with etree.fromstring

The last attempt produced this error: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

What can I do? I need to be able to read these strings but can't figure out how to parse them. The xml strings with the windows encoding all start with <?xml version="1.0" encoding="windows-1252"?>

given that the table column is text, all the XML content is being presented to python in UTF-8, as a result attempting to parse a conflicting XML encoding attribute will cause problems.

maybe try stripping that attribute from the string.

I solved the problem by removing encoding information, newline literals and carriage return literals. Every string was parsed successfully if I opened the files returning errors in vim and ran the following three commands:

:%s/\\r//g
:%s/\\n//g
:%s/<?.*?>//g

Then lxml parsed the strings without issue.

Update:

I have a better solution. The problem was \n and \r literals in UTF-8 encoded strings I was copying to text files. I just needed to remove these characters from the strings with regexp_replace like so:

select regexp_replace(xmlcolumn, '\\n|\\r', '', 'g') from table;

now I can run the following and read the data with lxml without further processing:

psql -d database -c "copy (select regexp_replace(xml_column, '\\n|\\r', '', 'g') from resource ) to stdout" > output.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM