Python lxml: how to deal with encoding errors parsing xml strings?

Question

I need help with parsing xml data. Here's the scenario:

I have xml files loaded as strings to a postgresql database.
I downloaded them to a text file for further analysis. Each line corresponds to an xml file.
The strings have different encodings. Some explicitly specify utf-8 , other windows-1252 . There might be others as well; some don't specify the encoding in the string.
I need to parse these strings for data. The best approach I've found is the following:

encoded_string = bytes(bytearray(xml_data, encoding='utf-8'))
root = etree.fromstring(encoded_string)

When it doesn't work, I get two types of error messages:

"Extra content at the end of the document, line 1, column x (<string>, line 1)" 
# x varies with string; I think it corresponds to the last character in the line

Looking at the lines raising exceptions it looks like the Extra content error is raised by files with a windows-1252 encoding.

I need to be able to parse every string, ideally without having to alter them in any way after download. I've tried the following:

Apply 'windows-1252' as the encoding instead.
Reading the string as binary and then applying the encoding
Reading the string as binary and converting it directly with etree.fromstring

The last attempt produced this error: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

What can I do? I need to be able to read these strings but can't figure out how to parse them. The xml strings with the windows encoding all start with <?xml version="1.0" encoding="windows-1252"?>

Answer 1

given that the table column is text, all the XML content is being presented to python in UTF-8, as a result attempting to parse a conflicting XML encoding attribute will cause problems.

maybe try stripping that attribute from the string.

Answer 2

I solved the problem by removing encoding information, newline literals and carriage return literals. Every string was parsed successfully if I opened the files returning errors in vim and ran the following three commands:

:%s/\\r//g
:%s/\\n//g
:%s/<?.*?>//g

Then lxml parsed the strings without issue.

Update:

I have a better solution. The problem was \n and \r literals in UTF-8 encoded strings I was copying to text files. I just needed to remove these characters from the strings with regexp_replace like so:

select regexp_replace(xmlcolumn, '\\n|\\r', '', 'g') from table;

now I can run the following and read the data with lxml without further processing:

psql -d database -c "copy (select regexp_replace(xml_column, '\\n|\\r', '', 'g') from resource ) to stdout" > output.txt

Python lxml: how to deal with encoding errors parsing xml strings?

Question

2 answers

solution1
0 2020-06-27 05:02:28

solution2
0 2020-06-30 21:57:02

Python lxml: how to deal with encoding errors parsing xml strings?

Question

2 answers

solution1 0 2020-06-27 05:02:28

solution2 0 2020-06-30 21:57:02

solution1
0 2020-06-27 05:02:28

solution2
0 2020-06-30 21:57:02