简体   繁体   中英

HTML encoding and lxml parsing

I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered:

1.

<!DOCTYPE html>
<html lang='en'>
<head>
   <title>Unicode Chars: 은 —’</title>
   <meta charset='utf-8'>
</head>
<body></body>
</html>

2.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR">
<head>
    <title>Unicode Chars: 은 —’</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body></body>
</html>

3.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Unicode Chars: 은 —’</title>
</head>
<body></body>
</html>

My basic script:

from lxml.html import fromstring
...

doc = fromstring(raw_html)
title = doc.xpath('//title/text()')[0]
print title

The results are:

Unicode Chars: ì ââ
Unicode Chars: 은 —’
Unicode Chars: 은 —’

So, obviously an issue with sample 1 and the missing <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> tag. The solution from here will correctly recognize sample 1 as utf-8 and so it is functionally equivalent to my original code.

The lxml docs appear conflicted:

From here the example seems to suggest we should use UnicodeDammit to encode the markup as unicode.

from BeautifulSoup import UnicodeDammit

def decode_html(html_string):
    converted = UnicodeDammit(html_string, isHTML=True)
    if not converted.unicode:
        raise UnicodeDecodeError(
            "Failed to detect encoding, tried [%s]",
            ', '.join(converted.triedEncodings))
    # print converted.originalEncoding
    return converted.unicode

root = lxml.html.fromstring(decode_html(tag_soup))

However here it says:

[Y]ou will get errors when you try [to parse] HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.

If I try to follow the the first suggestion in the lxml docs, my code is now:

from lxml.html import fromstring
from bs4 import UnicodeDammit
...
dammit = UnicodeDammit(raw_html)
doc = fromstring(dammit.unicode_markup)
title = doc.xpath('//title/text()')[0]
print title

I now get the following results:

Unicode Chars: 은 —’
Unicode Chars: 은 —’
ValueError: Unicode strings with encoding declaration are not supported.

Sample 1 now works correctly but sample 3 results in an error due to the <?xml version="1.0" encoding="utf-8"?> tag.

Is there a correct way to handle all of these cases? Is there a better solution than the following?

dammit = UnicodeDammit(raw_html)
try:
    doc = fromstring(dammit.unicode_markup)
except ValueError:
    doc = fromstring(raw_html)

lxml has several issues related to handling Unicode. It might be best to use bytes (for now) while specifying the character encoding explicitly:

#!/usr/bin/env python
import glob
from lxml import html
from bs4 import UnicodeDammit

for filename in glob.glob('*.html'):
    with open(filename, 'rb') as file:
        content = file.read()
        doc = UnicodeDammit(content, is_html=True)

    parser = html.HTMLParser(encoding=doc.original_encoding)
    root = html.document_fromstring(content, parser=parser)
    title = root.find('.//title').text_content()
    print(title)

Output

Unicode Chars: 은 —’
Unicode Chars: 은 —’
Unicode Chars: 은 —’

The problem probably stems from the fact that <meta charset> is a relatively new standard (HTML5 if I'm not mistaken, or it wasn't really used before it.)

Until such a time when the lxml.html library is updated to reflect it, you will need to handle that case specially.

If you only care about ISO-8859-* and UTF-8, and can afford to throw away non-ASCII compatible encodings (such as UTF-16 or the East Asian traditional charsets), you can do a regular expression substitution on the byte string, replacing the newer <meta charset> with the older http-equiv format.

Otherwise, if you need a proper solution, your best bet is to patch the library yourself (and contributing the fix while you're at it.) You might want to ask the lxml developers if they have any half-baked code laying around for this particular bug, or if they are tracking the bug on their bug tracking system in the first place.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM