The following Python code ...
html_data = urllib2.urlopen(some_url).read()
f = codecs.open(filename, 'w', encoding='utf-8')
f.write(html_data)
f.close()
... sometimes fails with UnicodeDecodeError
...
File "/.../lib/python2.6/codecs.py", line 686, in write
return self.writer.write(data)
File "/.../lib/python2.6/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 5605: ordinal not in range(128)
My questions:
urllib2.urlopen(some_url).read()
call always returns UTF-8? codecs.open(...)
call that prevents it from storing my data to disk in UTF-8 encoding? The problem is not with codecs.open
-- it's with passing to .write
a byte string that (given the \\xd0
code in it) is clearly encoded in some ISO-8859-*
or related codec.
urllib2.urlopen returns a response object which, besides file-like behavior, as the extra method:
info()
— return the meta-information of the page, such as headers, in the form of anhttplib.HTTPMessage
instance (see Quick Reference to HTTP Headers )
In particular the Content-Type
header, for text-like contents, should have a charset
parameter specifying the encoding it uses, eg Content-Type: text/html; charset=ISO-8859-4
Content-Type: text/html; charset=ISO-8859-4
. You need to parse and isolate the charset
and use it to decode the contents into Unicode (so your codecs.open
ed file-like object always gets unicode arguments to write
and properly writes them out in utf-8
).
If charset
is missing, or using it to decode the text results in errors (suggesting charset
is wrong), as the last hope of salvation you can try the Universal Encoding Detector which uses heuristics for the purpose (after all, many pages on the web have horrible metadata errors, as well as broken HTML and so forth).
Example:
data = urlopen(uri).read().decode(encoding)
f = open(file_name, 'wb')
f.write(data.encode('utf-8'))
f.close()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.