简体   繁体   中英

How do I write data to disk in UTF-8 encoding in Python?

The following Python code ...

html_data = urllib2.urlopen(some_url).read()
f = codecs.open(filename, 'w', encoding='utf-8')
f.write(html_data)
f.close()

... sometimes fails with UnicodeDecodeError ...

File "/.../lib/python2.6/codecs.py", line 686, in write
  return self.writer.write(data)
File "/.../lib/python2.6/codecs.py", line 351, in write
  data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 5605: ordinal not in range(128)

My questions:

  • How do I make sure my urllib2.urlopen(some_url).read() call always returns UTF-8?
  • Is there anything wrong with my codecs.open(...) call that prevents it from storing my data to disk in UTF-8 encoding?

The problem is not with codecs.open -- it's with passing to .write a byte string that (given the \\xd0 code in it) is clearly encoded in some ISO-8859-* or related codec.

urllib2.urlopen returns a response object which, besides file-like behavior, as the extra method:

info() — return the meta-information of the page, such as headers, in the form of an httplib.HTTPMessage instance (see Quick Reference to HTTP Headers )

In particular the Content-Type header, for text-like contents, should have a charset parameter specifying the encoding it uses, eg Content-Type: text/html; charset=ISO-8859-4 Content-Type: text/html; charset=ISO-8859-4 . You need to parse and isolate the charset and use it to decode the contents into Unicode (so your codecs.open ed file-like object always gets unicode arguments to write and properly writes them out in utf-8 ).

If charset is missing, or using it to decode the text results in errors (suggesting charset is wrong), as the last hope of salvation you can try the Universal Encoding Detector which uses heuristics for the purpose (after all, many pages on the web have horrible metadata errors, as well as broken HTML and so forth).

  1. AFAIK, You cannot do that. However, You can detect encoding from headers / html and re-encode.
  2. I don't know. I have always used binary mode for writing and it always worked

Example:

data = urlopen(uri).read().decode(encoding)
f = open(file_name, 'wb')
f.write(data.encode('utf-8'))
f.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM