Unicode error when saving an html file

Question

I'm using Python2.6, and having loads of issues with the requests module and character encodings.

Boiled to the simplest form, here's my code and the resulting error (including the actual site causing my issue):

import requests

sites = ['www.ddelectricmotors.com', 'www.stearnswood.com']
for domain in site:
 r = requests.get( 'http://' + domain )
 f = open( domain, 'w' )
 f.write( r.text )
 f.close()

The page for DDElectric Motors loads and saves fine, but Stearnswood attempt yields the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 13186: ordinal not in range(128)

Ideally, I'd prefer to just force encoding to ascii, because I'm loading it into scikit-learn, which seems to prefer ascii. I'd be fine with just removing the unknown char.

Answer 1

Unix files contain bytes, and your r object's text attribute appears to be a codepoint-string, so if you are on Unix you can instead f.write(r.text.encode('UTF-8')) .

The bigger issue is writing unsanitized data from the internet, obtained over an unsecured channel, into a file in an automated process. Be very careful how that file is used. Consider at minimum using HTTPS if you trust the site.

Unicode error when saving an html file

Question

1 answers

solution1
2 ACCPTED 2012-11-20 23:47:03

Unicode error when saving an html file

Question

1 answers

solution1 2 ACCPTED 2012-11-20 23:47:03

solution1
2 ACCPTED 2012-11-20 23:47:03