简体   繁体   中英

Unicode error when saving an html file

I'm using Python2.6, and having loads of issues with the requests module and character encodings.

Boiled to the simplest form, here's my code and the resulting error (including the actual site causing my issue):

import requests

sites = ['www.ddelectricmotors.com', 'www.stearnswood.com']
for domain in site:
 r = requests.get( 'http://' + domain )
 f = open( domain, 'w' )
 f.write( r.text )
 f.close()

The page for DDElectric Motors loads and saves fine, but Stearnswood attempt yields the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 13186: ordinal not in range(128)

Ideally, I'd prefer to just force encoding to ascii, because I'm loading it into scikit-learn, which seems to prefer ascii. I'd be fine with just removing the unknown char.

Unix files contain bytes, and your r object's text attribute appears to be a codepoint-string, so if you are on Unix you can instead f.write(r.text.encode('UTF-8')) .

The bigger issue is writing unsanitized data from the internet, obtained over an unsecured channel, into a file in an automated process. Be very careful how that file is used. Consider at minimum using HTTPS if you trust the site.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM