简体   繁体   中英

Python: 'ascii' codec can't encode characters

I am using the following code to scrape a webpage that contains Japanese characters:

import urllib2
import bs4
import time

url = 'http://www.city.sapporo.jp/eisei/tiiki/toban.html'

pagecontent = urllib2.urlopen(url)
soup = bs4.BeautifulSoup(pagecontent.read().decode("utf8"))

print(soup.prettify())
print(soup)

In some machines the code works fine, and the last two statements print the result successfully. However, in some machines the last but one statement gives the error

UnicodeEncodeError 'ascii' codec can't encode characters in position 485-496: ordinal not in range(128),

and the last statement prints strange squares for all Japanese characters.

Why the same code works differently for two machines? How can I fix this?

Python version 2.6.6

bs4 version: 4.1.0

You need to configure your environment locale correctly; once your locale is set, Python will pick it up automatically when printing to a terminal.

Check your locale with the locale command:

$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

Note the .UTF-8 in my locale settings; it tells programs running in the terminal that my terminal uses the UTF-8 codec, one that supports all of Unicode.

You can set all of your locale in one step with the LANG environment variable:

export LANG="en_US.UTF-8"

for a US locale (how dates and numbers are printed) with the UTF-8 codec. To be precise, the LC_CTYPE setting is used for the output codec, which in turn defaults to the LANG value.

Also see the very comprehensive UTF-8 and Unicode FAQ for Unix/Linux .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM