Python: 'ascii' codec can't encode characters

Question

I am using the following code to scrape a webpage that contains Japanese characters:

import urllib2
import bs4
import time

url = 'http://www.city.sapporo.jp/eisei/tiiki/toban.html'

pagecontent = urllib2.urlopen(url)
soup = bs4.BeautifulSoup(pagecontent.read().decode("utf8"))

print(soup.prettify())
print(soup)

In some machines the code works fine, and the last two statements print the result successfully. However, in some machines the last but one statement gives the error

UnicodeEncodeError 'ascii' codec can't encode characters in position 485-496: ordinal not in range(128),

and the last statement prints strange squares for all Japanese characters.

Why the same code works differently for two machines? How can I fix this?

Python version 2.6.6

bs4 version: 4.1.0

Answer 1

You need to configure your environment locale correctly; once your locale is set, Python will pick it up automatically when printing to a terminal.

Check your locale with the locale command:

$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

Note the .UTF-8 in my locale settings; it tells programs running in the terminal that my terminal uses the UTF-8 codec, one that supports all of Unicode.

You can set all of your locale in one step with the LANG environment variable:

export LANG="en_US.UTF-8"

for a US locale (how dates and numbers are printed) with the UTF-8 codec. To be precise, the LC_CTYPE setting is used for the output codec, which in turn defaults to the LANG value.

Also see the very comprehensive UTF-8 and Unicode FAQ for Unix/Linux .

Python: 'ascii' codec can't encode characters

Question

1 answers

solution1
7 2014-12-21 16:05:14

Python: 'ascii' codec can't encode characters

Question

1 answers

solution1 7 2014-12-21 16:05:14

solution1
7 2014-12-21 16:05:14