Converting funny, special Latin characters to unicode (foreign characters)

Question

I'm trying to scrape a website that has content in Hebrew.

The Hebrew portions of the site however are appearing like

úåìåòô

How do I convert these characters into their proper letters?

I am using Python with BeautifulSoup

Answer 1

You need to give BeautifulSoup the right codec to use, because otherwise make an educated guess and get it wrong (some of the time).

If you are using urllib2 to load the page, you can pass along any encoding the server set with:

soup = BeautifulSoup(response.read(),
                     from_encoding =response.info().getparam('charset'))

See the encodings section of the BeautifulSoup documentation.

Answer 2

According to the web site Standard Encodings

cp424 EBCDIC-CP-HE, IBM424 Hebrew
cp856 Hebrew
cp862 862, IBM862 Hebrew
cp1255 windows-1255 Hebrew iso8859_8 iso-8859-8, hebrew Hebrew