简体   繁体   中英

Converting funny, special Latin characters to unicode (foreign characters)

I'm trying to scrape a website that has content in Hebrew.

The Hebrew portions of the site however are appearing like

úåìåòô

How do I convert these characters into their proper letters?

I am using Python with BeautifulSoup

You need to give BeautifulSoup the right codec to use, because otherwise make an educated guess and get it wrong (some of the time).

If you are using urllib2 to load the page, you can pass along any encoding the server set with:

soup = BeautifulSoup(response.read(),
                     from_encoding =response.info().getparam('charset'))

See the encodings section of the BeautifulSoup documentation.

According to the web site Standard Encodings

cp424 EBCDIC-CP-HE, IBM424 Hebrew
cp856 Hebrew
cp862 862, IBM862 Hebrew
cp1255 windows-1255 Hebrew iso8859_8 iso-8859-8, hebrew Hebrew

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM