I'm trying to scrape a website that has content in Hebrew.
The Hebrew portions of the site however are appearing like
úåìåòô
How do I convert these characters into their proper letters?
I am using Python with BeautifulSoup
You need to give BeautifulSoup the right codec to use, because otherwise make an educated guess and get it wrong (some of the time).
If you are using urllib2
to load the page, you can pass along any encoding the server set with:
soup = BeautifulSoup(response.read(),
from_encoding =response.info().getparam('charset'))
See the encodings section of the BeautifulSoup documentation.
According to the web site Standard Encodings
cp424 EBCDIC-CP-HE, IBM424 Hebrew
cp856 Hebrew
cp862 862, IBM862 Hebrew
cp1255 windows-1255 Hebrew iso8859_8 iso-8859-8, hebrew Hebrew
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.