I have this script, which reads the text from web page:
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page);
paragraphs = soup.findAll('p');
for p in paragraphs:
content = content+p.text+" ";
In the web page I have this string:
Möddinghofe
My script reads it as:
Möddinghofe
How can I read it as it is?
Hope this would help you
from BeautifulSoup import BeautifulStoneSoup
import cgi
def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text
def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text
text = "&, ®, <, >, ¢, £, ¥, €, §, ©"
uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)
print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©
我建议您看一看BeautifulSoup文档的编码部分。
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.