简体   繁体   中英

BeautifulSoup findall with class attribute- unicode encode error

I am using BeautifulSoup to extract news stories(just the titles) from Hacker News and have this much up till now-

import urllib2
from BeautifulSoup import BeautifulSoup

HN_url = "http://news.ycombinator.com"

def get_page():
    page_html = urllib2.urlopen(HN_url) 
    return page_html

def get_stories(content):
    soup = BeautifulSoup(content)
    titles_html =[]

    for td in soup.findAll("td", { "class":"title" }):
        titles_html += td.findAll("a")

    return titles_html

print get_stories(get_page()

)

When I run the code, however, it gives an error-

Traceback (most recent call last):
  File "terminalHN.py", line 19, in <module>
    print get_stories(get_page())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 131: ordinal not in range(128)

How do I get this to work?

Because BeautifulSoup works internally with unicode strings. Printing unicode strings to the console will cause Python to try the conversion of unicode to the default encoding of Python which is usually ascii. This will in general fail for non-ascii web-site. You may learn the basics about Python and Unicode by googling for "python + unicode". Meanwhile convert your unicode strings to utf-8 using

print some_unicode_string.decode('utf-8')

One thing to note about your code is that findAll returns a list (in this case a list of BeautifulSoup objects) and you just want the titles. You might want to use find instead. And rather than printing out a list of the BeautifulSoup objects, you say that you just want the titles. The following works fine, for example:

import urllib2
from BeautifulSoup import BeautifulSoup

HN_url = "http://news.ycombinator.com"

def get_page():
    page_html = urllib2.urlopen(HN_url) 
    return page_html

def get_stories(content):
    soup = BeautifulSoup(content)
    titles = []

    for td in soup.findAll("td", { "class":"title" }):
        a_element = td.find("a")
        if a_element:
            titles.append(a_element.string)

    return titles

print get_stories(get_page())

So now get_stories() returns a list of unicode objects, which prints out as you'd expect.

It works fine, what's broken is the output. Either explicitly encode to your console's charset, or find a different way to run your code (eg, from within IDLE).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM