How to escape a unicode error when converting a BeautifulSoup object to a string

Question

I have been working with the following bit of code, attempting to extract the text elements of this webpage.

site= 'http://football.fantasysports.yahoo.com/f1/1785/4/team?&week=4'
print site
response = urllib2.urlopen(site)
html = response.read()

soup = BeautifulSoup(html)
position = soup.find_all('span', class_="Fz-xxs")
for j in range(0,13):
    positionlist = str(position[j].get_text())

print (positionlist)

Unfortunately, the text itself that is being put into the positionlist string contains many hyphens (ie: SEA-RB) that are not able to be encoded. When I attempt to run the code as it is I get the following response:

Traceback (most recent call last):
  File "/Users/masongardner/Desktop/TestSorter.py", line 20, in <module>
    positionlist = str(position[j].get_text())
UnicodeEncodeError: 'ascii' codec can't encode character u'\ue002' in position 0: ordinal not in range(128)

I am aware that the hyphen cannot be encoded, but I am not sure how to change the coding so that I can have unicode interpret the hyphen if possible, or otherwise ignore the hyphen and just encode the text before and after for each instance. This project is purely for my own use, and so a hackerish approach is not a problem!

Thanks Everyone!

Answer 1

Don't try to casting to a str just print the unicode string you get from get_text :

site= 'http://football.fantasysports.yahoo.com/f1/1785/4/team?&week=4'

print site
response = urllib2.urlopen(site)
html = response.read()

soup = BeautifulSoup(html)
position = soup.find_all('span', class_="Fz-xxs")
for j in range(0,13):
    positionlist = (position[j].get_text()) # unicode string

    print (positionlist)
Viewing Info for League: The League (ID# 1785)
 # http://chars.suikawiki.org/char/E002




Since '08
Jax - QB

Atl - WR

Ten - WR

You are seeing exactly what is in the source <span class="F-icon Fz-xxs"></span></a>

If you want to ignore that character use if positionlist != u"\":

You can also use unicodedata :

 import unicodedata
 print unicodedata.normalize('NFKD', positionlist).encode('ascii','ignore')

Answer 2

You can do this too

 try:
    print(word)
 except Exception: 
    print(str(word.encode("utf-8",'ignore')))

Answer 3

get_text() (as the name suggests) already returns a text -- Unicode string. You should not call str() ; you can print Unicode text directly:

>>> str(u'\N{SNOWMAN}')                                                                                   
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2603' in position 0: ordinal not in range(128)
>>> print u'\N{SNOWMAN}'
☃

If you need to convert Unicode string to bytes; call .encode() method (don't use str() ):

bytestring = unicode_text.encode(character_encoding)

Answer 4

position[j].get_text() actually gives you a 'unicode' output which you can't convert to 'str' which actually is a byte stream without specifying the encoding to use. By default it assumes you need ASCII and then it throws up an error when it finds something that is not ASCII.

You don't need to convert to str if you want to print to console. But most likely you want to sent to somewhere so mention the encoding, and if you don't know which one stick to UTF-8 since most applications use UTF-8.Also like mentioned check how to ignore non ASCII characters.

How to escape a unicode error when converting a BeautifulSoup object to a string

Question

4 answers

solution1
0 ACCPTED 2015-01-23 18:16:30

solution2
0 2015-01-23 18:35:56

solution3
0 2015-01-23 18:42:15

solution4
0 2015-01-23 18:54:20

How to escape a unicode error when converting a BeautifulSoup object to a string

Question

4 answers

solution1 0 ACCPTED 2015-01-23 18:16:30

solution2 0 2015-01-23 18:35:56

solution3 0 2015-01-23 18:42:15

solution4 0 2015-01-23 18:54:20

solution1
0 ACCPTED 2015-01-23 18:16:30

solution2
0 2015-01-23 18:35:56

solution3
0 2015-01-23 18:42:15

solution4
0 2015-01-23 18:54:20