简体   繁体   中英

UnicodeEncodeError: 'ascii' codec can't encode character u'\xef'

I am scrapping Amazon customer review. It runs a while but after certain point, I get this error.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "custreviewscrap.py", line 73, in <module>
    strcomment = str(k.getText())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 293
7: ordinal not in range(128)

I tried following things but didn't work...

1)strcomment = `str(k.getText()).encode('utf8')`
2)strcomment = str(k.getText())
  strcomment = strcomment.encode('ascii', 'ignore')

Thank you very much!

for k in bsreview2.findAll('div',{"style":"margin-left:0.5em;"}):
    #next part is clean the comments. sorry, this part is really dirty, I should have written a function
    #the comment is surrounded by different stuff depends on what kind of review it is, video or pics or text
    strcomment = str(k.getText())
    patcomment = re.compile(r'(.*(\(Electronics\)|\(Health and Beauty\)))')
    patcomment2 = re.compile(r'Help other customers find.*')
    patcomment3 = re.compile(r'(Customer review from the Amazon Vine Program(.|\n)*Length::)|(\<\!(.|\n)*Length::)|(Customer review from the Amazon Vine Program\(What\'s this\?\)|(.*See all my reviews))')

    cleancomment = re.sub(patcomment, '', strcomment)
    cleancomment = re.sub('&nbsp;', '', cleancomment)
    cleancomment = re.sub(patcomment2, '', cleancomment)
    cleancomment = re.sub(',' ,'.', cleancomment)
    cleancomment = re.sub(patcomment3, '', cleancomment)
    strdate = str(k.nobr.getText())
    cleandate = re.sub(',','',strdate)

    print (k.span.getText())[0:1]+','+ cleandate +',' + cleancomment
    csvtext = csvtext + (k.span.getText())[0:1]+','+ cleandate +',' + a +','+ cleancomment + '\n'

Assuming k.getText() returns Unicode, the following would work (where s is the result of k.getText() ):

>>> s = u'\xef'
>>> s.encode('utf-8')
'\xc3\xaf'

Note that a str() call isn't needed anymore.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM