简体   繁体   中英

Selenium/BeautifulSoup Webscraper in Python Keeps Having UnicodeEncodeError

So I have a webscraper up and running and for some pages my code works fine, but for others (which must contain special characters) when I go to write the page to a file it won't do it as I get the dreaded UnicodeEncodeError. I have tried a number of solutions including UnicodeDammit and also using the .encode('utf-8', 'ignore') method which all the true programmers despise from reading the other threads because it just throws out data. The problem is, I still have no idea how to fix my code. Ah the joys of a rookie programmer! So do you gurus have some ideas of how to fix this problem?

The code in question is here (assume I have imported the necessary things and defined the variables because I have).

LBfull = browser2.page_source
LBfullsoup = BeautifulSoup(LBfull, 'html.parser', from_encoding='UTF-8')


LBfileready = str(LBfullsoup.prettify())
unicodedata.normalize('NFKD', LBfileready).encode('utf-8','ignore')
file = open('D:/PATH/'+date+citynames[i]+'LB.txt', 'w')
file.write(LBfileready)
file.close()

The dreaded traceback is here:

Traceback (most recent call last):

File "fitbitloop.py", line 95, in <module>
    file.write(LBfileready)
  File "C:\python351\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1209190-
1209191: character maps to <undefined>

It seems no matter what I have done I cannot get rid of this error. Is there some kind of error checking code I can use to throw out characters that map to . The website I am working on is global so there could admittedly be all kinds of special characters. Since I can't write to a file, i haven't been able to look up the character in question. It just comes up blank in the python shell when I ask for it out of the string which I assume is because my little command prompt window can't show it either. So how do I defeat this unpleasant problem? Any help is once again greatly appreciated. Or if you could point me to the thread that solves the problem, that would also be appreciated. There are so many threads on this particular topic that it is hard to find the "right answer."

Writing the file with the 'wb' attribute allowed me to avoid the error mentioned above. HT Adam Van Prooyen. Thanks for the help!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM