简体   繁体   中英

Python 3 | Unicode Error | Requests and BeautifulSoup

I'm still in the process of learning Python 3 and I am trying to make a program that uses Requests and BeautifulSoup to accomplish this. I'm new to both this modules.

I'm having this error relating to Unicode because I'm trying to save the code on a file before analysing it.

Error:

 Traceback (most recent call last):
    File "C:\Users\Gonçalo\Desktop\Coding\Python\Web Crawler\Image Retriver.py", line 25, in <module>
    saveFile.write(soup)
  File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200e' in position 145890: character maps to <undefined>

Code:

    import requests
    from bs4 import BeautifulSoup
    import os


url = "https://www.google.pt/search?q=hello"
req = requests.get(url)
resp = req.text
soup = BeautifulSoup(resp,"html.parser")
soup = soup.prettify()


dir_list = os.listdir()
if "Image Retriever Files" not in dir_list:
        os.makedir("Image Retriever Files")
curDir = os.curdir
filename = curDir+"/Image Retriever Files/Search Results.html"
saveFile = open(filename,"w")
saveFile.write(soup)
saveFile.close()

Thanks for any help!

This can get you closer, but you will have others issues, you should wrap this in a try/except block and extend that if with an else, because you didnt state what you wanted it to do if the folder existed(you will get an error if it exist).

import requests
from bs4 import BeautifulSoup
import os

url = "https://www.google.pt/search?q=hello"
req = requests.get(url)
resp = req.text
soup = BeautifulSoup(resp, "html.parser")
soup = soup.prettify()

dir_list = os.getcwd()
if "imageFile" not in dir_list:
    os.mkdir("imageFile")
curDir = os.curdir
filename = curDir + "/imageFile/SearchResults.html"
saveFile = open(filename, "wb")
saveFile.write('files')
saveFile.close()

I hope this sets you on the right path. If it did hit the check box, and if not I am here to help. Regards, Jason

This is similar to this question . Your problem is a feature of unicode. In the beginning there was ASCII and 128 characters was all anyone ever needed.

And then some bright people saw that 8 bits for a character would give them 256 characters, and thus was born codepages where different systems would use characters 128-256 for symbols and letters for other languages. And all was good until people wanted to represent more than one language in a file, or heaven forbid, a language with more than 256 symbols.

And then some other bright people said use more bits! But how many 16?, 32? But what if I don't want my file size to double or quadruple? And more smart people said "Simple we'll use an encoding" and thus was born utf-8 and ISO 8859-1 and their ilk. And more smart people said lets give every character and symbol their one true value and number, thus was born unicode.

'\‎' is a unicode character indicating text that displays left to right. It has no keyboard equivalent.
saveFile = open(filename,"w") is trying to write to a standard text file and it assumes everycharacter can be written with an 8-bit value. 'u200e' has a decimal value of 8,206. To solve your problem you need to choose an encoding like utf-8 explicitly, so that yours strings can be written to a file in a readable manner. Just changing the file type to saveFile = open(filename, "wb") just kicks the can down the road to when you try and read the file.

Check out this article from Joel Spolsky on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM