I would have a need for scraping. Specifically, I use Pycurl and BytesIO.
The following code:
c = pycurl.Curl()
page = BytesIO()
c.setopt(c.INTERFACE, "tun0")
c.setopt(c.USERAGENT, userAgent)
c.setopt(pycurl.CAINFO, certifi.where())
c.setopt(c.URL, URL)
c.setopt(c.WRITEDATA, page)
c.perform()
Until yesterday, page.getvalue()
would return the html of the page which I would then pass to bs4. Today, however, I notice that it returns a string in bytes that I can't even decode into utf-8 because it returns an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
How can I get in the string type the content of the url, in order to pass it to bs4 and scrape?
The data you retrieved is not valid UTF-8, therefore it cannot be decoded automatically.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.