简体   繁体   中英

pyCurl and BytesIO for scraping a website

I would have a need for scraping. Specifically, I use Pycurl and BytesIO.

The following code:

c = pycurl.Curl()
page = BytesIO()
c.setopt(c.INTERFACE, "tun0")
c.setopt(c.USERAGENT, userAgent)
c.setopt(pycurl.CAINFO, certifi.where())
c.setopt(c.URL, URL)
c.setopt(c.WRITEDATA, page)
c.perform()

Until yesterday, page.getvalue() would return the html of the page which I would then pass to bs4. Today, however, I notice that it returns a string in bytes that I can't even decode into utf-8 because it returns an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

How can I get in the string type the content of the url, in order to pass it to bs4 and scrape?

The data you retrieved is not valid UTF-8, therefore it cannot be decoded automatically.

  • Use the headers returned with the response to identify what encoding the body is supposed to be in. If the encoding is not UTF-8, decode using the correct decoding.
  • If the body is claimed to be in UTF-8 but contains invalid data, use the second argument to bytes.decode to specify what to do about the invalid data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM