pyCurl and BytesIO for scraping a website

Question

I would have a need for scraping. Specifically, I use Pycurl and BytesIO.

The following code:

c = pycurl.Curl()
page = BytesIO()
c.setopt(c.INTERFACE, "tun0")
c.setopt(c.USERAGENT, userAgent)
c.setopt(pycurl.CAINFO, certifi.where())
c.setopt(c.URL, URL)
c.setopt(c.WRITEDATA, page)
c.perform()

Until yesterday, page.getvalue() would return the html of the page which I would then pass to bs4. Today, however, I notice that it returns a string in bytes that I can't even decode into utf-8 because it returns an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

How can I get in the string type the content of the url, in order to pass it to bs4 and scrape?

Answer 1

The data you retrieved is not valid UTF-8, therefore it cannot be decoded automatically.

Use the headers returned with the response to identify what encoding the body is supposed to be in. If the encoding is not UTF-8, decode using the correct decoding.
If the body is claimed to be in UTF-8 but contains invalid data, use the second argument to bytes.decode to specify what to do about the invalid data.

pyCurl and BytesIO for scraping a website

Question

1 answers

solution1
0 2021-01-14 23:47:04

pyCurl and BytesIO for scraping a website

Question

1 answers

solution1 0 2021-01-14 23:47:04

solution1
0 2021-01-14 23:47:04