简体   繁体   中英

How do I convert Python crawled Bing web page content to human-readable?

I'm playing with crawling Bing web search page using python. I find the raw content received looks like byte type, but the attempt to decompress it has failed. Does someone have clue what kind of data is this, and how should I extract readable from this raw content? Thanks!

My code displayed the raw content and then tried to do the gunzip, so you could see the raw content as well as error from the decompression. Due to the raw content is too long, I just paste the first a few lines in below.

Code:

import urllib.request as Request
import gzip

req = Request.Request('www.bing.com')
req.add_header('upgrade-insecure-requests', 1)
res = Request.urlopen(req).read()
print("RAW Content: %s" %ResPage) # show raw content of web
print("Try decompression:")
print(gzip.decompress(ResPage))   # try decompression

Result:

RAW Content: b'+p\xe70\x0bi{)\xee!\xea\x88\x9c\xd4z\x00Tgb\x8c\x1b\xfa\xe3\xd7\x9f\x7f\x7f\x1d8\xb8\xfeaZ\xb6\xe3z\xbe\'\x7fj\xfd\xff+\x1f\xff\x1a\xbc\xc5N\x00\xab\x00\xa6l\xb2\xc5N\xb2\xdek\xb9V5\x02\t\xd0D \x1d\x92m%\x0c#\xb9>\xfbN\xd7\xa7\x9d\xa5\xa8\x926\xf0\xcc\'\x13\x97\x01/-\x03... ...

Try decompression:
Traceback (most recent call last):
OSError: Not a gzipped file (b'+p')


Process finished with exit code 1

It's much easier to get started with the requests library. Plus, this is also the most commonly used lib for http requests nowadays.

Install requests in your python environment:

pip install requests

In your .py file:

import requests

r = requests.get("http://www.bing.com")

print(r.text)

In addition to Zilong Li answer, you need to pass a user-agent to request headers to act as a "real" user visit.

If no user-agent is being passed into request headers while using requests library it defaults to python-requests so Bing or other search engine understands that it's a bot/script, and blocks a request. Check what's your user-agent .

Pass user-agent using requests library:

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}

requests.get('URL', headers=headers)

How to reduce the chance of being blocked while web scraping search engines .


Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.

The difference is that you don't have to spend time trying to bypass blocks from Bing or other search engines. Instead, focus on the data that needs to be extracted from the structured JSON. Check out the playground .

Disclaimer, I work for SerpApi.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM