简体   繁体   English

如何将 Python 抓取的 Bing 网页内容转换为人类可读的?

[英]How do I convert Python crawled Bing web page content to human-readable?

I'm playing with crawling Bing web search page using python.我正在玩使用 python 抓取 Bing 网络搜索页面。 I find the raw content received looks like byte type, but the attempt to decompress it has failed.我发现接收到的原始内容看起来像字节类型,但尝试解压缩它失败了。 Does someone have clue what kind of data is this, and how should I extract readable from this raw content?有人知道这是什么类型的数据,我应该如何从这些原始内容中提取可读的内容? Thanks!谢谢!

My code displayed the raw content and then tried to do the gunzip, so you could see the raw content as well as error from the decompression.我的代码显示原始内容,然后尝试执行 gunzip,因此您可以看到原始内容以及解压错误。 Due to the raw content is too long, I just paste the first a few lines in below.由于原始内容太长,我只粘贴下面的前几行。

Code:代码:

import urllib.request as Request
import gzip

req = Request.Request('www.bing.com')
req.add_header('upgrade-insecure-requests', 1)
res = Request.urlopen(req).read()
print("RAW Content: %s" %ResPage) # show raw content of web
print("Try decompression:")
print(gzip.decompress(ResPage))   # try decompression

Result:结果:

RAW Content: b'+p\xe70\x0bi{)\xee!\xea\x88\x9c\xd4z\x00Tgb\x8c\x1b\xfa\xe3\xd7\x9f\x7f\x7f\x1d8\xb8\xfeaZ\xb6\xe3z\xbe\'\x7fj\xfd\xff+\x1f\xff\x1a\xbc\xc5N\x00\xab\x00\xa6l\xb2\xc5N\xb2\xdek\xb9V5\x02\t\xd0D \x1d\x92m%\x0c#\xb9>\xfbN\xd7\xa7\x9d\xa5\xa8\x926\xf0\xcc\'\x13\x97\x01/-\x03... ...

Try decompression:
Traceback (most recent call last):
OSError: Not a gzipped file (b'+p')


Process finished with exit code 1

It's much easier to get started with the requests library. 开始使用请求库要容易得多。 Plus, this is also the most commonly used lib for http requests nowadays. 另外,这也是当今HTTP请求最常用的库。

Install requests in your python environment: 在您的python环境中安装请求:

pip install requests

In your .py file: 在您的.py文件中:

import requests

r = requests.get("http://www.bing.com")

print(r.text)

In addition to Zilong Li answer, you need to pass a user-agent to request headers to act as a "real" user visit.除了Zilong Li 的回答,您还需要通过一个user-agent来请求headers以充当“真正的”用户访问。

If no user-agent is being passed into request headers while using requests library it defaults to python-requests so Bing or other search engine understands that it's a bot/script, and blocks a request.如果在使用requests库时没有user-agent被传递到请求headers ,它默认为python-requests,所以 Bing 或其他搜索引擎会理解它是一个机器人/脚本,并阻止请求。 Check what's your user-agent .检查您的user-agent是什么

Pass user-agent using requests library:使用requests库传递user-agent

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}

requests.get('URL', headers=headers)

How to reduce the chance of being blocked while web scraping search engines . 如何减少网页抓取搜索引擎被拦截的机会


Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi.或者,您可以使用 SerpApi 的Bing Organic Results API实现相同的目的。 It's a paid API with a free plan.这是一个带有免费计划的付费 API。

The difference is that you don't have to spend time trying to bypass blocks from Bing or other search engines.不同之处在于您不必花时间试图绕过 Bing 或其他搜索引擎的阻止。 Instead, focus on the data that needs to be extracted from the structured JSON.相反,应关注需要从结构化 JSON 中提取的数据。 Check out the playground .看看操场

Disclaimer, I work for SerpApi.免责声明,我为 SerpApi 工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM