我可以使用Python从Google搜索中抓取所有URL结果而不会被阻止吗？

Question

I realize that versions of this question have been asked and I spent several hours the other day trying a number of strategies. 我意识到有人问过这个问题的版本，前几天我花了几个小时尝试了许多策略。

What I would like to is use python to scrape all of the URLs from a google search that I can use in a separate script to do text analysis of a large corpus (news sites mainly). 我想用python从Google搜索中抓取所有URL，我可以在单独的脚本中使用它来对大型语料库（主要是新闻网站）进行文本分析。 This seems relatively straightforward, but none of the attempts I've tried have worked properly. 这似乎相对简单，但我尝试过的所有尝试均未正常进行。

This is as close as I got: 这与我得到的接近：

from google import search

for url in search('site:cbc.ca "kinder morgan" and "trans mountain" and protest*', stop=100):
    print(url)

This returned about 300 URLs before I got kicked. 在我被踢之前，这返回了大约300个URL。 An actual search using these parameters provides about 1000 results and I'd like all of them. 使用这些参数的实际搜索可提供约1000个结果，我希望所有这些结果。

First: is this possible? 第一：这可能吗？ Second: does anyone have any suggestions to do this? 第二：有人对此有何建议？ I basically just want a txt file of all the URLs that I can use in another script. 我基本上只想要一个可以在另一个脚本中使用的所有URL的txt文件。

Answer 1

It seems that this package uses screen scraping to retrieve search results from google, so it doesn't play well with Google's Terms of Service which could be the reason why you've been blocked. 该软件包似乎使用屏幕抓取从Google检索搜索结果，因此它与Google的服务条款不能很好地配合使用，这可能是您被屏蔽的原因。

The relevant clause in Google's Terms of Service : Google服务条款中的相关条款：

Don't misuse our Services. 不要滥用我们的服务。 For example, don't interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. 例如，请勿干扰我们的服务或尝试使用界面和我们提供的说明以外的方法来访问它们。 You may use our Services only as permitted by law, including applicable export and re-export control laws and regulations. 您只能在法律允许的范围内使用我们的服务，包括适用的出口和再出口控制法律和法规。 We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct. 如果您不遵守我们的条款或政策，或者我们正在调查可疑的不当行为，我们可能会暂停或停止向您提供服务。

I haven't been able to find a definite number, but it seems like their limit for the number of search queries a day is rather strict too - at 100 search queries / day on their JSON Custom Search API documentation here . 我还没有找到一个确定的数字，但是似乎他们每天对搜索查询的数量限制也非常严格- 此处的JSON Custom Search API文档中每天要进行100个搜索查询。

Nonetheless, there's no harm trying out other alternatives to see if they work better: 尽管如此，尝试其他替代方法是否效果更好也没有什么害处：

BeautifulSoup 美丽汤
Scrapy cra草
ParseHub - this one is not in code, but is a useful piece of software with good documentation. ParseHub-这个不是代码，而是一个有用的软件，带有良好的文档说明。 Link to their tutorial on how to scrape a list of URLs . 链接到他们的有关如何抓取URL列表的教程。

我可以使用Python从Google搜索中抓取所有URL结果而不会被阻止吗？

问题描述

1 个解决方案

解决方案1
0 2018-01-17 08:29:24

我可以使用Python从Google搜索中抓取所有URL结果而不会被阻止吗？

问题描述

1 个解决方案

解决方案1 0 2018-01-17 08:29:24

解决方案1
0 2018-01-17 08:29:24