简体   繁体   English

我可以使用Python从Google搜索中抓取所有URL结果而不会被阻止吗?

[英]Can I scrape all URL results using Python from a google search without getting blocked?

I realize that versions of this question have been asked and I spent several hours the other day trying a number of strategies. 我意识到有人问过这个问题的版本,前几天我花了几个小时尝试了许多策略。

What I would like to is use python to scrape all of the URLs from a google search that I can use in a separate script to do text analysis of a large corpus (news sites mainly). 我想用python从Google搜索中抓取所有URL,我可以在单独的脚本中使用它来对大型语料库(主要是新闻网站)进行文本分析。 This seems relatively straightforward, but none of the attempts I've tried have worked properly. 这似乎相对简单,但我尝试过的所有尝试均未正常进行。

This is as close as I got: 这与我得到的接近:

from google import search

for url in search('site:cbc.ca "kinder morgan" and "trans mountain" and protest*', stop=100):
    print(url)

This returned about 300 URLs before I got kicked. 在我被踢之前,这返回了大约300个URL。 An actual search using these parameters provides about 1000 results and I'd like all of them. 使用这些参数的实际搜索可提供约1000个结果,我希望所有这些结果。

First: is this possible? 第一:这可能吗? Second: does anyone have any suggestions to do this? 第二:有人对此有何建议? I basically just want a txt file of all the URLs that I can use in another script. 我基本上只想要一个可以在另一个脚本中使用的所有URL的txt文件。

It seems that this package uses screen scraping to retrieve search results from google, so it doesn't play well with Google's Terms of Service which could be the reason why you've been blocked. 该软件包似乎使用屏幕抓取从Google检索搜索结果,因此它与Google的服务条款不能很好地配合使用,这可能是您被屏蔽的原因。

The relevant clause in Google's Terms of Service : Google服务条款中的相关条款:

Don't misuse our Services. 不要滥用我们的服务。 For example, don't interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. 例如,请勿干扰我们的服务或尝试使用界面和我们提供的说明以外的方法来访问它们。 You may use our Services only as permitted by law, including applicable export and re-export control laws and regulations. 您只能在法律允许的范围内使用我们的服务,包括适用的出口和再出口控制法律和法规。 We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct. 如果您不遵守我们的条款或政策,或者我们正在调查可疑的不当行为,我们可能会暂停或停止向您提供服务。

I haven't been able to find a definite number, but it seems like their limit for the number of search queries a day is rather strict too - at 100 search queries / day on their JSON Custom Search API documentation here . 我还没有找到一个确定的数字,但是似乎他们每天对搜索查询的数量限制也非常严格- 此处的JSON Custom Search API文档中每天要进行100个搜索查询。

Nonetheless, there's no harm trying out other alternatives to see if they work better: 尽管如此,尝试其他替代方法是否效果更好也没有什么害处:

  1. BeautifulSoup 美丽汤
  2. Scrapy cra草
  3. ParseHub - this one is not in code, but is a useful piece of software with good documentation. ParseHub-这个不是代码,而是一个有用的软件,带有良好的文档说明。 Link to their tutorial on how to scrape a list of URLs . 链接到他们的有关如何抓取URL列表的教程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 python 请求模块从仅拨打搜索结果页面的前 10 个结果中抓取所有结果? - How can I scrape all the results not only first 10 results from just dial search result page using python requests module? 如何从 Google 搜索结果页面中抓取所有结果 (Python/Selenium ChromeDriver) - How to scrape all results from Google search results pages (Python/Selenium ChromeDriver) Python 抓取google搜索结果 - Python scrape google search results 如何从 Google 搜索结果中抓取所有标题和链接(Python + Selenium) - How to scrape all the titles and links from Google search results (Python + Selenium) 我如何从没有明显网址的搜索结果中抓取pdf和html - How do I scrape pdf and html from search results without obvious url 使用 Python 抓取谷歌搜索结果标题和网址 - Scrape google search results titles and urls using Python 如何使用 Selenium webdriver 和 Python 抓取所有搜索结果 - How to scrape all the search results using Selenium webdriver and Python 在不被阻止的情况下抓取 Google 结果 - Scraping Google results without getting blocked 使用python从Google搜索获取结果 - Getting results from google search with python 使用 BeautifulSoup 抓取 Google 搜索结果 - Web scrape Google search results using BeautifulSoup
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM