简体   繁体   English

需要限制刮板以仅每4秒从URL的python列表中访问网站-scraperapi,scrapy,python

[英]Need to throttle scraper to only hit website every 4s from a python list of URLS - scraperapi, scrapy, python

Scraping a python list of web domains, would like to put a 4 second delay between each scrape in order to comply with robots.txt. 抓取网络域的python列表,想在每次抓取之间放置4秒的延迟,以便遵守robots.txt。 Would like each iteration to run asynchronously, so the loop will continue running every 4 seconds, irrespective of whether the scrape for that particular page has finished. 希望每个迭代都异步运行,因此该循环将每4秒继续运行一次,而不管该特定页面的抓取是否完成。

I have tried implementing asyncio gather, coroutine and was beginning to attempt throttling. 我尝试实现异步收集,协程,并开始尝试节流。 However my solutions were getting very complex and I believe there must be a simpler way, or that I am missing something here. 但是我的解决方案变得非常复杂,我相信必须有一种更简单的方法,否则我会在这里遗漏一些东西。 In one of my past versions, I just put a sleep(4) inside the for in loop, though to my updated understanding this is bad as it sleeps the entire interpreter and other loops won't run asynchronously at that time? 在我以前的版本中,我只是在for in循环中放入了sleep(4),尽管据我的最新了解,这很不好,因为它使整个解释器进入睡眠状态,而其他循环那时不会异步运行?

import requests
import csv
csvFile = open('test.csv', 'w+')

urls = [
    'domain1', 'domain2', 'domain3'...
];

YOURAPIKEY = <KEY>; 
from bs4 import BeautifulSoup

writer = csv.writer(csvFile)
writer.writerow(('Scraped text', 'other info 1', 'other info 2'))

lastI = len(urls) - 1

for i, a in enumerate(urls):
  payload = {'api_key': YOURAPIKEY, 'url': a}
  r = requests.get('http://api.scraperapi.com', params=payload)
  soup = BeautifulSoup(r.text, 'html.parser')
  def parse(self, response):
    scraper_url = 'http://api.scraperapi.com/?api_key=YOURAPIKEY&url=' + a
    yield scrapy.Request(scraper_url, self.parse)

  price_cells = soup.select('.step > b.whb:first-child')
  lastF = len(price_cells) - 1
  for f, price_cell in enumerate(price_cells):
    writer.writerow((price_cell.text.rstrip(), '...', '...'))
    print(price_cell.text.rstrip())

    if (i == lastI and f == lastF):
      print('closing now')
      csvFile.close()

No errors with the above code that I can tell. 我可以告诉我上面的代码没有错误。 Just want each loop to keep running at 4s intervals and the results coming back from the fetch to be saved to the excel document ad hoc. 只希望每个循环以4s的间隔保持运行,并将取回的结果临时保存到excel文档中。

In scrapy the appropriate setting in the setting.py file would be: 抓紧了setting.py文件中的适当设置是:

DOWNLOAD_DELAY DOWNLOAD_DELAY

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. 从同一网站下载连续页面之前,下载程序应等待的时间(以秒为单位)。 This can be used to throttle the crawling speed to avoid hitting servers too hard. 这可以用来限制爬网速度,以避免对服务器造成太大的冲击。 Decimal numbers are supported. 支持小数。

DOWNLOAD_DELAY = 4 # 4s of delay

https://doc.scrapy.org/en/latest/topics/settings.html https://doc.scrapy.org/en/latest/topics/settings.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM