简体   繁体   English

time.sleep()函数在Scrapy递归网络爬虫中不起作用

[英]time.sleep() function not working within Scrapy recursive webscraper

I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. 我在Windows Vista 64位上使用Python.org版本2.7 64位。 I have some recursive webscraping code that is being caught by anti scraping measures on a site I am looking at: 我正在查看的网站上有一些递归网络抓取代码,这些代码被反抓取措施所捕获:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time


class ExampleSpider(CrawlSpider):
    name = "goal3"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/"]
    rules = [Rule(SgmlLinkExtractor(allow=()), 
                  follow=True),
             Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
    ]
    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)
        scripts = response.selector.xpath("normalize-space(//title)")
        for scripts in scripts:
            body = response.xpath('//p').extract()
            body2 = "".join(body)
            print remove_tags(body2).encode('utf-8')  
            time.sleep(5)

execute(['scrapy','crawl','goal3'])

In order to stop this from happening, I have tried adding a basic 'time.sleep()' function to slow down the rate at which submissions are made. 为了阻止这种情况的发生,我尝试添加一个基本的“ time.sleep()”函数来减慢提交的速度。 However, when running the code via Command Prompt, this function does not seem to be having any kind of effect. 但是,通过命令提示符运行代码时,此功能似乎没有任何效果。 The code continues to run at the same speed and thus all the requests come back as HTTP 403. 代码继续以相同的速度运行,因此所有请求都以HTTP 403的形式返回。

Can anyone see why this might not be working? 谁能看到为什么这可能行不通?

Thanks 谢谢

Don't reinvent the wheel. 不要重新发明轮子。 DOWNLOAD_DELAY setting is what you are looking for: 您正在寻找DOWNLOAD_DELAY设置:

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. 从同一网站下载连续页面之前,下载程序应等待的时间(以秒为单位)。 This can be used to throttle the crawling speed to avoid hitting servers too hard. 这可以用来限制爬网速度,以避免对服务器造成太大的冲击。

There are other techniques like rotating User Agents, IP addresses, see more at Avoid Getting Banned section. 还有其他技术,例如旋转用户代理,IP地址,请参阅“ 避免被禁止”部分中的更多内容。

Also, make sure you know what are the Terms of Use of the web-site. 另外,请确保您知道该网站Terms of Use Make sure they don't state against web-crawling and whether the site provides API or not. 确保他们不反对网络爬虫,以及网站是否提供API。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM