简体   繁体   English

Scrapy 在抓取一长串 url 时卡住了

[英]Scrapy gets stuck crawling a long list of urls

I am scraping a large list of urls (1000-ish) and after a set time the crawler gets stuck with crawling 0 pages/min.我正在抓取一个大的 url 列表(1000-ish),并且在设定的时间后,爬虫卡住了爬行 0 页/分钟。 The problem always occurs at the same spot when crawling.爬行时问题总是出现在同一个地方。 The list of urls is retrieved from a MySQL database. url 列表是从 MySQL 数据库中检索的。 I am fairly new to python and scrapy so I don't know where to start debugging, and I fear that due to my inexperience the code itself is also a bit of a mess.我对python和scrapy相当陌生,所以我不知道从哪里开始调试,我担心由于我的经验不足,代码本身也有点混乱。 Any pointers to where the issue lies are appreciated.任何指向问题所在的指针表示赞赏。

I used to retrieve the entire list of urls in one go, and the crawler worked fine.我曾经一次检索整个 url 列表,并且爬虫工作正常。 However I had problems with writing the results back into the database and I didn't want to read the whole large list of urls into the memory, so I changed it to iterate through the database one url at a time, where the problem occurred.但是,我在将结果写回数据库时遇到了问题,并且我不想将整个大的 url 列表读入内存,所以我将其更改为一次一个 url 遍历数据库,问题发生在那里。 I am fairly certain the url itself isn't the issue, because when I try to start the crawling from the problem url, it works without issue, getting stuck further down the line in a different, but consistent spot.我相当确定 url 本身不是问题,因为当我尝试从有问题的 url 开始爬行时,它可以正常工作,在不同但一致的位置进一步卡住。

The relevant parts of the code are as follow.代码的相关部分如下。 Note that the script is supposed to be run as a standalone script, which is why I define the necessary settings in the spider itself.请注意,该脚本应该作为独立脚本运行,这就是我在蜘蛛本身中定义必要设置的原因。

class MySpider(CrawlSpider):
    name = "mySpider"
    item = []
    #spider settings
    custom_settings = {
        'CONCURRENT_REQUESTS': 1,
        'DEPTH_LIMIT': 1,
        'DNS_TIMEOUT': 5,
        'DOWNLOAD_TIMEOUT':5,
        'RETRY_ENABLED': False,
        'REDIRECT_MAX_TIMES': 1
    }


    def start_requests(self):

        while i < n_urls:
            urllist = "SELECT url FROM database WHERE id=" + i
            cursor = db.cursor()
            cursor.execute(urllist)
            urls = cursor.fetchall()
            urls = [i[0] for i in urls] #fetch url from inside list of tuples
            urls = str(urls[0]) #transform url into string from list
            yield Request(urls, callback=self.parse, errback=self.errback)

    def errback(self, failure):
        global i
        sql = "UPDATE db SET item = %s, scrape_time = now() WHERE id = %s"
        val = ('Error', str(j))
        cursor.execute(sql, val)
        db.commit()
        i += 1


    def parse(self, response):
        global i
        item = myItem()
        item["result"] = response.xpath("//item to search")
        if item["result"] is None or len(item["result"]) == 0:
            sql = "UPDATE db SET, item = %s, scrape_time = now() WHERE id = %s"
            val = ('None', str(i))
            cursor.execute(sql, val)
            db.commit()
            i += 1
        else:
            sql = "UPDATE db SET item = %s, scrape_time = now() WHERE id = %s"
            val = ('Item', str(i))
            cursor.execute(sql, val)
            db.commit()
            i += 1

The scraper gets stuck showing the following message:刮板卡住并显示以下消息:

2019-01-14 15:10:43 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET someUrl> from <GET anotherUrl>
2019-01-14 15:11:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 9 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:12:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:13:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:14:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:15:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:16:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

Everything works fine up until this point.到目前为止一切正常。 Any help you could give me is appreciated!感谢您能给我的任何帮助!

scrapy syas 0 item 的原因是它计算产生的数据,而您除了插入数据库之外没有产生任何东西。

I just had this happen to me, so I wanted to share what caused the bug, in case someone encounters the exact same issue.我刚刚发生了这种情况,所以我想分享导致错误的原因,以防有人遇到完全相同的问题。

Apparently, if you don't specify a callback for a Request, it defaults to the spider's parse method as a callback (my intention was to not have a callback at all for those requests).显然,如果你没有为请求指定回调,它默认为蜘蛛的 parse 方法作为回调(我的意图是根本没有这些请求的回调)。

In my spider, I used the parse method to make most of the Requests, so this behavior caused many unnecessary requests that eventually led to Scrapy crashing.在我的蜘蛛中,我使用了 parse 方法来制作大部分请求,因此这种行为导致了许多不必要的请求,最终导致 Scrapy 崩溃。 Simply adding an empty callback function ( lambda a: None ) for those requests solved my issue.只需为这些请求添加一个空的回调函数( lambda a: None )就解决了我的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM