简体   繁体   English

使用Python从异常页面再次爬网

[英]Crawl again from the exceptional page using Python

I use a for loop to crawl web pages. 我使用for循环来爬网网页。 However, i encounter ip request limit error when i am crawling some pages. 但是,我在爬网某些页面时遇到ip请求限制错误。 I have tried to make python sleep some seconds when i has crawled every 20 pages, however, the error holds. 当我每20页抓取一次时,我试图使python休眠几秒钟,但是,该错误仍然成立。 I can start to crawl again after python sleeps 60 secs. python睡眠60秒后,我可以再次开始爬网。

The problem is each time, when there is an exception, i will lose a page of information. 问题是每次发生例外情况时,我都会丢失一页信息。 It seems that python jumps over the exceptional page using the try-except method. 似乎python使用try-except方法跳过了例外页面。

I am wondering the best way is to restart to crawl again from the page which has encountered the exception. 我想知道最好的方法是从遇到异常的页面重新开始重新爬网。

My question is how to restart to crawl from the exceptional page. 我的问题是如何从例外页面重新开始爬网。

pageNum = 0

for page in range(1, 200):
    pageNum += 1
    if(pageNum % 20 ==0):  # every 20 pages sleep 180 secs
        print 'sleeep 180 secs'
        time.sleep(180)  # to oppress the ip request limit
    try:
        for object in api.repost_timeline(id=id, count=200, page=page): 
            mid = object.__getattribute__("id")
            # my code here to store data
    except:
        print "Ip request limit", page
        sleep.time(60)

Use a stack of pages. 使用stack页面。 pop a page, if it fails then append again. pop一个页面,如果失败,则再次追加。

from collections import deque

page_stack = deque()
for page in range(199, 0, -1):
    page_stack.append(page)

while len(page_stack):
    page = page_stack.pop()

    try:
        ## Do something
    except IPLimitException, e:
        page_stack.append(page)

The code can run into infinite loop. 代码可能会陷入无限循环。 Based on your need you can keep a threshold of trials that you can make. 根据您的需要,您可以保持可以进行试验的阈值。 Keep a counter and do not append the page back to stack if that threshold is exhausted. 保留一个计数器,如果该阈值已用尽,则不要将页面追加回堆栈。

To keep the code as closest as possible to yours, you could just do something like: 为了使代码尽可能地接近您的代码,您可以执行以下操作:

pageNum = 0

for page in range(1, 200):
    pageNum += 1
    if(pageNum % 20 ==0):  # every 20 pages sleep 180 secs
        print 'sleeep 180 secs'
        time.sleep(180)  # to oppress the ip request limit
    succeeded = False
    while not succeeded:
        try:
            for object in api.repost_timeline(id=id, count=200, page=page): 
                mid = object.__getattribute__("id")
                # my code here to store data
            succeeded = True
        except:
            print "Ip request limit", page
            sleep.time(60)

Of course you may want to include some sort of limit instead of risking to enter an endless loop. 当然,您可能希望包括某种限制,而不要冒进入无限循环的风险。 Btw, you can also get rid of pageNum (just use page). 顺便说一句,您也可以摆脱pageNum(仅使用page)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM