简体   繁体   English

全局变量重置在Google App Engine中不起作用

[英]Global Variable Reset not working in Google App Engine

I am calling a web crawling function from a handler in GAE and it retrieves a few images and then displays them. 我正在从GAE中的处理程序调用网络抓取功能,它会检索一些图像,然后显示它们。 It works just fine on the first call but then the next time it displays all the same images and the crawler starts up from where the last one left off. 它在第一次调用时效果很好,但是在下一次显示所有相同的图像时,爬虫从上次停止的地方启动。 I think it is a problem with my global variables not being reset correctly. 我认为全局变量未正确重置是一个问题。

Everytime I redeploy the app it does it correctly the first time but then the problem begins. 每次我重新部署应用程序时,它都会在第一次正确执行,但是问题就开始了。

Here is my code please let me know if you need me to clarify it but I think it should make sense. 这是我的代码,如果您需要我进行说明,请告诉我,但我认为这是有道理的。

Here is the scraper function 这是刮板功能

visited_pages = []
visit_queue = deque([])
collected_pages = []
collected_pics = []
count = 0
pic_count = 0

def scrape_pages(url, root_url, keywords=[], recurse=True):
    #variables
    max_count = 16
    pic_num = 100

    global count
    global pic_count
    global collected_pics
    global collected_pages

    print 'the keywords and url are'
    print keywords
    print url

    #this is all of the links that have been scraped
    the_links = []

    soup = soupify_url(url)

    #only add new pages onto the queue if the recursion argument is true    
    if recurse:
        #find all the links on the page
        try:
            for tag in soup.findAll('a'):
                the_links.append(tag.get('href'))
        except AttributeError:
            return

        try:
            external_links, internal_links, root_links, primary_links = categorize_links(the_links, url, root_url)
        except TypeError:
            return


        #change it so this depends on the input
        links_to_visit = external_links + internal_links + root_links

        #build the queue
        for link in links_to_visit:
            if link not in visited_pages and link not in visit_queue:
                visit_queue.append(link)

    visited_pages.append(url)
    count = count + 1
#    print 'number of pages visited'
#    print count

    #add pages to collected_pages depending on the criteria given if any keywords are given
    if keywords:
        page_to_add = find_pages(url, soup, keywords)

#        print 'page to add'
#        print page_to_add
        if page_to_add and page_to_add not in collected_pages:
            collected_pages.append(page_to_add)


    pics_to_add = add_pics(url, soup)
#    print 'pics to add'
#    print pics_to_add
    if pics_to_add:
        collected_pics.extend(pics_to_add)

    #here is where the actual recursion happens by finishing the queue
    while visit_queue:
        if count >= max_count:
            return

        if pic_count > pic_num:
            return

        link = visit_queue.popleft()
#        print link
        scrape_pages(link, root_url, keywords)

#    print '***done***'
    ###done with the recursive scraping function here

#here I just get a list of links from Bing, add them to the queue and go through them then reset all the global variables
def scrape_bing_src(keywords):
    visit_queue, the_url = scrape_bing.get_links(keywords, a_list = False)
    scrape_pages(visit_queue.popleft(), the_url, keywords, recurse=True)

    global collected_pics
    global pic_count
    global count
    global visited_pages
    global visit_queue

    pic_count = 0
    count = 0
    visited_pages = []
    visit_queue = deque([])

    pics_to_return = collected_pics
    collected_pics = []
    return pics_to_return

Here is the handler that calls the scraper function 这是调用scraper函数的处理程序

#this just simply displays the images
class Try(BlogHandler):
    def get(self, keyword):
        keyword = str(keyword)
        keyword_list = keyword.split()
        img_list = scraper.scrape_bing_src(keyword_list)

        for img in img_list:
            self.response.write("""<br><img src='""" + img + """'>""")

        self.response.write('we are done here')

Your code isn't run inside only one "server" and one instance, you probably already noticed instances tab in admin console. 您的代码不仅在一个“服务器”和一个实例中运行,您可能已经注意到管理控制台中的instances选项卡。 So there is chance that even between calls you will be switched to different server, or process will be "restarted" (more you can read here ). 因此,即使在两次调用之间,您也有可能会切换到其他服务器,或者进程将被“重新启动”(更多信息,请参见此处 )。 During warmup process your application reads from disk into memory and then starts to handle requests. 在预热过程中,您的应用程序将从磁盘读取到内存中,然后开始处理请求。 So every time you getting new precached python instance with its own globals variable values. 因此,每当您获得带有其自己的全局变量值的新的预缓存python实例时。

In your case it is better to use memcache . 在您的情况下,最好使用memcache

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM