简体   繁体   中英

Memory usage while parsing XML file into Google App Engine datastore

I'm trying to parse a big (5GB) XML file (product catalog) into a google datastore. The issue I am having is it taking up a lot of memory. I was able to get the memory down from the parsing part by reading it line by line and deleting elements as I go. However something is still sticking behind.

My code is http://pastebin.com/ESARQikC

I believe the issue is occuring in this specific function

def process_element(self,item):
    if item.tag == "programname":
        self.Plist.append(item.text)
    elif item.tag == 'name':
        self.Plist.append(item.text)        
    elif item.tag == 'description':
        self.Plist.append(item.text)
    elif item.tag == 'sku':
        self.Plist.append(item.text)
    elif item.tag == 'manufacturer':
        self.Plist.append(item.text)
    elif item.tag == 'price':
        self.Plist.append(item.text)
    elif item.tag == 'buyurl':
        self.Plist.append(item.text)
    elif item.tag == 'imageurl':
        self.Plist.append(item.text)
    elif item.tag == 'advertisercategory':
        self.Plist.append(item.text)
    elif item.tag=="product":
        Product(
            programname=("%s" % self.Plist[0]),
            name=("%s" % self.Plist[1]),
            description=("%s" % self.Plist[2][0:500]),
            sku=("%s" % self.Plist[3]),
            manufacturer=("%s" % self.Plist[4]),
            price=("%s" % self.Plist[5]),
            buyurl=("%s" % self.Plist[6]),
            imageurl=("%s" % self.getBigImageUrl(self.Plist[7])),
            advertisercategory=("%s" % self.Plist[8])).put()

        self.count+=1
        print self.count
        if self.count%15000 == 0:      
            time.sleep(10000)
        for ob in self.Plist:
            del ob
        del self.Plist
        self.Plist=[]
    del item

When I comment out the Product().put() line and run it, it can go through tons of lines without making much of a memory impact. The reason I added the sleep in the middle of it is I was thinking some subprocesses that GAE spawns were adding the data to the datastore and might need some time to operate. So I waited after adding 15000 items to see if any ram would be freed up (purged memory on the OS side as well) however it did not help. Is this something in my code or something I can't change related to adding data to a datastore. I'm stuck and confused after hours/days of playing around with it.

Are you running this code in your development server? There is a known problem with the dev server's datastore using up memory: Why memory leaks occurs when using DataStore API on dev server . The reason is that it uses a memory map to store all of your entities.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM