Need advice to optimize Python 2.7code

Question

Reading 2.5GB .osm file. Process takes around 15 min and about 4GB RAM (using 64bit version). After all lines are done and print count_nodes-count becomes zero- RAM skyrockets (HDD also) and the pc is freezing. It never prints print'last step-closing',("--- %s seconds ---" % (time.time() - start_time))

What is it happening with the execution? Any suggestions to avoid that?

My code:

import time
import xml.etree.ElementTree as etree

file=('california.osm')    
context=etree.iterparse(file)

start_time = time.time()
localtime = time.asctime( time.localtime(time.time()) )
print "Start time :", localtime

count_nodes=6132755
count=0
list=[]
with open('new_file.txt','w') as f:
    for event, elem in context:
        dict = {}
        if elem.tag == "node":
            count+=1            
            lat=elem.get('lat')
            lon=elem.get('lon') 
            dict['lat']=lat
            dict['lon']=lon     
            for child in elem:          
                key=child.get('k')
                val=child.get('v')
                dict[key]=val           
                child.clear()                   
            elem.clear()                            
            if len(dict)>2:
                i=str(dict)                 
                f.write(i)
                f.write('\n')
            print count_nodes-count

print'last step-closing',("--- %s seconds ---" % (time.time() - start_time))
f.close

Answer 1

I suppose that python is flushing the buffer and writing the data to the hard drive (into f). Try to add the following line after the print last step...:

sys.stdout.flush()

And do not forget import sys . If it is too slow, change the language to something faster like C++ or even Java. You have XML parsers there too and unless you are doing something that is dependent upon python, it's better for big data.

Or try an existing parser like Python imposm

Answer 2

Why have you written f.close at the bottom like an attribute? You can remove that, the file has already been closed once control leaves the "with open" statement. I agree with asalic here though, the data is probably being flushed at that point. However, this seems like a very doable task for python.

Since you are using iterparse(), I'm not really sure if clearing the elements once you are done with them is really getting you anything in terms of speed. That being said, you should remove your intermediary variables and do only one file write per loop like so:

dict['lat'] = elem.get('lat')
dict['lon'] = elem.get('lon')
for child in elem:
    dict[child.get('k')] = child.get('v')
    if len(dict) > 2:
        f.write("%s\n" % str(dict))

Also you should skip the print statement since the dataset is rather large.

Need advice to optimize Python 2.7code

Question

2 answers

solution1
0 2016-03-11 09:32:01

solution2
0 2016-03-11 13:22:33

Need advice to optimize Python 2.7code

Question

2 answers

solution1 0 2016-03-11 09:32:01

solution2 0 2016-03-11 13:22:33

solution1
0 2016-03-11 09:32:01

solution2
0 2016-03-11 13:22:33