Reading 2.5GB .osm file. Process takes around 15 min and about 4GB RAM (using 64bit version). After all lines are done and print count_nodes-count
becomes zero- RAM skyrockets (HDD also) and the pc is freezing. It never prints print'last step-closing',("--- %s seconds ---" % (time.time() - start_time))
What is it happening with the execution? Any suggestions to avoid that?
My code:
import time
import xml.etree.ElementTree as etree
file=('california.osm')
context=etree.iterparse(file)
start_time = time.time()
localtime = time.asctime( time.localtime(time.time()) )
print "Start time :", localtime
count_nodes=6132755
count=0
list=[]
with open('new_file.txt','w') as f:
for event, elem in context:
dict = {}
if elem.tag == "node":
count+=1
lat=elem.get('lat')
lon=elem.get('lon')
dict['lat']=lat
dict['lon']=lon
for child in elem:
key=child.get('k')
val=child.get('v')
dict[key]=val
child.clear()
elem.clear()
if len(dict)>2:
i=str(dict)
f.write(i)
f.write('\n')
print count_nodes-count
print'last step-closing',("--- %s seconds ---" % (time.time() - start_time))
f.close
I suppose that python is flushing the buffer and writing the data to the hard drive (into f). Try to add the following line after the print last step...:
sys.stdout.flush()
And do not forget import sys
. If it is too slow, change the language to something faster like C++ or even Java. You have XML parsers there too and unless you are doing something that is dependent upon python, it's better for big data.
Or try an existing parser like Python imposm
Why have you written f.close at the bottom like an attribute? You can remove that, the file has already been closed once control leaves the "with open" statement. I agree with asalic here though, the data is probably being flushed at that point. However, this seems like a very doable task for python.
Since you are using iterparse(), I'm not really sure if clearing the elements once you are done with them is really getting you anything in terms of speed. That being said, you should remove your intermediary variables and do only one file write per loop like so:
dict['lat'] = elem.get('lat')
dict['lon'] = elem.get('lon')
for child in elem:
dict[child.get('k')] = child.get('v')
if len(dict) > 2:
f.write("%s\n" % str(dict))
Also you should skip the print statement since the dataset is rather large.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.