EDIT:
I have a script that parses a sitemap xml and stores the first pass in an array. I then have it set so that it refreshes, parses and stores a desired xml tag into another array to check for any differences. This second array is constantly updated every 3 seconds on the xmls refresh. However, it seems to get hung up and I am wondering what the problem is.
import urllib,time
from time import gmtime, strftime
from xml.dom import minidom
url='http://kutoa.com/sitemap_products_1.xml?from=1&to=999999999'
def main():
primList=[]
secList=[]
xml = urllib.urlopen(url).read()
xmldoc = minidom.parseString(xml)
loc_values = xmldoc.getElementsByTagName('loc')
for loc_val in loc_values:
item=(loc_val.firstChild.nodeValue)
primList.append(item)
for i in primList:
secList.append(item)
while len(secList)==len(primList):
print str(strftime("%Y-%m-%d %H:%M:%S", gmtime()))+' :: '+str(len(secList)) +' items indexed...'
print 'destruct list'
secList=[]
print 'empty list/reading url'
xml = urllib.urlopen(url).read()
print 'url read/parsing'
xmldoc = minidom.parseString(xml)
print 'parsed going for tags'
loc_values = xmldoc.getElementsByTagName('loc')
print 'adding tags'
for loc_val in loc_values:
item=(loc_val.firstChild.nodeValue)
secList.append(item)
print 'tags added to list'
time.sleep(3)
print 'sleep for 3\n'
if len(primList)>len(secList):
print 'items removed'
main()
elif len(secList)>len(primList):
print 'items added'
main()
main()
With print statements for troubleshooting I see that it gets hung up on opening the url. Here is some recent output:
2015-12-26 18:30:21 :: 7 items indexed...
destruct list
empty list/reading url
url read/parsing
parsed going for tags
adding tags
tags added to list
sleep for 3
2015-12-26 18:30:24 :: 7 items indexed...
destruct list
empty list/reading url
url read/parsing
parsed going for tags
adding tags
tags added to list
sleep for 3
2015-12-26 18:30:27 :: 7 items indexed...
destruct list
empty list/reading url
and then nothing more will output and my program will just hang, un-terminated under the last parse output. Is this network related? Any thoughts/remedies would be greatly appreciated!
At the beginning of your function, before calling urlopen
, you might want to set the socket timeout to prevent the call from potentially hanging forever. This snippet sets the timeout to 3 seconds for consistency with your sleep value:
import socket
def main():
socket.setdefaulttimeout(3)
...
Then, wrap your call to urlopen
to catch the socket.timeout
exception. This snippet just prints a string and continues your loop:
try:
xml = urllib.urlopen(url).read()
except socket.timeout as e:
print 'timeout reading url: %s' % e
continue
print 'url read/parsing'
...
I haven't tested this so let me know how it goes for you.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.