urlopen error 10045, 'address already in use' while downloading in Python 2.5 on Windows

Question

I'm writing code that will run on Linux, OS X, and Windows. It downloads a list of approximately 55,000 files from the server, then steps through the list of files, checking if the files are present locally. (With SHA hash verification and a few other goodies.) If the files aren't present locally or the hash doesn't match, it downloads them.

The server-side is plain-vanilla Apache 2 on Ubuntu over port 80.

The client side works perfectly on Mac and Linux, but gives me this error on Windows (XP and Vista) after downloading a number of files:

urllib2.URLError: <urlopen error <10048, 'Address already in use'>>

This link: http://bytes.com/topic/python/answers/530949-client-side-tcp-socket-receiving-address-already-use-upon-connect points me to TCP port exhaustion, but "netstat -n" never showed me more than six connections in "TIME_WAIT" status, even just before it errored out.

The code (called once for each of the 55,000 files it downloads) is this:

request = urllib2.Request(file_remote_path)
opener = urllib2.build_opener()
datastream = opener.open(request)
outfileobj = open(temp_file_path, 'wb')
try:
    while True:
        chunk = datastream.read(CHUNK_SIZE)
        if chunk == '':
            break
        else:
            outfileobj.write(chunk)
finally:
    outfileobj = outfileobj.close()
    datastream.close()

UPDATE: I find by greping the log that it enters the download routine exactly 3998 times. I've run this multiple times and it fails at 3998 each time. Given that the linked article states that available ports are 5000-1025=3975 (and some are probably expiring and being reused) it's starting to look a lot more like the linked article describes the real issue. However, I'm still not sure how to fix this. Making registry edits is not an option.

Answer 1

If it is really a resource problem (freeing os socket resources)

try this:

request = urllib2.Request(file_remote_path)
opener = urllib2.build_opener()

retry = 3 # 3 tries
while retry :
    try :
        datastream = opener.open(request)
    except urllib2.URLError, ue:
        if ue.reason.find('10048') > -1 :
            if retry :
                retry -= 1
            else :
                raise urllib2.URLError("Address already in use / retries exhausted")
        else :
            retry = 0
    if datastream :
        retry = 0

outfileobj = open(temp_file_path, 'wb')
try:
    while True:
        chunk = datastream.read(CHUNK_SIZE)
        if chunk == '':
            break
        else:
            outfileobj.write(chunk)
finally:
    outfileobj = outfileobj.close()
    datastream.close()

if you want you can insert a sleep or you make it os depended

on my win-xp the problem doesn't show up (I reached 5000 downloads)

I watch my processes and network with process hacker .

Answer 2

Thinking outside the box, the problem you seem to be trying to solve has already been solved by a program called rsync. You might look for a Windows implementation and see if it meets your needs.

Answer 3

您应该认真考虑复制和修改此pyCurl示例，以便有效下载大量文件。

Answer 4

你应该真正使用持久的HTTP连接，而不是为每个请求打开一个新的TCP连接 - 看看urlgrabber （或者只是在keepalive.py上看看如何为urllib2添加keep-alive连接支持）。

Answer 5

All indications point to a lack of available sockets. Are you sure that only 6 are in TIME_WAIT status? If you're running so many download operations it's very likely that netstat overruns your terminal buffer. I find that netstat stat overruns my terminal during normal useage periods.

The solution is to either modify the code to reuse sockets. Or introduce a timeout. It also wouldn't hurt to keep track of how many open sockets you have. To optimize waiting. The default timeout on Windows XP is 120 seconds. so you want to sleep for at least that long if you run out of sockets. Unfortunately it doesn't look like there's an easy way to check from Python when a socket has closed and left the TIME_WAIT status.

Given the asynchronous nature of the requests and timeouts, the best way to do this might be in a thread. Make each threat sleep for 2 minutes before it finishes. You can either use a Semaphore or limit the number of active threads to ensure that you don't run out of sockets.

Here's how I'd handle it. You might want to add an exception clause to the inner try block of the fetch section, to warn you about failed fetches.

import time
import threading
import Queue

# assumes url_queue is a Queue object populated with tuples in the form of(url_to_fetch, temp_file)
# also assumes that TotalUrls is the size of the queue before any threads are started.


class urlfetcher(threading.Thread)
    def __init__ (self, queue)
        Thread.__init__(self)
        self.queue = queue


    def run(self)
        try: # needed to handle empty exception raised by an empty queue.
            file_remote_path, temp_file_path = self.queue.get()
            request = urllib2.Request(file_remote_path)
            opener = urllib2.build_opener()
            datastream = opener.open(request)
            outfileobj = open(temp_file_path, 'wb')
            try:
                while True:
                    chunk = datastream.read(CHUNK_SIZE)
                    if chunk == '':
                        break
                    else:
                        outfileobj.write(chunk)
            finally:
                outfileobj = outfileobj.close()
                datastream.close()    
                time.sleep(120)
                self.queue.task_done()

elsewhere:


while url_queue.size() < TotalUrls: # hard limit of available ports.
    if threading.active_threads() < 3975: # Hard limit of available ports
         t = urlFetcher(url_queue)
         t.start()
    else: 
        time.sleep(2)

url_queue.join()

Sorry, my python is a little rusty, so I wouldn't be surprised if I missed something.

urlopen error 10045, 'address already in use' while downloading in Python 2.5 on Windows

Question

5 answers

solution1
5 ACCPTED 2009-10-08 13:15:28

solution2
1 2009-10-02 23:44:41

solution3
1 2009-10-09 01:34:51

solution4
1 2009-10-11 17:03:06

solution5
1 2009-10-11 18:11:15

urlopen error 10045, 'address already in use' while downloading in Python 2.5 on Windows

Question

5 answers

solution1 5 ACCPTED 2009-10-08 13:15:28

solution2 1 2009-10-02 23:44:41

solution3 1 2009-10-09 01:34:51

solution4 1 2009-10-11 17:03:06

solution5 1 2009-10-11 18:11:15

solution1
5 ACCPTED 2009-10-08 13:15:28

solution2
1 2009-10-02 23:44:41

solution3
1 2009-10-09 01:34:51

solution4
1 2009-10-11 17:03:06

solution5
1 2009-10-11 18:11:15