简体   繁体   English

在Windows上的Python 2.5中下载时,urlopen错误10045,'地址已在使用'

[英]urlopen error 10045, 'address already in use' while downloading in Python 2.5 on Windows

I'm writing code that will run on Linux, OS X, and Windows. 我正在编写将在Linux,OS X和Windows上运行的代码。 It downloads a list of approximately 55,000 files from the server, then steps through the list of files, checking if the files are present locally. 它从服务器下载大约55,000个文件的列表,然后逐步检查文件列表,检查文件是否存在于本地。 (With SHA hash verification and a few other goodies.) If the files aren't present locally or the hash doesn't match, it downloads them. (使用SHA哈希验证和一些其他好东西。)如果文件不在本地存在或哈希不匹配,则下载它们。

The server-side is plain-vanilla Apache 2 on Ubuntu over port 80. 服务器端在Ubuntu上通过端口80是普通的Apache 2。

The client side works perfectly on Mac and Linux, but gives me this error on Windows (XP and Vista) after downloading a number of files: 客户端在Mac和Linux上运行良好,但在下载了大量文件后,在Windows(XP和Vista)上给出了这个错误:

urllib2.URLError: <urlopen error <10048, 'Address already in use'>>

This link: http://bytes.com/topic/python/answers/530949-client-side-tcp-socket-receiving-address-already-use-upon-connect points me to TCP port exhaustion, but "netstat -n" never showed me more than six connections in "TIME_WAIT" status, even just before it errored out. 这个链接: http//bytes.com/topic/python/answers/530949-client-side-tcp-socket-receiving-address-already-use-upon-connect指向我的TCP端口耗尽,但“netstat -n “从来没有在”TIME_WAIT“状态下向我显示超过六个连接,即使在它出错之前。

The code (called once for each of the 55,000 files it downloads) is this: 代码(对于它下载的55,000个文件中的每一个都调用一次)是这样的:

request = urllib2.Request(file_remote_path)
opener = urllib2.build_opener()
datastream = opener.open(request)
outfileobj = open(temp_file_path, 'wb')
try:
    while True:
        chunk = datastream.read(CHUNK_SIZE)
        if chunk == '':
            break
        else:
            outfileobj.write(chunk)
finally:
    outfileobj = outfileobj.close()
    datastream.close()

UPDATE: I find by greping the log that it enters the download routine exactly 3998 times. 更新:我通过greping日志发现它正好进入下载例程3998次。 I've run this multiple times and it fails at 3998 each time. 我已经多次运行它,每次都失败了3998。 Given that the linked article states that available ports are 5000-1025=3975 (and some are probably expiring and being reused) it's starting to look a lot more like the linked article describes the real issue. 鉴于链接文章指出可用端口是5000-1025 = 3975(有些可能已到期并被重用),它开始看起来更像链接文章描述真正的问题。 However, I'm still not sure how to fix this. 但是,我仍然不确定如何解决这个问题。 Making registry edits is not an option. 进行注册表编辑不是一种选择。

If it is really a resource problem (freeing os socket resources) 如果它确实是一个资源问题(释放os套接字资源)

try this: 试试这个:

request = urllib2.Request(file_remote_path)
opener = urllib2.build_opener()

retry = 3 # 3 tries
while retry :
    try :
        datastream = opener.open(request)
    except urllib2.URLError, ue:
        if ue.reason.find('10048') > -1 :
            if retry :
                retry -= 1
            else :
                raise urllib2.URLError("Address already in use / retries exhausted")
        else :
            retry = 0
    if datastream :
        retry = 0

outfileobj = open(temp_file_path, 'wb')
try:
    while True:
        chunk = datastream.read(CHUNK_SIZE)
        if chunk == '':
            break
        else:
            outfileobj.write(chunk)
finally:
    outfileobj = outfileobj.close()
    datastream.close()

if you want you can insert a sleep or you make it os depended 如果你想要你可以插入一个睡眠,或者你可以依赖它

on my win-xp the problem doesn't show up (I reached 5000 downloads) 在我的win-xp上问题没有出现(我下载了5000次)

I watch my processes and network with process hacker . 我通过进程黑客观察我的进程和网络。

Thinking outside the box, the problem you seem to be trying to solve has already been solved by a program called rsync. 在框外思考,你似乎试图解决的问题已经被一个名为rsync的程序解决了。 You might look for a Windows implementation and see if it meets your needs. 您可能会查找Windows实施,看看它是否满足您的需求。

您应该认真考虑复制和修改此pyCurl示例,以便有效下载大量文件。

你应该真正使用持久的HTTP连接,而不是为每个请求打开一个新的TCP连接 - 看看urlgrabber (或者只是在keepalive.py上看看如何为urllib2添加keep-alive连接支持)。

All indications point to a lack of available sockets. 所有迹象都表明缺少可用的插座。 Are you sure that only 6 are in TIME_WAIT status? 你确定只有6个处于TIME_WAIT状态吗? If you're running so many download operations it's very likely that netstat overruns your terminal buffer. 如果您正在运行如此多的下载操作,netstat很可能会超出您的终端缓冲区。 I find that netstat stat overruns my terminal during normal useage periods. 我发现netstat stat在正常使用期间超出了我的终端。

The solution is to either modify the code to reuse sockets. 解决方案是修改代码以重用套接字。 Or introduce a timeout. 或者引入超时。 It also wouldn't hurt to keep track of how many open sockets you have. 跟踪你有多少个开放插座也没什么坏处。 To optimize waiting. 优化等待。 The default timeout on Windows XP is 120 seconds. Windows XP的默认超时为120秒。 so you want to sleep for at least that long if you run out of sockets. 如果你的插座耗尽,你至少要睡这么长时间。 Unfortunately it doesn't look like there's an easy way to check from Python when a socket has closed and left the TIME_WAIT status. 不幸的是,当套接字关闭并离开TIME_WAIT状态时,看起来很容易从Python检查。

Given the asynchronous nature of the requests and timeouts, the best way to do this might be in a thread. 鉴于请求和超时的异步性质,执行此操作的最佳方法可能是在一个线程中。 Make each threat sleep for 2 minutes before it finishes. 使每个威胁在完成之前睡眠2分钟。 You can either use a Semaphore or limit the number of active threads to ensure that you don't run out of sockets. 您可以使用信号量或限制活动线程数,以确保不会耗尽套接字。

Here's how I'd handle it. 这是我如何处理它。 You might want to add an exception clause to the inner try block of the fetch section, to warn you about failed fetches. 您可能希望将异常子句添加到fetch部分的内部try块,以警告您有关失败的提取。

import time
import threading
import Queue

# assumes url_queue is a Queue object populated with tuples in the form of(url_to_fetch, temp_file)
# also assumes that TotalUrls is the size of the queue before any threads are started.


class urlfetcher(threading.Thread)
    def __init__ (self, queue)
        Thread.__init__(self)
        self.queue = queue


    def run(self)
        try: # needed to handle empty exception raised by an empty queue.
            file_remote_path, temp_file_path = self.queue.get()
            request = urllib2.Request(file_remote_path)
            opener = urllib2.build_opener()
            datastream = opener.open(request)
            outfileobj = open(temp_file_path, 'wb')
            try:
                while True:
                    chunk = datastream.read(CHUNK_SIZE)
                    if chunk == '':
                        break
                    else:
                        outfileobj.write(chunk)
            finally:
                outfileobj = outfileobj.close()
                datastream.close()    
                time.sleep(120)
                self.queue.task_done()

elsewhere:


while url_queue.size() < TotalUrls: # hard limit of available ports.
    if threading.active_threads() < 3975: # Hard limit of available ports
         t = urlFetcher(url_queue)
         t.start()
    else: 
        time.sleep(2)

url_queue.join()

Sorry, my python is a little rusty, so I wouldn't be surprised if I missed something. 对不起,我的python有点生疏,所以如果我错过了什么,我不会感到惊讶。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM