简体   繁体   English

Python中的多处理 <urlopen error ftp error> 下载文件包时

[英]Multiprocessing in Python <urlopen error ftp error> when Downloading Bataches of Files

This was a question that was spawned from a previous question (see Downloading a LOT of files using Python ) but was so much more specific than what I originally asked that I thought it deserved its own question. 这是上一个问题(请参阅使用Python下载大量文件 )产生的一个问题,但是比我最初提出的问题具体得多,以至于我认为这是值得提出的问题。

When running python multiprocessing if I try to download a batch of files at once using threading it throws as error on only some of the files. 运行python多重处理时,如果我尝试使用线程一次下载一批文件,则仅在某些文件上抛出错误。 This is the error, obviously there is an error with urllib2 opening the file but the question is, why? 这是错误,显然urllib2打开文件时出错,但问题是,为什么?

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib64/anaconda/lib/python2.7/multiprocessing/pool.py", line 250, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/local/lib64/anaconda/lib/python2.7/multiprocessing/pool.py", line 554, in get
    raise self._value
urllib2.URLError: <urlopen error ftp error: >

What is weird is that if I download the files one at a time I do not get this error. 奇怪的是,如果一次下载一个文件,不会出现此错误。 And the error is (normally) not consistent. 并且错误(通常)不一致。 If I run the same process twice it will throw the same error but on different files. 如果我运行相同的进程两次,它将抛出相同的错误,但会在不同的文件上。 This leads me to think that the problem is the interaction of the threads. 这使我认为问题出在线程之间的相互作用。 Maybe 2 threads are trying to ping the site at the same time? 也许有2个线程试图同时ping通该站点? Does anyone know what might be causing this? 有谁知道这可能是什么原因?

The machine I am using is a LinuxBox running RedHat with 32 cores. 我正在使用的机器是运行RedHat并具有32个内核的LinuxBox。

Here is the code I am using: 这是我正在使用的代码:

from __future__ import division
import pandas as pd
import numpy as np
import urllib2
import os
import linecache
from multiprocessing import Pool
import time

#making our list of urls to download from
data2=pd.read_csv("edgar14A14C.csv")

flist=np.array(data2['filename'])
print len(flist)
print flist

os.chdir(str(os.getcwd())+str('/edgar14A14C'))

###below we have a script to download all of the files in the data2 database
###here you will need to create a new directory named edgar14A14C in your CWD

def job(url):
    print "I'm doing something!"
    file_name = str(url.split('/')[-1])
    u = urllib2.urlopen(url)
    f = open(file_name, 'wb')
    f.write(u.read())
    print file_name
    f.close()


urls = ["ftp://ftp.sec.gov/{0:s}".format(f) for f in flist]
pool = Pool(processes=20)
pool.map(job, urls)

Processes are limited to only having a specific number of file pointers open at a given time. 进程仅限于在给定时间仅打开特定数量的文件指针。 urllib2.urlopen(url) opens a file pointer (the socket). urllib2.urlopen(url)打开文件指针(套接字)。 After you are done with the data, be sure to close it: u.close() 处理完数据后,请确保将其关闭: u.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM