[英]Why am I getting [Errno 7] Argument list too long and OSError: [Errno 24] Too many open files when using mrjob v0.4.4?
似乎MapReduce框架的本质是要处理许多文件。 因此,当我收到告诉我使用过多文件的错误时,我怀疑我在做错什么。
如果我使用inline
运行器和三个目录运行该作业,则它可以工作:
$ python mr_gps_quality.py /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/
但是,如果我使用local
运行器(以及相同的三个目录)运行它,它将失败:
$ python mr_gps_quality.py /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r local --no-output --output-dir city1_results/gps_quality/2015/03/
[...output clipped...]
> /Users/andrewsturges/sturges/mr/env/bin/python mr_gps_quality.py --step-num=0 --mapper /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/input_part-00249 > /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/step-k0-mapper_part-00249
Traceback (most recent call last):
File "mr_gps_quality.py", line 53, in <module>
MRGPSQuality.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
self.run_job()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 182, in _run
self._invoke_step(step_num, 'mapper')
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 269, in _invoke_step
working_dir, env)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 150, in _run_step
procs_args, output_path, working_dir, env)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 253, in _invoke_processes
cwd=working_dir, env=env)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 76, in _chain_procs
proc = Popen(args, **proc_kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1197, in _execute_child
errpipe_read, errpipe_write = self.pipe_cloexec()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1153, in pipe_cloexec
r, w = os.pipe()
OSError: [Errno 24] Too many open files
此外,如果我返回使用内联运行器,并在输入中包含更多目录(总共11个),那么我会再次遇到另一个错误:
$ python mr_gps_quality.py /Volumes/Logs/gps/ByCityLogs/city1/*/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/
[...clipped...]
Traceback (most recent call last):
File "mr_gps_quality.py", line 53, in <module>
MRGPSQuality.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
self.run_job()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 191, in _run
self._invoke_sort(self._step_input_paths(), sort_output_path)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 1202, in _invoke_sort
check_call(args, stdout=output, stderr=err, env=env)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 537, in check_call
retcode = call(*popenargs, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 524, in call
return Popen(*popenargs, **kwargs).wait()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1308, in _execute_child
raise child_exception
OSError: [Errno 7] Argument list too long
mrjob文档包括inline
和local
跑步者之间差异的讨论 ,但我不知道它如何解释这种行为。
最后,我要提到的是,我正在遍历的目录中的文件数量并不是很大( 确认 ):
$ find . -maxdepth 1 -mindepth 1 -type d | while read dir; do printf "%-25.25s : " "$dir"; find "$dir" -type f | wc -l; done | sort
./01 : 236
./02 : 169
./03 : 176
./04 : 185
./05 : 176
./06 : 235
./07 : 275
./08 : 265
./09 : 186
./10 : 171
./11 : 161
我认为这与工作本身无关,但在这里是:
from mrjob.job import MRJob
import numpy as np
import geohash
class MRGPSQuality(MRJob):
def mapper(self, _, line):
try:
lat = float(line.split(',')[1])
lng = float(line.split(',')[2])
horizontalAccuracy = float(line.split(',')[4])
gh = geohash.encode(lat, lng, precision=7)
yield gh, horizontalAccuracy
except:
pass
def reducer(self, key, values):
# Convert the generator straight back to array:
vals = np.fromiter(values, float)
count = len(vals)
mean = np.mean(vals)
if count > 50:
yield key, [count, mean]
if __name__ == '__main__':
MRGPSQuality.run()
“参数列表太长”的问题不是作业或python,而是bash。 命令行中用于启动作业的星号会扩展到每个匹配的文件,这是一个非常长的命令行,超过了bash限制。
该错误与ulimit无关,但错误“ u打开的文件太多”与ulimit有关,因此,如果命令实际上要运行,则会遇到ulimit。
您可以像这样检查炮弹极限(如果您有兴趣)... getconf ARG_MAX
要解决最大args问题,您可以通过执行以下操作将所有文件串联在一起。
for f in *; do cat "$f" >> ../directory/bigfile.log; done
然后运行指向大文件的mrjob。
如果文件很多,则可以使用gnu parallel使用多个线程来连接文件,因为上述命令是单线程且速度较慢。
ls | parallel -m -j 8 "cat {} >> ../files/bigfile.log"
*将8更改为所需的并行度
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.