[英]Why am I getting [Errno 7] Argument list too long and OSError: [Errno 24] Too many open files when using mrjob v0.4.4?
似乎MapReduce框架的本質是要處理許多文件。 因此,當我收到告訴我使用過多文件的錯誤時,我懷疑我在做錯什么。
如果我使用inline
運行器和三個目錄運行該作業,則它可以工作:
$ python mr_gps_quality.py /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/
但是,如果我使用local
運行器(以及相同的三個目錄)運行它,它將失敗:
$ python mr_gps_quality.py /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r local --no-output --output-dir city1_results/gps_quality/2015/03/
[...output clipped...]
> /Users/andrewsturges/sturges/mr/env/bin/python mr_gps_quality.py --step-num=0 --mapper /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/input_part-00249 > /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/step-k0-mapper_part-00249
Traceback (most recent call last):
File "mr_gps_quality.py", line 53, in <module>
MRGPSQuality.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
self.run_job()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 182, in _run
self._invoke_step(step_num, 'mapper')
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 269, in _invoke_step
working_dir, env)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 150, in _run_step
procs_args, output_path, working_dir, env)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 253, in _invoke_processes
cwd=working_dir, env=env)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 76, in _chain_procs
proc = Popen(args, **proc_kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1197, in _execute_child
errpipe_read, errpipe_write = self.pipe_cloexec()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1153, in pipe_cloexec
r, w = os.pipe()
OSError: [Errno 24] Too many open files
此外,如果我返回使用內聯運行器,並在輸入中包含更多目錄(總共11個),那么我會再次遇到另一個錯誤:
$ python mr_gps_quality.py /Volumes/Logs/gps/ByCityLogs/city1/*/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/
[...clipped...]
Traceback (most recent call last):
File "mr_gps_quality.py", line 53, in <module>
MRGPSQuality.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
self.run_job()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 191, in _run
self._invoke_sort(self._step_input_paths(), sort_output_path)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 1202, in _invoke_sort
check_call(args, stdout=output, stderr=err, env=env)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 537, in check_call
retcode = call(*popenargs, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 524, in call
return Popen(*popenargs, **kwargs).wait()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1308, in _execute_child
raise child_exception
OSError: [Errno 7] Argument list too long
mrjob文檔包括inline
和local
跑步者之間差異的討論 ,但我不知道它如何解釋這種行為。
最后,我要提到的是,我正在遍歷的目錄中的文件數量並不是很大( 確認 ):
$ find . -maxdepth 1 -mindepth 1 -type d | while read dir; do printf "%-25.25s : " "$dir"; find "$dir" -type f | wc -l; done | sort
./01 : 236
./02 : 169
./03 : 176
./04 : 185
./05 : 176
./06 : 235
./07 : 275
./08 : 265
./09 : 186
./10 : 171
./11 : 161
我認為這與工作本身無關,但在這里是:
from mrjob.job import MRJob
import numpy as np
import geohash
class MRGPSQuality(MRJob):
def mapper(self, _, line):
try:
lat = float(line.split(',')[1])
lng = float(line.split(',')[2])
horizontalAccuracy = float(line.split(',')[4])
gh = geohash.encode(lat, lng, precision=7)
yield gh, horizontalAccuracy
except:
pass
def reducer(self, key, values):
# Convert the generator straight back to array:
vals = np.fromiter(values, float)
count = len(vals)
mean = np.mean(vals)
if count > 50:
yield key, [count, mean]
if __name__ == '__main__':
MRGPSQuality.run()
“參數列表太長”的問題不是作業或python,而是bash。 命令行中用於啟動作業的星號會擴展到每個匹配的文件,這是一個非常長的命令行,超過了bash限制。
該錯誤與ulimit無關,但錯誤“ u打開的文件太多”與ulimit有關,因此,如果命令實際上要運行,則會遇到ulimit。
您可以像這樣檢查炮彈極限(如果您有興趣)... getconf ARG_MAX
要解決最大args問題,您可以通過執行以下操作將所有文件串聯在一起。
for f in *; do cat "$f" >> ../directory/bigfile.log; done
然后運行指向大文件的mrjob。
如果文件很多,則可以使用gnu parallel使用多個線程來連接文件,因為上述命令是單線程且速度較慢。
ls | parallel -m -j 8 "cat {} >> ../files/bigfile.log"
*將8更改為所需的並行度
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.