簡體   English   中英

為什么在使用mrjob v0.4.4時,[Errno 7]參數列表過長且OSError:[Errno 24]打開的文件太多?

[英]Why am I getting [Errno 7] Argument list too long and OSError: [Errno 24] Too many open files when using mrjob v0.4.4?

似乎MapReduce框架的本質是要處理許多文件。 因此,當我收到告訴我使用過多文件的錯誤時,我懷疑我在做錯什么。

如果我使用inline運行器和三個目錄運行該作業,則它可以工作:

$ python mr_gps_quality.py  /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/

但是,如果我使用local運行器(以及相同的三個目錄)運行它,它將失敗:

$ python mr_gps_quality.py  /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r local --no-output --output-dir city1_results/gps_quality/2015/03/

[...output clipped...]

> /Users/andrewsturges/sturges/mr/env/bin/python mr_gps_quality.py --step-num=0 --mapper /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/input_part-00249 > /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/step-k0-mapper_part-00249
Traceback (most recent call last):
  File "mr_gps_quality.py", line 53, in <module>
    MRGPSQuality.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
    mr_job.execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
    super(MRJob, self).execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
    self.run_job()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
    runner.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
    self._run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 182, in _run
    self._invoke_step(step_num, 'mapper')
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 269, in _invoke_step
    working_dir, env)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 150, in _run_step
    procs_args, output_path, working_dir, env)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 253, in _invoke_processes
    cwd=working_dir, env=env)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 76, in _chain_procs
    proc = Popen(args, **proc_kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1197, in _execute_child
    errpipe_read, errpipe_write = self.pipe_cloexec()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1153, in pipe_cloexec
    r, w = os.pipe()
OSError: [Errno 24] Too many open files

此外,如果我返回使用內聯運行器,並在輸入中包含更多目錄(總共11個),那么我會再次遇到另一個錯誤:

$ python mr_gps_quality.py  /Volumes/Logs/gps/ByCityLogs/city1/*/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/

[...clipped...]

Traceback (most recent call last):
  File "mr_gps_quality.py", line 53, in <module>
    MRGPSQuality.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run 
    mr_job.execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
    super(MRJob, self).execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
    self.run_job()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
    runner.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run 
    self._run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 191, in _run
    self._invoke_sort(self._step_input_paths(), sort_output_path)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 1202, in _invoke_sort
    check_call(args, stdout=output, stderr=err, env=env)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 537, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 524, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1308, in _execute_child
    raise child_exception
OSError: [Errno 7] Argument list too long

mrjob文檔包括inlinelocal跑步者之間差異討論 ,但我不知道它如何解釋這種行為。

最后,我要提到的是,我正在遍歷的目錄中的文件數量並不是很大( 確認 ):

$ find . -maxdepth 1 -mindepth 1 -type d | while read dir; do   printf "%-25.25s : " "$dir";   find "$dir" -type f | wc -l; done | sort
./01                      :      236
./02                      :      169
./03                      :      176
./04                      :      185
./05                      :      176
./06                      :      235
./07                      :      275
./08                      :      265
./09                      :      186
./10                      :      171
./11                      :      161

我認為這與工作本身無關,但在這里是:

from mrjob.job import MRJob
import numpy as np
import geohash

class MRGPSQuality(MRJob):

    def mapper(self, _, line):

        try:
            lat = float(line.split(',')[1])
            lng = float(line.split(',')[2])
            horizontalAccuracy = float(line.split(',')[4])
            gh = geohash.encode(lat, lng, precision=7)
            yield gh, horizontalAccuracy
        except:
            pass

    def reducer(self, key, values):
        # Convert the generator straight back to array:
        vals = np.fromiter(values, float)
        count = len(vals)
        mean = np.mean(vals)
        if count > 50:
            yield key, [count, mean]

if __name__ == '__main__':
    MRGPSQuality.run()

“參數列表太長”的問題不是作業或python,而是bash。 命令行中用於啟動作業的星號會擴展到每個匹配的文件,這是一個非常長的命令行,超過了bash限制。

該錯誤與ulimit無關,但錯誤“ u打開的文件太多”與ulimit有關,因此,如果命令實際上要運行,則會遇到ulimit。

您可以像這樣檢查炮彈極限(如果您有興趣)... getconf ARG_MAX

要解決最大args問題,您可以通過執行以下操作將所有文件串聯在一起。

for f in *; do cat "$f" >> ../directory/bigfile.log; done

然后運行指向大文件的mrjob。

如果文件很多,則可以使用gnu parallel使用多個線程來連接文件,因為上述命令是單線程且速度較慢。

ls | parallel -m -j 8 "cat {} >> ../files/bigfile.log"

*將8更改為所需的並行度

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM