简体   繁体   English

为什么在使用mrjob v0.4.4时,[Errno 7]参数列表过长且OSError:[Errno 24]打开的文件太多?

[英]Why am I getting [Errno 7] Argument list too long and OSError: [Errno 24] Too many open files when using mrjob v0.4.4?

It seems like the nature of the MapReduce framework is to work with many files. 似乎MapReduce框架的本质是要处理许多文件。 So when I get errors that tell me I'm using too many files, I suspect I'm doing something wrong. 因此,当我收到告诉我使用过多文件的错误时,我怀疑我在做错什么。

If I run the job with the inline runner and three directories, it works: 如果我使用inline运行器和三个目录运行该作业,则它可以工作:

$ python mr_gps_quality.py  /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/

But if I run it using the local runner (and the same three directories), it fails: 但是,如果我使用local运行器(以及相同的三个目录)运行它,它将失败:

$ python mr_gps_quality.py  /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r local --no-output --output-dir city1_results/gps_quality/2015/03/

[...output clipped...]

> /Users/andrewsturges/sturges/mr/env/bin/python mr_gps_quality.py --step-num=0 --mapper /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/input_part-00249 > /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/step-k0-mapper_part-00249
Traceback (most recent call last):
  File "mr_gps_quality.py", line 53, in <module>
    MRGPSQuality.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
    mr_job.execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
    super(MRJob, self).execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
    self.run_job()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
    runner.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
    self._run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 182, in _run
    self._invoke_step(step_num, 'mapper')
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 269, in _invoke_step
    working_dir, env)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 150, in _run_step
    procs_args, output_path, working_dir, env)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 253, in _invoke_processes
    cwd=working_dir, env=env)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 76, in _chain_procs
    proc = Popen(args, **proc_kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1197, in _execute_child
    errpipe_read, errpipe_write = self.pipe_cloexec()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1153, in pipe_cloexec
    r, w = os.pipe()
OSError: [Errno 24] Too many open files

Furthermore, if I go back to using the inline runner and include even more directories (11 total) in my input, then I get a different error again: 此外,如果我返回使用内联运行器,并在输入中包含更多目录(总共11个),那么我会再次遇到另一个错误:

$ python mr_gps_quality.py  /Volumes/Logs/gps/ByCityLogs/city1/*/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/

[...clipped...]

Traceback (most recent call last):
  File "mr_gps_quality.py", line 53, in <module>
    MRGPSQuality.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run 
    mr_job.execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
    super(MRJob, self).execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
    self.run_job()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
    runner.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run 
    self._run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 191, in _run
    self._invoke_sort(self._step_input_paths(), sort_output_path)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 1202, in _invoke_sort
    check_call(args, stdout=output, stderr=err, env=env)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 537, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 524, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1308, in _execute_child
    raise child_exception
OSError: [Errno 7] Argument list too long

The mrjob docs include a discussion of the differences between the inline and local runners , but I don't understand how it would explain this behavior. mrjob文档包括inlinelocal跑步者之间差异讨论 ,但我不知道它如何解释这种行为。

Lastly, I'll mention that the number of files in the directories I'm globbing isn't huge ( acknowledgement ): 最后,我要提到的是,我正在遍历的目录中的文件数量并不是很大( 确认 ):

$ find . -maxdepth 1 -mindepth 1 -type d | while read dir; do   printf "%-25.25s : " "$dir";   find "$dir" -type f | wc -l; done | sort
./01                      :      236
./02                      :      169
./03                      :      176
./04                      :      185
./05                      :      176
./06                      :      235
./07                      :      275
./08                      :      265
./09                      :      186
./10                      :      171
./11                      :      161

I don't think this has to do with the job itself, but here it is: 我认为这与工作本身无关,但在这里是:

from mrjob.job import MRJob
import numpy as np
import geohash

class MRGPSQuality(MRJob):

    def mapper(self, _, line):

        try:
            lat = float(line.split(',')[1])
            lng = float(line.split(',')[2])
            horizontalAccuracy = float(line.split(',')[4])
            gh = geohash.encode(lat, lng, precision=7)
            yield gh, horizontalAccuracy
        except:
            pass

    def reducer(self, key, values):
        # Convert the generator straight back to array:
        vals = np.fromiter(values, float)
        count = len(vals)
        mean = np.mean(vals)
        if count > 50:
            yield key, [count, mean]

if __name__ == '__main__':
    MRGPSQuality.run()

The problem for "Argument list too long" is not the job or python, its bash. “参数列表太长”的问题不是作业或python,而是bash。 The asterisk in your command line to kick off the job expands out to every file that matches which is a really long command line and exceeds bash limit. 命令行中用于启动作业的星号会扩展到每个匹配的文件,这是一个非常长的命令行,超过了bash限制。

The error has nothing to do with ulimit but the error "Too many open files" is to do with ulimit, so you bump into the ulimit if the command were to actually run. 该错误与ulimit无关,但错误“ u打开的文件太多”与ulimit有关,因此,如果命令实际上要运行,则会遇到ulimit。

You can check the shells limit like this (if you are interested)... getconf ARG_MAX 您可以像这样检查炮弹极限(如果您有兴趣)... getconf ARG_MAX

To get around the max args problem, you can concatenate all the files into one by doing this. 要解决最大args问题,您可以通过执行以下操作将所有文件串联在一起。

for f in *; do cat "$f" >> ../directory/bigfile.log; done

Then run your mrjob pointed at the big file. 然后运行指向大文件的mrjob。

If its a lot of files you can use multiple threads to concat the file using gnu parallel because above command is single thread and slow. 如果文件很多,则可以使用gnu parallel使用多个线程来连接文件,因为上述命令是单线程且速度较慢。

ls | parallel -m -j 8 "cat {} >> ../files/bigfile.log"

*Change 8 to the amount of parallelism you want *将8更改为所需的并行度

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 OSError:[Errno 24]打开的文件太多 - OSError: [Errno 24] Too many open files OSError: [Errno 24] 使用 Nibabel 打开的文件太多 - OSError: [Errno 24] Too many open files using Nibabel OSError:[Errno 24]在Twisted中使用Reactor.run()时打开的文件过多 - OSError: [Errno 24] Too many open files when using reactor.run() in Twisted OSError: [Errno 24] 打开的文件太多 - OS Mojave - OSError: [Errno 24] Too many open files - OS Mojave OSError:[Errno 24]太多打开的文件python,ubuntu - OSError: [Errno 24] Too many open files python , ubuntu slackclient OSError:[Errno 24]打开的文件太多 - slackclient OSError: [Errno 24] Too many open files OSError: [Errno 24] 打开的文件太多; 在 python; 难以调试 - OSError: [Errno 24] Too many open files; in python; difficult to debug 为什么我收到 OSError: [Errno 7] 参数列表太长:b&#39;/usr/local/bin/git&#39;? - Why I`m getting OSError: [Errno 7] Argument list too long: b'/usr/local/bin/git'? OSError: [Errno 24] 通过 Django admin 上传 9000+ csv 个文件时打开的文件太多 - OSError: [Errno 24] Too many open files when uploading 9000+ csv files through Django admin OSError:[Errno 24]从终端调用脚本时打开的文件太多 - OSError: [Errno 24] Too many open files when invoking script from terminal
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM