简体   繁体   English

是否可以将 arguments 传递给 mr 工作

[英]Is it possible to pass arguments to mr job

Given the basic example from the mrJob site for a word count program:给定来自 mrJob 网站的字数统计程序的基本示例:

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

From command line, this example can be run as python mrJobFilename.py mrJobFilename.py .从命令行,此示例可以作为python mrJobFilename.py mrJobFilename.py运行。 This should run the program on itself and count the words in the file.这应该自行运行程序并计算文件中的单词。

So given this example, what if I want to pass in an argument, say minCount = 3 .所以给出这个例子,如果我想传入一个参数怎么办,比如minCount = 3 With this argument, the reducer would only return words with counts more than minCount .使用这个参数,reducer 只会返回计数超过minCount的单词。

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        X = sum(values)
        if X > minCount:
            yield key, sum(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

I tried passing minWord as an argument: python mrJobFilename.py mrJobFilename.py 3 , but I get an error OSError: Input path 3 does not exist!我尝试将 minWord 作为参数传递: python mrJobFilename.py mrJobFilename.py 3 ,但出现错误OSError: Input path 3 does not exist!

I also tried setting a variable with sysArg:我还尝试使用 sysArg 设置变量:

if __name__ == '__main__':
    minWord = sys.argv[1]
    MRWordFrequencyCount.run()

When run with python mrJobFilename.py mrJobFilename.py < 3 I get an error bash: 3: No such file or directory .当使用python mrJobFilename.py mrJobFilename.py < 3运行时,我收到错误bash: 3: No such file or directory If I don't use the < I get the previous input file not found error.如果我不使用<我得到上一个输入文件未找到错误。

Finally, I tried inputting a second csv file.最后,我尝试输入第二个 csv 文件。 The csv file is 2 lines and looks like this: csv 文件有 2 行,如下所示:

minWord
3

It is meant to pass a parameter to mrJobs since it keeps giving me error that second arugment is not an input file.它旨在将参数传递给 mrJobs,因为它不断给我错误,即第二个参数不是输入文件。 I use mapper_raw to try and load it, but I get a weird error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8f in position 22: invalid start byte我使用 mapper_raw 尝试加载它,但我收到一个奇怪的错误: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8f in position 22: invalid start byte

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

    def mapper_raw(self, input_arg1, input_arg2):
        import csv
        f = open(input_path2)
        reader = csv.reader(f)
        next(reader) # skip header
        yield(next(reader))

    def steps(self):
          return [
              MRStep(mapper_raw=self.mapper_raw)
          ]


if __name__ == '__main__':
    MRWordFrequencyCount.run()

How can I pass an argument to mrJob?如何将参数传递给 mrJob? Ultimately I need to do this to pass parameters for differential equation systems which I want to solve in parallel.最终我需要这样做来传递我想要并行求解的微分方程系统的参数。

You can follow the mrjob document to add command-line argument like argparse .您可以按照mrjob 文档添加命令行参数,如argparse

So your code should look something like this:所以你的代码应该是这样的:

from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def configure_args(self):
        super(MRWordFrequencyCount, self).configure_args()
        self.add_passthru_arg("-m", "--minCount", help="your argument description")

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        X = sum(values)
        if X > self.options.minCount:
            yield key, sum(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

Use your argument with self.options.minCount .将您的论点与self.options.minCount一起使用。

Run command:运行命令:

python code.py input.txt --minCount 4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM