简体   繁体   English

Python mrjob mapreduce如何预处理输入文件

[英]Python mrjob mapreduce how to preprocess the input file

I am trying to pre-process a XML file to extract certain nodes before putting into mapreduce. 我在尝试预处理XML文件以在放入mapreduce之前提取某些节点。 I have the following code: 我有以下代码:

from mrjob.compat import jobconf_from_env
from mrjob.job import MRJob
from mrjob.util import cmd_line, bash_wrap

class MRCountLinesByFile(MRJob):
    def configure_options(self):
        super(MRCountLinesByFile, self).configure_options()
        self.add_file_option('--filter')

    def mapper_cmd(self):
        cmd = cmd_line([self.options.filter, jobconf_from_env('mapreduce.map.input.file'])
        return cmd



if __name__ == '__main__':
    MRCountLinesByFile.run()

And on the command line, I type: 在命令行中,我键入:

python3 test_job_conf.py --filter ./filter.py -r local < test.txt

test.txt is a normal XML file like here . test.txt是一个普通的XML文件,就像这里一样。 While filter.py is a script to find all title information. filter.py是一个查找所有标题信息的脚本。

However, I am getting the following errors: 但是,我收到以下错误:

Creating temp directory /tmp/test_job_conf.vagrant.20160406.042648.689625
Running step 1 of 1...
Traceback (most recent call last):
  File "./filter.py", line 8, in <module>
    with open(filename) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'None'
Step 1 of 1 failed: Command '['./filter.py', 'None']' returned non-zero exit status 1

It looks like mapreduce.map.input.file render None in this case. 在这种情况下,它看起来像mapreduce.map.input.file渲染None How can I ask the mapper_cmd function to read the file that mrjob is currently reading? 如何让mapper_cmd函数读取mrjob当前正在读取的文件?

As per my understanding goes in the your self.add_file_option should have the path to your file. 根据我的理解,你的self.add_file_option应该有你的文件的路径。

self.add_file_option('--items', help='Path to u.item')

I do not quite get your scenario right but here is my understanding. 我不太了解你的情况,但这是我的理解。 You use the configure option to make sure a given file is sent to all the mappers for processing for example when you want to do an ancillary lookup on data in another file other than the source. 您可以使用configure选项确保将给定文件发送给所有映射器以进行处理,例如,当您要对除源以外的其他文件中的数据进行辅助查找时。 This ancillary lookup file is made available by self.add_file_option('--items', help='Path to u.item'). self.add_file_option(' - items',help ='u.item的路径')提供了这个辅助查找文件。

To preprocess something say before a reducer or a mapper phase, you use the reducer_init or the mapper_init. 要在reducer或mapper阶段之前预先处理某些内容,请使用reducer_init或mapper_init。 These init or the processing steps also need to be mentioned in your step function like shown below for example. 这些初始化或处理步骤也需要在步骤函数中提及,如下所示。

def steps(self):
        return [
            MRStep(mapper=self.mapper_get_name,
                   reducer_init=self.reducer_init,
                   reducer=self.reducer_count_name),
            MRStep(reducer = self.reducer_find_maxname)
        ]

Within your init function you do the actual pre-processing of what you need to done before sending to mapper or reducer. 在init函数中,您可以在发送到mapper或reducer之前对您需要完成的操作进行实际预处理。 Say for example open a file xyz and copy the values in the first field in another field which I would be using in my reducer and output the same. 比如说打开一个文件xyz并复制我将在我的reducer中使用的另一个字段的第一个字段中的值并输出相同的值。

def reducer_init(self):
        self.movieNames = {}    
        with open("xyz") as f:
            for line in f:
                fields = line.split('|')
                self.myNames[fields[0]] = fields[1]

Hope this helps!! 希望这可以帮助!!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM