简体   繁体   English

带有 MRJob 的多个输入

[英]Multiple Inputs with MRJob

I'm trying to learn to use Yelp's Python API for MapReduce, MRJob.我正在尝试学习将 Yelp 的 Python API 用于 MapReduce、MRJob。 Their simple word counter example makes sense, but I'm curious how one would handle an application involving multiple inputs.他们简单的单词计数器示例是有道理的,但我很好奇如何处理涉及多个输入的应用程序。 For instance, rather than simply counting the words in a document, multiplying a vector by a matrix.例如,不是简单地计算文档中的单词,而是将向量乘以矩阵。 I came up with this solution, which functions, but feels silly:我想出了这个解决方案,它可以运行,但感觉很傻:

class MatrixVectMultiplyTast(MRJob):
    def multiply(self,key,line):
            line = map(float,line.split(" "))
            v,col = line[-1],line[:-1]

            for i in xrange(len(col)):
                    yield i,col[i]*v

    def sum(self,i,occurrences):
            yield i,sum(occurrences)

    def steps(self):
            return [self.mr (self.multiply,self.sum),]

if __name__=="__main__":
    MatrixVectMultiplyTast.run()

This code is run ./matrix.py < input.txt and the reason it works is that the matrix stored in input.txt by columns, with the corresponding vector value at the end of the line.此代码运行./matrix.py < input.txt并且它起作用的原因是矩阵按列存储在 input.txt 中,相应的向量值位于行尾。

So, the following matrix and vector:因此,以下矩阵和向量:

在此处输入图片说明

are represented as input.txt as:表示为 input.txt 为:

在此处输入图片说明

In short, how would I go about storing the matrix and vector more naturally in separate files and passing them both into MRJob?简而言之,我如何将矩阵和向量更自然地存储在单独的文件中并将它们传递给 MRJob?

If you're in need of processing your raw data against another (or same row_i, row_j) data set, you can either:如果您需要针对另一个(或相同的 row_i、row_j)数据集处理原始数据,您可以:

1) Create an S3 bucket to store a copy of your data. 1) 创建一个 S3 存储桶来存储数据的副本。 Pass the location of this copy to your task class, eg self.options.bucket and self.options.my_datafile_copy_location in the code below.将此副本的位置传递给您的任务类,例如下面代码中的 self.options.bucket 和 self.options.my_datafile_copy_location。 Caveat: Unfortunately, it seems that the whole file must get "downloaded" to the task machines before getting processed.警告:不幸的是,整个文件似乎必须在处理之前“下载”到任务机器。 If the connections falters or takes too long to load, this job may fail.如果连接不稳定或加载时间过长,此作业可能会失败。 Here is some Python/MRJob code to do this.这是一些 Python/MRJob 代码来执行此操作。

Put this in your mapper function:把它放在你的映射器函数中:

d1 = line1.split('\t', 1)
v1, col1 = d1[0], d1[1]
conn = boto.connect_s3(aws_access_key_id=<AWS_ACCESS_KEY_ID>, aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>)
bucket = conn.get_bucket(self.options.bucket)  # bucket = conn.get_bucket(MY_UNIQUE_BUCKET_NAME_AS_STRING)
data_copy = bucket.get_key(self.options.my_datafile_copy_location).get_contents_as_string().rstrip()
### CAVEAT: Needs to get the whole file before processing the rest.
for line2 in data_copy.split('\n'):
    d2 = line2.split('\t', 1)
    v2, col2 = d2[0], d2[1]
    ## Now, insert code to do any operations between v1 and v2 (or c1 and c2) here:
    yield <your output key, value pairs>
conn.close()

2) Create a SimpleDB domain, and store all of your data in there. 2) 创建一个 SimpleDB 域,并将所有数据存储在其中。 Read here on boto and SimpleDB: http://code.google.com/p/boto/wiki/SimpleDbIntro在这里阅读 boto 和 SimpleDB: http : //code.google.com/p/boto/wiki/SimpleDbIntro

Your mapper code would look like this:您的映射器代码如下所示:

dline = dline.strip()
d0 = dline.split('\t', 1)
v1, c1 = d0[0], d0[1]
sdb = boto.connect_sdb(aws_access_key_id=<AWS_ACCESS_KEY>, aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>)
domain = sdb.get_domain(MY_DOMAIN_STRING_NAME)
for item in domain:
    v2, c2 = item.name, item['column']
    ## Now, insert code to do any operations between v1 and v2 (or c1 and c2) here:
    yield <your output key, value pairs>
sdb.close()

This second option may perform better if you have very large amounts of data, since it can make the requests for each row of data rather than the whole amount at once.如果您有大量数据,则第二个选项可能会执行得更好,因为它可以对每行数据而不是全部数据进行请求。 Keep in mind that SimpleDB values can only be a maximum of 1024 characters long, so you may need to compress/decompress via some method if your data values are longer than that.请记住,SimpleDB 值的最大长度为 1024 个字符,因此如果您的数据值长于该值,您可能需要通过某种方法进行压缩/解压缩。

The actual answer to your question is that mrjob does not quite yet support the hadoop streaming join pattern, which is to read the map_input_file environment variable (which exposes the map.input.file property) to determine which type of file you are dealing with based on its path and/or name.您问题的实际答案是,mrjob 尚不完全支持 hadoop 流连接模式,即读取 map_input_file 环境变量(公开 map.input.file 属性)以确定您正在处理的文件类型基于在其路径和/或名称上。

You might still be able to pull it off, if you can easily detect from just reading the data itself which type it belongs to, as is displayed in this article:如果您可以轻松地通过读取数据本身来检测它属于哪种类型,那么您仍然可以将其拉下来,如本文所示:

http://allthingshadoop.com/2011/12/16/simple-hadoop-streaming-tutorial-using-joins-and-keys-with-python/ http://allthingshadoop.com/2011/12/16/simple-hadoop-streaming-tutorial-using-joins-and-keys-with-python/

However that's not always possible...然而,这并不总是可能的......

Otherwise myjob looks fantastic and I wish they could add support for this in the future.否则 myjob 看起来很棒,我希望他们将来可以为此添加支持。 Until then this is pretty much a deal breaker for me.在那之前,这对我来说几乎是一个交易破坏者。

This is how I use multiple inputs and based on filename make suitable changes in the mapper phase.这就是我使用多个输入并根据文件名在映射器阶段进行适当更改的方式。

Runner Program :跑者计划:

from mrjob.hadoop import *


#Define all arguments

os.environ['HADOOP_HOME'] = '/opt/cloudera/parcels/CDH/lib/hadoop/'
print "HADOOP HOME is now set to : %s" % (str(os.environ.get('HADOOP_HOME')))
job_running_time = datetime.datetime.now().strftime('%Y-%m-%d_%H_%M_%S')
hadoop_bin = '/usr/bin/hadoop'
mode = 'hadoop'
hs = HadoopFilesystem([hadoop_bin])

input_file_names = ["hdfs:///app/input_file1/","hdfs:///app/input_file2/"]

aargs = ['-r',mode,'--jobconf','mapred.job.name=JobName','--jobconf','mapred.reduce.tasks=3','--no-output','--hadoop-bin',hadoop_bin]
aargs.extend(input_file_names)
aargs.extend(['-o',output_dir])
print aargs
status_file = True

mr_job = MRJob(args=aargs)
with mr_job.make_runner() as runner:
    runner.run()
os.environ['HADOOP_HOME'] = ''
print "HADOOP HOME is now set to : %s" % (str(os.environ.get('HADOOP_HOME')))

The MRJob Class : MRJob 类:

class MR_Job(MRJob):
    DEFAULT_OUTPUT_PROTOCOL = 'repr_value'
    def mapper(self, _, line):
    """
    This function reads lines from file.
    """
    try:
        #Need to clean email.
        input_file_name = get_jobconf_value('map.input.file').split('/')[-2]
                """
                Mapper code
                """
    except Exception, e:
        print e

    def reducer(self, email_id,visitor_id__date_time):
    try:
        """
                Reducer Code
                """
    except:
        pass


if __name__ == '__main__':
    MRV_Email.run()

In my understanding, you would not be using MrJob unless you wanted to leverage Hadoop cluster or Hadoop services from Amazon, even if the example utilizes running on local files.根据我的理解,除非您想利用 Amazon 的 Hadoop 集群或 Hadoop 服务,否则您不会使用 MrJob,即使该示例利用在本地文件上运行。

MrJob in principal uses " Hadoop streaming " to submit the job. MrJob 主要使用“ Hadoop 流”来提交作业。

This means that all inputs specified as files or folders from Hadoop is streamed to mapper and subsequent results to reducer.这意味着从 Hadoop 指定为文件或文件夹的所有输入都将流式传输到映射器,随后的结果传输到化简器。 All mapper obtains a slice of input and considers all input to be schematically the same so that it uniformly parses and processes key,value for each data slice. all mapper获取输入的一个切片,并认为所有输入在原理上是相同的,以便统一解析和处理每个数据切片的key,value。

Deriving from this understanding, the inputs are schematically the same to the mapper.根据这种理解,映射器的输入在原理上是相同的。 Only way possible to include two different schematic data is to interleave them in the same file in such a manner that the mapper can understand which is vector data and which is matrix data.包含两个不同的原理图数据的唯一方法是将它们交织在同一个文件中,以便映射器可以理解哪些是矢量数据,哪些是矩阵数据。

You are actually doing it already.

You can simply improve that by having some specifier if a line is matrix data or a vector data.如果一条线是矩阵数据或矢量数据,您可以通过使用一些说明符来简单地改进它。 Once you see a vector data then the preceding matrix data is applied to it.一旦你看到一个矢量数据,那么前面的矩阵数据就会应用于它。

matrix, 1, 2, ...
matrix, 2, 4, ...
vector, 3, 4, ...
matrix, 1, 2, ...
.....

But the process that you have mentioned works well.但是你提到的过程运行良好。 You have to have all schematic data in a single file.您必须将所有原理图数据放在一个文件中。

This still has issues though.这仍然有问题。 K,V map reduce works better when complete schema is present in a single line and contains a complete single processing unit.当完整模式存在于一行中并且包含完整的单个处理单元时,K,V map reduce 效果更好。

From my understanding, you are already doing it correctly but I guess Map-Reduce is not a suitable mechanism for this kind of data.根据我的理解,您已经做得对了,但我猜 Map-Reduce 不是这种数据的合适机制。 I hope some one clarifies this even further than I could.我希望有人能比我更进一步地澄清这一点。

The MrJob Fundumentals state: MrJob 基本 原理指出:

You can pass multiple input files, mixed with stdin (using the - character):您可以传递多个输入文件,与 stdin 混合(使用 - 字符):

$ python my_job.py input1.txt input2.txt - < input3.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM