简体   繁体   English

使用 Hadoop map-reduce 并行计算值列表的中值

[英]calculate median of a list of values parallely using Hadoop map-reduce

I'm new to Hadoop mrjob.我是 Hadoop mrjob 的新手。 I have a text file which consists of data "id groupId value" in each line.我有一个文本文件,每行包含数据“id groupId 值”。 I am trying to calculate a median of all values in the text file using Hadoop map-reduce.我正在尝试使用 Hadoop map-reduce 计算文本文件中所有值的中值。 But i'm stuck when it comes to calculate only the median value.但是当只计算中值时我被卡住了。 What I get is a median value for each id like:我得到的是每个 id 的中值,例如:

"123213"        5.0
"123218"        2
"231532"        1
"234634"        7
"234654"        2
"345345"        9
"345445"        4.5
"345645"        2
"346324"        2
"436324"        6
"436456"        2
"674576"        10
"781623"        1.5

The output should be like "median value of all values is: ####". output 应该类似于“所有值的中值是:####”。 I got influnced by this article https://computehustle.com/2019/09/02/getting-started-with-mapreduce-in-python/ My python file median-mrjob.py :我受到这篇文章https://computehustle.com/2019/09/02/getting-started-with-mapreduce-in-python/我的 python 文件median-mrjob.py的影响:

from mrjob.job import MRJob
from mrjob.step import MRStep

class MRMedian(MRJob):
    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_stats, combiner=self.reducer_count_stats),
            MRStep(reducer=self.reducer_sort_by_values),
            MRStep(reducer=self.reducer_retrieve_median)
        ]

    def mapper_get_stats(self, _, line):
        line_arr = line.split(" ")
        values = int(float(line_arr[-1]))
        id = line_arr[0]
        yield id, values

    def reducer_count_stats(self, key, values):
        yield str(sum(values)).zfill(2), key

    def reducer_sort_by_values(self, values, ids):
        for id in ids:
            yield id, values

    def reducer_retrieve_median(self, id, values):
        valList=[]
        median = 0
        for val in values:
            valList.append(int(val))
        N = len(valList)
        #find the median
        if N % 2 == 0:
            #if N is even
            m1 = N / 2
            m2 = (N / 2) + 1
            #Convert to integer, match post
            m1 = int(m1) - 1
            m2 = int(m2) - 1
            median = (valList[m1] + valList[m2]) / 2 
        else:
            m = (N + 1) / 2
            # Convert to integer, match position
            m = int(m) - 1
            median = valList[m]
        yield (id, median)

if __name__ == '__main__':
   MRMedian.run()

My original text files is about 1million and 1billion line of data, but I have created a test file which has arbitrary data.我的原始文本文件大约有 100 万和 10 亿行数据,但我创建了一个包含任意数据的测试文件。 It has the name input.txt :它的名称为input.txt

781623 2 2.3243
781623 1 1.1243
234654 1 2.122
123218 8 2.1245
436456 22 2.26346
436324 3 6.6667
346324 8 2.123
674576 1 10.1232
345345 1 9.56135
345645 7 2.1231
345445 10 6.1232
231532 1 1.1232
234634 6 7.124
345445 6 3.654376
123213 18 8.123
123213 2 2.1232

What I care about is the values.我关心的是价值观。 Considering that might be duplicates.考虑到这可能是重复的。 I run the command line in the terminal to run the code python median-mrjob.py input.txt我在终端运行命令行运行代码python median-mrjob.py input.txt

Update: The point of the assignment is not to use any libraries, so I need to sort the list manually(or maybe some of it as I understood) and calculate the median manually(hardcoding).更新:作业的重点是不使用任何库,所以我需要手动对列表进行排序(或者可能是我理解的其中一些)并手动计算中位数(硬编码)。 Otherwise the goal of using MapReduce disappears.否则使用 MapReduce 的目标就会消失。 Using PySpark is not allowed in this assignment.在此分配中不允许使用 PySpark。 Check this link for more inspiration Computing median in map reduce查看此链接以获取更多灵感计算中位数 map reduce

The output should be like "median value of all values is: ####" output 应该类似于“所有值的中值是:####”

Then you need to force all data to one reducer first (effectively defeating the purpose of using MapReduce).然后你需要先将所有数据强制到一个 reducer(有效地破坏了使用 MapReduce 的目的)。

You'd do that by not using the ID as the key and discarding it您可以通过不使用 ID 作为密钥并丢弃它来做到这一点

def mapper_get_stats(self, _, line):
    line_arr = line.split()
    if line_arr:  # prevent empty lines
        value = float(line_arr[-1])
        yield None, value

After that, sort and find the median (I fixed your parameter order)之后,排序并找到中位数(我固定了你的参数顺序)

def reducer_retrieve_median(self, key, values):
    import statistics
    yield None, f"median value of all values is: {statistics.median(values)}"  # automatically sorts the data

So, only two steps所以,只需两步

class MRMedian(MRJob):
    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_stats),
            MRStep(reducer=self.reducer_retrieve_median)
        ]

For the given file, you should see对于给定的文件,您应该看到

null    "median value of all values is: 2.2938799999999997"

original text files is about 1million and 1billion line of data原始文本文件大约有 100 万和 10 亿行数据

Not that it matters, but which is it?这并不重要,但它是什么?

You should upload the file to HDFS first, then you can use better tools than MrJob for this like Hive or Pig.您应该先将文件上传到 HDFS,然后您可以使用比 MrJob 更好的工具,例如 Hive 或 Pig。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM