如何在 Hadoop 中减少步骤差异？

Question

How to steps differences reduce in Hadoop?如何在 Hadoop 中减少步骤差异？
I have a problem with understand Hadoop. I have two files and first I did a join between those files.我对理解 Hadoop 有疑问。我有两个文件，首先我在这些文件之间进行了连接。 One file is about countries and the other is about client in each country.一个文件是关于国家的，另一个是关于每个国家的客户的。
Example, clients.csv:例如，clients.csv：

Bertram Pearcy  ,bueno,SO
Steven Ulman  ,regular,ZA

Countries.csv Countries.csv

Name,Code   
Afghanistan,AF
Ã…land Islands,AX
Albania,AL  
…

I did one map reduce that give me how many “good” (bueno) clients have a country (ZA, SO) and with countries.csv I know with country we are talking.我做了一个 map reduce，它告诉我有多少“好”（bueno）客户有一个国家（ZA，SO）和国家。csv 我知道我们正在谈论的国家。

I programmed:我编程：

def steps(self): 
        # ordenamos las operaciones para su ejecución.
        return [ 
            MRStep(mapper=self.mapper 
                   ,reducer=self.reducer),            
            MRStep(mapper=self.mapper1
                   ,combiner=self.combiner_cuenta_palabras
                   ,reducer=self.reducer2
                    ),
        ]

The result of my map/reduce is:我的 map/reduce 的结果是：

["South Georgia and the South Sandwich Islands"]    1
["South Sudan"] 1
["Spain"]   3

Now, I would like to know which one is the best.现在，我想知道哪一个是最好的。

I added one reduce more.我加了一个减少更多。

    def reducer3(self, _, values):            
        yield  _, max (values)
        
    def steps(self): 
        # ordenamos las operaciones para su ejecución.
        return [ 
            MRStep(mapper=self.mapper 
                   ,reducer=self.reducer),  
            MRStep(mapper=self.mapper1
                   ,combiner=self.combiner_cuenta_palabras
                   ,reducer=self.reducer2),
            MRStep(#mapper=self.mapper3,
                   reducer=self.reducer3
                   #,reducer=self.reducer3
            ),            
        ]

But I have the same answer than without that reducer但我的答案与没有那个减速器的答案相同

I try to use one map/reduce program adding another reduce.我尝试使用一个 map/reduce 程序添加另一个 reduce。 It that does not work.它不起作用。

With my first reduce I got:通过我的第一次减少，我得到了：

A, 10
C, 2
D, 5

Now, I would like to use that result I get: A, 10现在，我想使用我得到的结果：A，10

Additional comment:附加评论：

INPUT [Fille1]+[File2] => enter image description here INPUT [Fille1]+[File2] =>在此处输入图像描述

MAP/REDUCE => OUT映射/减少 => 输出

enter image description here在此处输入图像描述

Now, I need that with additional map/reduce ( and I would like to use what I did) get another answers.现在，我需要通过额外的 map/reduce（我想使用我所做的）得到另一个答案。

First) For instance, one and only one answer.第一）例如，一个且唯一的答案。 Example: 3 Spain示例： 3 Spain

Second) All with the best or bigger number, 3 Spain and 3 Guan .第二）所有最好或更大的数字， 3 Spain和3 Guan 。

I try to use:我尝试使用：

def reducer3(self, _, values):            
        yield  _, max (values)

And I add,我补充说，

def steps(self): 
        # ordenamos las operaciones para su ejecución.
        return [ 
            MRStep(mapper=self.mapper 
                   ,reducer=self.reducer),  
            MRStep(mapper=self.mapper1
                   ,combiner=self.combiner_cuenta_palabras
                   ,reducer=self.reducer2),
            MRStep(reducer=self.reducer3
            ),            
        ]

But I still have the same result.但我仍然有相同的结果。 I Know that REDUCER3 is using because if I write max(values)+1000 give me the same result but with number 1001 , 1003我知道 REDUCER3 正在使用，因为如果我写max(values)+1000给我相同的结果但数字1001 ， 1003

Answer 1

Your reducer is getting 3 distinct keys, therefore you're finding the max of each, and values only has one element (try printing its length... ).你的 reducer 有 3 个不同的键，因此你找到每个键的最大值，而values只有一个元素（尝试打印它的长度......）。 Therefore, you get 3 results.因此，您会得到 3 个结果。

You need a third mapper that returns (None, f'{key}|{value}) for example, then all records will be sent to one reducer , where you can then iterate, parse, and aggregate the results例如，您需要第三个返回(None, f'{key}|{value})的映射器，然后所有记录将被发送到一个 reducer ，然后您可以在其中迭代、解析和聚合结果

def reducer3(self, _, values):
    _max = float('-inf')
    k_out = None
    for x in values:
        k, v = x.split('|')
        if int(v) > _max:
            _max = v
            k_out = k
    yield  k_out, _max

That'll only return one result for all values.这只会为所有值返回一个结果。 If you want to capture equal max values, I think you'll need to iterate over the list more than once, then yield within a loop of found max elements如果你想捕获相等的最大值，我认为你需要多次遍历列表，然后在找到的最大元素的循环中产生

如何在 Hadoop 中减少步骤差异？

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-11-13 13:08:17

如何在 Hadoop 中减少步骤差异？

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-11-13 13:08:17

解决方案1
0 已采纳 2022-11-13 13:08:17