Hadoop自定义分区程序未按照逻辑进行操作

Question

Based on this example here , this works. 根据此处的示例，此方法有效。 Have tried the same on my dataset. 在我的数据集上尝试了相同的方法。

Sample Dataset: 样本数据集：

OBSERVATION;2474472;137176;
OBSERVATION;2474473;137176;
OBSERVATION;2474474;137176;
OBSERVATION;2474475;137177;

Consider each line as string, my Mapper output is: 将每行视为字符串，我的Mapper输出为：

key-> string[2], value-> string. 键->字符串[2]，值->字符串。

My Partitioner code: 我的分区程序代码：

@Override
public int getPartition(Text key, Text value, int reducersDefined) {

    String keyStr = key.toString();
    if(keyStr == "137176") {
        return 0;
    } else {
        return 1 % reducersDefined;
    }
}

In my data set most id's are 137176. Reducer declared -2. 在我的数据集中，大多数ID为137176。Reducer声明为-2。 I expect two output files, one for 137176 and second for remaining Id's. 我希望有两个输出文件，一个用于137176，第二个用于剩余的ID。 I'm getting two output files but, Id's evenly distributed on both the output files. 我得到了两个输出文件，但是，Id均匀地分布在两个输出文件上。 What's going wrong in my program? 我的程序出了什么问题？

Answer 1

Explicitly set in the Driver method that you want to use your custom Partitioner, by using: job.setPartitionerClass(YourPartitioner.class); 通过使用以下方法在要使用自定义分区程序的Driver方法中进行显式设置： job.setPartitionerClass(YourPartitioner.class); . 。 If you don't do that, the default HashPartitioner is used. 如果不这样做，则使用默认的HashPartitioner。
Change String comparison method from == to .equals() . 将String比较方法从==更改为.equals() 。 ie, change if(keyStr == "137176") { to if(keyStr.equals("137176")) { . 即，将if(keyStr == "137176") {更改为if(keyStr.equals("137176")) {
To save some time, perhaps it will be faster to declare a new Text variable at the beginning of the partitioner, like that: Text KEY = new Text("137176"); 为了节省时间，也许更快地在分区程序的开头声明一个新的Text变量，如下所示： Text KEY = new Text("137176"); and then, without converting your input key to String every time, just compare it with the KEY variable (again using the equals() method). 然后，无需每次都将输入键转换为String，只需将其与KEY变量进行比较（再次使用equals()方法）。 But perhaps those are equivalent. 但是也许那些是等效的。 So, what I suggest is: 所以，我建议是：
```
\nText KEY = new Text("137176"); 文字KEY =新文字（“ 137176”）;\n\n@Override @覆盖\npublic int getPartition(Text key, Text value, int reducersDefined) { public int getPartition（Text key，Text value，int reducersDefined）{\n    return key.equals(KEY) ? 返回key.equals（KEY）吗？ 0 : 1 % reducersDefined; 0：1％减速器已定义;    \n} }\n
```

Another suggestion, if the network load is heavy, parse the map output key as VIntWritable and change the Partitioner accordingly. 另一个建议是，如果网络负载很重，请将映射输出键解析为VIntWritable并相应地更改Partitioner。

Hadoop自定义分区程序未按照逻辑进行操作

问题描述

1 个解决方案

解决方案1
0 已采纳 2015-07-20 14:05:47

Hadoop自定义分区程序未按照逻辑进行操作

问题描述

1 个解决方案

解决方案1 0 已采纳 2015-07-20 14:05:47

解决方案1
0 已采纳 2015-07-20 14:05:47