简体繁体 English

在hadoop中查找不同列的最有效方法是什么

[英]What is the most effective way to find distinct of columns in hadoop

原文 2013-02-23 21:11:14 7 2 java/ hadoop/ mapreduce

I have a file of size 1TB. 我有一个1TB的文件。 And we need to find the distinct values for 4 columns in the file. 并且我们需要找到文件中4列的不同值。 So for example if we have columns A,B,C,D,E,F and so on. 因此，例如，如果我们有A，B，C，D，E，F等列。 Among them we need to find all the distinct values in column A and create one file in HDFS. 其中我们需要在A列中找到所有不同的值，并在HDFS中创建一个文件。 Similarly for B,C and D. 对于B，C和D同样如此。

Note: We have to do this for only 4 Columns not for the remaining. 注意：我们仅需要对4列执行此操作，而对于其余列则不需要。 There are total of 300 columns in the file. 该文件中共有300列。

We need to write Map Reduce for this. 为此，我们需要编写Map Reduce。 What would be an effective way to address this problem. 什么是解决此问题的有效方法。 Appreciate your help. 感谢您的帮助。 Thanks. 谢谢。

2 个解决方案

Let the mapper output a record for each column you need the unique values. 让映射器为您需要唯一值的每一列输出一条记录。 So in your example the map will (with a single input record) output 4 records with the key being A,B,C,D. 因此，在您的示例中，地图将（具有单个输入记录）输出4个记录，键为A，B，C，D。

In the reducer you can then handles all the values. 然后可以在化简器中处理所有值。

Depending on the details of what you need you may want to use a key that looks something like this: "A:value of column A" 根据您需要的详细信息，您可能需要使用如下所示的键：“ A：A列的值”

Basically you need to filter out duplicate records, this can be done in few steps begin from mapper, combiner and reducer. 基本上，您需要过滤掉重复的记录，这可以从映射器，合并器和简化器开始的几个步骤中完成。 Also you can use java Set. 您也可以使用java Set。 Mapper can output key= 'Column A' Value='complete record'. 映射器可以输出key ='Column A'Value ='complete record'。 Store keys in Set, if set contain key don't emit record. 将密钥存储在Set中，如果set中包含密钥，则不会发出记录。 Same can be done in combiner. 可以在合并器中完成相同的操作。 And may be you don't need reducer. 也许您不需要减速器。 Also need to workout so Set does not create out of memory error by clearing on some specific size. 还需要锻炼，因此Set不会因清除某些特定大小而造成内存不足错误。